[v1] sched/fair: Fix pelt lost idle time detection

[PATCH] sched/fair: Fix pelt lost idle time detection

Posted by Vincent Guittot 4 months ago

The check for some lost idle pelt time should be always done when 
pick_next_task_fair() fails to pick a task and not only when we call it
from the fair fast-path.

The case happens when the last running task on rq is a RT or DL task. When
the latter goes to sleep and the /Sum of util_sum of the rq is at the max
value, we don't account the lost of idle time whereas we should.

Fixes: 67692435c411 ("sched: Rework pick_next_task() slow-path")
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---

I Noticed this while reviewing [1]

[1] https://lore.kernel.org/all/20251006105453.648473106@infradead.org/

 kernel/sched/fair.c | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b3be1e2749ce..dd0ea01af730 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8920,21 +8920,21 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 	return p;
 
 idle:
-	if (!rf)
-		return NULL;
-
-	new_tasks = sched_balance_newidle(rq, rf);
+	if (rf) {
+		new_tasks = sched_balance_newidle(rq, rf);
 
-	/*
-	 * Because sched_balance_newidle() releases (and re-acquires) rq->lock, it is
-	 * possible for any higher priority task to appear. In that case we
-	 * must re-start the pick_next_entity() loop.
-	 */
-	if (new_tasks < 0)
-		return RETRY_TASK;
+		/*
+		 * Because sched_balance_newidle() releases (and re-acquires)
+		 * rq->lock, it is possible for any higher priority task to
+		 * appear. In that case we must re-start the pick_next_entity()
+		 * loop.
+		 */
+		if (new_tasks < 0)
+			return RETRY_TASK;
 
-	if (new_tasks > 0)
-		goto again;
+		if (new_tasks > 0)
+			goto again;
+	}
 
 	/*
 	 * rq is about to be idle, check if we need to update the
-- 
2.43.0

Re: [PATCH] sched/fair: Fix pelt lost idle time detection

Posted by Samuel Wu 2 months, 1 week ago

On Wed, Oct 8, 2025 at 6:12 AM Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> The check for some lost idle pelt time should be always done when
> pick_next_task_fair() fails to pick a task and not only when we call it
> from the fair fast-path.
>
> The case happens when the last running task on rq is a RT or DL task. When
> the latter goes to sleep and the /Sum of util_sum of the rq is at the max
> value, we don't account the lost of idle time whereas we should.
>
> Fixes: 67692435c411 ("sched: Rework pick_next_task() slow-path")
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>
> I Noticed this while reviewing [1]
>
> [1] https://lore.kernel.org/all/20251006105453.648473106@infradead.org/
>
>  kernel/sched/fair.c | 26 +++++++++++++-------------
>  1 file changed, 13 insertions(+), 13 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b3be1e2749ce..dd0ea01af730 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8920,21 +8920,21 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
>         return p;
>
>  idle:
> -       if (!rf)
> -               return NULL;
> -
> -       new_tasks = sched_balance_newidle(rq, rf);
> +       if (rf) {
> +               new_tasks = sched_balance_newidle(rq, rf);
>
> -       /*
> -        * Because sched_balance_newidle() releases (and re-acquires) rq->lock, it is
> -        * possible for any higher priority task to appear. In that case we
> -        * must re-start the pick_next_entity() loop.
> -        */
> -       if (new_tasks < 0)
> -               return RETRY_TASK;
> +               /*
> +                * Because sched_balance_newidle() releases (and re-acquires)
> +                * rq->lock, it is possible for any higher priority task to
> +                * appear. In that case we must re-start the pick_next_entity()
> +                * loop.
> +                */
> +               if (new_tasks < 0)
> +                       return RETRY_TASK;
>
> -       if (new_tasks > 0)
> -               goto again;
> +               if (new_tasks > 0)
> +                       goto again;
> +       }
>
>         /*
>          * rq is about to be idle, check if we need to update the
> --
> 2.43.0
>

Hi all,

I am seeing a power regression I've observed with this patch. This
test was performed on Pixel 6 running android-mainline (6.18.0-rc7
based); all scheduling vendor hooks are disabled, and I'm not seeing
any obvious sched code differences compared to the vanilla upstream
kernel. I am still actively working to see if I can find a simpler
sequence to reproduce this on mainline Linux.

The Wattson tool is reporting an increased average power (~30-40%)
with the patch vs baseline (patch reverted). This regression
correlates with two other metrics:
1. Increased residency at higher CPU frequencies
2. A significant increase in sugov invocations (at least 10x)

Data in the tables below are collected from a 10s run of a bouncing
ball animation, with and without the patch.
+-----------------------------------+--------------+-------------------+
|                                           | with patch |  without patch |
+-----------------------------------+-------------+--------------------+
| sugov invocation rate (Hz) |       133.5 |                   3.7 |
+-----------------------------------+-------------+--------------------+

+--------------+----------------------+----------------------+
|                   |         with patch: |    without patch: |
| Freq (kHz) | time spent (ms) |  time spent (ms) |
+--------------+----------------------+----------------------+
|     738000 |                   4869 |                  9869 |
|   1803000 |                   2936 |                      68 |
|   1598000 |                   1072 |                        0 |
|   1704000 |                     674 |                        0 |
|              ... |                        ... |                       ... |
+--------------+----------------------+---------------------+

Thanks!
Sam

Re: [PATCH] sched/fair: Fix pelt lost idle time detection

Posted by Vincent Guittot 2 months ago

On Tue, 2 Dec 2025 at 01:24, Samuel Wu <wusamuel@google.com> wrote:
>
> On Wed, Oct 8, 2025 at 6:12 AM Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> >
> > The check for some lost idle pelt time should be always done when
> > pick_next_task_fair() fails to pick a task and not only when we call it
> > from the fair fast-path.
> >
> > The case happens when the last running task on rq is a RT or DL task. When
> > the latter goes to sleep and the /Sum of util_sum of the rq is at the max
> > value, we don't account the lost of idle time whereas we should.
> >
> > Fixes: 67692435c411 ("sched: Rework pick_next_task() slow-path")
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >
> > I Noticed this while reviewing [1]
> >
> > [1] https://lore.kernel.org/all/20251006105453.648473106@infradead.org/
> >
> >  kernel/sched/fair.c | 26 +++++++++++++-------------
> >  1 file changed, 13 insertions(+), 13 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index b3be1e2749ce..dd0ea01af730 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8920,21 +8920,21 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
> >         return p;
> >
> >  idle:
> > -       if (!rf)
> > -               return NULL;
> > -
> > -       new_tasks = sched_balance_newidle(rq, rf);
> > +       if (rf) {
> > +               new_tasks = sched_balance_newidle(rq, rf);
> >
> > -       /*
> > -        * Because sched_balance_newidle() releases (and re-acquires) rq->lock, it is
> > -        * possible for any higher priority task to appear. In that case we
> > -        * must re-start the pick_next_entity() loop.
> > -        */
> > -       if (new_tasks < 0)
> > -               return RETRY_TASK;
> > +               /*
> > +                * Because sched_balance_newidle() releases (and re-acquires)
> > +                * rq->lock, it is possible for any higher priority task to
> > +                * appear. In that case we must re-start the pick_next_entity()
> > +                * loop.
> > +                */
> > +               if (new_tasks < 0)
> > +                       return RETRY_TASK;
> >
> > -       if (new_tasks > 0)
> > -               goto again;
> > +               if (new_tasks > 0)
> > +                       goto again;
> > +       }
> >
> >         /*
> >          * rq is about to be idle, check if we need to update the
> > --
> > 2.43.0
> >
>
> Hi all,
>
> I am seeing a power regression I've observed with this patch. This

The problem is that this patch is about fixing a wrong load tracking
which can be underestimated on systems that become loaded.

> test was performed on Pixel 6 running android-mainline (6.18.0-rc7
> based); all scheduling vendor hooks are disabled, and I'm not seeing
> any obvious sched code differences compared to the vanilla upstream
> kernel. I am still actively working to see if I can find a simpler
> sequence to reproduce this on mainline Linux.
>
> The Wattson tool is reporting an increased average power (~30-40%)
> with the patch vs baseline (patch reverted). This regression

For a use case in particular ?

> correlates with two other metrics:
> 1. Increased residency at higher CPU frequencies
> 2. A significant increase in sugov invocations (at least 10x)
>
> Data in the tables below are collected from a 10s run of a bouncing
> ball animation, with and without the patch.
> +-----------------------------------+--------------+-------------------+
> |                                           | with patch |  without patch |
> +-----------------------------------+-------------+--------------------+
> | sugov invocation rate (Hz) |       133.5 |                   3.7 |
> +-----------------------------------+-------------+--------------------+
>
> +--------------+----------------------+----------------------+
> |                   |         with patch: |    without patch: |
> | Freq (kHz) | time spent (ms) |  time spent (ms) |
> +--------------+----------------------+----------------------+
> |     738000 |                   4869 |                  9869 |
> |   1803000 |                   2936 |                      68 |
> |   1598000 |                   1072 |                        0 |
> |   1704000 |                     674 |                        0 |
> |              ... |                        ... |                       ... |
> +--------------+----------------------+---------------------+
>
> Thanks!
> Sam

Re: [PATCH] sched/fair: Fix pelt lost idle time detection

Posted by Samuel Wu 2 months ago

On Fri, Dec 5, 2025 at 7:08 AM Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Tue, 2 Dec 2025 at 01:24, Samuel Wu <wusamuel@google.com> wrote:
> >
> > On Wed, Oct 8, 2025 at 6:12 AM Vincent Guittot
> > <vincent.guittot@linaro.org> wrote:
> > >
> > > The check for some lost idle pelt time should be always done when
> > > pick_next_task_fair() fails to pick a task and not only when we call it
> > > from the fair fast-path.
> > >
> > > The case happens when the last running task on rq is a RT or DL task. When
> > > the latter goes to sleep and the /Sum of util_sum of the rq is at the max
> > > value, we don't account the lost of idle time whereas we should.
> > >
> > > Fixes: 67692435c411 ("sched: Rework pick_next_task() slow-path")
> > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > > ---
> > >
> > > I Noticed this while reviewing [1]
> > >
> > > [1] https://lore.kernel.org/all/20251006105453.648473106@infradead.org/
> > >
> > >  kernel/sched/fair.c | 26 +++++++++++++-------------
> > >  1 file changed, 13 insertions(+), 13 deletions(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index b3be1e2749ce..dd0ea01af730 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -8920,21 +8920,21 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
> > >         return p;
> > >
> > >  idle:
> > > -       if (!rf)
> > > -               return NULL;
> > > -
> > > -       new_tasks = sched_balance_newidle(rq, rf);
> > > +       if (rf) {
> > > +               new_tasks = sched_balance_newidle(rq, rf);
> > >
> > > -       /*
> > > -        * Because sched_balance_newidle() releases (and re-acquires) rq->lock, it is
> > > -        * possible for any higher priority task to appear. In that case we
> > > -        * must re-start the pick_next_entity() loop.
> > > -        */
> > > -       if (new_tasks < 0)
> > > -               return RETRY_TASK;
> > > +               /*
> > > +                * Because sched_balance_newidle() releases (and re-acquires)
> > > +                * rq->lock, it is possible for any higher priority task to
> > > +                * appear. In that case we must re-start the pick_next_entity()
> > > +                * loop.
> > > +                */
> > > +               if (new_tasks < 0)
> > > +                       return RETRY_TASK;
> > >
> > > -       if (new_tasks > 0)
> > > -               goto again;
> > > +               if (new_tasks > 0)
> > > +                       goto again;
> > > +       }
> > >
> > >         /*
> > >          * rq is about to be idle, check if we need to update the
> > > --
> > > 2.43.0
> > >
> >
> > Hi all,
> >
> > I am seeing a power regression I've observed with this patch. This
>
> The problem is that this patch is about fixing a wrong load tracking
> which can be underestimated on systems that become loaded.
>

I feel the patch is doing the proper thing, which is the appropriate
bookkeeping when idle is the next task. I just wasn't 100% sure if
there were some other indirect impact that was unintentional, so
thought it would be good to send a report out and have another set of
eyes look over it.

> > test was performed on Pixel 6 running android-mainline (6.18.0-rc7
> > based); all scheduling vendor hooks are disabled, and I'm not seeing
> > any obvious sched code differences compared to the vanilla upstream
> > kernel. I am still actively working to see if I can find a simpler
> > sequence to reproduce this on mainline Linux.
> >
> > The Wattson tool is reporting an increased average power (~30-40%)
> > with the patch vs baseline (patch reverted). This regression
>
> For a use case in particular ?

This was for BouncyBall apk, which is a bouncing ball animation. I'm
still trying to find a method to reproduce this on Linux, but still
haven't been able to. Also we've been checking internally to root
cause this, but nothing definitive yet.

>
> > correlates with two other metrics:
> > 1. Increased residency at higher CPU frequencies
> > 2. A significant increase in sugov invocations (at least 10x)
> >
> > Data in the tables below are collected from a 10s run of a bouncing
> > ball animation, with and without the patch.
> > +-----------------------------------+--------------+-------------------+
> > |                                           | with patch |  without patch |
> > +-----------------------------------+-------------+--------------------+
> > | sugov invocation rate (Hz) |       133.5 |                   3.7 |
> > +-----------------------------------+-------------+--------------------+
> >
> > +--------------+----------------------+----------------------+
> > |                   |         with patch: |    without patch: |
> > | Freq (kHz) | time spent (ms) |  time spent (ms) |
> > +--------------+----------------------+----------------------+
> > |     738000 |                   4869 |                  9869 |
> > |   1803000 |                   2936 |                      68 |
> > |   1598000 |                   1072 |                        0 |
> > |   1704000 |                     674 |                        0 |
> > |              ... |                        ... |                       ... |
> > +--------------+----------------------+---------------------+
> >
> > Thanks!
> > Sam

Re: [PATCH] sched/fair: Fix pelt lost idle time detection

Posted by Samuel Wu 2 months ago

On Fri, Dec 5, 2025 at 4:54 PM Samuel Wu <wusamuel@google.com> wrote:
>
> On Fri, Dec 5, 2025 at 7:08 AM Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> >
> > On Tue, 2 Dec 2025 at 01:24, Samuel Wu <wusamuel@google.com> wrote:
> > >
> > > On Wed, Oct 8, 2025 at 6:12 AM Vincent Guittot
> > > <vincent.guittot@linaro.org> wrote:
> > > >
> > > > The check for some lost idle pelt time should be always done when
> > > > pick_next_task_fair() fails to pick a task and not only when we call it
> > > > from the fair fast-path.
> > > >
> > > > The case happens when the last running task on rq is a RT or DL task. When
> > > > the latter goes to sleep and the /Sum of util_sum of the rq is at the max
> > > > value, we don't account the lost of idle time whereas we should.
> > > >
> > > > Fixes: 67692435c411 ("sched: Rework pick_next_task() slow-path")
> > > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > > > ---
> > > >
> > > > I Noticed this while reviewing [1]
> > > >
> > > > [1] https://lore.kernel.org/all/20251006105453.648473106@infradead.org/
> > > >
> > > >  kernel/sched/fair.c | 26 +++++++++++++-------------
> > > >  1 file changed, 13 insertions(+), 13 deletions(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index b3be1e2749ce..dd0ea01af730 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -8920,21 +8920,21 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
> > > >         return p;
> > > >
> > > >  idle:
> > > > -       if (!rf)
> > > > -               return NULL;
> > > > -
> > > > -       new_tasks = sched_balance_newidle(rq, rf);
> > > > +       if (rf) {
> > > > +               new_tasks = sched_balance_newidle(rq, rf);
> > > >
> > > > -       /*
> > > > -        * Because sched_balance_newidle() releases (and re-acquires) rq->lock, it is
> > > > -        * possible for any higher priority task to appear. In that case we
> > > > -        * must re-start the pick_next_entity() loop.
> > > > -        */
> > > > -       if (new_tasks < 0)
> > > > -               return RETRY_TASK;
> > > > +               /*
> > > > +                * Because sched_balance_newidle() releases (and re-acquires)
> > > > +                * rq->lock, it is possible for any higher priority task to
> > > > +                * appear. In that case we must re-start the pick_next_entity()
> > > > +                * loop.
> > > > +                */
> > > > +               if (new_tasks < 0)
> > > > +                       return RETRY_TASK;
> > > >
> > > > -       if (new_tasks > 0)
> > > > -               goto again;
> > > > +               if (new_tasks > 0)
> > > > +                       goto again;
> > > > +       }
> > > >
> > > >         /*
> > > >          * rq is about to be idle, check if we need to update the
> > > > --
> > > > 2.43.0
> > > >
> > >
> > > Hi all,
> > >
> > > I am seeing a power regression I've observed with this patch. This
> >
> > The problem is that this patch is about fixing a wrong load tracking
> > which can be underestimated on systems that become loaded.
> >
>
> I feel the patch is doing the proper thing, which is the appropriate
> bookkeeping when idle is the next task. I just wasn't 100% sure if
> there were some other indirect impact that was unintentional, so
> thought it would be good to send a report out and have another set of
> eyes look over it.
>
> > > test was performed on Pixel 6 running android-mainline (6.18.0-rc7
> > > based); all scheduling vendor hooks are disabled, and I'm not seeing
> > > any obvious sched code differences compared to the vanilla upstream
> > > kernel. I am still actively working to see if I can find a simpler
> > > sequence to reproduce this on mainline Linux.
> > >
> > > The Wattson tool is reporting an increased average power (~30-40%)
> > > with the patch vs baseline (patch reverted). This regression
> >
> > For a use case in particular ?
>
> This was for BouncyBall apk, which is a bouncing ball animation. I'm
> still trying to find a method to reproduce this on Linux, but still
> haven't been able to. Also we've been checking internally to root
> cause this, but nothing definitive yet.
>
> >
> > > correlates with two other metrics:
> > > 1. Increased residency at higher CPU frequencies
> > > 2. A significant increase in sugov invocations (at least 10x)
> > >
> > > Data in the tables below are collected from a 10s run of a bouncing
> > > ball animation, with and without the patch.
> > > +-----------------------------------+--------------+-------------------+
> > > |                                           | with patch |  without patch |
> > > +-----------------------------------+-------------+--------------------+
> > > | sugov invocation rate (Hz) |       133.5 |                   3.7 |
> > > +-----------------------------------+-------------+--------------------+
> > >
> > > +--------------+----------------------+----------------------+
> > > |                   |         with patch: |    without patch: |
> > > | Freq (kHz) | time spent (ms) |  time spent (ms) |
> > > +--------------+----------------------+----------------------+
> > > |     738000 |                   4869 |                  9869 |
> > > |   1803000 |                   2936 |                      68 |
> > > |   1598000 |                   1072 |                        0 |
> > > |   1704000 |                     674 |                        0 |
> > > |              ... |                        ... |                       ... |
> > > +--------------+----------------------+---------------------+
> > >
> > > Thanks!
> > > Sam

For completeness, here are some Perfetto traces that show threads
running, CPU frequency, and PELT related stats. I've pinned the
util_avg track for a CPU on the little cluster, as the util_avg metric
shows an obvious increase (~66 vs ~3 for with patch and without patch
respectively).

- with patch: https://ui.perfetto.dev/#!/?s=964594d07a5a5ba51a159ba6c90bb7ab48e09326
- without patch:
https://ui.perfetto.dev/#!/?s=6ff6854c87ea187e4ca488acd2e6501b90ec9f6f

Re: [PATCH] sched/fair: Fix pelt lost idle time detection

Posted by Vincent Guittot 1 month, 3 weeks ago

Hi Samuel,


On Sat, 6 Dec 2025 at 02:20, Samuel Wu <wusamuel@google.com> wrote:
>
> On Fri, Dec 5, 2025 at 4:54 PM Samuel Wu <wusamuel@google.com> wrote:
> >
> > On Fri, Dec 5, 2025 at 7:08 AM Vincent Guittot
> > <vincent.guittot@linaro.org> wrote:
> > >
> > > On Tue, 2 Dec 2025 at 01:24, Samuel Wu <wusamuel@google.com> wrote:
> > > >
> > > > On Wed, Oct 8, 2025 at 6:12 AM Vincent Guittot
> > > > <vincent.guittot@linaro.org> wrote:

[...]

> > > > >
> > > >
> > > > Hi all,
> > > >
> > > > I am seeing a power regression I've observed with this patch. This
> > >
> > > The problem is that this patch is about fixing a wrong load tracking
> > > which can be underestimated on systems that become loaded.
> > >
> >
> > I feel the patch is doing the proper thing, which is the appropriate
> > bookkeeping when idle is the next task. I just wasn't 100% sure if
> > there were some other indirect impact that was unintentional, so
> > thought it would be good to send a report out and have another set of
> > eyes look over it.
> >
> > > > test was performed on Pixel 6 running android-mainline (6.18.0-rc7
> > > > based); all scheduling vendor hooks are disabled, and I'm not seeing
> > > > any obvious sched code differences compared to the vanilla upstream
> > > > kernel. I am still actively working to see if I can find a simpler
> > > > sequence to reproduce this on mainline Linux.
> > > >
> > > > The Wattson tool is reporting an increased average power (~30-40%)
> > > > with the patch vs baseline (patch reverted). This regression
> > >
> > > For a use case in particular ?
> >
> > This was for BouncyBall apk, which is a bouncing ball animation. I'm
> > still trying to find a method to reproduce this on Linux, but still
> > haven't been able to. Also we've been checking internally to root
> > cause this, but nothing definitive yet.
> >
> > >
> > > > correlates with two other metrics:
> > > > 1. Increased residency at higher CPU frequencies
> > > > 2. A significant increase in sugov invocations (at least 10x)
> > > >
> > > > Data in the tables below are collected from a 10s run of a bouncing
> > > > ball animation, with and without the patch.
> > > > +-----------------------------------+--------------+-------------------+
> > > > |                                           | with patch |  without patch |
> > > > +-----------------------------------+-------------+--------------------+
> > > > | sugov invocation rate (Hz) |       133.5 |                   3.7 |
> > > > +-----------------------------------+-------------+--------------------+
> > > >
> > > > +--------------+----------------------+----------------------+
> > > > |                   |         with patch: |    without patch: |
> > > > | Freq (kHz) | time spent (ms) |  time spent (ms) |
> > > > +--------------+----------------------+----------------------+
> > > > |     738000 |                   4869 |                  9869 |
> > > > |   1803000 |                   2936 |                      68 |
> > > > |   1598000 |                   1072 |                        0 |
> > > > |   1704000 |                     674 |                        0 |
> > > > |              ... |                        ... |                       ... |
> > > > +--------------+----------------------+---------------------+
> > > >
> > > > Thanks!
> > > > Sam
>
> For completeness, here are some Perfetto traces that show threads
> running, CPU frequency, and PELT related stats. I've pinned the
> util_avg track for a CPU on the little cluster, as the util_avg metric
> shows an obvious increase (~66 vs ~3 for with patch and without patch
> respectively).

I was focusing on the update of rq->lost_idle_time but It can't be
related because the CPUs are often idle in your trace. But it also
updates the rq->clock_idle and rq->clock_pelt_idle which are used to
sync cfs task util_avg at wakeup when it is about to migrate and prev
cpu is idle.

before the patch we could have old clock_pelt_idle and clock_idle that
were used to decay the util_avg of cfs task before migrating them
which would ends up with decaying too much util_avg

But I noticed that you put the util_avg_rt which doesn't use the 2
fields above in mainline. Does android kernel make some changes for rt
util_avg tracking ?

>
> - with patch: https://ui.perfetto.dev/#!/?s=964594d07a5a5ba51a159ba6c90bb7ab48e09326
> - without patch:
> https://ui.perfetto.dev/#!/?s=6ff6854c87ea187e4ca488acd2e6501b90ec9f6f

Re: [PATCH] sched/fair: Fix pelt lost idle time detection

Posted by Alex Hoh (賀振坤) 3 weeks, 2 days ago

On Sat, 2025-12-13 at 04:54 +0100, Vincent Guittot wrote:
> Hi Samuel,
> 
> 
> On Sat, 6 Dec 2025 at 02:20, Samuel Wu <wusamuel@google.com> wrote:
> > 
> > On Fri, Dec 5, 2025 at 4:54 PM Samuel Wu <wusamuel@google.com>
> > wrote:
> > > 
> > > On Fri, Dec 5, 2025 at 7:08 AM Vincent Guittot
> > > <vincent.guittot@linaro.org> wrote:
> > > > 
> > > > On Tue, 2 Dec 2025 at 01:24, Samuel Wu <wusamuel@google.com>
> > > > wrote:
> > > > > 
> > > > > On Wed, Oct 8, 2025 at 6:12 AM Vincent Guittot
> > > > > <vincent.guittot@linaro.org> wrote:
> 
> [...]
> 
> > > > > > 
> > > > > 
> > > > > Hi all,
> > > > > 
> > > > > I am seeing a power regression I've observed with this patch.
> > > > > This
> > > > 
> > > > The problem is that this patch is about fixing a wrong load
> > > > tracking
> > > > which can be underestimated on systems that become loaded.
> > > > 
> > > 
> > > I feel the patch is doing the proper thing, which is the
> > > appropriate
> > > bookkeeping when idle is the next task. I just wasn't 100% sure
> > > if
> > > there were some other indirect impact that was unintentional, so
> > > thought it would be good to send a report out and have another
> > > set of
> > > eyes look over it.
> > > 
> > > > > test was performed on Pixel 6 running android-mainline
> > > > > (6.18.0-rc7
> > > > > based); all scheduling vendor hooks are disabled, and I'm not
> > > > > seeing
> > > > > any obvious sched code differences compared to the vanilla
> > > > > upstream
> > > > > kernel. I am still actively working to see if I can find a
> > > > > simpler
> > > > > sequence to reproduce this on mainline Linux.
> > > > > 
> > > > > The Wattson tool is reporting an increased average power
> > > > > (~30-40%)
> > > > > with the patch vs baseline (patch reverted). This regression
> > > > 
> > > > For a use case in particular ?
> > > 
> > > This was for BouncyBall apk, which is a bouncing ball animation.
> > > I'm
> > > still trying to find a method to reproduce this on Linux, but
> > > still
> > > haven't been able to. Also we've been checking internally to root
> > > cause this, but nothing definitive yet.
> > > 
> > > > 
> > > > > correlates with two other metrics:
> > > > > 1. Increased residency at higher CPU frequencies
> > > > > 2. A significant increase in sugov invocations (at least 10x)
> > > > > 
> > > > > Data in the tables below are collected from a 10s run of a
> > > > > bouncing
> > > > > ball animation, with and without the patch.
> > > > > +-----------------------------------+--------------+---------
> > > > > ----------+
> > > > > >                                           | with patch | 
> > > > > > without patch |
> > > > > +-----------------------------------+-------------+----------
> > > > > ----------+
> > > > > > sugov invocation rate (Hz) |       133.5
> > > > > > |                   3.7 |
> > > > > +-----------------------------------+-------------+----------
> > > > > ----------+
> > > > > 
> > > > > +--------------+----------------------+----------------------
> > > > > +
> > > > > >                   |         with patch: |    without patch:
> > > > > > |
> > > > > > Freq (kHz) | time spent (ms) |  time spent (ms) |
> > > > > +--------------+----------------------+----------------------
> > > > > +
> > > > > >     738000 |                   4869 |                  9869
> > > > > > |
> > > > > >   1803000 |                   2936 |                     
> > > > > > 68 |
> > > > > >   1598000 |                   1072 |                       
> > > > > > 0 |
> > > > > >   1704000 |                     674
> > > > > > |                        0 |
> > > > > >              ... |                        ...
> > > > > > |                       ... |
> > > > > +--------------+----------------------+---------------------+
> > > > > 
> > > > > Thanks!
> > > > > Sam
> > 
> > For completeness, here are some Perfetto traces that show threads
> > running, CPU frequency, and PELT related stats. I've pinned the
> > util_avg track for a CPU on the little cluster, as the util_avg
> > metric
> > shows an obvious increase (~66 vs ~3 for with patch and without
> > patch
> > respectively).
> 
> I was focusing on the update of rq->lost_idle_time but It can't be
> related because the CPUs are often idle in your trace. But it also
> updates the rq->clock_idle and rq->clock_pelt_idle which are used to
> sync cfs task util_avg at wakeup when it is about to migrate and prev
> cpu is idle.
> 
> before the patch we could have old clock_pelt_idle and clock_idle
> that
> were used to decay the util_avg of cfs task before migrating them
> which would ends up with decaying too much util_avg
> 
> But I noticed that you put the util_avg_rt which doesn't use the 2
> fields above in mainline. Does android kernel make some changes for
> rt
> util_avg tracking ?


I believe this change can indeed account for the observed increase in
RT util.

When prev is the last RT task on the rq, the scheduler proceeds through
the CFS pick-next flow. With this patch, that path advances
rq_clock_pelt to the current time. However, updating rq_clock_pelt at
this stage does not seem correct, as RT util has not yet been updated.

The RT util update actually occurs later in put_prev_set_next_task(),
and it relies on the original value of rq_clock_pelt as input. Since
rq_clock_pelt has already been overwritten by the time the RT util
update takes place, the original timestamp is lost.

As a result, the intended CPU/frequency capacity scaling behavior is
disrupted, causing RT util to increase more rapidly than expected. This
appears to be an unintended consequence introduced by the patch.

> 
> > 
> > - with patch:
> > https://ui.perfetto.dev/#!/?s=964594d07a5a5ba51a159ba6c90bb7ab48e09326
> > - without patch:
> > https://ui.perfetto.dev/#!/?s=6ff6854c87ea187e4ca488acd2e6501b90ec9f6f

Re: [PATCH] sched/fair: Fix pelt lost idle time detection

Posted by Vincent Guittot 3 weeks, 2 days ago

Hi Alex,

Le vendredi 16 janv. 2026 à 06:51:03 (+0000), Alex Hoh (賀振坤) a écrit :
> On Sat, 2025-12-13 at 04:54 +0100, Vincent Guittot wrote:
> > Hi Samuel,
> > 
> > 
> > On Sat, 6 Dec 2025 at 02:20, Samuel Wu <wusamuel@google.com> wrote:
> > > 
> > > On Fri, Dec 5, 2025 at 4:54 PM Samuel Wu <wusamuel@google.com>
> > > wrote:
> > > > 
> > > > On Fri, Dec 5, 2025 at 7:08 AM Vincent Guittot
> > > > <vincent.guittot@linaro.org> wrote:
> > > > > 
> > > > > On Tue, 2 Dec 2025 at 01:24, Samuel Wu <wusamuel@google.com>
> > > > > wrote:
> > > > > > 
> > > > > > On Wed, Oct 8, 2025 at 6:12 AM Vincent Guittot
> > > > > > <vincent.guittot@linaro.org> wrote:
> > 
> > [...]
> > 
> > > > > > > 
> > > > > > 
> > > > > > Hi all,
> > > > > > 
> > > > > > I am seeing a power regression I've observed with this patch.
> > > > > > This
> > > > > 
> > > > > The problem is that this patch is about fixing a wrong load
> > > > > tracking
> > > > > which can be underestimated on systems that become loaded.
> > > > > 
> > > > 
> > > > I feel the patch is doing the proper thing, which is the
> > > > appropriate
> > > > bookkeeping when idle is the next task. I just wasn't 100% sure
> > > > if
> > > > there were some other indirect impact that was unintentional, so
> > > > thought it would be good to send a report out and have another
> > > > set of
> > > > eyes look over it.
> > > > 
> > > > > > test was performed on Pixel 6 running android-mainline
> > > > > > (6.18.0-rc7
> > > > > > based); all scheduling vendor hooks are disabled, and I'm not
> > > > > > seeing
> > > > > > any obvious sched code differences compared to the vanilla
> > > > > > upstream
> > > > > > kernel. I am still actively working to see if I can find a
> > > > > > simpler
> > > > > > sequence to reproduce this on mainline Linux.
> > > > > > 
> > > > > > The Wattson tool is reporting an increased average power
> > > > > > (~30-40%)
> > > > > > with the patch vs baseline (patch reverted). This regression
> > > > > 
> > > > > For a use case in particular ?
> > > > 
> > > > This was for BouncyBall apk, which is a bouncing ball animation.
> > > > I'm
> > > > still trying to find a method to reproduce this on Linux, but
> > > > still
> > > > haven't been able to. Also we've been checking internally to root
> > > > cause this, but nothing definitive yet.
> > > > 
> > > > > 
> > > > > > correlates with two other metrics:
> > > > > > 1. Increased residency at higher CPU frequencies
> > > > > > 2. A significant increase in sugov invocations (at least 10x)
> > > > > > 
> > > > > > Data in the tables below are collected from a 10s run of a
> > > > > > bouncing
> > > > > > ball animation, with and without the patch.
> > > > > > +-----------------------------------+--------------+---------
> > > > > > ----------+
> > > > > > >                                           | with patch | 
> > > > > > > without patch |
> > > > > > +-----------------------------------+-------------+----------
> > > > > > ----------+
> > > > > > > sugov invocation rate (Hz) |       133.5
> > > > > > > |                   3.7 |
> > > > > > +-----------------------------------+-------------+----------
> > > > > > ----------+
> > > > > > 
> > > > > > +--------------+----------------------+----------------------
> > > > > > +
> > > > > > >                   |         with patch: |    without patch:
> > > > > > > |
> > > > > > > Freq (kHz) | time spent (ms) |  time spent (ms) |
> > > > > > +--------------+----------------------+----------------------
> > > > > > +
> > > > > > >     738000 |                   4869 |                  9869
> > > > > > > |
> > > > > > >   1803000 |                   2936 |                     
> > > > > > > 68 |
> > > > > > >   1598000 |                   1072 |                       
> > > > > > > 0 |
> > > > > > >   1704000 |                     674
> > > > > > > |                        0 |
> > > > > > >              ... |                        ...
> > > > > > > |                       ... |
> > > > > > +--------------+----------------------+---------------------+
> > > > > > 
> > > > > > Thanks!
> > > > > > Sam
> > > 
> > > For completeness, here are some Perfetto traces that show threads
> > > running, CPU frequency, and PELT related stats. I've pinned the
> > > util_avg track for a CPU on the little cluster, as the util_avg
> > > metric
> > > shows an obvious increase (~66 vs ~3 for with patch and without
> > > patch
> > > respectively).
> > 
> > I was focusing on the update of rq->lost_idle_time but It can't be
> > related because the CPUs are often idle in your trace. But it also
> > updates the rq->clock_idle and rq->clock_pelt_idle which are used to
> > sync cfs task util_avg at wakeup when it is about to migrate and prev
> > cpu is idle.
> > 
> > before the patch we could have old clock_pelt_idle and clock_idle
> > that
> > were used to decay the util_avg of cfs task before migrating them
> > which would ends up with decaying too much util_avg
> > 
> > But I noticed that you put the util_avg_rt which doesn't use the 2
> > fields above in mainline. Does android kernel make some changes for
> > rt
> > util_avg tracking ?
> 
> 
> I believe this change can indeed account for the observed increase in
> RT util.
> 
> When prev is the last RT task on the rq, the scheduler proceeds through
> the CFS pick-next flow. With this patch, that path advances
> rq_clock_pelt to the current time. However, updating rq_clock_pelt at
> this stage does not seem correct, as RT util has not yet been updated.

You're right, put prev happens after we updated pelt clock

Samuel,

Could you try the fix below ?

---
 kernel/sched/fair.c | 6 ------
 kernel/sched/idle.c | 3 +++
 2 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 108213e94158..bea71564d3da 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8989,12 +8989,6 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 			goto again;
 	}

-	/*
-	 * rq is about to be idle, check if we need to update the
-	 * lost_idle_time of clock_pelt
-	 */
-	update_idle_rq_clock_pelt(rq);
-
 	return NULL;
 }

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 65eb8f8c1a5d..91af6acafe44 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -260,6 +260,9 @@ static void do_idle(void)
 {
 	int cpu = smp_processor_id();

+	/* Sync pelt clock with rq's one */
+	update_idle_rq_clock_pelt(cpu_rq(cpu));
+
 	/*
 	 * Check if we need to update blocked load
 	 */
--
2.43.0

> 
> The RT util update actually occurs later in put_prev_set_next_task(),
> and it relies on the original value of rq_clock_pelt as input. Since
> rq_clock_pelt has already been overwritten by the time the RT util
> update takes place, the original timestamp is lost.
> 
> As a result, the intended CPU/frequency capacity scaling behavior is
> disrupted, causing RT util to increase more rapidly than expected. This
> appears to be an unintended consequence introduced by the patch.
> 
> > 
> > > 
> > > - with patch:
> > > https://ui.perfetto.dev/#!/?s=964594d07a5a5ba51a159ba6c90bb7ab48e09326
> > > - without patch:
> > > https://ui.perfetto.dev/#!/?s=6ff6854c87ea187e4ca488acd2e6501b90ec9f6f

Re: [PATCH] sched/fair: Fix pelt lost idle time detection

Posted by Vincent Guittot 3 weeks, 1 day ago

Le vendredi 16 janv. 2026 à 15:21:35 (+0100), Vincent Guittot a écrit :
> Hi Alex,
> 
> Le vendredi 16 janv. 2026 à 06:51:03 (+0000), Alex Hoh (賀振坤) a écrit :
> > On Sat, 2025-12-13 at 04:54 +0100, Vincent Guittot wrote:
> > > Hi Samuel,
> > > 
> > > 
> > > On Sat, 6 Dec 2025 at 02:20, Samuel Wu <wusamuel@google.com> wrote:
> > > > 
> > > > On Fri, Dec 5, 2025 at 4:54 PM Samuel Wu <wusamuel@google.com>
> > > > wrote:
> > > > > 
> > > > > On Fri, Dec 5, 2025 at 7:08 AM Vincent Guittot
> > > > > <vincent.guittot@linaro.org> wrote:
> > > > > > 
> > > > > > On Tue, 2 Dec 2025 at 01:24, Samuel Wu <wusamuel@google.com>
> > > > > > wrote:
> > > > > > > 
> > > > > > > On Wed, Oct 8, 2025 at 6:12 AM Vincent Guittot
> > > > > > > <vincent.guittot@linaro.org> wrote:
> > > 
> > > [...]
> > > 

[..]

> > > that
> > > were used to decay the util_avg of cfs task before migrating them
> > > which would ends up with decaying too much util_avg
> > > 
> > > But I noticed that you put the util_avg_rt which doesn't use the 2
> > > fields above in mainline. Does android kernel make some changes for
> > > rt
> > > util_avg tracking ?
> > 
> > 
> > I believe this change can indeed account for the observed increase in
> > RT util.
> > 
> > When prev is the last RT task on the rq, the scheduler proceeds through
> > the CFS pick-next flow. With this patch, that path advances
> > rq_clock_pelt to the current time. However, updating rq_clock_pelt at
> > this stage does not seem correct, as RT util has not yet been updated.
> 
> You're right, put prev happens after we updated pelt clock
> 
> Samuel,
> 
> Could you try the fix below ?

I have been too quick and the correct fix should be:

---
 kernel/sched/fair.c | 6 ------
 kernel/sched/idle.c | 2 ++
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 108213e94158..bea71564d3da 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8989,12 +8989,6 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 			goto again;
 	}

-	/*
-	 * rq is about to be idle, check if we need to update the
-	 * lost_idle_time of clock_pelt
-	 */
-	update_idle_rq_clock_pelt(rq);
-
 	return NULL;
 }

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 65eb8f8c1a5d..9df6654ef84f 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -468,6 +468,8 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir
 	scx_update_idle(rq, true, true);
 	schedstat_inc(rq->sched_goidle);
 	next->se.exec_start = rq_clock_task(rq);
+
+	update_idle_rq_clock_pelt(rq);
 }

 struct task_struct *pick_task_idle(struct rq *rq, struct rq_flags *rf)
--
2.43.0

> > 
> > The RT util update actually occurs later in put_prev_set_next_task(),
> > and it relies on the original value of rq_clock_pelt as input. Since
> > rq_clock_pelt has already been overwritten by the time the RT util
> > update takes place, the original timestamp is lost.
> > 
> > As a result, the intended CPU/frequency capacity scaling behavior is
> > disrupted, causing RT util to increase more rapidly than expected. This
> > appears to be an unintended consequence introduced by the patch.
> > 
> > > 
> > > > 
> > > > - with patch:
> > > > https://ui.perfetto.dev/#!/?s=964594d07a5a5ba51a159ba6c90bb7ab48e09326
> > > > - without patch:
> > > > https://ui.perfetto.dev/#!/?s=6ff6854c87ea187e4ca488acd2e6501b90ec9f6f

Re: [PATCH] sched/fair: Fix pelt lost idle time detection

Posted by Alex Hoh (賀振坤) 2 weeks, 5 days ago

On Fri, 2026-01-16 at 17:13 +0100, Vincent Guittot wrote:
> Le vendredi 16 janv. 2026 à 15:21:35 (+0100), Vincent Guittot a écrit
> :
> > Hi Alex,
> > 
> > Le vendredi 16 janv. 2026 à 06:51:03 (+0000), Alex Hoh (賀振坤) a
> > écrit :
> > > On Sat, 2025-12-13 at 04:54 +0100, Vincent Guittot wrote:
> > > > Hi Samuel,
> > > > 
> > > > 
> > > > On Sat, 6 Dec 2025 at 02:20, Samuel Wu <wusamuel@google.com>
> > > > wrote:
> > > > > 
> > > > > On Fri, Dec 5, 2025 at 4:54 PM Samuel Wu
> > > > > <wusamuel@google.com>
> > > > > wrote:
> > > > > > 
> > > > > > On Fri, Dec 5, 2025 at 7:08 AM Vincent Guittot
> > > > > > <vincent.guittot@linaro.org> wrote:
> > > > > > > 
> > > > > > > On Tue, 2 Dec 2025 at 01:24, Samuel Wu
> > > > > > > <wusamuel@google.com>
> > > > > > > wrote:
> > > > > > > > 
> > > > > > > > On Wed, Oct 8, 2025 at 6:12 AM Vincent Guittot
> > > > > > > > <vincent.guittot@linaro.org> wrote:
> > > > 
> > > > [...]
> > > > 
> 
> [..]
> 
> > > > that
> > > > were used to decay the util_avg of cfs task before migrating
> > > > them
> > > > which would ends up with decaying too much util_avg
> > > > 
> > > > But I noticed that you put the util_avg_rt which doesn't use
> > > > the 2
> > > > fields above in mainline. Does android kernel make some changes
> > > > for
> > > > rt
> > > > util_avg tracking ?
> > > 
> > > 
> > > I believe this change can indeed account for the observed
> > > increase in
> > > RT util.
> > > 
> > > When prev is the last RT task on the rq, the scheduler proceeds
> > > through
> > > the CFS pick-next flow. With this patch, that path advances
> > > rq_clock_pelt to the current time. However, updating
> > > rq_clock_pelt at
> > > this stage does not seem correct, as RT util has not yet been
> > > updated.
> > 
> > You're right, put prev happens after we updated pelt clock
> > 
> > Samuel,
> > 
> > Could you try the fix below ?
> 
> I have been too quick and the correct fix should be:
> 
> ---
>  kernel/sched/fair.c | 6 ------
>  kernel/sched/idle.c | 2 ++
>  2 files changed, 2 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 108213e94158..bea71564d3da 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8989,12 +8989,6 @@ pick_next_task_fair(struct rq *rq, struct
> task_struct *prev, struct rq_flags *rf
>  			goto again;
>  	}
> 
> -	/*
> -	 * rq is about to be idle, check if we need to update the
> -	 * lost_idle_time of clock_pelt
> -	 */
> -	update_idle_rq_clock_pelt(rq);
> -
>  	return NULL;
>  }
> 
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 65eb8f8c1a5d..9df6654ef84f 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -468,6 +468,8 @@ static void set_next_task_idle(struct rq *rq,
> struct task_struct *next, bool fir
>  	scx_update_idle(rq, true, true);
>  	schedstat_inc(rq->sched_goidle);
>  	next->se.exec_start = rq_clock_task(rq);
> +
> +	update_idle_rq_clock_pelt(rq);
>  }
> 
>  struct task_struct *pick_task_idle(struct rq *rq, struct rq_flags
> *rf)
> --
> 2.43.0
> 

This fix looks good to me, and in my local testing, 
RT utilization dropped significantly from 72 to 20.

> > > 
> > > The RT util update actually occurs later in
> > > put_prev_set_next_task(),
> > > and it relies on the original value of rq_clock_pelt as input.
> > > Since
> > > rq_clock_pelt has already been overwritten by the time the RT
> > > util
> > > update takes place, the original timestamp is lost.
> > > 
> > > As a result, the intended CPU/frequency capacity scaling behavior
> > > is
> > > disrupted, causing RT util to increase more rapidly than
> > > expected. This
> > > appears to be an unintended consequence introduced by the patch.
> > > 
> > > > 
> > > > > 
> > > > > - with patch:
> > > > > https://urldefense.com/v3/__https://ui.perfetto.dev/*!/?s=964594d07a5a5ba51a159ba6c90bb7ab48e09326__;Iw!!CTRNKA9wMg0ARbw!hSdr5foMTuMuY9OK82IumAY3LPDt5HCKhkduyW-8UsLaQvcW8F-kwwsgcX0crwnkzU29xvN24l5bpsa55oqTy5MwyJxI$
> > > > >  
> > > > > - without patch:
> > > > > https://urldefense.com/v3/__https://ui.perfetto.dev/*!/?s=6ff6854c87ea187e4ca488acd2e6501b90ec9f6f__;Iw!!CTRNKA9wMg0ARbw!hSdr5foMTuMuY9OK82IumAY3LPDt5HCKhkduyW-8UsLaQvcW8F-kwwsgcX0crwnkzU29xvN24l5bpsa55oqTyzycmnIs$
> > > > >

Re: [PATCH] sched/fair: Fix pelt lost idle time detection

Posted by Vincent Guittot 2 weeks, 5 days ago

On Tue, 20 Jan 2026 at 06:46, Alex Hoh (賀振坤) <Alex.Hoh@mediatek.com> wrote:
>
> On Fri, 2026-01-16 at 17:13 +0100, Vincent Guittot wrote:
> > Le vendredi 16 janv. 2026 à 15:21:35 (+0100), Vincent Guittot a écrit
> > :
> > > Hi Alex,

[..]

>
> This fix looks good to me, and in my local testing,
> RT utilization dropped significantly from 72 to 20.

Thanks for your tests. Let see if this fixes Samuel's regression too
and I will prepare a patch

>
> > > >
> > > > The RT util update actually occurs later in
> > > > put_prev_set_next_task(),
> > > > and it relies on the original value of rq_clock_pelt as input.
> > > > Since
> > > > rq_clock_pelt has already been overwritten by the time the RT
> > > > util
> > > > update takes place, the original timestamp is lost.
> > > >
> > > > As a result, the intended CPU/frequency capacity scaling behavior
> > > > is
> > > > disrupted, causing RT util to increase more rapidly than
> > > > expected. This
> > > > appears to be an unintended consequence introduced by the patch.
> > > >
> > > > >
> > > > > >
> > > > > > - with patch:
> > > > > > https://urldefense.com/v3/__https://ui.perfetto.dev/*!/?s=964594d07a5a5ba51a159ba6c90bb7ab48e09326__;Iw!!CTRNKA9wMg0ARbw!hSdr5foMTuMuY9OK82IumAY3LPDt5HCKhkduyW-8UsLaQvcW8F-kwwsgcX0crwnkzU29xvN24l5bpsa55oqTy5MwyJxI$
> > > > > >
> > > > > > - without patch:
> > > > > > https://urldefense.com/v3/__https://ui.perfetto.dev/*!/?s=6ff6854c87ea187e4ca488acd2e6501b90ec9f6f__;Iw!!CTRNKA9wMg0ARbw!hSdr5foMTuMuY9OK82IumAY3LPDt5HCKhkduyW-8UsLaQvcW8F-kwwsgcX0crwnkzU29xvN24l5bpsa55oqTyzycmnIs$
> > > > > >
>
>
> ************* MEDIATEK Confidentiality Notice
>  ********************
> The information contained in this e-mail message (including any
> attachments) may be confidential, proprietary, privileged, or otherwise
> exempt from disclosure under applicable laws. It is intended to be
> conveyed only to the designated recipient(s). Any use, dissemination,
> distribution, printing, retaining or copying of this e-mail (including its
> attachments) by unintended recipient(s) is strictly prohibited and may
> be unlawful. If you are not an intended recipient of this e-mail, or believe
>
> that you have received this e-mail in error, please notify the sender
> immediately (by replying to this e-mail), delete any and all copies of
> this e-mail (including any attachments) from your system, and do not
> disclose the content of this e-mail to any other person. Thank you!

Re: [PATCH] sched/fair: Fix pelt lost idle time detection

Posted by Samuel Wu 2 weeks, 4 days ago

On Tue, Jan 20, 2026 at 12:36 AM Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Tue, 20 Jan 2026 at 06:46, Alex Hoh (賀振坤) <Alex.Hoh@mediatek.com> wrote:
> >
> > On Fri, 2026-01-16 at 17:13 +0100, Vincent Guittot wrote:
> > > Le vendredi 16 janv. 2026 à 15:21:35 (+0100), Vincent Guittot a écrit
> > > :
> > > > Hi Alex,
>
> [..]
>
> >
> > This fix looks good to me, and in my local testing,
> > RT utilization dropped significantly from 72 to 20.
>
> Thanks for your tests. Let see if this fixes Samuel's regression too
> and I will prepare a patch
>

Thanks Vincent- fix also looks good to me, with RT util and power both
dropped significantly and within baseline levels.

For completeness, here is Perfetto trace with this latest fix:
https://ui.perfetto.dev/#!/?s=f82243599e7a3e590ab91a1b3a7e15e3c17f481d

Feel free to add these tags as appropriate:
Reported-by: Samuel Wu <wusamuel@google.com>
Tested-by: Samuel Wu <wusamuel@google.com>

Re: [PATCH] sched/fair: Fix pelt lost idle time detection

Posted by Vincent Guittot 2 weeks, 4 days ago

On Wed, 21 Jan 2026 at 02:19, Samuel Wu <wusamuel@google.com> wrote:
>
> On Tue, Jan 20, 2026 at 12:36 AM Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> >
> > On Tue, 20 Jan 2026 at 06:46, Alex Hoh (賀振坤) <Alex.Hoh@mediatek.com> wrote:
> > >
> > > On Fri, 2026-01-16 at 17:13 +0100, Vincent Guittot wrote:
> > > > Le vendredi 16 janv. 2026 à 15:21:35 (+0100), Vincent Guittot a écrit
> > > > :
> > > > > Hi Alex,
> >
> > [..]
> >
> > >
> > > This fix looks good to me, and in my local testing,
> > > RT utilization dropped significantly from 72 to 20.
> >
> > Thanks for your tests. Let see if this fixes Samuel's regression too
> > and I will prepare a patch
> >
>
> Thanks Vincent- fix also looks good to me, with RT util and power both
> dropped significantly and within baseline levels.
>
> For completeness, here is Perfetto trace with this latest fix:
> https://ui.perfetto.dev/#!/?s=f82243599e7a3e590ab91a1b3a7e15e3c17f481d
>
> Feel free to add these tags as appropriate:
> Reported-by: Samuel Wu <wusamuel@google.com>
> Tested-by: Samuel Wu <wusamuel@google.com>

Thanks.

Re: [PATCH] sched/fair: Fix pelt lost idle time detection

Posted by Qais Yousef 1 month, 2 weeks ago

On 12/13/25 04:54, Vincent Guittot wrote:

> > For completeness, here are some Perfetto traces that show threads
> > running, CPU frequency, and PELT related stats. I've pinned the
> > util_avg track for a CPU on the little cluster, as the util_avg metric
> > shows an obvious increase (~66 vs ~3 for with patch and without patch
> > respectively).
> 
> I was focusing on the update of rq->lost_idle_time but It can't be
> related because the CPUs are often idle in your trace. But it also
> updates the rq->clock_idle and rq->clock_pelt_idle which are used to
> sync cfs task util_avg at wakeup when it is about to migrate and prev
> cpu is idle.
> 
> before the patch we could have old clock_pelt_idle and clock_idle that
> were used to decay the util_avg of cfs task before migrating them
> which would ends up with decaying too much util_avg
> 
> But I noticed that you put the util_avg_rt which doesn't use the 2
> fields above in mainline. Does android kernel make some changes for rt
> util_avg tracking ?

We shouldn't be doing that. I think we were not updating RT pressure correctly
before the patch. The new values make more sense to me as RT tasks are running
2ms every 10ms and a util_avg_rt of ~150 range makes more sense than the
previous 5-6 values? If we add the 20% headroom that can easily saturate the
little core.

update_rt_rq_load_avg() uses rq_clock_pelt() which takes into account the
lost_idle_time which we now ensure is updated in this corner case?

I guess the first question is which do you think is the right behavior for the
RT pressure?

And 2nd question, does it make sense to take RT pressure into account in
schedutil if there are no fair tasks? It is supposed to help compensate for the
stolen time by RT so we make fair run faster. But if there are no fair tasks,
the RT pressure is meaningless on its own as they should run at max or whatever
value specified by uclamp_min? I think in this test uclamp_min is set to 0 by
default for RT, so expected not to cause frequency to rise on their own.

> 
> >
> > - with patch: https://ui.perfetto.dev/#!/?s=964594d07a5a5ba51a159ba6c90bb7ab48e09326
> > - without patch:
> > https://ui.perfetto.dev/#!/?s=6ff6854c87ea187e4ca488acd2e6501b90ec9f6f

Re: [PATCH] sched/fair: Fix pelt lost idle time detection

Posted by Vincent Guittot 1 month ago

On Tue, 23 Dec 2025 at 18:27, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 12/13/25 04:54, Vincent Guittot wrote:
>
> > > For completeness, here are some Perfetto traces that show threads
> > > running, CPU frequency, and PELT related stats. I've pinned the
> > > util_avg track for a CPU on the little cluster, as the util_avg metric
> > > shows an obvious increase (~66 vs ~3 for with patch and without patch
> > > respectively).
> >
> > I was focusing on the update of rq->lost_idle_time but It can't be
> > related because the CPUs are often idle in your trace. But it also
> > updates the rq->clock_idle and rq->clock_pelt_idle which are used to
> > sync cfs task util_avg at wakeup when it is about to migrate and prev
> > cpu is idle.
> >
> > before the patch we could have old clock_pelt_idle and clock_idle that
> > were used to decay the util_avg of cfs task before migrating them
> > which would ends up with decaying too much util_avg
> >
> > But I noticed that you put the util_avg_rt which doesn't use the 2
> > fields above in mainline. Does android kernel make some changes for rt
> > util_avg tracking ?
>
> We shouldn't be doing that. I think we were not updating RT pressure correctly
> before the patch. The new values make more sense to me as RT tasks are running
> 2ms every 10ms and a util_avg_rt of ~150 range makes more sense than the
> previous 5-6 values? If we add the 20% headroom that can easily saturate the
> little core.
>
> update_rt_rq_load_avg() uses rq_clock_pelt() which takes into account the
> lost_idle_time which we now ensure is updated in this corner case?

But according to the Samuel's trace, there is a lot of idle time and
the CPUs are far from being over utilized so we don't miss any lost
idle time which happens when util_avg reaches 1024 (util_sum >
47791490)

>
> I guess the first question is which do you think is the right behavior for the
> RT pressure?

RT pressure reflects the utilization of CPU by RT tasks. The tracking
is simpler than cfs (no per task tracking, no migration tracking, no
util est ..) but it is the utilization of CPU by RT and as a result
should be used when selecting OPP. The reality is that this RT
utilization is almost always hidden because we jump to a high OPP as
soon as a RT task is runnable and the max(rt util_avg, rt uclamp_min)
often/always returns uclamp_min

>
> And 2nd question, does it make sense to take RT pressure into account in
> schedutil if there are no fair tasks? It is supposed to help compensate for the
> stolen time by RT so we make fair run faster. But if there are no fair tasks,
> the RT pressure is meaningless on its own as they should run at max or whatever
> value specified by uclamp_min? I think in this test uclamp_min is set to 0 by
> default for RT, so expected not to cause frequency to rise on their own.

Interesting that uclamp_min is set to 0 :-). And this is another
reason to keep rt util_avg otherwise we might not have enough capacity
to run RT task

I'm worried that if we don't take into account rt util_avg the
execution of RT task will be too much delayed when cfs task will wake
up and the increase of OPP will not be enough.

Let's take the case of a RT task that runs 2ms every 10ms.
now we have a cfs task that runs 2ms every 10ms but wakes up 8ms after
the RT task

In order to run these 2 tasks: 4ms every 10ms, we need to use a
minimum compute capacity which is rt util_avg + cfs util_avg

>
> >
> > >
> > > - with patch: https://ui.perfetto.dev/#!/?s=964594d07a5a5ba51a159ba6c90bb7ab48e09326
> > > - without patch:
> > > https://ui.perfetto.dev/#!/?s=6ff6854c87ea187e4ca488acd2e6501b90ec9f6f

Re: [PATCH] sched/fair: Fix pelt lost idle time detection

Posted by Qais Yousef 3 weeks, 6 days ago

On 01/07/26 08:50, Vincent Guittot wrote:
> On Tue, 23 Dec 2025 at 18:27, Qais Yousef <qyousef@layalina.io> wrote:
> >
> > On 12/13/25 04:54, Vincent Guittot wrote:
> >
> > > > For completeness, here are some Perfetto traces that show threads
> > > > running, CPU frequency, and PELT related stats. I've pinned the
> > > > util_avg track for a CPU on the little cluster, as the util_avg metric
> > > > shows an obvious increase (~66 vs ~3 for with patch and without patch
> > > > respectively).
> > >
> > > I was focusing on the update of rq->lost_idle_time but It can't be
> > > related because the CPUs are often idle in your trace. But it also
> > > updates the rq->clock_idle and rq->clock_pelt_idle which are used to
> > > sync cfs task util_avg at wakeup when it is about to migrate and prev
> > > cpu is idle.
> > >
> > > before the patch we could have old clock_pelt_idle and clock_idle that
> > > were used to decay the util_avg of cfs task before migrating them
> > > which would ends up with decaying too much util_avg
> > >
> > > But I noticed that you put the util_avg_rt which doesn't use the 2
> > > fields above in mainline. Does android kernel make some changes for rt
> > > util_avg tracking ?
> >
> > We shouldn't be doing that. I think we were not updating RT pressure correctly
> > before the patch. The new values make more sense to me as RT tasks are running
> > 2ms every 10ms and a util_avg_rt of ~150 range makes more sense than the
> > previous 5-6 values? If we add the 20% headroom that can easily saturate the
> > little core.
> >
> > update_rt_rq_load_avg() uses rq_clock_pelt() which takes into account the
> > lost_idle_time which we now ensure is updated in this corner case?
> 
> But according to the Samuel's trace, there is a lot of idle time and
> the CPUs are far from being over utilized so we don't miss any lost
> idle time which happens when util_avg reaches 1024 (util_sum >
> 47791490)

Hmm I didn't look deeply but something is 'fixing' the values. clocks being
updated?

> 
> >
> > I guess the first question is which do you think is the right behavior for the
> > RT pressure?
> 
> RT pressure reflects the utilization of CPU by RT tasks. The tracking
> is simpler than cfs (no per task tracking, no migration tracking, no
> util est ..) but it is the utilization of CPU by RT and as a result
> should be used when selecting OPP. The reality is that this RT
> utilization is almost always hidden because we jump to a high OPP as
> soon as a RT task is runnable and the max(rt util_avg, rt uclamp_min)
> often/always returns uclamp_min

Yes, but this value shouldn't be driving RT tasks frequency selection.

> 
> >
> > And 2nd question, does it make sense to take RT pressure into account in
> > schedutil if there are no fair tasks? It is supposed to help compensate for the
> > stolen time by RT so we make fair run faster. But if there are no fair tasks,
> > the RT pressure is meaningless on its own as they should run at max or whatever
> > value specified by uclamp_min? I think in this test uclamp_min is set to 0 by
> > default for RT, so expected not to cause frequency to rise on their own.
> 
> Interesting that uclamp_min is set to 0 :-). And this is another
> reason to keep rt util_avg otherwise we might not have enough capacity
> to run RT task
> 
> I'm worried that if we don't take into account rt util_avg the
> execution of RT task will be too much delayed when cfs task will wake
> up and the increase of OPP will not be enough.

Oh, we should absolutely take it into account. I wasn't proposing not to take
it into account for fair. Just to handle the corner cases when there are no
fair tasks and RT is running at higher frequency (to what it has requested).
The design of RT is that it has no DVFS support but runs at a constant
frequency which can be controlled by uclamp_min. When there are no fair tasks
running, RT util_avg shouldn't cause RT to run higher as the util_avg is for
fair OPP selection, not RT.

I think I see better where the real problem lies. uclamp_max is set to 1024 by
default for RT which means they can run higher than what they requested.
I think RT tasks should ensure that uclamp_max is always equal to uclamp_min to
deliver on the constant freq request design. Which I think a more graceful way
to handle this corner case. If uclamp_max is set to 0, then this RT tasks can't
cause the frequency to rise on its own. As I wrote the patch I realized we
don't have explicit handling of tasks switching_to/from RT tasks. I need to
think about it. We should enforce the default rt uclamp_min value when we
switch to RT if the task is !user_defined. For uclamp_max the logic and the
rule will becoming complicated with this proposal. Maybe we can simplify the
rules by forcing policy switch to restore the task's uclamp_min/max to the
policy default values?

Only compile tested the patch

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 41ba0be16911..0df03a419a53 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1575,6 +1575,11 @@ static void __uclamp_update_util_min_rt_default(struct task_struct *p)

        default_util_min = sysctl_sched_uclamp_util_min_rt_default;
        uclamp_se_set(uc_se, default_util_min, false);
+       /*
+        * UCLAMP_MAX should always follow UCLAMP_MIN for RT so that it
+        * requests a constant frequency which is the intended design
+        */
+       uclamp_se_set(&p->uclamp_req[UCLAMP_MAX], default_util_min, false);
 }

 static void uclamp_update_util_min_rt_default(struct task_struct *p)
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
index 0496dc29ed0f..94ec08a0693e 100644
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -330,6 +330,13 @@ static int uclamp_validate(struct task_struct *p,
        if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
                util_max = attr->sched_util_max;

+               /*
+                * UCLAMP_MAX follows UCLAMP_MIN for RT to guarantee constant
+                * freq request.
+                */
+               if (unlikely(rt_task(p)))
+                       return -EINVAL;
+
                if (util_max + 1 > SCHED_CAPACITY_SCALE + 1)
                        return -EINVAL;
        }
@@ -388,9 +395,10 @@ static void __setscheduler_uclamp(struct task_struct *p,

                /*
                 * RT by default have a 100% boost value that could be modified
-                * at runtime.
+                * at runtime. RT should also always request a constant value,
+                * which means UCLAMP_MIN === UCLAMP_MAX.
                 */
-               if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
+               if (unlikely(rt_task(p)))
                        value = sysctl_sched_uclamp_util_min_rt_default;
                else
                        value = uclamp_none(clamp_id);
@@ -406,6 +414,12 @@ static void __setscheduler_uclamp(struct task_struct *p,
            attr->sched_util_min != -1) {
                uclamp_se_set(&p->uclamp_req[UCLAMP_MIN],
                              attr->sched_util_min, true);
+
+               /* UCLAMP_MAX follows UCLAMP_MIN for RT */
+               if (unlikely(rt_task(p)) {
+                       uclamp_se_set(&p->uclamp_req[UCLAMP_MAX],
+                                     attr->sched_util_max, true);
+               }
        }

        if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX &&

> 
> Let's take the case of a RT task that runs 2ms every 10ms.
> now we have a cfs task that runs 2ms every 10ms but wakes up 8ms after
> the RT task
> 
> In order to run these 2 tasks: 4ms every 10ms, we need to use a
> minimum compute capacity which is rt util_avg + cfs util_avg
> 
> >
> > >
> > > >
> > > > - with patch: https://ui.perfetto.dev/#!/?s=964594d07a5a5ba51a159ba6c90bb7ab48e09326
> > > > - without patch:
> > > > https://ui.perfetto.dev/#!/?s=6ff6854c87ea187e4ca488acd2e6501b90ec9f6f

Re: [PATCH] sched/fair: Fix pelt lost idle time detection

Posted by Vincent Guittot 3 weeks, 2 days ago

On Mon, 12 Jan 2026 at 16:10, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 01/07/26 08:50, Vincent Guittot wrote:
> > On Tue, 23 Dec 2025 at 18:27, Qais Yousef <qyousef@layalina.io> wrote:
> > >
> > > On 12/13/25 04:54, Vincent Guittot wrote:
> > >
> > > > > For completeness, here are some Perfetto traces that show threads
> > > > > running, CPU frequency, and PELT related stats. I've pinned the
> > > > > util_avg track for a CPU on the little cluster, as the util_avg metric
> > > > > shows an obvious increase (~66 vs ~3 for with patch and without patch
> > > > > respectively).
> > > >
> > > > I was focusing on the update of rq->lost_idle_time but It can't be
> > > > related because the CPUs are often idle in your trace. But it also
> > > > updates the rq->clock_idle and rq->clock_pelt_idle which are used to
> > > > sync cfs task util_avg at wakeup when it is about to migrate and prev
> > > > cpu is idle.
> > > >
> > > > before the patch we could have old clock_pelt_idle and clock_idle that
> > > > were used to decay the util_avg of cfs task before migrating them
> > > > which would ends up with decaying too much util_avg
> > > >
> > > > But I noticed that you put the util_avg_rt which doesn't use the 2
> > > > fields above in mainline. Does android kernel make some changes for rt
> > > > util_avg tracking ?
> > >
> > > We shouldn't be doing that. I think we were not updating RT pressure correctly
> > > before the patch. The new values make more sense to me as RT tasks are running
> > > 2ms every 10ms and a util_avg_rt of ~150 range makes more sense than the
> > > previous 5-6 values? If we add the 20% headroom that can easily saturate the
> > > little core.
> > >
> > > update_rt_rq_load_avg() uses rq_clock_pelt() which takes into account the
> > > lost_idle_time which we now ensure is updated in this corner case?
> >
> > But according to the Samuel's trace, there is a lot of idle time and
> > the CPUs are far from being over utilized so we don't miss any lost
> > idle time which happens when util_avg reaches 1024 (util_sum >
> > 47791490)
>
> Hmm I didn't look deeply but something is 'fixing' the values. clocks being
> updated?
>
> >
> > >
> > > I guess the first question is which do you think is the right behavior for the
> > > RT pressure?
> >
> > RT pressure reflects the utilization of CPU by RT tasks. The tracking
> > is simpler than cfs (no per task tracking, no migration tracking, no
> > util est ..) but it is the utilization of CPU by RT and as a result
> > should be used when selecting OPP. The reality is that this RT
> > utilization is almost always hidden because we jump to a high OPP as
> > soon as a RT task is runnable and the max(rt util_avg, rt uclamp_min)
> > often/always returns uclamp_min
>
> Yes, but this value shouldn't be driving RT tasks frequency selection.

I would add as long as the min freq for RT is high enough

>
> >
> > >
> > > And 2nd question, does it make sense to take RT pressure into account in
> > > schedutil if there are no fair tasks? It is supposed to help compensate for the
> > > stolen time by RT so we make fair run faster. But if there are no fair tasks,
> > > the RT pressure is meaningless on its own as they should run at max or whatever
> > > value specified by uclamp_min? I think in this test uclamp_min is set to 0 by
> > > default for RT, so expected not to cause frequency to rise on their own.
> >
> > Interesting that uclamp_min is set to 0 :-). And this is another
> > reason to keep rt util_avg otherwise we might not have enough capacity
> > to run RT task
> >
> > I'm worried that if we don't take into account rt util_avg the
> > execution of RT task will be too much delayed when cfs task will wake
> > up and the increase of OPP will not be enough.
>
> Oh, we should absolutely take it into account. I wasn't proposing not to take
> it into account for fair. Just to handle the corner cases when there are no
> fair tasks and RT is running at higher frequency (to what it has requested).
> The design of RT is that it has no DVFS support but runs at a constant
> frequency which can be controlled by uclamp_min. When there are no fair tasks
> running, RT util_avg shouldn't cause RT to run higher as the util_avg is for
> fair OPP selection, not RT.

fair enough

>
> I think I see better where the real problem lies. uclamp_max is set to 1024 by
> default for RT which means they can run higher than what they requested.
> I think RT tasks should ensure that uclamp_max is always equal to uclamp_min to
> deliver on the constant freq request design. Which I think a more graceful way
> to handle this corner case. If uclamp_max is set to 0, then this RT tasks can't
> cause the frequency to rise on its own. As I wrote the patch I realized we
> don't have explicit handling of tasks switching_to/from RT tasks. I need to
> think about it. We should enforce the default rt uclamp_min value when we
> switch to RT if the task is !user_defined. For uclamp_max the logic and the
> rule will becoming complicated with this proposal. Maybe we can simplify the
> rules by forcing policy switch to restore the task's uclamp_min/max to the
> policy default values?
>
> Only compile tested the patch
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 41ba0be16911..0df03a419a53 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1575,6 +1575,11 @@ static void __uclamp_update_util_min_rt_default(struct task_struct *p)
>
>         default_util_min = sysctl_sched_uclamp_util_min_rt_default;
>         uclamp_se_set(uc_se, default_util_min, false);
> +       /*
> +        * UCLAMP_MAX should always follow UCLAMP_MIN for RT so that it
> +        * requests a constant frequency which is the intended design
> +        */
> +       uclamp_se_set(&p->uclamp_req[UCLAMP_MAX], default_util_min, false);
>  }
>
>  static void uclamp_update_util_min_rt_default(struct task_struct *p)
> diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
> index 0496dc29ed0f..94ec08a0693e 100644
> --- a/kernel/sched/syscalls.c
> +++ b/kernel/sched/syscalls.c
> @@ -330,6 +330,13 @@ static int uclamp_validate(struct task_struct *p,
>         if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
>                 util_max = attr->sched_util_max;
>
> +               /*
> +                * UCLAMP_MAX follows UCLAMP_MIN for RT to guarantee constant
> +                * freq request.
> +                */
> +               if (unlikely(rt_task(p)))
> +                       return -EINVAL;
> +
>                 if (util_max + 1 > SCHED_CAPACITY_SCALE + 1)
>                         return -EINVAL;
>         }
> @@ -388,9 +395,10 @@ static void __setscheduler_uclamp(struct task_struct *p,
>
>                 /*
>                  * RT by default have a 100% boost value that could be modified
> -                * at runtime.
> +                * at runtime. RT should also always request a constant value,
> +                * which means UCLAMP_MIN === UCLAMP_MAX.
>                  */
> -               if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
> +               if (unlikely(rt_task(p)))
>                         value = sysctl_sched_uclamp_util_min_rt_default;
>                 else
>                         value = uclamp_none(clamp_id);
> @@ -406,6 +414,12 @@ static void __setscheduler_uclamp(struct task_struct *p,
>             attr->sched_util_min != -1) {
>                 uclamp_se_set(&p->uclamp_req[UCLAMP_MIN],
>                               attr->sched_util_min, true);
> +
> +               /* UCLAMP_MAX follows UCLAMP_MIN for RT */
> +               if (unlikely(rt_task(p)) {
> +                       uclamp_se_set(&p->uclamp_req[UCLAMP_MAX],
> +                                     attr->sched_util_max, true);
> +               }
>         }
>
>         if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX &&
>
> >
> > Let's take the case of a RT task that runs 2ms every 10ms.
> > now we have a cfs task that runs 2ms every 10ms but wakes up 8ms after
> > the RT task
> >
> > In order to run these 2 tasks: 4ms every 10ms, we need to use a
> > minimum compute capacity which is rt util_avg + cfs util_avg
> >
> > >
> > > >
> > > > >
> > > > > - with patch: https://ui.perfetto.dev/#!/?s=964594d07a5a5ba51a159ba6c90bb7ab48e09326
> > > > > - without patch:
> > > > > https://ui.perfetto.dev/#!/?s=6ff6854c87ea187e4ca488acd2e6501b90ec9f6f

Re: [PATCH] sched/fair: Fix pelt lost idle time detection

Posted by Qais Yousef 1 month, 2 weeks ago

On 12/23/25 17:27, Qais Yousef wrote:
> On 12/13/25 04:54, Vincent Guittot wrote:
> 
> > > For completeness, here are some Perfetto traces that show threads
> > > running, CPU frequency, and PELT related stats. I've pinned the
> > > util_avg track for a CPU on the little cluster, as the util_avg metric
> > > shows an obvious increase (~66 vs ~3 for with patch and without patch
> > > respectively).
> > 
> > I was focusing on the update of rq->lost_idle_time but It can't be
> > related because the CPUs are often idle in your trace. But it also
> > updates the rq->clock_idle and rq->clock_pelt_idle which are used to
> > sync cfs task util_avg at wakeup when it is about to migrate and prev
> > cpu is idle.
> > 
> > before the patch we could have old clock_pelt_idle and clock_idle that
> > were used to decay the util_avg of cfs task before migrating them
> > which would ends up with decaying too much util_avg
> > 
> > But I noticed that you put the util_avg_rt which doesn't use the 2
> > fields above in mainline. Does android kernel make some changes for rt
> > util_avg tracking ?
> 
> We shouldn't be doing that. I think we were not updating RT pressure correctly
> before the patch. The new values make more sense to me as RT tasks are running
> 2ms every 10ms and a util_avg_rt of ~150 range makes more sense than the
> previous 5-6 values? If we add the 20% headroom that can easily saturate the
> little core.
> 
> update_rt_rq_load_avg() uses rq_clock_pelt() which takes into account the
> lost_idle_time which we now ensure is updated in this corner case?
> 
> I guess the first question is which do you think is the right behavior for the
> RT pressure?
> 
> And 2nd question, does it make sense to take RT pressure into account in
> schedutil if there are no fair tasks? It is supposed to help compensate for the
> stolen time by RT so we make fair run faster. But if there are no fair tasks,
> the RT pressure is meaningless on its own as they should run at max or whatever
> value specified by uclamp_min? I think in this test uclamp_min is set to 0 by
> default for RT, so expected not to cause frequency to rise on their own.

Something like this

--->8---

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da46c3164537..80b526c40dab 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8059,7 +8059,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
                                 unsigned long *min,
                                 unsigned long *max)
 {
-       unsigned long util, irq, scale;
+       unsigned long util = 0, irq, scale;
        struct rq *rq = cpu_rq(cpu);

        scale = arch_scale_cpu_capacity(cpu);
@@ -8100,9 +8100,14 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
         * CFS tasks and we use the same metric to track the effective
         * utilization (PELT windows are synchronized) we can directly add them
         * to obtain the CPU's actual utilization.
+        *
+        * Only applicable if there are fair tasks queued. When a new fair task
+        * wakes up it should trigger a freq update.
         */
-       util = util_cfs + cpu_util_rt(rq);
-       util += cpu_util_dl(rq);
+       if (rq->cfs.h_nr_queued) {
+               util = util_cfs + cpu_util_rt(rq);
+               util += cpu_util_dl(rq);
+       }

        /*
         * The maximum hint is a soft bandwidth requirement, which can be lower

Re: [PATCH] sched/fair: Fix pelt lost idle time detection

Posted by Vincent Guittot 1 month ago

On Tue, 23 Dec 2025 at 19:49, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 12/23/25 17:27, Qais Yousef wrote:
> > On 12/13/25 04:54, Vincent Guittot wrote:
> >
> > > > For completeness, here are some Perfetto traces that show threads
> > > > running, CPU frequency, and PELT related stats. I've pinned the
> > > > util_avg track for a CPU on the little cluster, as the util_avg metric
> > > > shows an obvious increase (~66 vs ~3 for with patch and without patch
> > > > respectively).
> > >
> > > I was focusing on the update of rq->lost_idle_time but It can't be
> > > related because the CPUs are often idle in your trace. But it also
> > > updates the rq->clock_idle and rq->clock_pelt_idle which are used to
> > > sync cfs task util_avg at wakeup when it is about to migrate and prev
> > > cpu is idle.
> > >
> > > before the patch we could have old clock_pelt_idle and clock_idle that
> > > were used to decay the util_avg of cfs task before migrating them
> > > which would ends up with decaying too much util_avg
> > >
> > > But I noticed that you put the util_avg_rt which doesn't use the 2
> > > fields above in mainline. Does android kernel make some changes for rt
> > > util_avg tracking ?
> >
> > We shouldn't be doing that. I think we were not updating RT pressure correctly
> > before the patch. The new values make more sense to me as RT tasks are running
> > 2ms every 10ms and a util_avg_rt of ~150 range makes more sense than the
> > previous 5-6 values? If we add the 20% headroom that can easily saturate the
> > little core.
> >
> > update_rt_rq_load_avg() uses rq_clock_pelt() which takes into account the
> > lost_idle_time which we now ensure is updated in this corner case?
> >
> > I guess the first question is which do you think is the right behavior for the
> > RT pressure?
> >
> > And 2nd question, does it make sense to take RT pressure into account in
> > schedutil if there are no fair tasks? It is supposed to help compensate for the
> > stolen time by RT so we make fair run faster. But if there are no fair tasks,
> > the RT pressure is meaningless on its own as they should run at max or whatever
> > value specified by uclamp_min? I think in this test uclamp_min is set to 0 by
> > default for RT, so expected not to cause frequency to rise on their own.
>
> Something like this
>
> --->8---
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index da46c3164537..80b526c40dab 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8059,7 +8059,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
>                                  unsigned long *min,
>                                  unsigned long *max)
>  {
> -       unsigned long util, irq, scale;
> +       unsigned long util = 0, irq, scale;
>         struct rq *rq = cpu_rq(cpu);
>
>         scale = arch_scale_cpu_capacity(cpu);
> @@ -8100,9 +8100,14 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
>          * CFS tasks and we use the same metric to track the effective
>          * utilization (PELT windows are synchronized) we can directly add them
>          * to obtain the CPU's actual utilization.
> +        *
> +        * Only applicable if there are fair tasks queued. When a new fair task
> +        * wakes up it should trigger a freq update.
>          */

As mentioned in my other reply, it might be too late to handle
everything and we don't speed up enough rt and dl tasks to ensure
margin to cfs

> -       util = util_cfs + cpu_util_rt(rq);
> -       util += cpu_util_dl(rq);
> +       if (rq->cfs.h_nr_queued) {
> +               util = util_cfs + cpu_util_rt(rq);
> +               util += cpu_util_dl(rq);
> +       }
>
>         /*
>          * The maximum hint is a soft bandwidth requirement, which can be lower

Re: [PATCH] sched/fair: Fix pelt lost idle time detection

Posted by Samuel Wu 1 month ago

On Tue, Dec 23, 2025 at 10:49 AM Qais Yousef <qyousef@layalina.io> wrote:
>
> On 12/23/25 17:27, Qais Yousef wrote:
> > On 12/13/25 04:54, Vincent Guittot wrote:
> >
> > > > For completeness, here are some Perfetto traces that show threads
> > > > running, CPU frequency, and PELT related stats. I've pinned the
> > > > util_avg track for a CPU on the little cluster, as the util_avg metric
> > > > shows an obvious increase (~66 vs ~3 for with patch and without patch
> > > > respectively).
> > >
> > > I was focusing on the update of rq->lost_idle_time but It can't be
> > > related because the CPUs are often idle in your trace. But it also
> > > updates the rq->clock_idle and rq->clock_pelt_idle which are used to
> > > sync cfs task util_avg at wakeup when it is about to migrate and prev
> > > cpu is idle.
> > >
> > > before the patch we could have old clock_pelt_idle and clock_idle that
> > > were used to decay the util_avg of cfs task before migrating them
> > > which would ends up with decaying too much util_avg
> > >
> > > But I noticed that you put the util_avg_rt which doesn't use the 2
> > > fields above in mainline. Does android kernel make some changes for rt
> > > util_avg tracking ?
> >
> > We shouldn't be doing that. I think we were not updating RT pressure correctly
> > before the patch. The new values make more sense to me as RT tasks are running
> > 2ms every 10ms and a util_avg_rt of ~150 range makes more sense than the
> > previous 5-6 values? If we add the 20% headroom that can easily saturate the
> > little core.
> >
> > update_rt_rq_load_avg() uses rq_clock_pelt() which takes into account the
> > lost_idle_time which we now ensure is updated in this corner case?
> >
> > I guess the first question is which do you think is the right behavior for the
> > RT pressure?
> >
> > And 2nd question, does it make sense to take RT pressure into account in
> > schedutil if there are no fair tasks? It is supposed to help compensate for the
> > stolen time by RT so we make fair run faster. But if there are no fair tasks,
> > the RT pressure is meaningless on its own as they should run at max or whatever
> > value specified by uclamp_min? I think in this test uclamp_min is set to 0 by
> > default for RT, so expected not to cause frequency to rise on their own.
>
> Something like this
>
> --->8---
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index da46c3164537..80b526c40dab 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8059,7 +8059,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
>                                  unsigned long *min,
>                                  unsigned long *max)
>  {
> -       unsigned long util, irq, scale;
> +       unsigned long util = 0, irq, scale;
>         struct rq *rq = cpu_rq(cpu);
>
>         scale = arch_scale_cpu_capacity(cpu);
> @@ -8100,9 +8100,14 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
>          * CFS tasks and we use the same metric to track the effective
>          * utilization (PELT windows are synchronized) we can directly add them
>          * to obtain the CPU's actual utilization.
> +        *
> +        * Only applicable if there are fair tasks queued. When a new fair task
> +        * wakes up it should trigger a freq update.
>          */
> -       util = util_cfs + cpu_util_rt(rq);
> -       util += cpu_util_dl(rq);
> +       if (rq->cfs.h_nr_queued) {
> +               util = util_cfs + cpu_util_rt(rq);
> +               util += cpu_util_dl(rq);
> +       }
>
>         /*
>          * The maximum hint is a soft bandwidth requirement, which can be lower

I tested Qais's patch with the same use case. There are 3 builds to
reference now:
1. baseline
2. baseline + Vincent's patch
3. baseline + Vincent's patch + Qais's patch

Scheduling behavior seems more proper now. I agree with Qais that RT
values seemed a little off in build 1, and also CPU freq values in
build 2 seemed too high for the given workload. With build 3, the
Wattson power values are back to baseline build 1, while keeping RT
util_avg in build 3 similar to that of build 2.

- build 1: https://ui.perfetto.dev/#!/?s=6ff6854c87ea187e4ca488acd2e6501b90ec9f6f
- build 2: https://ui.perfetto.dev/#!/?s=964594d07a5a5ba51a159ba6c90bb7ab48e09326
- build 3: https://ui.perfetto.dev/#!/?s=c0a1585c31e51ab3e6b38d948829a4c0196f338c

Re: [PATCH] sched/fair: Fix pelt lost idle time detection

Posted by Qais Yousef 3 weeks, 6 days ago

On 01/05/26 12:08, Samuel Wu wrote:
> On Tue, Dec 23, 2025 at 10:49 AM Qais Yousef <qyousef@layalina.io> wrote:
> >
> > On 12/23/25 17:27, Qais Yousef wrote:
> > > On 12/13/25 04:54, Vincent Guittot wrote:
> > >
> > > > > For completeness, here are some Perfetto traces that show threads
> > > > > running, CPU frequency, and PELT related stats. I've pinned the
> > > > > util_avg track for a CPU on the little cluster, as the util_avg metric
> > > > > shows an obvious increase (~66 vs ~3 for with patch and without patch
> > > > > respectively).
> > > >
> > > > I was focusing on the update of rq->lost_idle_time but It can't be
> > > > related because the CPUs are often idle in your trace. But it also
> > > > updates the rq->clock_idle and rq->clock_pelt_idle which are used to
> > > > sync cfs task util_avg at wakeup when it is about to migrate and prev
> > > > cpu is idle.
> > > >
> > > > before the patch we could have old clock_pelt_idle and clock_idle that
> > > > were used to decay the util_avg of cfs task before migrating them
> > > > which would ends up with decaying too much util_avg
> > > >
> > > > But I noticed that you put the util_avg_rt which doesn't use the 2
> > > > fields above in mainline. Does android kernel make some changes for rt
> > > > util_avg tracking ?
> > >
> > > We shouldn't be doing that. I think we were not updating RT pressure correctly
> > > before the patch. The new values make more sense to me as RT tasks are running
> > > 2ms every 10ms and a util_avg_rt of ~150 range makes more sense than the
> > > previous 5-6 values? If we add the 20% headroom that can easily saturate the
> > > little core.
> > >
> > > update_rt_rq_load_avg() uses rq_clock_pelt() which takes into account the
> > > lost_idle_time which we now ensure is updated in this corner case?
> > >
> > > I guess the first question is which do you think is the right behavior for the
> > > RT pressure?
> > >
> > > And 2nd question, does it make sense to take RT pressure into account in
> > > schedutil if there are no fair tasks? It is supposed to help compensate for the
> > > stolen time by RT so we make fair run faster. But if there are no fair tasks,
> > > the RT pressure is meaningless on its own as they should run at max or whatever
> > > value specified by uclamp_min? I think in this test uclamp_min is set to 0 by
> > > default for RT, so expected not to cause frequency to rise on their own.
> >
> > Something like this
> >
> > --->8---
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index da46c3164537..80b526c40dab 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8059,7 +8059,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> >                                  unsigned long *min,
> >                                  unsigned long *max)
> >  {
> > -       unsigned long util, irq, scale;
> > +       unsigned long util = 0, irq, scale;
> >         struct rq *rq = cpu_rq(cpu);
> >
> >         scale = arch_scale_cpu_capacity(cpu);
> > @@ -8100,9 +8100,14 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> >          * CFS tasks and we use the same metric to track the effective
> >          * utilization (PELT windows are synchronized) we can directly add them
> >          * to obtain the CPU's actual utilization.
> > +        *
> > +        * Only applicable if there are fair tasks queued. When a new fair task
> > +        * wakes up it should trigger a freq update.
> >          */
> > -       util = util_cfs + cpu_util_rt(rq);
> > -       util += cpu_util_dl(rq);
> > +       if (rq->cfs.h_nr_queued) {
> > +               util = util_cfs + cpu_util_rt(rq);
> > +               util += cpu_util_dl(rq);
> > +       }
> >
> >         /*
> >          * The maximum hint is a soft bandwidth requirement, which can be lower
> 
> I tested Qais's patch with the same use case. There are 3 builds to
> reference now:
> 1. baseline
> 2. baseline + Vincent's patch
> 3. baseline + Vincent's patch + Qais's patch
> 
> Scheduling behavior seems more proper now. I agree with Qais that RT
> values seemed a little off in build 1, and also CPU freq values in
> build 2 seemed too high for the given workload. With build 3, the
> Wattson power values are back to baseline build 1, while keeping RT
> util_avg in build 3 similar to that of build 2.
> 
> - build 1: https://ui.perfetto.dev/#!/?s=6ff6854c87ea187e4ca488acd2e6501b90ec9f6f
> - build 2: https://ui.perfetto.dev/#!/?s=964594d07a5a5ba51a159ba6c90bb7ab48e09326
> - build 3: https://ui.perfetto.dev/#!/?s=c0a1585c31e51ab3e6b38d948829a4c0196f338c

Thanks for verifying! I think we can handle it better with uclamp_max; could
you give that other approach a try if possible please? Thanks!

Re: [PATCH] sched/fair: Fix pelt lost idle time detection

Posted by Samuel Wu 3 weeks, 2 days ago

On Mon, Jan 12, 2026 at 7:11 AM Qais Yousef <qyousef@layalina.io> wrote:
>
> On 01/05/26 12:08, Samuel Wu wrote:
> > On Tue, Dec 23, 2025 at 10:49 AM Qais Yousef <qyousef@layalina.io> wrote:
> > >
> > > On 12/23/25 17:27, Qais Yousef wrote:
> > > > On 12/13/25 04:54, Vincent Guittot wrote:
> > > >
> > > > > > For completeness, here are some Perfetto traces that show threads
> > > > > > running, CPU frequency, and PELT related stats. I've pinned the
> > > > > > util_avg track for a CPU on the little cluster, as the util_avg metric
> > > > > > shows an obvious increase (~66 vs ~3 for with patch and without patch
> > > > > > respectively).
> > > > >
> > > > > I was focusing on the update of rq->lost_idle_time but It can't be
> > > > > related because the CPUs are often idle in your trace. But it also
> > > > > updates the rq->clock_idle and rq->clock_pelt_idle which are used to
> > > > > sync cfs task util_avg at wakeup when it is about to migrate and prev
> > > > > cpu is idle.
> > > > >
> > > > > before the patch we could have old clock_pelt_idle and clock_idle that
> > > > > were used to decay the util_avg of cfs task before migrating them
> > > > > which would ends up with decaying too much util_avg
> > > > >
> > > > > But I noticed that you put the util_avg_rt which doesn't use the 2
> > > > > fields above in mainline. Does android kernel make some changes for rt
> > > > > util_avg tracking ?
> > > >
> > > > We shouldn't be doing that. I think we were not updating RT pressure correctly
> > > > before the patch. The new values make more sense to me as RT tasks are running
> > > > 2ms every 10ms and a util_avg_rt of ~150 range makes more sense than the
> > > > previous 5-6 values? If we add the 20% headroom that can easily saturate the
> > > > little core.
> > > >
> > > > update_rt_rq_load_avg() uses rq_clock_pelt() which takes into account the
> > > > lost_idle_time which we now ensure is updated in this corner case?
> > > >
> > > > I guess the first question is which do you think is the right behavior for the
> > > > RT pressure?
> > > >
> > > > And 2nd question, does it make sense to take RT pressure into account in
> > > > schedutil if there are no fair tasks? It is supposed to help compensate for the
> > > > stolen time by RT so we make fair run faster. But if there are no fair tasks,
> > > > the RT pressure is meaningless on its own as they should run at max or whatever
> > > > value specified by uclamp_min? I think in this test uclamp_min is set to 0 by
> > > > default for RT, so expected not to cause frequency to rise on their own.
> > >
> > > Something like this
> > >
> > > --->8---
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index da46c3164537..80b526c40dab 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -8059,7 +8059,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> > >                                  unsigned long *min,
> > >                                  unsigned long *max)
> > >  {
> > > -       unsigned long util, irq, scale;
> > > +       unsigned long util = 0, irq, scale;
> > >         struct rq *rq = cpu_rq(cpu);
> > >
> > >         scale = arch_scale_cpu_capacity(cpu);
> > > @@ -8100,9 +8100,14 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> > >          * CFS tasks and we use the same metric to track the effective
> > >          * utilization (PELT windows are synchronized) we can directly add them
> > >          * to obtain the CPU's actual utilization.
> > > +        *
> > > +        * Only applicable if there are fair tasks queued. When a new fair task
> > > +        * wakes up it should trigger a freq update.
> > >          */
> > > -       util = util_cfs + cpu_util_rt(rq);
> > > -       util += cpu_util_dl(rq);
> > > +       if (rq->cfs.h_nr_queued) {
> > > +               util = util_cfs + cpu_util_rt(rq);
> > > +               util += cpu_util_dl(rq);
> > > +       }
> > >
> > >         /*
> > >          * The maximum hint is a soft bandwidth requirement, which can be lower
> >
> > I tested Qais's patch with the same use case. There are 3 builds to
> > reference now:
> > 1. baseline
> > 2. baseline + Vincent's patch
> > 3. baseline + Vincent's patch + Qais's patch
> >
> > Scheduling behavior seems more proper now. I agree with Qais that RT
> > values seemed a little off in build 1, and also CPU freq values in
> > build 2 seemed too high for the given workload. With build 3, the
> > Wattson power values are back to baseline build 1, while keeping RT
> > util_avg in build 3 similar to that of build 2.
> >
> > - build 1: https://ui.perfetto.dev/#!/?s=6ff6854c87ea187e4ca488acd2e6501b90ec9f6f
> > - build 2: https://ui.perfetto.dev/#!/?s=964594d07a5a5ba51a159ba6c90bb7ab48e09326
> > - build 3: https://ui.perfetto.dev/#!/?s=c0a1585c31e51ab3e6b38d948829a4c0196f338c
>
> Thanks for verifying! I think we can handle it better with uclamp_max; could
> you give that other approach a try if possible please? Thanks!

Thanks for the analysis and experimental patch Qais. Here is a trace
with baseline + uclamp min/max patch, and labeling it as build 4.
build 4: https://ui.perfetto.dev/#!/?s=79cc53a2a13886571c498dea07e35525869a0657

It's hard for me to say what is right or wrong now, but the rt
util_avg and Wattson power estimates are both high and comparable to
that of build 2, which is the baseline with just Vincent's patch. One
new element with this build is that there seems to be more variance in
terms of both the rt util_avg and power estimates metrics, which
doesn't seem desirable, especially for a relatively constant workload.

Re: [PATCH] sched/fair: Fix pelt lost idle time detection

Posted by Peter Zijlstra 4 months ago

On Wed, Oct 08, 2025 at 03:12:14PM +0200, Vincent Guittot wrote:
> The check for some lost idle pelt time should be always done when 
> pick_next_task_fair() fails to pick a task and not only when we call it
> from the fair fast-path.
> 
> The case happens when the last running task on rq is a RT or DL task. When
> the latter goes to sleep and the /Sum of util_sum of the rq is at the max
> value, we don't account the lost of idle time whereas we should.
> 
> Fixes: 67692435c411 ("sched: Rework pick_next_task() slow-path")
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

Durr, sorry about that. Let me go queue this.

> ---
> 
> I Noticed this while reviewing [1]
> 
> [1] https://lore.kernel.org/all/20251006105453.648473106@infradead.org/
> 
>  kernel/sched/fair.c | 26 +++++++++++++-------------
>  1 file changed, 13 insertions(+), 13 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b3be1e2749ce..dd0ea01af730 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8920,21 +8920,21 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
>  	return p;
>  
>  idle:
> -	if (!rf)
> -		return NULL;
> -
> -	new_tasks = sched_balance_newidle(rq, rf);
> +	if (rf) {
> +		new_tasks = sched_balance_newidle(rq, rf);
>  
> -	/*
> -	 * Because sched_balance_newidle() releases (and re-acquires) rq->lock, it is
> -	 * possible for any higher priority task to appear. In that case we
> -	 * must re-start the pick_next_entity() loop.
> -	 */
> -	if (new_tasks < 0)
> -		return RETRY_TASK;
> +		/*
> +		 * Because sched_balance_newidle() releases (and re-acquires)
> +		 * rq->lock, it is possible for any higher priority task to
> +		 * appear. In that case we must re-start the pick_next_entity()
> +		 * loop.
> +		 */
> +		if (new_tasks < 0)
> +			return RETRY_TASK;
>  
> -	if (new_tasks > 0)
> -		goto again;
> +		if (new_tasks > 0)
> +			goto again;
> +	}
>  
>  	/*
>  	 * rq is about to be idle, check if we need to update the
> -- 
> 2.43.0
>

[tip: sched/urgent] sched/fair: Fix pelt lost idle time detection

Posted by tip-bot2 for Vincent Guittot 3 months, 3 weeks ago

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     17e3e88ed0b6318fde0d1c14df1a804711cab1b5
Gitweb:        https://git.kernel.org/tip/17e3e88ed0b6318fde0d1c14df1a804711cab1b5
Author:        Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate:    Wed, 08 Oct 2025 15:12:14 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 14 Oct 2025 13:43:08 +02:00

sched/fair: Fix pelt lost idle time detection

The check for some lost idle pelt time should be always done when
pick_next_task_fair() fails to pick a task and not only when we call it
from the fair fast-path.

The case happens when the last running task on rq is a RT or DL task. When
the latter goes to sleep and the /Sum of util_sum of the rq is at the max
value, we don't account the lost of idle time whereas we should.

Fixes: 67692435c411 ("sched: Rework pick_next_task() slow-path")
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bc0b7ce..cee1793 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8920,21 +8920,21 @@ simple:
 	return p;
 
 idle:
-	if (!rf)
-		return NULL;
-
-	new_tasks = sched_balance_newidle(rq, rf);
+	if (rf) {
+		new_tasks = sched_balance_newidle(rq, rf);
 
-	/*
-	 * Because sched_balance_newidle() releases (and re-acquires) rq->lock, it is
-	 * possible for any higher priority task to appear. In that case we
-	 * must re-start the pick_next_entity() loop.
-	 */
-	if (new_tasks < 0)
-		return RETRY_TASK;
+		/*
+		 * Because sched_balance_newidle() releases (and re-acquires)
+		 * rq->lock, it is possible for any higher priority task to
+		 * appear. In that case we must re-start the pick_next_entity()
+		 * loop.
+		 */
+		if (new_tasks < 0)
+			return RETRY_TASK;
 
-	if (new_tasks > 0)
-		goto again;
+		if (new_tasks > 0)
+			goto again;
+	}
 
 	/*
 	 * rq is about to be idle, check if we need to update the