[v6] sched: update the rq->avg_idle when a task is moved to an idle CPU

[PATCH v6 2/2] sched: update the rq->avg_idle when a task is moved to an idle CPU

Posted by Huang Shijie 2 months ago

In the newidle balance, the rq->idle_stamp may set to a non-zero value
if it cannot pull any task.

In the wakeup, it will detect the rq->idle_stamp, and updates
the rq->avg_idle, then ends the CPU idle status by setting rq->idle_stamp
to zero.

Besides the wakeup, current code does not end the CPU idle status
when a task is moved to the idle CPU, such as fork/clone, execve,
or other cases. In order to get more accurate rq->avg_idle,
we need to update it at more places(not only the wakeup).

This patch introduces a helper: update_rq_avg_idle().
And uses it in enqueue_task(), so it will update the rq->avg_idle
when a task is moved to an idle CPU at:
   -- wakeup
   -- fork/clone
   -- execve
   -- idle balance
   -- other cases

Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
---
 kernel/sched/core.c | 29 +++++++++++++++++------------
 1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9f10cfbdc228..2e3c4043de51 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2078,6 +2078,21 @@ unsigned long get_wchan(struct task_struct *p)
 	return ip;
 }
 
+static void update_rq_avg_idle(struct rq *rq)
+{
+	if (rq->idle_stamp) {
+		u64 delta = rq_clock(rq) - rq->idle_stamp;
+		u64 max = 2*rq->max_idle_balance_cost;
+
+		update_avg(&rq->avg_idle, delta);
+
+		if (rq->avg_idle > max)
+			rq->avg_idle = max;
+
+		rq->idle_stamp = 0;
+	}
+}
+
 void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 {
 	if (!(flags & ENQUEUE_NOCLOCK))
@@ -2100,6 +2115,8 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 
 	if (sched_core_enabled(rq))
 		sched_core_enqueue(rq, p);
+
+	update_rq_avg_idle(rq);
 }
 
 /*
@@ -3645,18 +3662,6 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
 		p->sched_class->task_woken(rq, p);
 		rq_repin_lock(rq, rf);
 	}
-
-	if (rq->idle_stamp) {
-		u64 delta = rq_clock(rq) - rq->idle_stamp;
-		u64 max = 2*rq->max_idle_balance_cost;
-
-		update_avg(&rq->avg_idle, delta);
-
-		if (rq->avg_idle > max)
-			rq->avg_idle = max;
-
-		rq->idle_stamp = 0;
-	}
 }
 
 /*
-- 
2.40.1

Re: [PATCH v6 2/2] sched: update the rq->avg_idle when a task is moved to an idle CPU

Posted by Vincent Guittot 1 month, 4 weeks ago

On Tue, 9 Dec 2025 at 10:46, Huang Shijie <shijie@os.amperecomputing.com> wrote:
>
> In the newidle balance, the rq->idle_stamp may set to a non-zero value
> if it cannot pull any task.
>
> In the wakeup, it will detect the rq->idle_stamp, and updates
> the rq->avg_idle, then ends the CPU idle status by setting rq->idle_stamp
> to zero.
>
> Besides the wakeup, current code does not end the CPU idle status
> when a task is moved to the idle CPU, such as fork/clone, execve,
> or other cases. In order to get more accurate rq->avg_idle,
> we need to update it at more places(not only the wakeup).
>
> This patch introduces a helper: update_rq_avg_idle().
> And uses it in enqueue_task(), so it will update the rq->avg_idle
> when a task is moved to an idle CPU at:
>    -- wakeup
>    -- fork/clone
>    -- execve
>    -- idle balance
>    -- other cases
>
> Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
> ---
>  kernel/sched/core.c | 29 +++++++++++++++++------------
>  1 file changed, 17 insertions(+), 12 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9f10cfbdc228..2e3c4043de51 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2078,6 +2078,21 @@ unsigned long get_wchan(struct task_struct *p)
>         return ip;
>  }
>
> +static void update_rq_avg_idle(struct rq *rq)
> +{
> +       if (rq->idle_stamp) {
> +               u64 delta = rq_clock(rq) - rq->idle_stamp;
> +               u64 max = 2*rq->max_idle_balance_cost;
> +
> +               update_avg(&rq->avg_idle, delta);
> +
> +               if (rq->avg_idle > max)
> +                       rq->avg_idle = max;
> +
> +               rq->idle_stamp = 0;
> +       }
> +}
> +
>  void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
>  {
>         if (!(flags & ENQUEUE_NOCLOCK))
> @@ -2100,6 +2115,8 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
>
>         if (sched_core_enabled(rq))
>                 sched_core_enqueue(rq, p);
> +
> +       update_rq_avg_idle(rq);

put_prev_task_idle() would be a better place to call
update_rq_avg_idle() because this is when we leave idle.

>  }
>
>  /*
> @@ -3645,18 +3662,6 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
>                 p->sched_class->task_woken(rq, p);
>                 rq_repin_lock(rq, rf);
>         }
> -
> -       if (rq->idle_stamp) {
> -               u64 delta = rq_clock(rq) - rq->idle_stamp;
> -               u64 max = 2*rq->max_idle_balance_cost;
> -
> -               update_avg(&rq->avg_idle, delta);
> -
> -               if (rq->avg_idle > max)
> -                       rq->avg_idle = max;
> -
> -               rq->idle_stamp = 0;
> -       }
>  }
>
>  /*
> --
> 2.40.1
>

Re: [PATCH v6 2/2] sched: update the rq->avg_idle when a task is moved to an idle CPU

Posted by Shijie Huang 1 month, 3 weeks ago

On 13/12/2025 09:36, Vincent Guittot wrote:
> put_prev_task_idle() would be a better place to call
> update_rq_avg_idle() because this is when we leave idle.

The update_rq_avg_idle() is not only called by current CPU, but also 
called by

other CPUs. For example, the try_to_wake_up(), update_rq_avg_idle() is 
called by

the other CPUs. So enqueue_task() is a good place.

Thanks

Huang Shijie

Re: [PATCH v6 2/2] sched: update the rq->avg_idle when a task is moved to an idle CPU

Posted by Vincent Guittot 1 month, 3 weeks ago

On Tue, 16 Dec 2025 at 07:22, Shijie Huang
<shijie@amperemail.onmicrosoft.com> wrote:
>
>
> On 13/12/2025 09:36, Vincent Guittot wrote:
> > put_prev_task_idle() would be a better place to call
> > update_rq_avg_idle() because this is when we leave idle.
>
> The update_rq_avg_idle() is not only called by current CPU, but also
> called by
>
> other CPUs. For example, the try_to_wake_up(), update_rq_avg_idle() is
> called by
>
> the other CPUs. So enqueue_task() is a good place.

But put_prev_task_idle() is called by local CPU whenever it leaves
idle so instead of trying to catch all places that could make the CPU
leave idle it's better to use this single place.
And as you mentioned, put_prev_task_idle is only called by local CPU
whereas enqueue_task can be called by all CPUs creating useless
pressure in the variable.

So I disagree when you say enqueue_task() is a "good place"

Thanks

>
>
> Thanks
>
> Huang Shijie
>

Re: [PATCH v6 2/2] sched: update the rq->avg_idle when a task is moved to an idle CPU

Posted by Shijie Huang 1 month, 3 weeks ago

On 16/12/2025 15:17, Vincent Guittot wrote:
> On Tue, 16 Dec 2025 at 07:22, Shijie Huang
> <shijie@amperemail.onmicrosoft.com> wrote:
>>
>> On 13/12/2025 09:36, Vincent Guittot wrote:
>>> put_prev_task_idle() would be a better place to call
>>> update_rq_avg_idle() because this is when we leave idle.
>> The update_rq_avg_idle() is not only called by current CPU, but also
>> called by
>>
>> other CPUs. For example, the try_to_wake_up(), update_rq_avg_idle() is
>> called by
>>
>> the other CPUs. So enqueue_task() is a good place.
> But put_prev_task_idle() is called by local CPU whenever it leaves
> idle so instead of trying to catch all places that could make the CPU
> leave idle it's better to use this single place.
> And as you mentioned, put_prev_task_idle is only called by local CPU
> whereas enqueue_task can be called by all CPUs creating useless
> pressure in the variable.

The rq->idle_stamp is set at sched_balance_newidle().  then we call 
update_rq_avg_idle()

in put_prev_task_idle() right now. How can we update the rq->avg_idle?

Thanks

Huang Shijie

>>

Re: [PATCH v6 2/2] sched: update the rq->avg_idle when a task is moved to an idle CPU

Posted by Vincent Guittot 1 month, 3 weeks ago

On Tue, 16 Dec 2025 at 08:39, Shijie Huang
<shijie@amperemail.onmicrosoft.com> wrote:
>
>
> On 16/12/2025 15:17, Vincent Guittot wrote:
> > On Tue, 16 Dec 2025 at 07:22, Shijie Huang
> > <shijie@amperemail.onmicrosoft.com> wrote:
> >>
> >> On 13/12/2025 09:36, Vincent Guittot wrote:
> >>> put_prev_task_idle() would be a better place to call
> >>> update_rq_avg_idle() because this is when we leave idle.
> >> The update_rq_avg_idle() is not only called by current CPU, but also
> >> called by
> >>
> >> other CPUs. For example, the try_to_wake_up(), update_rq_avg_idle() is
> >> called by
> >>
> >> the other CPUs. So enqueue_task() is a good place.
> > But put_prev_task_idle() is called by local CPU whenever it leaves
> > idle so instead of trying to catch all places that could make the CPU
> > leave idle it's better to use this single place.
> > And as you mentioned, put_prev_task_idle is only called by local CPU
> > whereas enqueue_task can be called by all CPUs creating useless
> > pressure in the variable.
>
> The rq->idle_stamp is set at sched_balance_newidle().  then we call
> update_rq_avg_idle()
>
> in put_prev_task_idle() right now. How can we update the rq->avg_idle?

I'm not sure I understand your point.

rq->avg_idle tracks idle time. The easiest way would be to use
- set_next_task_idle() when we enter idle
- put_prev_task_idle() when we exit idle

Except that sched_balance_newidle() can be long and the time should be
accounted as idle time too. So instead of using set_next_task_idle(),
we use sched_balance_newidle() to set . Which is okay because
sched_balance_newidle() is always called before going to idle.

>
> Thanks
>
> Huang Shijie
>
> >>

Re: [PATCH v6 2/2] sched: update the rq->avg_idle when a task is moved to an idle CPU

Posted by Shijie Huang 1 month, 3 weeks ago

On 16/12/2025 16:47, Vincent Guittot wrote:
> On Tue, 16 Dec 2025 at 08:39, Shijie Huang
> <shijie@amperemail.onmicrosoft.com> wrote:
>>
>> On 16/12/2025 15:17, Vincent Guittot wrote:
>>> On Tue, 16 Dec 2025 at 07:22, Shijie Huang
>>> <shijie@amperemail.onmicrosoft.com> wrote:
>>>> On 13/12/2025 09:36, Vincent Guittot wrote:
>>>>> put_prev_task_idle() would be a better place to call
>>>>> update_rq_avg_idle() because this is when we leave idle.
>>>> The update_rq_avg_idle() is not only called by current CPU, but also
>>>> called by
>>>>
>>>> other CPUs. For example, the try_to_wake_up(), update_rq_avg_idle() is
>>>> called by
>>>>
>>>> the other CPUs. So enqueue_task() is a good place.
>>> But put_prev_task_idle() is called by local CPU whenever it leaves
>>> idle so instead of trying to catch all places that could make the CPU
>>> leave idle it's better to use this single place.
>>> And as you mentioned, put_prev_task_idle is only called by local CPU
>>> whereas enqueue_task can be called by all CPUs creating useless
>>> pressure in the variable.
>> The rq->idle_stamp is set at sched_balance_newidle().  then we call
>> update_rq_avg_idle()
>>
>> in put_prev_task_idle() right now. How can we update the rq->avg_idle?
> I'm not sure I understand your point.
>
> rq->avg_idle tracks idle time. The easiest way would be to use
> - set_next_task_idle() when we enter idle
> - put_prev_task_idle() when we exit idle
>
> Except that sched_balance_newidle() can be long and the time should be
> accounted as idle time too. So instead of using set_next_task_idle(),
> we use sched_balance_newidle() to set . Which is okay because
> sched_balance_newidle() is always called before going to idle.

Thanks for the explanations.

It seems that put_prev_task_idle() is really a better place to call
update_rq_avg_idle(). Let me think it for a while :)


Thanks
Huang Shijie



>

Re: [PATCH v6 2/2] sched: update the rq->avg_idle when a task is moved to an idle CPU

Posted by Dietmar Eggemann 1 month, 4 weeks ago

On 09.12.25 10:45, Huang Shijie wrote:
> In the newidle balance, the rq->idle_stamp may set to a non-zero value
> if it cannot pull any task.
> 
> In the wakeup, it will detect the rq->idle_stamp, and updates
> the rq->avg_idle, then ends the CPU idle status by setting rq->idle_stamp
> to zero.
> 
> Besides the wakeup, current code does not end the CPU idle status
> when a task is moved to the idle CPU, such as fork/clone, execve,
> or other cases. In order to get more accurate rq->avg_idle,
> we need to update it at more places(not only the wakeup).
> 
> This patch introduces a helper: update_rq_avg_idle().
> And uses it in enqueue_task(), so it will update the rq->avg_idle
> when a task is moved to an idle CPU at:
>    -- wakeup
>    -- fork/clone
>    -- execve
>    -- idle balance
>    -- other cases

[...]

In v2 you moved update_rq_avg_idle() (1) from activate_task() (2) to
enqueue_task() to possibly handle delayed tasks.
In v3 you figured there can't be any delayed task on a CPU when it sets
rq->idle_stamp in sched_balance_newidle()
So you could move (1) back to (2) avoiding the 'if rq->idle_stamp' for
the sched_change pattern for instance?

I tried to understand whether this patch could help with one issue we
currently have with 'DCPerf Mediawiki' benchmark where on 2 comparable
servers, one has a 10% lower CPU utilization and I traced it down to
significantly different behaviour in sched_balance_newidle() and further
down to:

if (!get_rd_overloaded(this_rq->rd) || this_rq->avg_idle <
sd->max_newidle_lb_cost)

where the one with the lower CPU utilization bails out way less often
(because this_rq->avg_idle is very high, system is overloaded). But I
failed so far. Anyway, we'll use this patch for another test run right now.

Re: [PATCH v6 2/2] sched: update the rq->avg_idle when a task is moved to an idle CPU

Posted by Shijie Huang 1 month, 4 weeks ago

On 12/12/2025 00:15, Dietmar Eggemann wrote:
> In v2 you moved update_rq_avg_idle() (1) from activate_task() (2) to
> enqueue_task() to possibly handle delayed tasks.
Not for the delayed tasks, just for a more common place to handle more 
cases.
> In v3 you figured there can't be any delayed task on a CPU when it sets
> rq->idle_stamp in sched_balance_newidle()
Yes.
> So you could move (1) back to (2) avoiding the 'if rq->idle_stamp' for
> the sched_change pattern for instance?
Could you please tell me what is "avoiding the 'if rq->idle_stamp' for 
the sched_change pattern" ?

Sorry, I do not understand your meaning.

IMHO, there is no need to move (1) back to (2).


Thanks

Huang Shijie

Re: [PATCH v6 2/2] sched: update the rq->avg_idle when a task is moved to an idle CPU

Posted by Dietmar Eggemann 1 month, 4 weeks ago

On 12.12.25 04:16, Shijie Huang wrote:
> 
> On 12/12/2025 00:15, Dietmar Eggemann wrote:
>> In v2 you moved update_rq_avg_idle() (1) from activate_task() (2) to
>> enqueue_task() to possibly handle delayed tasks.
> Not for the delayed tasks, just for a more common place to handle more
> cases.
>> In v3 you figured there can't be any delayed task on a CPU when it sets
>> rq->idle_stamp in sched_balance_newidle()
> Yes.
>> So you could move (1) back to (2) avoiding the 'if rq->idle_stamp' for
>> the sched_change pattern for instance?
> Could you please tell me what is "avoiding the 'if rq->idle_stamp' for
> the sched_change pattern" ?
> 
> Sorry, I do not understand your meaning.

sched_change uses dequeue_task()/enqueue_task() for a queued task to
change prio, policy, sched params, taskgroups, etc.

kernel/sched/sched.h:

DEFINE_CLASS(sched_change, struct sched_change_ctx *,
             sched_change_end(_T),
             sched_change_begin(p, flags),
             struct task_struct *p, unsigned int flags)

kernel/sched/core.c:

struct sched_change_ctx *sched_change_begin(struct task_struct *p,
unsigned int flags)

void sched_change_end(struct sched_change_ctx *ctx)

> IMHO, there is no need to move (1) back to (2).

Btw, I'm using a BUG_ON(this_rq->idle_stamp) in sched_balance_newidle()
with update_rq_avg_idle() in activate_task() for testing.

Re: [PATCH v6 2/2] sched: update the rq->avg_idle when a task is moved to an idle CPU

Posted by Shijie Huang 1 month, 3 weeks ago

On 12/12/2025 22:22, Dietmar Eggemann wrote:
>>> So you could move (1) back to (2) avoiding the 'if rq->idle_stamp' for
>>> the sched_change pattern for instance?
>> Could you please tell me what is "avoiding the 'if rq->idle_stamp' for
>> the sched_change pattern" ?
>>
>> Sorry, I do not understand your meaning.
> sched_change uses dequeue_task()/enqueue_task() for a queued task to
> change prio, policy, sched params, taskgroups, etc.

For sched_change, the dequeue_task()/enqueue_task() only work when

the queued task has TASK_ON_RQ_QUEUED flags. The TASK_ON_RQ_QUEUED

is set in activate_task().

   1.) For this active task, if the sched_change makes it 
dequeue_task()/enqueue_task() on

       current CPU, it's okay. Since current CPU is not in the newidle, 
the "rq->idle_stamp" is 0 at this case.

       This patch works fine.

  2.) For this active task, if the sched_change makes it 
dequeue_task()/enqueue_task() on an another CPU,

       it's okay too.

          2.1) If the another CPU's idle_stamp is 0, the another CPU is 
busy now.

                  The sched_change works fine with this patch.

          2.2) If the another CPU's idle_stamp is not 0, the 
sched_change also works fine with this patch.

                 Since the sched_change is breaking the idle state of 
the another CPU by moving an active

                 task to an idle CPU. It makes sense.

Thanks

Huang Shijie

Re: [PATCH v6 2/2] sched: update the rq->avg_idle when a task is moved to an idle CPU

Posted by Dietmar Eggemann 1 month, 3 weeks ago

On 15.12.25 10:35, Shijie Huang wrote:
> 
> On 12/12/2025 22:22, Dietmar Eggemann wrote:
>>>> So you could move (1) back to (2) avoiding the 'if rq->idle_stamp' for
>>>> the sched_change pattern for instance?
>>> Could you please tell me what is "avoiding the 'if rq->idle_stamp' for
>>> the sched_change pattern" ?
>>>
>>> Sorry, I do not understand your meaning.
>> sched_change uses dequeue_task()/enqueue_task() for a queued task to
>> change prio, policy, sched params, taskgroups, etc.
> 
> For sched_change, the dequeue_task()/enqueue_task() only work when
> 
> the queued task has TASK_ON_RQ_QUEUED flags. The TASK_ON_RQ_QUEUED
> 
> is set in activate_task().

I guess this was a misunderstanding. It works because of 'if
(rq->idle_stamp)' and setting 'rq->idle_stamp = 0' within the condition
but this condition isn't worth checking in certain places where we
actually call enqueue_task(). >
>   1.) For this active task, if the sched_change makes it dequeue_task()/
> enqueue_task() on
> 
>       current CPU, it's okay. Since current CPU is not in the newidle,
> the "rq->idle_stamp" is 0 at this case.
> 
>       This patch works fine.
> 
> 
>  2.) For this active task, if the sched_change makes it dequeue_task()/
> enqueue_task() on an another CPU,
> 
>       it's okay too.
> 
>          2.1) If the another CPU's idle_stamp is 0, the another CPU is
> busy now.
> 
>                  The sched_change works fine with this patch.
> 
>          2.2) If the another CPU's idle_stamp is not 0, the sched_change
> also works fine with this patch.
> 
>                 Since the sched_change is breaking the idle state of the
> another CPU by moving an active
> 
>                 task to an idle CPU. It makes sense.

Not sure about this. I thought so far that the sched_change pattern is
doing a task dequeue + enqueue on the same CPU (this CPU or other)? So
you can't come out of idle here. We lock the rq before we call
scoped_guard (sched_change, ...)

I think Vincent is right by saying the update_rq_avg_idle() should be
put into put_prev_task_idle() instead.

Still waiting for the DCPerf Mediawiki test results to see if this
change fixes my 'rq->avg_idle being too big' issue.

Re: [PATCH v6 2/2] sched: update the rq->avg_idle when a task is moved to an idle CPU

Posted by Dietmar Eggemann 1 month ago

On 17.12.25 17:15, Dietmar Eggemann wrote:
> On 15.12.25 10:35, Shijie Huang wrote:
>>
>> On 12/12/2025 22:22, Dietmar Eggemann wrote:

[...]

> I think Vincent is right by saying the update_rq_avg_idle() should be
> put into put_prev_task_idle() instead.
> 
> Still waiting for the DCPerf Mediawiki test results to see if this
> change fixes my 'rq->avg_idle being too big' issue.

Turns out the patch didn't fix this issue. Still seeing a huge number of
sched_balance_newidle() calls in which the system is (1) overloaded and
(2) this_rq->avg_idle >= sd->max_newidle_lb_cost so that there is no
early bailout and no task is pulled at the end. Must be something else ...

Re: [PATCH v6 2/2] sched: update the rq->avg_idle when a task is moved to an idle CPU

Posted by Vincent Guittot 1 month ago

On Wed, 7 Jan 2026 at 16:48, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 17.12.25 17:15, Dietmar Eggemann wrote:
> > On 15.12.25 10:35, Shijie Huang wrote:
> >>
> >> On 12/12/2025 22:22, Dietmar Eggemann wrote:
>
> [...]
>
> > I think Vincent is right by saying the update_rq_avg_idle() should be
> > put into put_prev_task_idle() instead.
> >
> > Still waiting for the DCPerf Mediawiki test results to see if this
> > change fixes my 'rq->avg_idle being too big' issue.
>
> Turns out the patch didn't fix this issue. Still seeing a huge number of
> sched_balance_newidle() calls in which the system is (1) overloaded and
> (2) this_rq->avg_idle >= sd->max_newidle_lb_cost so that there is no
> early bailout and no task is pulled at the end. Must be something else ...

Do you mean the v6 or v7 version ?

sched_balance_newidle() will be called every time there is no cfs or
higher priority tasks to pick next but we will not loop the sched
domains and bail out early. So having a huge number of call to
sched_balance_newidle() is normal but if you don't bail out early and
loop the sched domain then it's a problem