[PATCH v6 2/2] sched: update the rq->avg_idle when a task is moved to an idle CPU

Huang Shijie posted 2 patches 1 week ago
[PATCH v6 2/2] sched: update the rq->avg_idle when a task is moved to an idle CPU
Posted by Huang Shijie 1 week ago
In the newidle balance, the rq->idle_stamp may set to a non-zero value
if it cannot pull any task.

In the wakeup, it will detect the rq->idle_stamp, and updates
the rq->avg_idle, then ends the CPU idle status by setting rq->idle_stamp
to zero.

Besides the wakeup, current code does not end the CPU idle status
when a task is moved to the idle CPU, such as fork/clone, execve,
or other cases. In order to get more accurate rq->avg_idle,
we need to update it at more places(not only the wakeup).

This patch introduces a helper: update_rq_avg_idle().
And uses it in enqueue_task(), so it will update the rq->avg_idle
when a task is moved to an idle CPU at:
   -- wakeup
   -- fork/clone
   -- execve
   -- idle balance
   -- other cases

Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
---
 kernel/sched/core.c | 29 +++++++++++++++++------------
 1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9f10cfbdc228..2e3c4043de51 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2078,6 +2078,21 @@ unsigned long get_wchan(struct task_struct *p)
 	return ip;
 }
 
+static void update_rq_avg_idle(struct rq *rq)
+{
+	if (rq->idle_stamp) {
+		u64 delta = rq_clock(rq) - rq->idle_stamp;
+		u64 max = 2*rq->max_idle_balance_cost;
+
+		update_avg(&rq->avg_idle, delta);
+
+		if (rq->avg_idle > max)
+			rq->avg_idle = max;
+
+		rq->idle_stamp = 0;
+	}
+}
+
 void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 {
 	if (!(flags & ENQUEUE_NOCLOCK))
@@ -2100,6 +2115,8 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 
 	if (sched_core_enabled(rq))
 		sched_core_enqueue(rq, p);
+
+	update_rq_avg_idle(rq);
 }
 
 /*
@@ -3645,18 +3662,6 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
 		p->sched_class->task_woken(rq, p);
 		rq_repin_lock(rq, rf);
 	}
-
-	if (rq->idle_stamp) {
-		u64 delta = rq_clock(rq) - rq->idle_stamp;
-		u64 max = 2*rq->max_idle_balance_cost;
-
-		update_avg(&rq->avg_idle, delta);
-
-		if (rq->avg_idle > max)
-			rq->avg_idle = max;
-
-		rq->idle_stamp = 0;
-	}
 }
 
 /*
-- 
2.40.1
Re: [PATCH v6 2/2] sched: update the rq->avg_idle when a task is moved to an idle CPU
Posted by Vincent Guittot 3 days, 19 hours ago
On Tue, 9 Dec 2025 at 10:46, Huang Shijie <shijie@os.amperecomputing.com> wrote:
>
> In the newidle balance, the rq->idle_stamp may set to a non-zero value
> if it cannot pull any task.
>
> In the wakeup, it will detect the rq->idle_stamp, and updates
> the rq->avg_idle, then ends the CPU idle status by setting rq->idle_stamp
> to zero.
>
> Besides the wakeup, current code does not end the CPU idle status
> when a task is moved to the idle CPU, such as fork/clone, execve,
> or other cases. In order to get more accurate rq->avg_idle,
> we need to update it at more places(not only the wakeup).
>
> This patch introduces a helper: update_rq_avg_idle().
> And uses it in enqueue_task(), so it will update the rq->avg_idle
> when a task is moved to an idle CPU at:
>    -- wakeup
>    -- fork/clone
>    -- execve
>    -- idle balance
>    -- other cases
>
> Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
> ---
>  kernel/sched/core.c | 29 +++++++++++++++++------------
>  1 file changed, 17 insertions(+), 12 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9f10cfbdc228..2e3c4043de51 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2078,6 +2078,21 @@ unsigned long get_wchan(struct task_struct *p)
>         return ip;
>  }
>
> +static void update_rq_avg_idle(struct rq *rq)
> +{
> +       if (rq->idle_stamp) {
> +               u64 delta = rq_clock(rq) - rq->idle_stamp;
> +               u64 max = 2*rq->max_idle_balance_cost;
> +
> +               update_avg(&rq->avg_idle, delta);
> +
> +               if (rq->avg_idle > max)
> +                       rq->avg_idle = max;
> +
> +               rq->idle_stamp = 0;
> +       }
> +}
> +
>  void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
>  {
>         if (!(flags & ENQUEUE_NOCLOCK))
> @@ -2100,6 +2115,8 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
>
>         if (sched_core_enabled(rq))
>                 sched_core_enqueue(rq, p);
> +
> +       update_rq_avg_idle(rq);

put_prev_task_idle() would be a better place to call
update_rq_avg_idle() because this is when we leave idle.

>  }
>
>  /*
> @@ -3645,18 +3662,6 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
>                 p->sched_class->task_woken(rq, p);
>                 rq_repin_lock(rq, rf);
>         }
> -
> -       if (rq->idle_stamp) {
> -               u64 delta = rq_clock(rq) - rq->idle_stamp;
> -               u64 max = 2*rq->max_idle_balance_cost;
> -
> -               update_avg(&rq->avg_idle, delta);
> -
> -               if (rq->avg_idle > max)
> -                       rq->avg_idle = max;
> -
> -               rq->idle_stamp = 0;
> -       }
>  }
>
>  /*
> --
> 2.40.1
>
Re: [PATCH v6 2/2] sched: update the rq->avg_idle when a task is moved to an idle CPU
Posted by Shijie Huang 14 hours ago
On 13/12/2025 09:36, Vincent Guittot wrote:
> put_prev_task_idle() would be a better place to call
> update_rq_avg_idle() because this is when we leave idle.

The update_rq_avg_idle() is not only called by current CPU, but also 
called by

other CPUs. For example, the try_to_wake_up(), update_rq_avg_idle() is 
called by

the other CPUs. So enqueue_task() is a good place.


Thanks

Huang Shijie
Re: [PATCH v6 2/2] sched: update the rq->avg_idle when a task is moved to an idle CPU
Posted by Vincent Guittot 13 hours ago
On Tue, 16 Dec 2025 at 07:22, Shijie Huang
<shijie@amperemail.onmicrosoft.com> wrote:
>
>
> On 13/12/2025 09:36, Vincent Guittot wrote:
> > put_prev_task_idle() would be a better place to call
> > update_rq_avg_idle() because this is when we leave idle.
>
> The update_rq_avg_idle() is not only called by current CPU, but also
> called by
>
> other CPUs. For example, the try_to_wake_up(), update_rq_avg_idle() is
> called by
>
> the other CPUs. So enqueue_task() is a good place.

But put_prev_task_idle() is called by local CPU whenever it leaves
idle so instead of trying to catch all places that could make the CPU
leave idle it's better to use this single place.
And as you mentioned, put_prev_task_idle is only called by local CPU
whereas enqueue_task can be called by all CPUs creating useless
pressure in the variable.

So I disagree when you say enqueue_task() is a "good place"

Thanks

>
>
> Thanks
>
> Huang Shijie
>
Re: [PATCH v6 2/2] sched: update the rq->avg_idle when a task is moved to an idle CPU
Posted by Shijie Huang 13 hours ago
On 16/12/2025 15:17, Vincent Guittot wrote:
> On Tue, 16 Dec 2025 at 07:22, Shijie Huang
> <shijie@amperemail.onmicrosoft.com> wrote:
>>
>> On 13/12/2025 09:36, Vincent Guittot wrote:
>>> put_prev_task_idle() would be a better place to call
>>> update_rq_avg_idle() because this is when we leave idle.
>> The update_rq_avg_idle() is not only called by current CPU, but also
>> called by
>>
>> other CPUs. For example, the try_to_wake_up(), update_rq_avg_idle() is
>> called by
>>
>> the other CPUs. So enqueue_task() is a good place.
> But put_prev_task_idle() is called by local CPU whenever it leaves
> idle so instead of trying to catch all places that could make the CPU
> leave idle it's better to use this single place.
> And as you mentioned, put_prev_task_idle is only called by local CPU
> whereas enqueue_task can be called by all CPUs creating useless
> pressure in the variable.

The rq->idle_stamp is set at sched_balance_newidle().  then we call 
update_rq_avg_idle()

in put_prev_task_idle() right now. How can we update the rq->avg_idle?

Thanks

Huang Shijie

>>   
Re: [PATCH v6 2/2] sched: update the rq->avg_idle when a task is moved to an idle CPU
Posted by Vincent Guittot 12 hours ago
On Tue, 16 Dec 2025 at 08:39, Shijie Huang
<shijie@amperemail.onmicrosoft.com> wrote:
>
>
> On 16/12/2025 15:17, Vincent Guittot wrote:
> > On Tue, 16 Dec 2025 at 07:22, Shijie Huang
> > <shijie@amperemail.onmicrosoft.com> wrote:
> >>
> >> On 13/12/2025 09:36, Vincent Guittot wrote:
> >>> put_prev_task_idle() would be a better place to call
> >>> update_rq_avg_idle() because this is when we leave idle.
> >> The update_rq_avg_idle() is not only called by current CPU, but also
> >> called by
> >>
> >> other CPUs. For example, the try_to_wake_up(), update_rq_avg_idle() is
> >> called by
> >>
> >> the other CPUs. So enqueue_task() is a good place.
> > But put_prev_task_idle() is called by local CPU whenever it leaves
> > idle so instead of trying to catch all places that could make the CPU
> > leave idle it's better to use this single place.
> > And as you mentioned, put_prev_task_idle is only called by local CPU
> > whereas enqueue_task can be called by all CPUs creating useless
> > pressure in the variable.
>
> The rq->idle_stamp is set at sched_balance_newidle().  then we call
> update_rq_avg_idle()
>
> in put_prev_task_idle() right now. How can we update the rq->avg_idle?

I'm not sure I understand your point.

rq->avg_idle tracks idle time. The easiest way would be to use
- set_next_task_idle() when we enter idle
- put_prev_task_idle() when we exit idle

Except that sched_balance_newidle() can be long and the time should be
accounted as idle time too. So instead of using set_next_task_idle(),
we use sched_balance_newidle() to set . Which is okay because
sched_balance_newidle() is always called before going to idle.

>
> Thanks
>
> Huang Shijie
>
> >>
Re: [PATCH v6 2/2] sched: update the rq->avg_idle when a task is moved to an idle CPU
Posted by Shijie Huang 11 hours ago
On 16/12/2025 16:47, Vincent Guittot wrote:
> On Tue, 16 Dec 2025 at 08:39, Shijie Huang
> <shijie@amperemail.onmicrosoft.com> wrote:
>>
>> On 16/12/2025 15:17, Vincent Guittot wrote:
>>> On Tue, 16 Dec 2025 at 07:22, Shijie Huang
>>> <shijie@amperemail.onmicrosoft.com> wrote:
>>>> On 13/12/2025 09:36, Vincent Guittot wrote:
>>>>> put_prev_task_idle() would be a better place to call
>>>>> update_rq_avg_idle() because this is when we leave idle.
>>>> The update_rq_avg_idle() is not only called by current CPU, but also
>>>> called by
>>>>
>>>> other CPUs. For example, the try_to_wake_up(), update_rq_avg_idle() is
>>>> called by
>>>>
>>>> the other CPUs. So enqueue_task() is a good place.
>>> But put_prev_task_idle() is called by local CPU whenever it leaves
>>> idle so instead of trying to catch all places that could make the CPU
>>> leave idle it's better to use this single place.
>>> And as you mentioned, put_prev_task_idle is only called by local CPU
>>> whereas enqueue_task can be called by all CPUs creating useless
>>> pressure in the variable.
>> The rq->idle_stamp is set at sched_balance_newidle().  then we call
>> update_rq_avg_idle()
>>
>> in put_prev_task_idle() right now. How can we update the rq->avg_idle?
> I'm not sure I understand your point.
>
> rq->avg_idle tracks idle time. The easiest way would be to use
> - set_next_task_idle() when we enter idle
> - put_prev_task_idle() when we exit idle
>
> Except that sched_balance_newidle() can be long and the time should be
> accounted as idle time too. So instead of using set_next_task_idle(),
> we use sched_balance_newidle() to set . Which is okay because
> sched_balance_newidle() is always called before going to idle.

Thanks for the explanations.

It seems that put_prev_task_idle() is really a better place to call
update_rq_avg_idle(). Let me think it for a while :)


Thanks
Huang Shijie



>
Re: [PATCH v6 2/2] sched: update the rq->avg_idle when a task is moved to an idle CPU
Posted by Dietmar Eggemann 5 days, 4 hours ago
On 09.12.25 10:45, Huang Shijie wrote:
> In the newidle balance, the rq->idle_stamp may set to a non-zero value
> if it cannot pull any task.
> 
> In the wakeup, it will detect the rq->idle_stamp, and updates
> the rq->avg_idle, then ends the CPU idle status by setting rq->idle_stamp
> to zero.
> 
> Besides the wakeup, current code does not end the CPU idle status
> when a task is moved to the idle CPU, such as fork/clone, execve,
> or other cases. In order to get more accurate rq->avg_idle,
> we need to update it at more places(not only the wakeup).
> 
> This patch introduces a helper: update_rq_avg_idle().
> And uses it in enqueue_task(), so it will update the rq->avg_idle
> when a task is moved to an idle CPU at:
>    -- wakeup
>    -- fork/clone
>    -- execve
>    -- idle balance
>    -- other cases

[...]

In v2 you moved update_rq_avg_idle() (1) from activate_task() (2) to
enqueue_task() to possibly handle delayed tasks.
In v3 you figured there can't be any delayed task on a CPU when it sets
rq->idle_stamp in sched_balance_newidle()
So you could move (1) back to (2) avoiding the 'if rq->idle_stamp' for
the sched_change pattern for instance?


I tried to understand whether this patch could help with one issue we
currently have with 'DCPerf Mediawiki' benchmark where on 2 comparable
servers, one has a 10% lower CPU utilization and I traced it down to
significantly different behaviour in sched_balance_newidle() and further
down to:

if (!get_rd_overloaded(this_rq->rd) || this_rq->avg_idle <
sd->max_newidle_lb_cost)

where the one with the lower CPU utilization bails out way less often
(because this_rq->avg_idle is very high, system is overloaded). But I
failed so far. Anyway, we'll use this patch for another test run right now.
Re: [PATCH v6 2/2] sched: update the rq->avg_idle when a task is moved to an idle CPU
Posted by Shijie Huang 4 days, 17 hours ago
On 12/12/2025 00:15, Dietmar Eggemann wrote:
> In v2 you moved update_rq_avg_idle() (1) from activate_task() (2) to
> enqueue_task() to possibly handle delayed tasks.
Not for the delayed tasks, just for a more common place to handle more 
cases.
> In v3 you figured there can't be any delayed task on a CPU when it sets
> rq->idle_stamp in sched_balance_newidle()
Yes.
> So you could move (1) back to (2) avoiding the 'if rq->idle_stamp' for
> the sched_change pattern for instance?
Could you please tell me what is "avoiding the 'if rq->idle_stamp' for 
the sched_change pattern" ?

Sorry, I do not understand your meaning.

IMHO, there is no need to move (1) back to (2).


Thanks

Huang Shijie
Re: [PATCH v6 2/2] sched: update the rq->avg_idle when a task is moved to an idle CPU
Posted by Dietmar Eggemann 4 days, 6 hours ago
On 12.12.25 04:16, Shijie Huang wrote:
> 
> On 12/12/2025 00:15, Dietmar Eggemann wrote:
>> In v2 you moved update_rq_avg_idle() (1) from activate_task() (2) to
>> enqueue_task() to possibly handle delayed tasks.
> Not for the delayed tasks, just for a more common place to handle more
> cases.
>> In v3 you figured there can't be any delayed task on a CPU when it sets
>> rq->idle_stamp in sched_balance_newidle()
> Yes.
>> So you could move (1) back to (2) avoiding the 'if rq->idle_stamp' for
>> the sched_change pattern for instance?
> Could you please tell me what is "avoiding the 'if rq->idle_stamp' for
> the sched_change pattern" ?
> 
> Sorry, I do not understand your meaning.

sched_change uses dequeue_task()/enqueue_task() for a queued task to
change prio, policy, sched params, taskgroups, etc.

kernel/sched/sched.h:

DEFINE_CLASS(sched_change, struct sched_change_ctx *,
             sched_change_end(_T),
             sched_change_begin(p, flags),
             struct task_struct *p, unsigned int flags)

kernel/sched/core.c:

struct sched_change_ctx *sched_change_begin(struct task_struct *p,
unsigned int flags)

void sched_change_end(struct sched_change_ctx *ctx)

> IMHO, there is no need to move (1) back to (2).

Btw, I'm using a BUG_ON(this_rq->idle_stamp) in sched_balance_newidle()
with update_rq_avg_idle() in activate_task() for testing.
Re: [PATCH v6 2/2] sched: update the rq->avg_idle when a task is moved to an idle CPU
Posted by Shijie Huang 1 day, 11 hours ago
On 12/12/2025 22:22, Dietmar Eggemann wrote:
>>> So you could move (1) back to (2) avoiding the 'if rq->idle_stamp' for
>>> the sched_change pattern for instance?
>> Could you please tell me what is "avoiding the 'if rq->idle_stamp' for
>> the sched_change pattern" ?
>>
>> Sorry, I do not understand your meaning.
> sched_change uses dequeue_task()/enqueue_task() for a queued task to
> change prio, policy, sched params, taskgroups, etc.

For sched_change, the dequeue_task()/enqueue_task() only work when

the queued task has TASK_ON_RQ_QUEUED flags. The TASK_ON_RQ_QUEUED

is set in activate_task().


   1.) For this active task, if the sched_change makes it 
dequeue_task()/enqueue_task() on

       current CPU, it's okay. Since current CPU is not in the newidle, 
the "rq->idle_stamp" is 0 at this case.

       This patch works fine.


  2.) For this active task, if the sched_change makes it 
dequeue_task()/enqueue_task() on an another CPU,

       it's okay too.

          2.1) If the another CPU's idle_stamp is 0, the another CPU is 
busy now.

                  The sched_change works fine with this patch.

          2.2) If the another CPU's idle_stamp is not 0, the 
sched_change also works fine with this patch.

                 Since the sched_change is breaking the idle state of 
the another CPU by moving an active

                 task to an idle CPU. It makes sense.


Thanks

Huang Shijie