In the newidle balance, the rq->idle_stamp may set to a non-zero value
if it cannot pull any task.
In the wakeup, it will detect the rq->idle_stamp, and updates
the rq->avg_idle, then ends the CPU idle status by setting rq->idle_stamp
to zero.
Besides the wakeup, current code does not end the CPU idle status
when a task is moved to the idle CPU, such as fork/clone, execve,
or other cases. In order to get more accurate rq->avg_idle,
we need to update it at more places(not only the wakeup).
This patch introduces a helper: update_rq_avg_idle().
And uses it in enqueue_task(), so it will update the rq->avg_idle
when a task is moved to an idle CPU at:
-- wakeup
-- fork/clone
-- execve
-- idle balance
-- other cases
Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
---
kernel/sched/core.c | 29 +++++++++++++++++------------
1 file changed, 17 insertions(+), 12 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9f10cfbdc228..2e3c4043de51 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2078,6 +2078,21 @@ unsigned long get_wchan(struct task_struct *p)
return ip;
}
+static void update_rq_avg_idle(struct rq *rq)
+{
+ if (rq->idle_stamp) {
+ u64 delta = rq_clock(rq) - rq->idle_stamp;
+ u64 max = 2*rq->max_idle_balance_cost;
+
+ update_avg(&rq->avg_idle, delta);
+
+ if (rq->avg_idle > max)
+ rq->avg_idle = max;
+
+ rq->idle_stamp = 0;
+ }
+}
+
void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
{
if (!(flags & ENQUEUE_NOCLOCK))
@@ -2100,6 +2115,8 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
if (sched_core_enabled(rq))
sched_core_enqueue(rq, p);
+
+ update_rq_avg_idle(rq);
}
/*
@@ -3645,18 +3662,6 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
p->sched_class->task_woken(rq, p);
rq_repin_lock(rq, rf);
}
-
- if (rq->idle_stamp) {
- u64 delta = rq_clock(rq) - rq->idle_stamp;
- u64 max = 2*rq->max_idle_balance_cost;
-
- update_avg(&rq->avg_idle, delta);
-
- if (rq->avg_idle > max)
- rq->avg_idle = max;
-
- rq->idle_stamp = 0;
- }
}
/*
--
2.40.1
On Tue, 9 Dec 2025 at 10:46, Huang Shijie <shijie@os.amperecomputing.com> wrote:
>
> In the newidle balance, the rq->idle_stamp may set to a non-zero value
> if it cannot pull any task.
>
> In the wakeup, it will detect the rq->idle_stamp, and updates
> the rq->avg_idle, then ends the CPU idle status by setting rq->idle_stamp
> to zero.
>
> Besides the wakeup, current code does not end the CPU idle status
> when a task is moved to the idle CPU, such as fork/clone, execve,
> or other cases. In order to get more accurate rq->avg_idle,
> we need to update it at more places(not only the wakeup).
>
> This patch introduces a helper: update_rq_avg_idle().
> And uses it in enqueue_task(), so it will update the rq->avg_idle
> when a task is moved to an idle CPU at:
> -- wakeup
> -- fork/clone
> -- execve
> -- idle balance
> -- other cases
>
> Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
> ---
> kernel/sched/core.c | 29 +++++++++++++++++------------
> 1 file changed, 17 insertions(+), 12 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9f10cfbdc228..2e3c4043de51 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2078,6 +2078,21 @@ unsigned long get_wchan(struct task_struct *p)
> return ip;
> }
>
> +static void update_rq_avg_idle(struct rq *rq)
> +{
> + if (rq->idle_stamp) {
> + u64 delta = rq_clock(rq) - rq->idle_stamp;
> + u64 max = 2*rq->max_idle_balance_cost;
> +
> + update_avg(&rq->avg_idle, delta);
> +
> + if (rq->avg_idle > max)
> + rq->avg_idle = max;
> +
> + rq->idle_stamp = 0;
> + }
> +}
> +
> void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
> {
> if (!(flags & ENQUEUE_NOCLOCK))
> @@ -2100,6 +2115,8 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
>
> if (sched_core_enabled(rq))
> sched_core_enqueue(rq, p);
> +
> + update_rq_avg_idle(rq);
put_prev_task_idle() would be a better place to call
update_rq_avg_idle() because this is when we leave idle.
> }
>
> /*
> @@ -3645,18 +3662,6 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
> p->sched_class->task_woken(rq, p);
> rq_repin_lock(rq, rf);
> }
> -
> - if (rq->idle_stamp) {
> - u64 delta = rq_clock(rq) - rq->idle_stamp;
> - u64 max = 2*rq->max_idle_balance_cost;
> -
> - update_avg(&rq->avg_idle, delta);
> -
> - if (rq->avg_idle > max)
> - rq->avg_idle = max;
> -
> - rq->idle_stamp = 0;
> - }
> }
>
> /*
> --
> 2.40.1
>
On 13/12/2025 09:36, Vincent Guittot wrote: > put_prev_task_idle() would be a better place to call > update_rq_avg_idle() because this is when we leave idle. The update_rq_avg_idle() is not only called by current CPU, but also called by other CPUs. For example, the try_to_wake_up(), update_rq_avg_idle() is called by the other CPUs. So enqueue_task() is a good place. Thanks Huang Shijie
On Tue, 16 Dec 2025 at 07:22, Shijie Huang <shijie@amperemail.onmicrosoft.com> wrote: > > > On 13/12/2025 09:36, Vincent Guittot wrote: > > put_prev_task_idle() would be a better place to call > > update_rq_avg_idle() because this is when we leave idle. > > The update_rq_avg_idle() is not only called by current CPU, but also > called by > > other CPUs. For example, the try_to_wake_up(), update_rq_avg_idle() is > called by > > the other CPUs. So enqueue_task() is a good place. But put_prev_task_idle() is called by local CPU whenever it leaves idle so instead of trying to catch all places that could make the CPU leave idle it's better to use this single place. And as you mentioned, put_prev_task_idle is only called by local CPU whereas enqueue_task can be called by all CPUs creating useless pressure in the variable. So I disagree when you say enqueue_task() is a "good place" Thanks > > > Thanks > > Huang Shijie >
On 16/12/2025 15:17, Vincent Guittot wrote: > On Tue, 16 Dec 2025 at 07:22, Shijie Huang > <shijie@amperemail.onmicrosoft.com> wrote: >> >> On 13/12/2025 09:36, Vincent Guittot wrote: >>> put_prev_task_idle() would be a better place to call >>> update_rq_avg_idle() because this is when we leave idle. >> The update_rq_avg_idle() is not only called by current CPU, but also >> called by >> >> other CPUs. For example, the try_to_wake_up(), update_rq_avg_idle() is >> called by >> >> the other CPUs. So enqueue_task() is a good place. > But put_prev_task_idle() is called by local CPU whenever it leaves > idle so instead of trying to catch all places that could make the CPU > leave idle it's better to use this single place. > And as you mentioned, put_prev_task_idle is only called by local CPU > whereas enqueue_task can be called by all CPUs creating useless > pressure in the variable. The rq->idle_stamp is set at sched_balance_newidle(). then we call update_rq_avg_idle() in put_prev_task_idle() right now. How can we update the rq->avg_idle? Thanks Huang Shijie >>
On Tue, 16 Dec 2025 at 08:39, Shijie Huang <shijie@amperemail.onmicrosoft.com> wrote: > > > On 16/12/2025 15:17, Vincent Guittot wrote: > > On Tue, 16 Dec 2025 at 07:22, Shijie Huang > > <shijie@amperemail.onmicrosoft.com> wrote: > >> > >> On 13/12/2025 09:36, Vincent Guittot wrote: > >>> put_prev_task_idle() would be a better place to call > >>> update_rq_avg_idle() because this is when we leave idle. > >> The update_rq_avg_idle() is not only called by current CPU, but also > >> called by > >> > >> other CPUs. For example, the try_to_wake_up(), update_rq_avg_idle() is > >> called by > >> > >> the other CPUs. So enqueue_task() is a good place. > > But put_prev_task_idle() is called by local CPU whenever it leaves > > idle so instead of trying to catch all places that could make the CPU > > leave idle it's better to use this single place. > > And as you mentioned, put_prev_task_idle is only called by local CPU > > whereas enqueue_task can be called by all CPUs creating useless > > pressure in the variable. > > The rq->idle_stamp is set at sched_balance_newidle(). then we call > update_rq_avg_idle() > > in put_prev_task_idle() right now. How can we update the rq->avg_idle? I'm not sure I understand your point. rq->avg_idle tracks idle time. The easiest way would be to use - set_next_task_idle() when we enter idle - put_prev_task_idle() when we exit idle Except that sched_balance_newidle() can be long and the time should be accounted as idle time too. So instead of using set_next_task_idle(), we use sched_balance_newidle() to set . Which is okay because sched_balance_newidle() is always called before going to idle. > > Thanks > > Huang Shijie > > >>
On 16/12/2025 16:47, Vincent Guittot wrote: > On Tue, 16 Dec 2025 at 08:39, Shijie Huang > <shijie@amperemail.onmicrosoft.com> wrote: >> >> On 16/12/2025 15:17, Vincent Guittot wrote: >>> On Tue, 16 Dec 2025 at 07:22, Shijie Huang >>> <shijie@amperemail.onmicrosoft.com> wrote: >>>> On 13/12/2025 09:36, Vincent Guittot wrote: >>>>> put_prev_task_idle() would be a better place to call >>>>> update_rq_avg_idle() because this is when we leave idle. >>>> The update_rq_avg_idle() is not only called by current CPU, but also >>>> called by >>>> >>>> other CPUs. For example, the try_to_wake_up(), update_rq_avg_idle() is >>>> called by >>>> >>>> the other CPUs. So enqueue_task() is a good place. >>> But put_prev_task_idle() is called by local CPU whenever it leaves >>> idle so instead of trying to catch all places that could make the CPU >>> leave idle it's better to use this single place. >>> And as you mentioned, put_prev_task_idle is only called by local CPU >>> whereas enqueue_task can be called by all CPUs creating useless >>> pressure in the variable. >> The rq->idle_stamp is set at sched_balance_newidle(). then we call >> update_rq_avg_idle() >> >> in put_prev_task_idle() right now. How can we update the rq->avg_idle? > I'm not sure I understand your point. > > rq->avg_idle tracks idle time. The easiest way would be to use > - set_next_task_idle() when we enter idle > - put_prev_task_idle() when we exit idle > > Except that sched_balance_newidle() can be long and the time should be > accounted as idle time too. So instead of using set_next_task_idle(), > we use sched_balance_newidle() to set . Which is okay because > sched_balance_newidle() is always called before going to idle. Thanks for the explanations. It seems that put_prev_task_idle() is really a better place to call update_rq_avg_idle(). Let me think it for a while :) Thanks Huang Shijie >
On 09.12.25 10:45, Huang Shijie wrote: > In the newidle balance, the rq->idle_stamp may set to a non-zero value > if it cannot pull any task. > > In the wakeup, it will detect the rq->idle_stamp, and updates > the rq->avg_idle, then ends the CPU idle status by setting rq->idle_stamp > to zero. > > Besides the wakeup, current code does not end the CPU idle status > when a task is moved to the idle CPU, such as fork/clone, execve, > or other cases. In order to get more accurate rq->avg_idle, > we need to update it at more places(not only the wakeup). > > This patch introduces a helper: update_rq_avg_idle(). > And uses it in enqueue_task(), so it will update the rq->avg_idle > when a task is moved to an idle CPU at: > -- wakeup > -- fork/clone > -- execve > -- idle balance > -- other cases [...] In v2 you moved update_rq_avg_idle() (1) from activate_task() (2) to enqueue_task() to possibly handle delayed tasks. In v3 you figured there can't be any delayed task on a CPU when it sets rq->idle_stamp in sched_balance_newidle() So you could move (1) back to (2) avoiding the 'if rq->idle_stamp' for the sched_change pattern for instance? I tried to understand whether this patch could help with one issue we currently have with 'DCPerf Mediawiki' benchmark where on 2 comparable servers, one has a 10% lower CPU utilization and I traced it down to significantly different behaviour in sched_balance_newidle() and further down to: if (!get_rd_overloaded(this_rq->rd) || this_rq->avg_idle < sd->max_newidle_lb_cost) where the one with the lower CPU utilization bails out way less often (because this_rq->avg_idle is very high, system is overloaded). But I failed so far. Anyway, we'll use this patch for another test run right now.
On 12/12/2025 00:15, Dietmar Eggemann wrote: > In v2 you moved update_rq_avg_idle() (1) from activate_task() (2) to > enqueue_task() to possibly handle delayed tasks. Not for the delayed tasks, just for a more common place to handle more cases. > In v3 you figured there can't be any delayed task on a CPU when it sets > rq->idle_stamp in sched_balance_newidle() Yes. > So you could move (1) back to (2) avoiding the 'if rq->idle_stamp' for > the sched_change pattern for instance? Could you please tell me what is "avoiding the 'if rq->idle_stamp' for the sched_change pattern" ? Sorry, I do not understand your meaning. IMHO, there is no need to move (1) back to (2). Thanks Huang Shijie
On 12.12.25 04:16, Shijie Huang wrote:
>
> On 12/12/2025 00:15, Dietmar Eggemann wrote:
>> In v2 you moved update_rq_avg_idle() (1) from activate_task() (2) to
>> enqueue_task() to possibly handle delayed tasks.
> Not for the delayed tasks, just for a more common place to handle more
> cases.
>> In v3 you figured there can't be any delayed task on a CPU when it sets
>> rq->idle_stamp in sched_balance_newidle()
> Yes.
>> So you could move (1) back to (2) avoiding the 'if rq->idle_stamp' for
>> the sched_change pattern for instance?
> Could you please tell me what is "avoiding the 'if rq->idle_stamp' for
> the sched_change pattern" ?
>
> Sorry, I do not understand your meaning.
sched_change uses dequeue_task()/enqueue_task() for a queued task to
change prio, policy, sched params, taskgroups, etc.
kernel/sched/sched.h:
DEFINE_CLASS(sched_change, struct sched_change_ctx *,
sched_change_end(_T),
sched_change_begin(p, flags),
struct task_struct *p, unsigned int flags)
kernel/sched/core.c:
struct sched_change_ctx *sched_change_begin(struct task_struct *p,
unsigned int flags)
void sched_change_end(struct sched_change_ctx *ctx)
> IMHO, there is no need to move (1) back to (2).
Btw, I'm using a BUG_ON(this_rq->idle_stamp) in sched_balance_newidle()
with update_rq_avg_idle() in activate_task() for testing.
On 12/12/2025 22:22, Dietmar Eggemann wrote: >>> So you could move (1) back to (2) avoiding the 'if rq->idle_stamp' for >>> the sched_change pattern for instance? >> Could you please tell me what is "avoiding the 'if rq->idle_stamp' for >> the sched_change pattern" ? >> >> Sorry, I do not understand your meaning. > sched_change uses dequeue_task()/enqueue_task() for a queued task to > change prio, policy, sched params, taskgroups, etc. For sched_change, the dequeue_task()/enqueue_task() only work when the queued task has TASK_ON_RQ_QUEUED flags. The TASK_ON_RQ_QUEUED is set in activate_task(). 1.) For this active task, if the sched_change makes it dequeue_task()/enqueue_task() on current CPU, it's okay. Since current CPU is not in the newidle, the "rq->idle_stamp" is 0 at this case. This patch works fine. 2.) For this active task, if the sched_change makes it dequeue_task()/enqueue_task() on an another CPU, it's okay too. 2.1) If the another CPU's idle_stamp is 0, the another CPU is busy now. The sched_change works fine with this patch. 2.2) If the another CPU's idle_stamp is not 0, the sched_change also works fine with this patch. Since the sched_change is breaking the idle state of the another CPU by moving an active task to an idle CPU. It makes sense. Thanks Huang Shijie
On 15.12.25 10:35, Shijie Huang wrote: > > On 12/12/2025 22:22, Dietmar Eggemann wrote: >>>> So you could move (1) back to (2) avoiding the 'if rq->idle_stamp' for >>>> the sched_change pattern for instance? >>> Could you please tell me what is "avoiding the 'if rq->idle_stamp' for >>> the sched_change pattern" ? >>> >>> Sorry, I do not understand your meaning. >> sched_change uses dequeue_task()/enqueue_task() for a queued task to >> change prio, policy, sched params, taskgroups, etc. > > For sched_change, the dequeue_task()/enqueue_task() only work when > > the queued task has TASK_ON_RQ_QUEUED flags. The TASK_ON_RQ_QUEUED > > is set in activate_task(). I guess this was a misunderstanding. It works because of 'if (rq->idle_stamp)' and setting 'rq->idle_stamp = 0' within the condition but this condition isn't worth checking in certain places where we actually call enqueue_task(). > > 1.) For this active task, if the sched_change makes it dequeue_task()/ > enqueue_task() on > > current CPU, it's okay. Since current CPU is not in the newidle, > the "rq->idle_stamp" is 0 at this case. > > This patch works fine. > > > 2.) For this active task, if the sched_change makes it dequeue_task()/ > enqueue_task() on an another CPU, > > it's okay too. > > 2.1) If the another CPU's idle_stamp is 0, the another CPU is > busy now. > > The sched_change works fine with this patch. > > 2.2) If the another CPU's idle_stamp is not 0, the sched_change > also works fine with this patch. > > Since the sched_change is breaking the idle state of the > another CPU by moving an active > > task to an idle CPU. It makes sense. Not sure about this. I thought so far that the sched_change pattern is doing a task dequeue + enqueue on the same CPU (this CPU or other)? So you can't come out of idle here. We lock the rq before we call scoped_guard (sched_change, ...) I think Vincent is right by saying the update_rq_avg_idle() should be put into put_prev_task_idle() instead. Still waiting for the DCPerf Mediawiki test results to see if this change fixes my 'rq->avg_idle being too big' issue.
On 17.12.25 17:15, Dietmar Eggemann wrote: > On 15.12.25 10:35, Shijie Huang wrote: >> >> On 12/12/2025 22:22, Dietmar Eggemann wrote: [...] > I think Vincent is right by saying the update_rq_avg_idle() should be > put into put_prev_task_idle() instead. > > Still waiting for the DCPerf Mediawiki test results to see if this > change fixes my 'rq->avg_idle being too big' issue. Turns out the patch didn't fix this issue. Still seeing a huge number of sched_balance_newidle() calls in which the system is (1) overloaded and (2) this_rq->avg_idle >= sd->max_newidle_lb_cost so that there is no early bailout and no task is pulled at the end. Must be something else ...
On Wed, 7 Jan 2026 at 16:48, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > > On 17.12.25 17:15, Dietmar Eggemann wrote: > > On 15.12.25 10:35, Shijie Huang wrote: > >> > >> On 12/12/2025 22:22, Dietmar Eggemann wrote: > > [...] > > > I think Vincent is right by saying the update_rq_avg_idle() should be > > put into put_prev_task_idle() instead. > > > > Still waiting for the DCPerf Mediawiki test results to see if this > > change fixes my 'rq->avg_idle being too big' issue. > > Turns out the patch didn't fix this issue. Still seeing a huge number of > sched_balance_newidle() calls in which the system is (1) overloaded and > (2) this_rq->avg_idle >= sd->max_newidle_lb_cost so that there is no > early bailout and no task is pulled at the end. Must be something else ... Do you mean the v6 or v7 version ? sched_balance_newidle() will be called every time there is no cfs or higher priority tasks to pick next but we will not loop the sched domains and bail out early. So having a huge number of call to sched_balance_newidle() is normal but if you don't bail out early and loop the sched domain then it's a problem
© 2016 - 2026 Red Hat, Inc.