In the newidle balance, the rq->idle_stamp may set to a non-zero value
if it cannot pull any task.
In the wakeup, it will detect the rq->idle_stamp, and updates
the rq->avg_idle, then ends the CPU idle status by setting rq->idle_stamp
to zero.
Besides the wakeup, current code does not end the CPU idle status
when a task is moved to the idle CPU, such as fork/clone, execve,
or other cases. In order to get more accurate rq->avg_idle,
we need to update it at more places(not only the wakeup).
This patch introduces a helper: update_rq_avg_idle().
And uses it in enqueue_task(), so it will update the rq->avg_idle
when a task is moved to an idle CPU at:
-- wakeup
-- fork/clone
-- execve
-- idle balance
-- other cases
Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
---
kernel/sched/core.c | 29 +++++++++++++++++------------
1 file changed, 17 insertions(+), 12 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9f10cfbdc228..2e3c4043de51 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2078,6 +2078,21 @@ unsigned long get_wchan(struct task_struct *p)
return ip;
}
+static void update_rq_avg_idle(struct rq *rq)
+{
+ if (rq->idle_stamp) {
+ u64 delta = rq_clock(rq) - rq->idle_stamp;
+ u64 max = 2*rq->max_idle_balance_cost;
+
+ update_avg(&rq->avg_idle, delta);
+
+ if (rq->avg_idle > max)
+ rq->avg_idle = max;
+
+ rq->idle_stamp = 0;
+ }
+}
+
void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
{
if (!(flags & ENQUEUE_NOCLOCK))
@@ -2100,6 +2115,8 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
if (sched_core_enabled(rq))
sched_core_enqueue(rq, p);
+
+ update_rq_avg_idle(rq);
}
/*
@@ -3645,18 +3662,6 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
p->sched_class->task_woken(rq, p);
rq_repin_lock(rq, rf);
}
-
- if (rq->idle_stamp) {
- u64 delta = rq_clock(rq) - rq->idle_stamp;
- u64 max = 2*rq->max_idle_balance_cost;
-
- update_avg(&rq->avg_idle, delta);
-
- if (rq->avg_idle > max)
- rq->avg_idle = max;
-
- rq->idle_stamp = 0;
- }
}
/*
--
2.40.1
On Tue, 9 Dec 2025 at 10:46, Huang Shijie <shijie@os.amperecomputing.com> wrote:
>
> In the newidle balance, the rq->idle_stamp may set to a non-zero value
> if it cannot pull any task.
>
> In the wakeup, it will detect the rq->idle_stamp, and updates
> the rq->avg_idle, then ends the CPU idle status by setting rq->idle_stamp
> to zero.
>
> Besides the wakeup, current code does not end the CPU idle status
> when a task is moved to the idle CPU, such as fork/clone, execve,
> or other cases. In order to get more accurate rq->avg_idle,
> we need to update it at more places(not only the wakeup).
>
> This patch introduces a helper: update_rq_avg_idle().
> And uses it in enqueue_task(), so it will update the rq->avg_idle
> when a task is moved to an idle CPU at:
> -- wakeup
> -- fork/clone
> -- execve
> -- idle balance
> -- other cases
>
> Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
> ---
> kernel/sched/core.c | 29 +++++++++++++++++------------
> 1 file changed, 17 insertions(+), 12 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9f10cfbdc228..2e3c4043de51 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2078,6 +2078,21 @@ unsigned long get_wchan(struct task_struct *p)
> return ip;
> }
>
> +static void update_rq_avg_idle(struct rq *rq)
> +{
> + if (rq->idle_stamp) {
> + u64 delta = rq_clock(rq) - rq->idle_stamp;
> + u64 max = 2*rq->max_idle_balance_cost;
> +
> + update_avg(&rq->avg_idle, delta);
> +
> + if (rq->avg_idle > max)
> + rq->avg_idle = max;
> +
> + rq->idle_stamp = 0;
> + }
> +}
> +
> void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
> {
> if (!(flags & ENQUEUE_NOCLOCK))
> @@ -2100,6 +2115,8 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
>
> if (sched_core_enabled(rq))
> sched_core_enqueue(rq, p);
> +
> + update_rq_avg_idle(rq);
put_prev_task_idle() would be a better place to call
update_rq_avg_idle() because this is when we leave idle.
> }
>
> /*
> @@ -3645,18 +3662,6 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
> p->sched_class->task_woken(rq, p);
> rq_repin_lock(rq, rf);
> }
> -
> - if (rq->idle_stamp) {
> - u64 delta = rq_clock(rq) - rq->idle_stamp;
> - u64 max = 2*rq->max_idle_balance_cost;
> -
> - update_avg(&rq->avg_idle, delta);
> -
> - if (rq->avg_idle > max)
> - rq->avg_idle = max;
> -
> - rq->idle_stamp = 0;
> - }
> }
>
> /*
> --
> 2.40.1
>
On 13/12/2025 09:36, Vincent Guittot wrote: > put_prev_task_idle() would be a better place to call > update_rq_avg_idle() because this is when we leave idle. The update_rq_avg_idle() is not only called by current CPU, but also called by other CPUs. For example, the try_to_wake_up(), update_rq_avg_idle() is called by the other CPUs. So enqueue_task() is a good place. Thanks Huang Shijie
On Tue, 16 Dec 2025 at 07:22, Shijie Huang <shijie@amperemail.onmicrosoft.com> wrote: > > > On 13/12/2025 09:36, Vincent Guittot wrote: > > put_prev_task_idle() would be a better place to call > > update_rq_avg_idle() because this is when we leave idle. > > The update_rq_avg_idle() is not only called by current CPU, but also > called by > > other CPUs. For example, the try_to_wake_up(), update_rq_avg_idle() is > called by > > the other CPUs. So enqueue_task() is a good place. But put_prev_task_idle() is called by local CPU whenever it leaves idle so instead of trying to catch all places that could make the CPU leave idle it's better to use this single place. And as you mentioned, put_prev_task_idle is only called by local CPU whereas enqueue_task can be called by all CPUs creating useless pressure in the variable. So I disagree when you say enqueue_task() is a "good place" Thanks > > > Thanks > > Huang Shijie >
On 16/12/2025 15:17, Vincent Guittot wrote: > On Tue, 16 Dec 2025 at 07:22, Shijie Huang > <shijie@amperemail.onmicrosoft.com> wrote: >> >> On 13/12/2025 09:36, Vincent Guittot wrote: >>> put_prev_task_idle() would be a better place to call >>> update_rq_avg_idle() because this is when we leave idle. >> The update_rq_avg_idle() is not only called by current CPU, but also >> called by >> >> other CPUs. For example, the try_to_wake_up(), update_rq_avg_idle() is >> called by >> >> the other CPUs. So enqueue_task() is a good place. > But put_prev_task_idle() is called by local CPU whenever it leaves > idle so instead of trying to catch all places that could make the CPU > leave idle it's better to use this single place. > And as you mentioned, put_prev_task_idle is only called by local CPU > whereas enqueue_task can be called by all CPUs creating useless > pressure in the variable. The rq->idle_stamp is set at sched_balance_newidle(). then we call update_rq_avg_idle() in put_prev_task_idle() right now. How can we update the rq->avg_idle? Thanks Huang Shijie >>
On Tue, 16 Dec 2025 at 08:39, Shijie Huang <shijie@amperemail.onmicrosoft.com> wrote: > > > On 16/12/2025 15:17, Vincent Guittot wrote: > > On Tue, 16 Dec 2025 at 07:22, Shijie Huang > > <shijie@amperemail.onmicrosoft.com> wrote: > >> > >> On 13/12/2025 09:36, Vincent Guittot wrote: > >>> put_prev_task_idle() would be a better place to call > >>> update_rq_avg_idle() because this is when we leave idle. > >> The update_rq_avg_idle() is not only called by current CPU, but also > >> called by > >> > >> other CPUs. For example, the try_to_wake_up(), update_rq_avg_idle() is > >> called by > >> > >> the other CPUs. So enqueue_task() is a good place. > > But put_prev_task_idle() is called by local CPU whenever it leaves > > idle so instead of trying to catch all places that could make the CPU > > leave idle it's better to use this single place. > > And as you mentioned, put_prev_task_idle is only called by local CPU > > whereas enqueue_task can be called by all CPUs creating useless > > pressure in the variable. > > The rq->idle_stamp is set at sched_balance_newidle(). then we call > update_rq_avg_idle() > > in put_prev_task_idle() right now. How can we update the rq->avg_idle? I'm not sure I understand your point. rq->avg_idle tracks idle time. The easiest way would be to use - set_next_task_idle() when we enter idle - put_prev_task_idle() when we exit idle Except that sched_balance_newidle() can be long and the time should be accounted as idle time too. So instead of using set_next_task_idle(), we use sched_balance_newidle() to set . Which is okay because sched_balance_newidle() is always called before going to idle. > > Thanks > > Huang Shijie > > >>
On 16/12/2025 16:47, Vincent Guittot wrote: > On Tue, 16 Dec 2025 at 08:39, Shijie Huang > <shijie@amperemail.onmicrosoft.com> wrote: >> >> On 16/12/2025 15:17, Vincent Guittot wrote: >>> On Tue, 16 Dec 2025 at 07:22, Shijie Huang >>> <shijie@amperemail.onmicrosoft.com> wrote: >>>> On 13/12/2025 09:36, Vincent Guittot wrote: >>>>> put_prev_task_idle() would be a better place to call >>>>> update_rq_avg_idle() because this is when we leave idle. >>>> The update_rq_avg_idle() is not only called by current CPU, but also >>>> called by >>>> >>>> other CPUs. For example, the try_to_wake_up(), update_rq_avg_idle() is >>>> called by >>>> >>>> the other CPUs. So enqueue_task() is a good place. >>> But put_prev_task_idle() is called by local CPU whenever it leaves >>> idle so instead of trying to catch all places that could make the CPU >>> leave idle it's better to use this single place. >>> And as you mentioned, put_prev_task_idle is only called by local CPU >>> whereas enqueue_task can be called by all CPUs creating useless >>> pressure in the variable. >> The rq->idle_stamp is set at sched_balance_newidle(). then we call >> update_rq_avg_idle() >> >> in put_prev_task_idle() right now. How can we update the rq->avg_idle? > I'm not sure I understand your point. > > rq->avg_idle tracks idle time. The easiest way would be to use > - set_next_task_idle() when we enter idle > - put_prev_task_idle() when we exit idle > > Except that sched_balance_newidle() can be long and the time should be > accounted as idle time too. So instead of using set_next_task_idle(), > we use sched_balance_newidle() to set . Which is okay because > sched_balance_newidle() is always called before going to idle. Thanks for the explanations. It seems that put_prev_task_idle() is really a better place to call update_rq_avg_idle(). Let me think it for a while :) Thanks Huang Shijie >
On 09.12.25 10:45, Huang Shijie wrote: > In the newidle balance, the rq->idle_stamp may set to a non-zero value > if it cannot pull any task. > > In the wakeup, it will detect the rq->idle_stamp, and updates > the rq->avg_idle, then ends the CPU idle status by setting rq->idle_stamp > to zero. > > Besides the wakeup, current code does not end the CPU idle status > when a task is moved to the idle CPU, such as fork/clone, execve, > or other cases. In order to get more accurate rq->avg_idle, > we need to update it at more places(not only the wakeup). > > This patch introduces a helper: update_rq_avg_idle(). > And uses it in enqueue_task(), so it will update the rq->avg_idle > when a task is moved to an idle CPU at: > -- wakeup > -- fork/clone > -- execve > -- idle balance > -- other cases [...] In v2 you moved update_rq_avg_idle() (1) from activate_task() (2) to enqueue_task() to possibly handle delayed tasks. In v3 you figured there can't be any delayed task on a CPU when it sets rq->idle_stamp in sched_balance_newidle() So you could move (1) back to (2) avoiding the 'if rq->idle_stamp' for the sched_change pattern for instance? I tried to understand whether this patch could help with one issue we currently have with 'DCPerf Mediawiki' benchmark where on 2 comparable servers, one has a 10% lower CPU utilization and I traced it down to significantly different behaviour in sched_balance_newidle() and further down to: if (!get_rd_overloaded(this_rq->rd) || this_rq->avg_idle < sd->max_newidle_lb_cost) where the one with the lower CPU utilization bails out way less often (because this_rq->avg_idle is very high, system is overloaded). But I failed so far. Anyway, we'll use this patch for another test run right now.
On 12/12/2025 00:15, Dietmar Eggemann wrote: > In v2 you moved update_rq_avg_idle() (1) from activate_task() (2) to > enqueue_task() to possibly handle delayed tasks. Not for the delayed tasks, just for a more common place to handle more cases. > In v3 you figured there can't be any delayed task on a CPU when it sets > rq->idle_stamp in sched_balance_newidle() Yes. > So you could move (1) back to (2) avoiding the 'if rq->idle_stamp' for > the sched_change pattern for instance? Could you please tell me what is "avoiding the 'if rq->idle_stamp' for the sched_change pattern" ? Sorry, I do not understand your meaning. IMHO, there is no need to move (1) back to (2). Thanks Huang Shijie
On 12.12.25 04:16, Shijie Huang wrote:
>
> On 12/12/2025 00:15, Dietmar Eggemann wrote:
>> In v2 you moved update_rq_avg_idle() (1) from activate_task() (2) to
>> enqueue_task() to possibly handle delayed tasks.
> Not for the delayed tasks, just for a more common place to handle more
> cases.
>> In v3 you figured there can't be any delayed task on a CPU when it sets
>> rq->idle_stamp in sched_balance_newidle()
> Yes.
>> So you could move (1) back to (2) avoiding the 'if rq->idle_stamp' for
>> the sched_change pattern for instance?
> Could you please tell me what is "avoiding the 'if rq->idle_stamp' for
> the sched_change pattern" ?
>
> Sorry, I do not understand your meaning.
sched_change uses dequeue_task()/enqueue_task() for a queued task to
change prio, policy, sched params, taskgroups, etc.
kernel/sched/sched.h:
DEFINE_CLASS(sched_change, struct sched_change_ctx *,
sched_change_end(_T),
sched_change_begin(p, flags),
struct task_struct *p, unsigned int flags)
kernel/sched/core.c:
struct sched_change_ctx *sched_change_begin(struct task_struct *p,
unsigned int flags)
void sched_change_end(struct sched_change_ctx *ctx)
> IMHO, there is no need to move (1) back to (2).
Btw, I'm using a BUG_ON(this_rq->idle_stamp) in sched_balance_newidle()
with update_rq_avg_idle() in activate_task() for testing.
On 12/12/2025 22:22, Dietmar Eggemann wrote: >>> So you could move (1) back to (2) avoiding the 'if rq->idle_stamp' for >>> the sched_change pattern for instance? >> Could you please tell me what is "avoiding the 'if rq->idle_stamp' for >> the sched_change pattern" ? >> >> Sorry, I do not understand your meaning. > sched_change uses dequeue_task()/enqueue_task() for a queued task to > change prio, policy, sched params, taskgroups, etc. For sched_change, the dequeue_task()/enqueue_task() only work when the queued task has TASK_ON_RQ_QUEUED flags. The TASK_ON_RQ_QUEUED is set in activate_task(). 1.) For this active task, if the sched_change makes it dequeue_task()/enqueue_task() on current CPU, it's okay. Since current CPU is not in the newidle, the "rq->idle_stamp" is 0 at this case. This patch works fine. 2.) For this active task, if the sched_change makes it dequeue_task()/enqueue_task() on an another CPU, it's okay too. 2.1) If the another CPU's idle_stamp is 0, the another CPU is busy now. The sched_change works fine with this patch. 2.2) If the another CPU's idle_stamp is not 0, the sched_change also works fine with this patch. Since the sched_change is breaking the idle state of the another CPU by moving an active task to an idle CPU. It makes sense. Thanks Huang Shijie
© 2016 - 2025 Red Hat, Inc.