[PATCH v3 2/2] sched: update the rq->avg_idle when a task is moved to an idle CPU

Huang Shijie posted 2 patches 4 days, 12 hours ago
There is a newer version of this series
[PATCH v3 2/2] sched: update the rq->avg_idle when a task is moved to an idle CPU
Posted by Huang Shijie 4 days, 12 hours ago
In the newidle balance, the rq->idle_stamp may set to a non-zero value
if it cannot pull any task.

In the wakeup, it will detect the rq->idle_stamp, and updates
the rq->avg_idle, then ends the CPU idle status by setting rq->idle_stamp
to zero.

Besides the wakeup, current code does not end the CPU idle status
when a task is moved to the idle CPU, such as fork/clone, execve,
or other cases.

This patch introduces a helper: update_rq_avg_idle().
And uses it in enqueue_task(), so it will update the rq->avg_idle
when a task is moved to an idle CPU at:
   -- wakeup
   -- fork/clone
   -- execve
   -- idle balance
   -- delayed dequeue task
   -- other cases

Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
---
 kernel/sched/core.c | 36 ++++++++++++++++++++++++------------
 1 file changed, 24 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0c4ff93eeb78..8531ef68ce76 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2078,8 +2078,25 @@ unsigned long get_wchan(struct task_struct *p)
 	return ip;
 }
 
+static void update_rq_avg_idle(struct rq *rq)
+{
+	if (rq->idle_stamp) {
+		u64 delta = rq_clock(rq) - rq->idle_stamp;
+		u64 max = 2*rq->max_idle_balance_cost;
+
+		update_avg(&rq->avg_idle, delta);
+
+		if (rq->avg_idle > max)
+			rq->avg_idle = max;
+
+		rq->idle_stamp = 0;
+	}
+}
+
 void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 {
+	int delayed = p->se.sched_delayed;
+
 	if (!(flags & ENQUEUE_NOCLOCK))
 		update_rq_clock(rq);
 
@@ -2100,6 +2117,13 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 
 	if (sched_core_enabled(rq))
 		sched_core_enqueue(rq, p);
+
+	if (delayed) {
+		if (entity_eligible(cfs_rq_of(&p->se), &p->se))
+			update_rq_avg_idle(rq);
+	} else {
+		update_rq_avg_idle(rq);
+	}
 }
 
 /*
@@ -3645,18 +3669,6 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
 		p->sched_class->task_woken(rq, p);
 		rq_repin_lock(rq, rf);
 	}
-
-	if (rq->idle_stamp) {
-		u64 delta = rq_clock(rq) - rq->idle_stamp;
-		u64 max = 2*rq->max_idle_balance_cost;
-
-		update_avg(&rq->avg_idle, delta);
-
-		if (rq->avg_idle > max)
-			rq->avg_idle = max;
-
-		rq->idle_stamp = 0;
-	}
 }
 
 /*
-- 
2.40.1
Re: [PATCH v3 2/2] sched: update the rq->avg_idle when a task is moved to an idle CPU
Posted by K Prateek Nayak 4 days, 11 hours ago
Hello Huang Shijie,

On 11/27/2025 2:44 PM, Huang Shijie wrote:
>  void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
>  {
> +	int delayed = p->se.sched_delayed;
> +
>  	if (!(flags & ENQUEUE_NOCLOCK))
>  		update_rq_clock(rq);
>  
> @@ -2100,6 +2117,13 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
>  
>  	if (sched_core_enabled(rq))
>  		sched_core_enqueue(rq, p);
> +
> +	if (delayed) {
> +		if (entity_eligible(cfs_rq_of(&p->se), &p->se))
> +			update_rq_avg_idle(rq);

Question: Why do we want to treat the delayed case like this?

If entity is not eligible, we want to consider that it hasn't
even gone through a wakeup? Wouldn't this lead to the next
wakeup seeing rq->idle_stamp to be non-zero and inaccurately
account more idle time?

Also if we've done newidle balance and the rq->idle_stamp is
set, we cannot have delayed tasks since pick_next_task() would
have dequeued all delayed tasks before reaching newidle
balance.

Just doing a update_rq_avg_idle() unconditionally should be
fine.

> +	} else {
> +		update_rq_avg_idle(rq);
> +	}
>  }
>  
>  /*
-- 
Thanks and Regards,
Prateek
Re: [PATCH v3 2/2] sched: update the rq->avg_idle when a task is moved to an idle CPU
Posted by Shijie Huang 3 days, 15 hours ago
On 27/11/2025 18:12, K Prateek Nayak wrote:
> Also if we've done newidle balance and the rq->idle_stamp is
> set, we cannot have delayed tasks since pick_next_task() would
> have dequeued all delayed tasks before reaching newidle
> balance.
Yes, you are right.
> Just doing a update_rq_avg_idle() unconditionally should be
> fine.

okay.


Thanks

Huang Shijie