sched/rt: fix incorrect schedstats for rt thread

[PATCH] sched/rt: fix incorrect schedstats for rt thread

Posted by Dengjun Su 1 month ago

For RT thread, only 'set_next_task_rt' will call
'update_stats_wait_end_rt' to update schedstats information.
However, during the RT migration process,
'update_stats_wait_start_rt' will be called twice, which
will cause the values of wait_max and wait_sum to be incorrect.
The specific output as follows:
$ cat /proc/6046/task/6046/sched | grep wait
wait_start                                   :             0.000000
wait_max                                     :        496717.080029
wait_sum                                     :       7921540.776553

Add 'update_stats_wait_end_rt' in 'update_stats_dequeue_rt' to
update schedstats information when dequeue_task.

Signed-off-by: Dengjun Su <dengjun.su@mediatek.com>
---
 kernel/sched/rt.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f1867fe8e5c5..12f2efddca9f 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1302,13 +1302,18 @@ update_stats_dequeue_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se,
 			int flags)
 {
 	struct task_struct *p = NULL;
+	struct rq *rq = rq_of_rt_rq(rt_rq);
 
 	if (!schedstat_enabled())
 		return;
 
-	if (rt_entity_is_task(rt_se))
+	if (rt_entity_is_task(rt_se)) {
 		p = rt_task_of(rt_se);
 
+		if (p != rq->curr)
+			update_stats_wait_end_rt(rt_rq, rt_se);
+	}
+
 	if ((flags & DEQUEUE_SLEEP) && p) {
 		unsigned int state;
 
-- 
2.43.0

Re: [PATCH] sched/rt: fix incorrect schedstats for rt thread

Posted by Peter Zijlstra 1 month ago

On Thu, Jan 08, 2026 at 11:13:07AM +0800, Dengjun Su wrote:
> For RT thread, only 'set_next_task_rt' will call
> 'update_stats_wait_end_rt' to update schedstats information.
> However, during the RT migration process,
> 'update_stats_wait_start_rt' will be called twice, which
> will cause the values of wait_max and wait_sum to be incorrect.

Right, that looses time. Also note that I think dl has the same issue.

> The specific output as follows:
> $ cat /proc/6046/task/6046/sched | grep wait
> wait_start                                   :             0.000000
> wait_max                                     :        496717.080029
> wait_sum                                     :       7921540.776553
> 
> Add 'update_stats_wait_end_rt' in 'update_stats_dequeue_rt' to
> update schedstats information when dequeue_task.

This needs a few more words on why this is correct -- notably it took me
a little time to find the 'task_on_rq_migrating()' case in
__update_stats_wait_end() which makes this not actually 'end'.

But then the corresponding clause in __update_stats_wait_start() gives
me a headache:

 'wait_start > prev_wait_start'

I mean, wtf. Should that not equally be using task_on_rq_migrating() ?

Can you please take a hard look at all that and fix up things all-round?

Re: [PATCH] sched/rt: fix incorrect schedstats for rt thread

Posted by Dengjun Su 4 weeks, 1 day ago

On Thu, 2026-01-08 at 12:16 +0100, Peter Zijlstra wrote:
> On Thu, Jan 08, 2026 at 11:13:07AM +0800, Dengjun Su wrote:
> > For RT thread, only 'set_next_task_rt' will call
> > 'update_stats_wait_end_rt' to update schedstats information.
> > However, during the RT migration process,
> > 'update_stats_wait_start_rt' will be called twice, which
> > will cause the values of wait_max and wait_sum to be incorrect.
> 
> Right, that looses time. Also note that I think dl has the same
> issue.

Hi Peter,

Thanks for the feedback. Yes, sorry for miss dl class,
I will update it in V2.

> 
> > The specific output as follows:
> > $ cat /proc/6046/task/6046/sched | grep wait
> > wait_start                                   :             0.000000
> > wait_max                                     :        496717.080029
> > wait_sum                                     :       7921540.776553
> > 
> > Add 'update_stats_wait_end_rt' in 'update_stats_dequeue_rt' to
> > update schedstats information when dequeue_task.
> 
> This needs a few more words on why this is correct -- notably it took
> me
> a little time to find the 'task_on_rq_migrating()' case in
> __update_stats_wait_end() which makes this not actually 'end'.
> 
> But then the corresponding clause in __update_stats_wait_start()
> gives
> me a headache:
> 
>  'wait_start > prev_wait_start'
> 
> I mean, wtf. Should that not equally be using task_on_rq_migrating()
> ?
> 
> Can you please take a hard look at all that and fix up things
> all-round?
> 

A complete schedstats information update flow of migrate should be
__update_stats_wait_start() [enter queue A, stage 1] ->
__update_stats_wait_end()   [leave queue A, stage 2] ->
__update_stats_wait_start() [enter queue B, stage 3] ->
__update_stats_wait_end()   [start running on queue B, stage 4]

    Stage 1: prev_wait_start is 0, and in the end, wait_start records the
    time of entering the queue.
    Stage 2: task_on_rq_migrating(p) is true, and wait_start is updated to
    the waiting time on queue A.
    Stage 3: prev_wait_start is the waiting time on queue A, wait_start is
    the time of entering queue B, and wait_start is expected to be greater
    than prev_wait_start. Under this condition, wait_start is updated to
    (the moment of entering queue B) - (the waiting time on queue A).
    Stage 4: the final wait time = (time when starting to run on queue B)
    - (time of entering queue B) + (waiting time on queue A) = waiting
    time on queue B + waiting time on queue A.

The current problem is that stage 2 does not call __update_stats_wait_end
to update wait_start, which causes the final computed wait time = waiting
time on queue B + the moment of entering queue A, leading to incorrect
wait_max and wait_sum.

For __update_stats_wait_end(), task_on_rq_migrating(p) is needed to
distinguish between stage 2 and stage 4 because they involve different
processing flows, but for __update_stats_wait_start(), it is not necessary
to distinguish between stage 1 and stage 3.

As for adding the condition wait_start > prev_wait_start, I think it is
more like a mechanism to prevent statistical deviations caused by time
inconsistencies.

Thanks

Re: [PATCH] sched/rt: fix incorrect schedstats for rt thread

Posted by Peter Zijlstra 3 weeks, 5 days ago

On Fri, Jan 09, 2026 at 03:24:47PM +0800, Dengjun Su wrote:

> For __update_stats_wait_end(), task_on_rq_migrating(p) is needed to
> distinguish between stage 2 and stage 4 because they involve different
> processing flows, but for __update_stats_wait_start(), it is not necessary
> to distinguish between stage 1 and stage 3.
> 
> As for adding the condition wait_start > prev_wait_start, I think it is
> more like a mechanism to prevent statistical deviations caused by time
> inconsistencies.

It looks like nonsense to me.. since you have a test-case, could you see
what this does for you?

Specifically:

 - it ensures that when not in a migration, prev_wait_start must be 0

 - it unconditionally subtracts; unsigned types are defined to wrap
   nicely (2s complement) and it all should work just fine.

---

diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index d1c9429a4ac5..144b23029327 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -12,8 +12,10 @@ void __update_stats_wait_start(struct rq *rq, struct task_struct *p,
 	wait_start = rq_clock(rq);
 	prev_wait_start = schedstat_val(stats->wait_start);
 
-	if (p && likely(wait_start > prev_wait_start))
+	if (p) {
+		WARN_ON_ONCE(!task_on_rq_migrating(p) && prev_wait_start);
 		wait_start -= prev_wait_start;
+	}
 
 	__schedstat_set(stats->wait_start, wait_start);
 }

Re: [PATCH] sched/rt: fix incorrect schedstats for rt thread

Posted by Dengjun Su 3 weeks, 3 days ago

On Mon, 2026-01-12 at 17:38 +0100, Peter Zijlstra wrote:
> On Fri, Jan 09, 2026 at 03:24:47PM +0800, Dengjun Su wrote:
>
> > For __update_stats_wait_end(), task_on_rq_migrating(p) is needed to
> > distinguish between stage 2 and stage 4 because they involve
> > different
> > processing flows, but for __update_stats_wait_start(), it is not
> > necessary
> > to distinguish between stage 1 and stage 3.
> >
> > As for adding the condition wait_start > prev_wait_start, I think
> > it is
> > more like a mechanism to prevent statistical deviations caused by
> > time
> > inconsistencies.
>
> It looks like nonsense to me.. since you have a test-case, could you
> see
> what this does for you?
>
> Specifically:
>
>  - it ensures that when not in a migration, prev_wait_start must be 0
>
>  - it unconditionally subtracts; unsigned types are defined to wrap
>    nicely (2s complement) and it all should work just fine.
>
> ---
>
> diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
> index d1c9429a4ac5..144b23029327 100644
> --- a/kernel/sched/stats.c
> +++ b/kernel/sched/stats.c
> @@ -12,8 +12,10 @@ void __update_stats_wait_start(struct rq *rq,
> struct task_struct *p,
>  	wait_start = rq_clock(rq);
>  	prev_wait_start = schedstat_val(stats->wait_start);
>  
> -	if (p && likely(wait_start > prev_wait_start))
> +	if (p) {
> +		WARN_ON_ONCE(!task_on_rq_migrating(p) &&
> prev_wait_start);
>  		wait_start -= prev_wait_start;
> +	}
>  
>  	__schedstat_set(stats->wait_start, wait_start);
>  }

Hi Peter,

I have confirm in my current test case, adding this change or not
have no impact. I think this is reasonable, task_on_rq_migrating(p)
should be true in my test case.

Base on original patch that adds this, I think the logic of 
__update_stats_wait_start is:

  1. If migrating and the time is updated normally, it records the 
  waiting time on the previous queue.

  2. If time is abnormal, it updates wait_start to the current time. 
  For migrating, it will discards the previously counted waiting time 
  on the previous queue (perhaps this statistical value itself 
  is abnormal).

According to the current modification, if there is a time abnormal,
wait_start will not be updated to current time.

Partial fragments of the original patch are as follows.

---

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9a5e60f..53ec4d4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -738,27 +738,41 @@ static void update_curr_fair(struct rq *rq)
 }
 
 static inline void
-update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
+update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se,
+			bool migrating)
 {
-	schedstat_set(se->statistics.wait_start, rq_clock(rq_of(cfs_rq)));
+	schedstat_set(se->statistics.wait_start,
+		migrating &&
+		likely(rq_clock(rq_of(cfs_rq)) > se->statistics.wait_start) ?
+		rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start :
+		rq_clock(rq_of(cfs_rq)));
 }