kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
When rt_mutex_setprio changes a task's scheduling class to RT,
sometimes the task's vruntime is not updated correctly upon
return to the fair class.
Specifically, the following is being observed:
- task has just been created and running for a short time
- task sleep while still in the fair class
- task is boosted to RT via rt_mutex_setprio, which changes
the task to RT and calls check_class_changed.
- check_class_changed leads to detach_task_cfs_rq, at which point
the vruntime_normalized check sees that the task's sum_exec_runtime
is zero, which results in skipping the subtraction of the
rq's min_vruntime from the task's vruntime
- later, when the prio is deboosted and the task is moved back
to the fair class, the fair rq's min_vruntime is added to
the task's vruntime, even though it wasn't subtracted earlier.
Since the task's vruntime is about double that of other tasks in cfs_rq,
the task to be unable to run for a long time when there are continuous
runnable tasks in cfs_rq.
The immediate result is inflation of the task's vruntime, giving
it lower priority (starving it if there's enough available work).
The longer-term effect is inflation of all vruntimes because the
task's vruntime becomes the rq's min_vruntime when the higher
priority tasks go idle. That leads to a vicious cycle, where
the vruntime inflation repeatedly doubled.
The root cause of the problem is that the vruntime_normalized made a
misjudgment. Since the sum_exec_runtime of some tasks that were just
created and run for a short time is zero, the vruntime_normalized
mistakenly thinks that they are tasks that have just been forked.
Therefore, sum_exec_runtime is not subtracted from the vruntime of the
task.
So, we fix this bug by adding a check condition for newly forked task.
Signed-off-by: mingyang.cui <mingyang.cui@horizon.ai>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 73a89fbd81be..3d0c14f3731f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11112,7 +11112,7 @@ static inline bool vruntime_normalized(struct task_struct *p)
* - A task which has been woken up by try_to_wake_up() and
* waiting for actually being woken up by sched_ttwu_pending().
*/
- if (!se->sum_exec_runtime ||
+ if (!se->sum_exec_runtime && p->state == TASK_NEW ||
(p->state == TASK_WAKING && p->sched_remote_wakeup))
return true;
--
2.34.1
On 2024/3/28 14:27, mingyang.cui wrote: > When rt_mutex_setprio changes a task's scheduling class to RT, > sometimes the task's vruntime is not updated correctly upon > return to the fair class. > Specifically, the following is being observed: > - task has just been created and running for a short time > - task sleep while still in the fair class > - task is boosted to RT via rt_mutex_setprio, which changes > the task to RT and calls check_class_changed. > - check_class_changed leads to detach_task_cfs_rq, at which point > the vruntime_normalized check sees that the task's sum_exec_runtime > is zero, which results in skipping the subtraction of the > rq's min_vruntime from the task's vruntime Hi Mingyang, Did you do the test on the latest tree? vruntime_normalized was removed by e8f331bcc2 (sched/smp: Use lag to simplify cross-runqueue placement). Thanks, Honglei > - later, when the prio is deboosted and the task is moved back > to the fair class, the fair rq's min_vruntime is added to > the task's vruntime, even though it wasn't subtracted earlier. > > Since the task's vruntime is about double that of other tasks in cfs_rq, > the task to be unable to run for a long time when there are continuous > runnable tasks in cfs_rq. > > The immediate result is inflation of the task's vruntime, giving > it lower priority (starving it if there's enough available work). > The longer-term effect is inflation of all vruntimes because the > task's vruntime becomes the rq's min_vruntime when the higher > priority tasks go idle. That leads to a vicious cycle, where > the vruntime inflation repeatedly doubled. > > The root cause of the problem is that the vruntime_normalized made a > misjudgment. Since the sum_exec_runtime of some tasks that were just > created and run for a short time is zero, the vruntime_normalized > mistakenly thinks that they are tasks that have just been forked. > Therefore, sum_exec_runtime is not subtracted from the vruntime of the > task. > > So, we fix this bug by adding a check condition for newly forked task. > > Signed-off-by: mingyang.cui <mingyang.cui@horizon.ai> > --- > kernel/sched/fair.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 73a89fbd81be..3d0c14f3731f 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -11112,7 +11112,7 @@ static inline bool vruntime_normalized(struct task_struct *p) > * - A task which has been woken up by try_to_wake_up() and > * waiting for actually being woken up by sched_ttwu_pending(). > */ > - if (!se->sum_exec_runtime || > + if (!se->sum_exec_runtime && p->state == TASK_NEW || > (p->state == TASK_WAKING && p->sched_remote_wakeup)) > return true; >
On Mon, Apr 1, 2024 at 8:20 AM Honglei Wang <jameshongleiwang@126.com> wrote: > On 2024/3/28 14:27, mingyang.cui wrote: > > When rt_mutex_setprio changes a task's scheduling class to RT, > > sometimes the task's vruntime is not updated correctly upon > > return to the fair class. > > Specifically, the following is being observed: > > - task has just been created and running for a short time > > - task sleep while still in the fair class > > - task is boosted to RT via rt_mutex_setprio, which changes > > the task to RT and calls check_class_changed. > > - check_class_changed leads to detach_task_cfs_rq, at which point > > the vruntime_normalized check sees that the task's sum_exec_runtime > > is zero, which results in skipping the subtraction of the > > rq's min_vruntime from the task's vruntime > > Did you do the test on the latest tree? vruntime_normalized was removed > by e8f331bcc2 (sched/smp: Use lag to simplify cross-runqueue placement). Indeed (I was looking at an older tree last week and missed it was removed as well). Though something like this change probably will be needed for the -stable trees? thanks -john
On Thu, Mar 28, 2024 at 8:21 AM mingyang.cui <mingyang.cui@horizon.ai> wrote: > > When rt_mutex_setprio changes a task's scheduling class to RT, > sometimes the task's vruntime is not updated correctly upon > return to the fair class. > Specifically, the following is being observed: > - task has just been created and running for a short time > - task sleep while still in the fair class > - task is boosted to RT via rt_mutex_setprio, which changes > the task to RT and calls check_class_changed. > - check_class_changed leads to detach_task_cfs_rq, at which point > the vruntime_normalized check sees that the task's sum_exec_runtime > is zero, which results in skipping the subtraction of the > rq's min_vruntime from the task's vruntime > - later, when the prio is deboosted and the task is moved back > to the fair class, the fair rq's min_vruntime is added to > the task's vruntime, even though it wasn't subtracted earlier. Just to make sure I am following: since sum_exec_runtime is updated in update_curr(), if the task goes to sleep, shouldn't it be dequeued and thus update_curr() would have been run (and thus sum_exec_runtime would be non-zero)? Maybe in this analysis is the new task being boosted while it is still newly running (instead of sleeping)? > Since the task's vruntime is about double that of other tasks in cfs_rq, > the task to be unable to run for a long time when there are continuous > runnable tasks in cfs_rq. > > The immediate result is inflation of the task's vruntime, giving > it lower priority (starving it if there's enough available work). > The longer-term effect is inflation of all vruntimes because the > task's vruntime becomes the rq's min_vruntime when the higher > priority tasks go idle. That leads to a vicious cycle, where > the vruntime inflation repeatedly doubled. This is an interesting find! I'm curious how you detected the problem, as it might make a good correctness test (which I'm selfishly looking for more of myself :) > The root cause of the problem is that the vruntime_normalized made a > misjudgment. Since the sum_exec_runtime of some tasks that were just > created and run for a short time is zero, the vruntime_normalized > mistakenly thinks that they are tasks that have just been forked. > Therefore, sum_exec_runtime is not subtracted from the vruntime of the > task. > > So, we fix this bug by adding a check condition for newly forked task. > > Signed-off-by: mingyang.cui <mingyang.cui@horizon.ai> > --- > kernel/sched/fair.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 73a89fbd81be..3d0c14f3731f 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -11112,7 +11112,7 @@ static inline bool vruntime_normalized(struct task_struct *p) > * - A task which has been woken up by try_to_wake_up() and > * waiting for actually being woken up by sched_ttwu_pending(). > */ > - if (!se->sum_exec_runtime || > + if (!se->sum_exec_runtime && p->state == TASK_NEW || > (p->state == TASK_WAKING && p->sched_remote_wakeup)) > return true; This looks like it was applied against an older tree? The p->state accesses should be under a READ_ONCE (and likely consolidated before the conditional?) Also, I wonder if alternatively a call to update_curr() if (cfs_rq->curr == se) in switched_from_fair() would be good (ie: normalize it on the boost to close the edge case rather than handle it as an expected non-normalized condition)? thanks -john
© 2016 - 2026 Red Hat, Inc.