[PATCH 2/2] sched/debug: Fix avg_vruntime() usage

Peter Zijlstra posted 2 patches 7 hours ago
[PATCH 2/2] sched/debug: Fix avg_vruntime() usage
Posted by Peter Zijlstra 7 hours ago
John reported that stress-ng-yield could make his machine unhappy and
managed to bisect it to commit b3d99f43c72b ("sched/fair: Fix
zero_vruntime tracking").

The commit in question changes avg_vruntime() from a function that is
a pure reader, to a function that updates variables. This turns an
unlocked sched/debug usage of this function from a minor mistake into
a data corruptor.

Fixes: af4cf40470c2 ("sched/fair: Add cfs_rq::avg_vruntime")
Fixes: b3d99f43c72b ("sched/fair: Fix zero_vruntime tracking")
Reported-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: John Stultz <jstultz@google.com>
---
 kernel/sched/debug.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -902,6 +902,7 @@ static void print_rq(struct seq_file *m,
 void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 {
 	s64 left_vruntime = -1, zero_vruntime, right_vruntime = -1, left_deadline = -1, spread;
+	u64 avruntime;
 	struct sched_entity *last, *first, *root;
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long flags;
@@ -925,6 +926,7 @@ void print_cfs_rq(struct seq_file *m, in
 	if (last)
 		right_vruntime = last->vruntime;
 	zero_vruntime = cfs_rq->zero_vruntime;
+	avruntime = avg_vruntime(cfs_rq);
 	raw_spin_rq_unlock_irqrestore(rq, flags);
 
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "left_deadline",
@@ -934,7 +936,7 @@ void print_cfs_rq(struct seq_file *m, in
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "zero_vruntime",
 			SPLIT_NS(zero_vruntime));
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "avg_vruntime",
-			SPLIT_NS(avg_vruntime(cfs_rq)));
+			SPLIT_NS(avruntime));
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "right_vruntime",
 			SPLIT_NS(right_vruntime));
 	spread = right_vruntime - left_vruntime;
Re: [PATCH 2/2] sched/debug: Fix avg_vruntime() usage
Posted by Vincent Guittot 6 hours ago
On Wed, 1 Apr 2026 at 15:24, Peter Zijlstra <peterz@infradead.org> wrote:
>
> John reported that stress-ng-yield could make his machine unhappy and
> managed to bisect it to commit b3d99f43c72b ("sched/fair: Fix
> zero_vruntime tracking").
>
> The commit in question changes avg_vruntime() from a function that is
> a pure reader, to a function that updates variables. This turns an
> unlocked sched/debug usage of this function from a minor mistake into
> a data corruptor.
>
> Fixes: af4cf40470c2 ("sched/fair: Add cfs_rq::avg_vruntime")
> Fixes: b3d99f43c72b ("sched/fair: Fix zero_vruntime tracking")
> Reported-by: John Stultz <jstultz@google.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Tested-by: John Stultz <jstultz@google.com>
> ---
>  kernel/sched/debug.c |    4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -902,6 +902,7 @@ static void print_rq(struct seq_file *m,
>  void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
>  {
>         s64 left_vruntime = -1, zero_vruntime, right_vruntime = -1, left_deadline = -1, spread;
> +       u64 avruntime;
>         struct sched_entity *last, *first, *root;
>         struct rq *rq = cpu_rq(cpu);
>         unsigned long flags;
> @@ -925,6 +926,7 @@ void print_cfs_rq(struct seq_file *m, in
>         if (last)
>                 right_vruntime = last->vruntime;
>         zero_vruntime = cfs_rq->zero_vruntime;
> +       avruntime = avg_vruntime(cfs_rq);

Minor comment:
Do you intentionally save zero_vruntime before callling avg_vruntime()
which will update zero_vruntime ?
That could make sense to take a snapshot before being modified by
print_cfs_rq() but I'm afraid the call to debugfs will anyway trigger
an update before we save and display the value

Reviewed-by: Vincent Guittot <vincent.guittot@linaor.rog>

>         raw_spin_rq_unlock_irqrestore(rq, flags);
>
>         SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "left_deadline",
> @@ -934,7 +936,7 @@ void print_cfs_rq(struct seq_file *m, in
>         SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "zero_vruntime",
>                         SPLIT_NS(zero_vruntime));
>         SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "avg_vruntime",
> -                       SPLIT_NS(avg_vruntime(cfs_rq)));
> +                       SPLIT_NS(avruntime));
>         SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "right_vruntime",
>                         SPLIT_NS(right_vruntime));
>         spread = right_vruntime - left_vruntime;
>
>
Re: [PATCH 2/2] sched/debug: Fix avg_vruntime() usage
Posted by Peter Zijlstra 4 hours ago
On Wed, Apr 01, 2026 at 04:13:06PM +0200, Vincent Guittot wrote:
> On Wed, 1 Apr 2026 at 15:24, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > John reported that stress-ng-yield could make his machine unhappy and
> > managed to bisect it to commit b3d99f43c72b ("sched/fair: Fix
> > zero_vruntime tracking").
> >
> > The commit in question changes avg_vruntime() from a function that is
> > a pure reader, to a function that updates variables. This turns an
> > unlocked sched/debug usage of this function from a minor mistake into
> > a data corruptor.
> >
> > Fixes: af4cf40470c2 ("sched/fair: Add cfs_rq::avg_vruntime")
> > Fixes: b3d99f43c72b ("sched/fair: Fix zero_vruntime tracking")
> > Reported-by: John Stultz <jstultz@google.com>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> > Tested-by: John Stultz <jstultz@google.com>
> > ---
> >  kernel/sched/debug.c |    4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > --- a/kernel/sched/debug.c
> > +++ b/kernel/sched/debug.c
> > @@ -902,6 +902,7 @@ static void print_rq(struct seq_file *m,
> >  void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
> >  {
> >         s64 left_vruntime = -1, zero_vruntime, right_vruntime = -1, left_deadline = -1, spread;
> > +       u64 avruntime;
> >         struct sched_entity *last, *first, *root;
> >         struct rq *rq = cpu_rq(cpu);
> >         unsigned long flags;
> > @@ -925,6 +926,7 @@ void print_cfs_rq(struct seq_file *m, in
> >         if (last)
> >                 right_vruntime = last->vruntime;
> >         zero_vruntime = cfs_rq->zero_vruntime;
> > +       avruntime = avg_vruntime(cfs_rq);
> 
> Minor comment:
> Do you intentionally save zero_vruntime before callling avg_vruntime()
> which will update zero_vruntime ?
> That could make sense to take a snapshot before being modified by
> print_cfs_rq() but I'm afraid the call to debugfs will anyway trigger
> an update before we save and display the value

Intentional might be a big word, but yeah, printing the same value twice
seemed pointless. This way you can at least see where it came from or
something.

> Reviewed-by: Vincent Guittot <vincent.guittot@linaor.rog>

Thanks!