John reported that stress-ng-yield could make his machine unhappy and
managed to bisect it to commit b3d99f43c72b ("sched/fair: Fix
zero_vruntime tracking").
The commit in question changes avg_vruntime() from a function that is
a pure reader, to a function that updates variables. This turns an
unlocked sched/debug usage of this function from a minor mistake into
a data corruptor.
Fixes: af4cf40470c2 ("sched/fair: Add cfs_rq::avg_vruntime")
Fixes: b3d99f43c72b ("sched/fair: Fix zero_vruntime tracking")
Reported-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: John Stultz <jstultz@google.com>
---
kernel/sched/debug.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -902,6 +902,7 @@ static void print_rq(struct seq_file *m,
void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
{
s64 left_vruntime = -1, zero_vruntime, right_vruntime = -1, left_deadline = -1, spread;
+ u64 avruntime;
struct sched_entity *last, *first, *root;
struct rq *rq = cpu_rq(cpu);
unsigned long flags;
@@ -925,6 +926,7 @@ void print_cfs_rq(struct seq_file *m, in
if (last)
right_vruntime = last->vruntime;
zero_vruntime = cfs_rq->zero_vruntime;
+ avruntime = avg_vruntime(cfs_rq);
raw_spin_rq_unlock_irqrestore(rq, flags);
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "left_deadline",
@@ -934,7 +936,7 @@ void print_cfs_rq(struct seq_file *m, in
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "zero_vruntime",
SPLIT_NS(zero_vruntime));
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "avg_vruntime",
- SPLIT_NS(avg_vruntime(cfs_rq)));
+ SPLIT_NS(avruntime));
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "right_vruntime",
SPLIT_NS(right_vruntime));
spread = right_vruntime - left_vruntime;
On Wed, 1 Apr 2026 at 15:24, Peter Zijlstra <peterz@infradead.org> wrote:
>
> John reported that stress-ng-yield could make his machine unhappy and
> managed to bisect it to commit b3d99f43c72b ("sched/fair: Fix
> zero_vruntime tracking").
>
> The commit in question changes avg_vruntime() from a function that is
> a pure reader, to a function that updates variables. This turns an
> unlocked sched/debug usage of this function from a minor mistake into
> a data corruptor.
>
> Fixes: af4cf40470c2 ("sched/fair: Add cfs_rq::avg_vruntime")
> Fixes: b3d99f43c72b ("sched/fair: Fix zero_vruntime tracking")
> Reported-by: John Stultz <jstultz@google.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Tested-by: John Stultz <jstultz@google.com>
> ---
> kernel/sched/debug.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -902,6 +902,7 @@ static void print_rq(struct seq_file *m,
> void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
> {
> s64 left_vruntime = -1, zero_vruntime, right_vruntime = -1, left_deadline = -1, spread;
> + u64 avruntime;
> struct sched_entity *last, *first, *root;
> struct rq *rq = cpu_rq(cpu);
> unsigned long flags;
> @@ -925,6 +926,7 @@ void print_cfs_rq(struct seq_file *m, in
> if (last)
> right_vruntime = last->vruntime;
> zero_vruntime = cfs_rq->zero_vruntime;
> + avruntime = avg_vruntime(cfs_rq);
Minor comment:
Do you intentionally save zero_vruntime before callling avg_vruntime()
which will update zero_vruntime ?
That could make sense to take a snapshot before being modified by
print_cfs_rq() but I'm afraid the call to debugfs will anyway trigger
an update before we save and display the value
Reviewed-by: Vincent Guittot <vincent.guittot@linaor.rog>
> raw_spin_rq_unlock_irqrestore(rq, flags);
>
> SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "left_deadline",
> @@ -934,7 +936,7 @@ void print_cfs_rq(struct seq_file *m, in
> SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "zero_vruntime",
> SPLIT_NS(zero_vruntime));
> SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "avg_vruntime",
> - SPLIT_NS(avg_vruntime(cfs_rq)));
> + SPLIT_NS(avruntime));
> SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "right_vruntime",
> SPLIT_NS(right_vruntime));
> spread = right_vruntime - left_vruntime;
>
>
On Wed, Apr 01, 2026 at 04:13:06PM +0200, Vincent Guittot wrote:
> On Wed, 1 Apr 2026 at 15:24, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > John reported that stress-ng-yield could make his machine unhappy and
> > managed to bisect it to commit b3d99f43c72b ("sched/fair: Fix
> > zero_vruntime tracking").
> >
> > The commit in question changes avg_vruntime() from a function that is
> > a pure reader, to a function that updates variables. This turns an
> > unlocked sched/debug usage of this function from a minor mistake into
> > a data corruptor.
> >
> > Fixes: af4cf40470c2 ("sched/fair: Add cfs_rq::avg_vruntime")
> > Fixes: b3d99f43c72b ("sched/fair: Fix zero_vruntime tracking")
> > Reported-by: John Stultz <jstultz@google.com>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> > Tested-by: John Stultz <jstultz@google.com>
> > ---
> > kernel/sched/debug.c | 4 +++-
> > 1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > --- a/kernel/sched/debug.c
> > +++ b/kernel/sched/debug.c
> > @@ -902,6 +902,7 @@ static void print_rq(struct seq_file *m,
> > void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
> > {
> > s64 left_vruntime = -1, zero_vruntime, right_vruntime = -1, left_deadline = -1, spread;
> > + u64 avruntime;
> > struct sched_entity *last, *first, *root;
> > struct rq *rq = cpu_rq(cpu);
> > unsigned long flags;
> > @@ -925,6 +926,7 @@ void print_cfs_rq(struct seq_file *m, in
> > if (last)
> > right_vruntime = last->vruntime;
> > zero_vruntime = cfs_rq->zero_vruntime;
> > + avruntime = avg_vruntime(cfs_rq);
>
> Minor comment:
> Do you intentionally save zero_vruntime before callling avg_vruntime()
> which will update zero_vruntime ?
> That could make sense to take a snapshot before being modified by
> print_cfs_rq() but I'm afraid the call to debugfs will anyway trigger
> an update before we save and display the value
Intentional might be a big word, but yeah, printing the same value twice
seemed pointless. This way you can at least see where it came from or
something.
> Reviewed-by: Vincent Guittot <vincent.guittot@linaor.rog>
Thanks!
© 2016 - 2026 Red Hat, Inc.