sched/eevdf: Fix NULL deref when avg_vruntime nears overflow

[PATCH EEVDF NULL deref] sched/eevdf: Fix NULL deref when avg_vruntime nears overflow

Posted by Zicheng Qu 3 months ago

Fix a kernel NULL pointer dereference in pick_next_task_fair() caused by
EEVDF scheduler arithmetic overflows when cfs_rq->avg_vruntime
approaches the s64 low.

The issue occurs when:
1. cfs_rq->avg_vruntime is driven downward by dynamic reweight
   operations on se->vruntime combined with frequent enqueue/dequeue of
another sched_entity with large se->vlag values. Note that the presence
of only one other sched_entity (besides the current one) is critical
because having more would average out the effect and prevent the
continuous and rapid decrease of cfs_rq->avg_vruntime.
2. These factors `reweight` and `frequent enqueue/dequeue` persistently
   suppress cfs_rq->min_vruntime, causing cfs_rq->avg_vruntime to
decrease rapidly toward S64_MIN.
3. In vruntime_eligible(), the calculation (int64_t)(vruntime -
   cfs_rq->min_vruntime) * load may overflow downward, becoming a large
positive value.
4. This causes vruntime_eligible() to incorrectly judge all tasks as
   ineligible, leading to NULL pointer dereference in
pick_next_task_fair().

The fix addresses this by adjusting the current sched_entity's vruntime
during reweight operations when:
- The entity is cfs_rq->curr and the only running task
- The entity is on the runqueue
- Its vruntime is below min_vruntime

The most straightforward fix would be to adjust the vruntime during
dequeue, but that would require checking and possibly modifying the
curr's vruntime on every dequeue, which has a broader impact and
concurrency concerns. Therefore, we choose to apply the fix in the
reweight path, which is one of the necessary conditions for the problem
to occur.

BUG: kernel NULL pointer dereference, address: 00000000000000a0
RIP: 0010:pick_next_task_fair+0x39b/0xab03

KERNEL: vmlinux  [TAINTED]
DUMPFILE: 127.0.0.1-2025-10-30-13:52:24/vmcore  [PARTIAL DUMP]
CPUS: 4
DATE: Thu Oct 30 05:52:18 UTC 2025
UPTIME: 02:02:50
LOAD AVERAGE: 15.00, 15.00, 15.00
TASKS: 151
NODENAME: SangforOS.localdomain
RELEASE: 6.6.0+
VERSION: #4 SMP Thu Oct 30 11:25:11 CST 2025
MACHINE: x86_64  (2194 Mhz)
MEMORY: 4 GB
PANIC: "Oops: 0000 [#1] SMP PTI" (check log for details)
 PID: 4702
COMMAND: "test_sched_2/-1"
TASK: ffff8881362dcf80  [THREAD_INFO: ffff8881362dcf80]
 CPU: 1
STATE: TASK_UNINTERRUPTIBLE (PANIC)

crash> bt
PID: 4702   TASK: ffff8881362dcf80  CPU: 1   COMMAND: "test_sched_2/-1"
 #0 [ffffc90000fffab0] machine_kexec at ffffffffb567e767
 #1 [ffffc90000fffb10] __crash_kexec at ffffffffb580474a
 #2 [ffffc90000fffbd0] crash_kexec at ffffffffb5805768
 #3 [ffffc90000fffbd8] oops_end at ffffffffb5639599
 #4 [ffffc90000fffbf8] page_fault_oops at ffffffffb56954a8
 #5 [ffffc90000fffc50] exc_page_fault at ffffffffb63424a9
 #6 [ffffc90000fffcb0] asm_exc_page_fault at ffffffffb6400c12
    [exception RIP: pick_next_task_fair+923]
    RIP: ffffffffb576f22b  RSP: ffffc90000fffd60  RFLAGS: 00010046
    RAX: 0000000000000000  RBX: ffff8881340b4d80  RCX: 82a3cdbe7f1c7aed
    RDX: 01721730951583fc  RSI: 0000000000015f5f  RDI: 00105468401dc9e3
    RBP: ffffc90000fffe18   R8: 00000000000003fa   R9: 0000000000000002
    R10: 0000000000000002  R11: 0000000000000064  R12: ffff8881362dcf80
    R13: ffffc90000fffdc0  R14: ffff8881340b4e00  R15: ffff8881340b4e00
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
 #7 [ffffc90000fffdb0] __schedule at ffffffffb6348cc8
 #8 [ffffc90000fffe20] schedule at ffffffffb63493ab
 #9 [ffffc90000fffe38] schedule_timeout at ffffffffb634eeaf
crash>
crash>
crash> p runqueues
PER-CPU DATA TYPE:
  struct rq runqueues;
PER-CPU ADDRESSES:
  [0]: ffff888134034d80
  [1]: ffff8881340b4d80
  [2]: ffff888134134d80
  [3]: ffff8881341b4d80
crash>
crash> struct -o rq.cfs ffff8881340b4d80
struct rq {
  [ffff8881340b4e00] struct cfs_rq cfs;
}
crash> struct cfs_rq.nr_running,curr,next,tasks_timeline,min_vruntime,avg_vruntime,avg_load,load,exec_clock ffff8881340b4e00
  nr_running = 3,
  curr = 0xffff888139b57c00,
  next = 0xffff888139b57c00,
  tasks_timeline = {
    rb_root = {
      rb_node = 0xffff8881362d80d0
    },
    rb_leftmost = 0xffff8881362d9b50
  },
  min_vruntime = 4596406356396515,
  avg_vruntime = -9137321448325056783,
  avg_load = 88933,
  load = {
    weight = 92109859,
    inv_weight = 0
  },
  exec_clock = 0,
crash> struct sched_entity.on_rq,deadline,min_vruntime,vruntime,load,vlag,slice,exec_start,sum_exec_runtime,prev_sum_exec_runtime,my_q,run_node 0xffff888139b57c00
  on_rq = 1,
  deadline = 4705706610399852,
  min_vruntime = 4493662477571149,
  vruntime = 4698735667604793,
  load = {
    weight = 1042467,
    inv_weight = 0
  },
  vlag = 4493662483537817,
  slice = 2250000,
  exec_start = 7308537586004,
  sum_exec_runtime = 7196457582967,
  prev_sum_exec_runtime = 7196456203065,
  my_q = 0xffff888139b55000,
  run_node = {
    __rb_parent_color = 1,
    rb_right = 0xffff8881362d80d0,
    rb_left = 0x0
  },
crash> struct sched_entity.deadline,min_vruntime,vruntime,load,vlag,slice,exec_start,sum_exec_runtime,prev_sum_exec_runtime,my_q,run_node -l sched_entity.run_node 0xffff8881362d80d0
  deadline = 4493662533339551,
  min_vruntime = 4493662476669436,
  vruntime = 4493662519944203,
  load = {
    weight = 176128,
    inv_weight = 24970740
  },
  vlag = 4493662519002535,
  slice = 2250000,
  exec_start = 7308527703195,
  sum_exec_runtime = 4759831,
  prev_sum_exec_runtime = 2351660,
  my_q = 0x0,
  run_node = {
    __rb_parent_color = 1,
    rb_right = 0x0,
    rb_left = 0xffff8881362d9b50
  },
crash> struct sched_entity.deadline,min_vruntime,vruntime,load,vlag,slice,exec_start,sum_exec_runtime,prev_sum_exec_runtime,my_q,run_node -l sched_entity.run_node 0xffff8881362d9b50
  deadline = 4493662476695393,
  min_vruntime = 4493662476669436,
  vruntime = 4493662476669436,
  load = {
    weight = 90891264,
    inv_weight = 48388
  },
  vlag = 51914,
  slice = 2250000,
  exec_start = 7308536206102,
  sum_exec_runtime = 2102797408,
  prev_sum_exec_runtime = 2102198648,
  my_q = 0x0,
  run_node = {
    __rb_parent_color = 18446612687273951440,
    rb_right = 0x0,
    rb_left = 0x0
  },
crash>

In vruntime_eligible():
	for sched_entity curr [0xffff888139b57c00]: 	avg [-9033150209515029779], (int64_t)(vruntime - cfs_rq->min_vruntime) * load [9204623872495814378], so return false
	for sched_entity root [0xffff8881362d80d0]: 	avg [-9033150209515029779], (int64_t)(vruntime - cfs_rq->min_vruntime) * load [9204833240987634904], so return false
	for sched_entity leftmost [0xffff8881362d9b50]: avg [-9033150209515029779], (int64_t)(vruntime - cfs_rq->min_vruntime) * load [9204829348379068487], so return false
Therefore, all sched_entities on this cfs_rq have no eligibility to run
to cause the NULL pointer dereference in pick_next_task_fair().

Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
Signed-off-by: Zicheng Qu <quzicheng@huawei.com>
Signed-off-by: wulibin163 <wulibin163@126.com>
---
 kernel/sched/fair.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 25970dbbb279..963802d43f4e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3772,6 +3772,14 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 {
 	bool curr = cfs_rq->curr == se;
 
+	if (curr && se->on_rq && cfs_rq->nr_queued == 1 &&
+		se->vruntime < cfs_rq->min_vruntime) {
+		s64 rel_deadline = se->deadline - se->vruntime;
+
+		se->vruntime = cfs_rq->min_vruntime;
+		se->deadline = se->vruntime + rel_deadline;
+	}
+
 	if (se->on_rq) {
 		/* commit outstanding execution time */
 		update_curr(cfs_rq);
-- 
2.34.1

Re: [PATCH EEVDF NULL deref] sched/eevdf: Fix NULL deref when avg_vruntime nears overflow

Posted by Peter Zijlstra 3 months ago

On Mon, Nov 03, 2025 at 01:04:41PM +0000, Zicheng Qu wrote:
> Fix a kernel NULL pointer dereference in pick_next_task_fair() caused by
> EEVDF scheduler arithmetic overflows when cfs_rq->avg_vruntime
> approaches the s64 low.
> 
> The issue occurs when:
> 1. cfs_rq->avg_vruntime is driven downward by dynamic reweight
>    operations on se->vruntime combined with frequent enqueue/dequeue of
> another sched_entity with large se->vlag values. Note that the presence
> of only one other sched_entity (besides the current one) is critical
> because having more would average out the effect and prevent the
> continuous and rapid decrease of cfs_rq->avg_vruntime.
> 2. These factors `reweight` and `frequent enqueue/dequeue` persistently
>    suppress cfs_rq->min_vruntime, causing cfs_rq->avg_vruntime to
> decrease rapidly toward S64_MIN.
> 3. In vruntime_eligible(), the calculation (int64_t)(vruntime -
>    cfs_rq->min_vruntime) * load may overflow downward, becoming a large
> positive value.

Right, so the problem is that min_vruntime doesn't follow (it only wants
to go up). I should have a patch for that, let me go find.

/me rummages in the patch drawer for a bit... 

Also, do you have a reproducer for this? Your crash output says it is
running something called 'sched_test_2'.

The below should work (unless you have CONFIG_SCHED_CORE=y), could you
please try? If it does work, I suppose I'll finally have to sit down
and work out the SCHED_CORE side of this.

(Note the changelog doesn't talk about your problem, it was from a
previous issue and I just haven't bothered fixing it up)

---
Subject: sched/fair: Replace cfs_rq::min_vruntime

Basically, from the constraint that the sum of lag is zero, you can
infer that the 0-lag point is the weighted average of the individual
vruntime, which is what we're trying to compute:

        \Sum w_i * v_i
  avg = --------------
           \Sum w_i

Now, since vruntime takes the whole u64 (worse, it wraps), this
multiplication term in the numerator is not something we can compute;
instead we do the min_vruntime (v0 henceforth) thing like:

  v_i = (v_i - v0) + v0

(edit -- this does two things:
 - it keeps the key 'small';
 - it creates a relative 0-point in the modular space)

If you do that subtitution and work it all out, you end up with:

        \Sum w_i * (v_i - v0)
  avg = --------------------- + v0
              \Sum w_i

Since you cannot very well track a ratio like that (and not suffer
terrible numerical problems) we simpy track the numerator and
denominator individually and only perform the division when strictly
needed.

Notably, the numerator lives in cfs_rq->avg_vruntime and the denominator
lives in cfs_rq->avg_load.

The one extra 'funny' is that these numbers track the entities in the
tree, and current is typically outside of the tree, so avg_vruntime()
adds current when needed before doing the division.

(vruntime_eligible() elides the division by cross-wise multiplication)

Anyway... we got s64 to track avg_vruntime, that sum over weighted keys,
if you have a ton of entities and large keys and large weights (your
case) you can overflow and things go to shit.

Anyway, seeing how your min_vruntime is weird, let me ask you to try the
below; it removes the old min_vruntime and instead tracks zero vruntime
as the 'current' avg_vruntime. We don't need the monotinicity filter,
all we really need is something 'near' all the other vruntimes in order
to compute this relative key so we can preserve order across the wrap.

This *should* get us near minimal sized keys. If you can still
reproduce, you should probably add something like that patch I send you
privately earlier, that checks the overflows.

The below builds and boots (provided SCHED_CORE=n).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/debug.c |  8 +++---
 kernel/sched/fair.c  | 75 +++++++++++-----------------------------------------
 kernel/sched/sched.h |  2 +-
 3 files changed, 20 insertions(+), 65 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 02e16b70a790..41caa22e0680 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -796,7 +796,7 @@ static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu)
 
 void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 {
-	s64 left_vruntime = -1, min_vruntime, right_vruntime = -1, left_deadline = -1, spread;
+	s64 left_vruntime = -1, zero_vruntime, right_vruntime = -1, left_deadline = -1, spread;
 	struct sched_entity *last, *first, *root;
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long flags;
@@ -819,15 +819,15 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 	last = __pick_last_entity(cfs_rq);
 	if (last)
 		right_vruntime = last->vruntime;
-	min_vruntime = cfs_rq->min_vruntime;
+	zero_vruntime = cfs_rq->zero_vruntime;
 	raw_spin_rq_unlock_irqrestore(rq, flags);
 
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "left_deadline",
 			SPLIT_NS(left_deadline));
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "left_vruntime",
 			SPLIT_NS(left_vruntime));
-	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "min_vruntime",
-			SPLIT_NS(min_vruntime));
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "zero_vruntime",
+			SPLIT_NS(zero_vruntime));
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "avg_vruntime",
 			SPLIT_NS(avg_vruntime(cfs_rq)));
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "right_vruntime",
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 25970dbbb279..1e5552469146 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -554,7 +554,7 @@ static inline bool entity_before(const struct sched_entity *a,
 
 static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	return (s64)(se->vruntime - cfs_rq->min_vruntime);
+	return (s64)(se->vruntime - cfs_rq->zero_vruntime);
 }
 
 #define __node_2_se(node) \
@@ -606,13 +606,13 @@ static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se)
  *
  * Which we track using:
  *
- *                    v0 := cfs_rq->min_vruntime
+ *                    v0 := cfs_rq->zero_vruntime
  * \Sum (v_i - v0) * w_i := cfs_rq->avg_vruntime
  *              \Sum w_i := cfs_rq->avg_load
  *
- * Since min_vruntime is a monotonic increasing variable that closely tracks
- * the per-task service, these deltas: (v_i - v), will be in the order of the
- * maximal (virtual) lag induced in the system due to quantisation.
+ * Since zero_vruntime closely tracks the per-task service, these
+ * deltas: (v_i - v), will be in the order of the maximal (virtual) lag
+ * induced in the system due to quantisation.
  *
  * Also, we use scale_load_down() to reduce the size.
  *
@@ -671,7 +671,7 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
 		avg = div_s64(avg, load);
 	}
 
-	return cfs_rq->min_vruntime + avg;
+	return cfs_rq->zero_vruntime + avg;
 }
 
 /*
@@ -732,7 +732,7 @@ static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
 		load += weight;
 	}
 
-	return avg >= (s64)(vruntime - cfs_rq->min_vruntime) * load;
+	return avg >= (s64)(vruntime - cfs_rq->zero_vruntime) * load;
 }
 
 int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -740,42 +740,14 @@ int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	return vruntime_eligible(cfs_rq, se->vruntime);
 }
 
-static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
+static void update_zero_vruntime(struct cfs_rq *cfs_rq)
 {
-	u64 min_vruntime = cfs_rq->min_vruntime;
-	/*
-	 * open coded max_vruntime() to allow updating avg_vruntime
-	 */
-	s64 delta = (s64)(vruntime - min_vruntime);
-	if (delta > 0) {
-		avg_vruntime_update(cfs_rq, delta);
-		min_vruntime = vruntime;
-	}
-	return min_vruntime;
-}
+	u64 vruntime = avg_vruntime(cfs_rq);
+	s64 delta = (s64)(vruntime - cfs_rq->zero_vruntime);
 
-static void update_min_vruntime(struct cfs_rq *cfs_rq)
-{
-	struct sched_entity *se = __pick_root_entity(cfs_rq);
-	struct sched_entity *curr = cfs_rq->curr;
-	u64 vruntime = cfs_rq->min_vruntime;
+	avg_vruntime_update(cfs_rq, delta);
 
-	if (curr) {
-		if (curr->on_rq)
-			vruntime = curr->vruntime;
-		else
-			curr = NULL;
-	}
-
-	if (se) {
-		if (!curr)
-			vruntime = se->min_vruntime;
-		else
-			vruntime = min_vruntime(vruntime, se->min_vruntime);
-	}
-
-	/* ensure we never gain time by being placed backwards. */
-	cfs_rq->min_vruntime = __update_min_vruntime(cfs_rq, vruntime);
+	cfs_rq->zero_vruntime = vruntime;
 }
 
 static inline u64 cfs_rq_min_slice(struct cfs_rq *cfs_rq)
@@ -848,6 +820,7 @@ RB_DECLARE_CALLBACKS(static, min_vruntime_cb, struct sched_entity,
 static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	avg_vruntime_add(cfs_rq, se);
+	update_zero_vruntime(cfs_rq);
 	se->min_vruntime = se->vruntime;
 	se->min_slice = se->slice;
 	rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
@@ -859,6 +832,7 @@ static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	rb_erase_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
 				  &min_vruntime_cb);
 	avg_vruntime_sub(cfs_rq, se);
+	update_zero_vruntime(cfs_rq);
 }
 
 struct sched_entity *__pick_root_entity(struct cfs_rq *cfs_rq)
@@ -1226,7 +1200,6 @@ static void update_curr(struct cfs_rq *cfs_rq)
 
 	curr->vruntime += calc_delta_fair(delta_exec, curr);
 	resched = update_deadline(cfs_rq, curr);
-	update_min_vruntime(cfs_rq);
 
 	if (entity_is_task(curr)) {
 		/*
@@ -3808,15 +3781,6 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 		if (!curr)
 			__enqueue_entity(cfs_rq, se);
 		cfs_rq->nr_queued++;
-
-		/*
-		 * The entity's vruntime has been adjusted, so let's check
-		 * whether the rq-wide min_vruntime needs updated too. Since
-		 * the calculations above require stable min_vruntime rather
-		 * than up-to-date one, we do the update at the end of the
-		 * reweight process.
-		 */
-		update_min_vruntime(cfs_rq);
 	}
 }
 
@@ -5429,15 +5393,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 	update_cfs_group(se);
 
-	/*
-	 * Now advance min_vruntime if @se was the entity holding it back,
-	 * except when: DEQUEUE_SAVE && !DEQUEUE_MOVE, in this case we'll be
-	 * put back on, and if we advance min_vruntime, we'll be placed back
-	 * further than we started -- i.e. we'll be penalized.
-	 */
-	if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) != DEQUEUE_SAVE)
-		update_min_vruntime(cfs_rq);
-
 	if (flags & DEQUEUE_DELAYED)
 		finish_delayed_dequeue_entity(se);
 
@@ -13322,7 +13277,7 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
 void init_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	cfs_rq->tasks_timeline = RB_ROOT_CACHED;
-	cfs_rq->min_vruntime = (u64)(-(1LL << 20));
+	cfs_rq->zero_vruntime = (u64)(-(1LL << 20));
 	raw_spin_lock_init(&cfs_rq->removed.lock);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4838dda75b10..5e657f89d4ea 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -682,7 +682,7 @@ struct cfs_rq {
 	s64			avg_vruntime;
 	u64			avg_load;
 
-	u64			min_vruntime;
+	u64			zero_vruntime;
 #ifdef CONFIG_SCHED_CORE
 	unsigned int		forceidle_seq;
 	u64			min_vruntime_fi;

Re: [PATCH EEVDF NULL deref] sched/eevdf: Fix NULL deref when avg_vruntime nears overflow

Posted by Zicheng Qu 3 months ago

 From 70bd6ca00976ab8ff42133f8189bc0a9ca77f62e Mon Sep 17 00:00:00 2001
From: Zicheng Qu <quzicheng@huawei.com>
Date: Tue, 4 Nov 2025 02:05:33 +0000
Subject: [PATCH for EEVDF NULL POINTER reproduction] EEVDF null pointer
  reproduction
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

 > Right, so the problem is that min_vruntime doesn't follow (it only wants
 > to go up). I should have a patch for that, let me go find.
 >
 > /me rummages in the patch drawer for a bit...
 >
 > Also, do you have a reproducer for this? Your crash output says it is
 > running something called 'sched_test_2'.

This is the reproduction patch. We encountered a similar issue on a
normally running kernel, and after analysis, we deliberately constructed
a comparable scenario. Since it is difficult to control the conditions
from user space, we proactively modified certain parts of the kernel
code to simulate the issue (the code includes detailed explanations).

For the detailed reproduction case, please refer to the path
null_reproduction_test. First, run ./make.sh, and then execute ./test.sh
to start the test. We have tried different machines but have not yet
identified the root cause — on some machines, the issue cannot be
reproduced, while on others it may occur within a few hours.

We need cfs_rq->avg_vruntime to drop near to the lower limit of s64
before the issue can potentially be reproduced. If other sched_entity
instances appear on the same cfs_rq, keeping its avg_vruntime stabilized
around a certain value, then it becomes difficult for the problem to
occur.

 > The below should work (unless you have CONFIG_SCHED_CORE=y), could you
 > please try? If it does work, I suppose I'll finally have to sit down
 > and work out the SCHED_CORE side of this.
 >
 > (Note the changelog doesn't talk about your problem, it was from a
 > previous issue and I just haven't bothered fixing it up)

As for the zero_vruntime patch, after applying it, we have not
observed any crashes over than 24 hours.

Signed-off-by: Zicheng Qu <quzicheng@huawei.com>
Signed-off-by: wulibin163 <wulibin163@126.com>
---
  kernel/sched/fair.c                 | 106 ++++++++++++++++++++-
  kernel/sched/sched.h                |   3 +
  null_reproduction_test/Makefile     |   9 ++
  null_reproduction_test/fullcpu.c    |  12 +++
  null_reproduction_test/make.sh      |  17 ++++
  null_reproduction_test/test.sh      |  32 +++++++
  null_reproduction_test/test_sched.c | 141 ++++++++++++++++++++++++++++
  7 files changed, 319 insertions(+), 1 deletion(-)
  create mode 100644 null_reproduction_test/Makefile
  create mode 100644 null_reproduction_test/fullcpu.c
  create mode 100755 null_reproduction_test/make.sh
  create mode 100755 null_reproduction_test/test.sh
  create mode 100644 null_reproduction_test/test_sched.c

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 25970dbbb279..4e9fdfc732b9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -58,6 +58,17 @@
  #include "stats.h"
  #include "autogroup.h"

+static int se_schedule_pid = 0; // the pid of task `test_sched_0` 
started in test_sched.c.
+module_param(se_schedule_pid, int, 0644);
+static int qzc_vlag_switch = 0; // switch to control the vlag for 
test_sched_0 in place_entity()
+module_param(qzc_vlag_switch, int, 0644);
+static int qzc_fixed_switch = 0; // switch to control whether to apply 
the old fix patch, not zero_vruntime patch
+module_param(qzc_fixed_switch, int, 0644);
+#define __FILENAME__ (__builtin_strrchr(__FILE__, '/') ? 
__builtin_strrchr(__FILE__, '/') + 1 : __FILE__)
+#define ENQUEUE_ENTITY_NONE 0
+#define ENQUEUE_ENTITY_BEGIN 1
+#define ENQUEUE_ENTITY_END 2
+
  /*
   * The initial- and re-scaling of tunables is configurable
   *
@@ -702,6 +713,16 @@ static void update_entity_lag(struct cfs_rq 
*cfs_rq, struct sched_entity *se)
      se->vlag = clamp(vlag, -limit, limit);
  }

+static s64 entity_lag(u64 avruntime, struct sched_entity *se)
+{
+    s64 vlag, limit;
+
+    vlag = avruntime - se->vruntime;
+    limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
+
+    return clamp(vlag, -limit, limit);
+}
+
  /*
   * Entity is eligible once it received less service than it ought to have,
   * eg. lag >= 0.
@@ -3945,7 +3966,7 @@ static long calc_group_shares(struct cfs_rq *cfs_rq)
   * Recomputes the group entity based on the current state of its group
   * runqueue.
   */
-static void update_cfs_group(struct sched_entity *se)
+static void __update_cfs_group(struct sched_entity *se, int flag)
  {
      struct cfs_rq *gcfs_rq = group_cfs_rq(se);
      long shares;
@@ -3958,10 +3979,21 @@ static void update_cfs_group(struct sched_entity 
*se)
          return;

      shares = calc_group_shares(gcfs_rq);
+
+    if (flag == ENQUEUE_ENTITY_BEGIN)  // enqueue begin
+        shares = 111616;
+    else if (flag == ENQUEUE_ENTITY_END) // enqueue end
+        shares = 395264;
+
      if (unlikely(se->load.weight != shares))
          reweight_entity(cfs_rq_of(se), se, shares);
  }

+static void update_cfs_group(struct sched_entity *se)
+{
+       __update_cfs_group(se, ENQUEUE_ENTITY_NONE);
+}
+
  #else /* !CONFIG_FAIR_GROUP_SCHED: */
  static inline void update_cfs_group(struct sched_entity *se)
  {
@@ -5143,6 +5175,16 @@ place_entity(struct cfs_rq *cfs_rq, struct 
sched_entity *se, int flags)
          struct sched_entity *curr = cfs_rq->curr;
          unsigned long load;

+        /*
+         * To make the avg_vruntime() and cfs_rq->avg_vruntime lower 
and lower:
+         * The original goal is to migrate a large number (countless) 
of test_sched_0 type tasks
+         * with very positive high vlag one by one to a specific cfs_rq.
+         * However, it is difficult to control from user space,
+         * so we directly simulate it here to achieve this.
+         */
+        if (qzc_vlag_switch != 0 && se_schedule_pid > 0 && 
entity_is_task(se) && (task_of(se)->pid == se_schedule_pid))
+             se->vlag = qzc_vlag_switch == 1 ? 
calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se) : qzc_vlag_switch;
+
          lag = se->vlag;

          /*
@@ -5240,6 +5282,19 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct 
sched_entity *se, int flags)
  {
      bool curr = cfs_rq->curr == se;

+    /*
+     * At the very beginning and end of the enqueue process for task 
`test_sched_0`,
+     * we want to adjust the weight/shares of cfs_rq->curr simultaneously,
+     * which is actually the task `fullcpu` from test.sh in most cases.
+     *
+     * However, it is quite challenging to control from user space,
+     * so we intend to simulate this behavior here instead.
+     */
+    if (se_schedule_pid > 0 && entity_is_task(se) && (task_of(se)->pid 
== se_schedule_pid)) {
+         if (cfs_rq->curr)
+            __update_cfs_group(cfs_rq->curr, ENQUEUE_ENTITY_BEGIN);
+    }
+
      /*
       * If we're the current task, we must renormalise before calling
       * update_curr().
@@ -5299,6 +5354,11 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct 
sched_entity *se, int flags)
          }
  #endif
      }
+
+    if (se_schedule_pid > 0 && entity_is_task(se) && (task_of(se)->pid 
== se_schedule_pid)) {
+        if (cfs_rq->curr)
+            __update_cfs_group(cfs_rq->curr, ENQUEUE_ENTITY_END);
+    }
  }

  static void __clear_buddies_next(struct sched_entity *se)
@@ -13323,6 +13383,17 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
  {
      cfs_rq->tasks_timeline = RB_ROOT_CACHED;
      cfs_rq->min_vruntime = (u64)(-(1LL << 20));
+
+    /*
+     * We suppose the original intention of (u64)(-(1LL << 20)) was 
likely to
+     * force cfs_rq->min_vruntime to overflow as quickly as possible,
+     * thereby exposing related overflow issues early during the kernel 
initial phase.
+     *
+     * To accelerate the reproduction of these issues,
+     * we have temporarily modified the initial value of 
cfs_rq->min_vruntime.
+     */
+    cfs_rq->min_vruntime = (u64)(4596393947272479);
+
      raw_spin_lock_init(&cfs_rq->removed.lock);
  }

@@ -13728,3 +13799,36 @@ __init void init_sched_fair_class(void)
      zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
  #endif
  }
+
+u64 sched_debug_min_vruntime(struct cfs_rq *cfs_rq)
+{
+    return cfs_rq->min_vruntime;
+}
+EXPORT_SYMBOL(sched_debug_min_vruntime);
+
+void sched_debug_cfs_rq_info(struct cfs_rq *cfs_rq)
+{
+    u64 qzc_avruntime = avg_vruntime(cfs_rq);
+
+    printk("%s:%s:%d, 
cfs_rq=[%p]\tcfs_rq->nr_queued=[%d]\tcfs_rq->avg_vruntime=[%lld]\tcfs_rq->min_vruntime=[%llu]\tcfs_rq->avg_load=[%llu]\tavg_vruntime(cfs_rq)=[%llu]\n", 

+        __FILENAME__,__FUNCTION__, __LINE__,
+        cfs_rq, cfs_rq->nr_queued, cfs_rq->avg_vruntime, 
cfs_rq->min_vruntime, cfs_rq->avg_load, qzc_avruntime);
+
+    if (cfs_rq->curr) {
+        printk("%s:%s:%d, 
curr=[%p]\tpid=[%d]\ttgid=[%d]\tcurr->vruntime=[%llu]\tcurr->load.weight=[%lu]\tcurr->vlag=[%lld]\tcurr->slice=[%llu]\tcurr->deadline=[%llu]\tcurr->my_q=[%p]\treal_vlag=[%lld]\tvruntime_eligible=[%d]\n", 

+            __FILENAME__,__FUNCTION__, __LINE__,
+            cfs_rq->curr, entity_is_task(cfs_rq->curr) ? 
task_of(cfs_rq->curr)->pid : -1, entity_is_task(cfs_rq->curr) ? 
task_of(cfs_rq->curr)->tgid : -1,
+            cfs_rq->curr->vruntime, cfs_rq->curr->load.weight, 
cfs_rq->curr->vlag, cfs_rq->curr->slice, cfs_rq->curr->deadline, 
cfs_rq->curr->my_q, entity_lag(qzc_avruntime, cfs_rq->curr), 
vruntime_eligible(cfs_rq, cfs_rq->curr->vruntime));
+    }
+
+    struct rb_node *node = rb_first_cached(&cfs_rq->tasks_timeline);
+
+    for (; node; node = rb_next(node)) {
+        struct sched_entity *rb_se = __node_2_se(node);
+        printk("%s:%s:%d, 
rb_se=[%p]\tpid=[%d]\ttgid=[%d]\trb_se->vruntime=[%llu]\trb_se->load.weight=[%lu]\trb_se->vlag=[%lld]\trb_se->slice=[%llu]\trb_se->deadline=[%llu]\trb_se->my_q=[%p]\treal_vlag=[%lld]\tvruntime_eligible=[%d]\n", 

+            __FILENAME__,__FUNCTION__, __LINE__,
+            rb_se, entity_is_task(rb_se) ? task_of(rb_se)->pid : -1, 
entity_is_task(rb_se) ? task_of(rb_se)->tgid : -1,
+            rb_se->vruntime, rb_se->load.weight, rb_se->vlag, 
rb_se->slice, rb_se->deadline, rb_se->my_q, entity_lag(qzc_avruntime, 
rb_se), vruntime_eligible(cfs_rq, rb_se->vruntime));
+    }
+}
+EXPORT_SYMBOL(sched_debug_cfs_rq_info);
\ No newline at end of file
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index adfb6e3409d7..50342e003b27 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3904,4 +3904,7 @@ void sched_enq_and_set_task(struct 
sched_enq_and_set_ctx *ctx);

  #include "ext.h"

+extern u64 sched_debug_min_vruntime(struct cfs_rq *cfs_rq);
+extern void sched_debug_cfs_rq_info(struct cfs_rq *cfs_rq);
+
  #endif /* _KERNEL_SCHED_SCHED_H */
diff --git a/null_reproduction_test/Makefile 
b/null_reproduction_test/Makefile
new file mode 100644
index 000000000000..48feb459e5ff
--- /dev/null
+++ b/null_reproduction_test/Makefile
@@ -0,0 +1,9 @@
+obj-m += test_sched.o
+KDIR := /lib/modules/$(shell uname -r)/build
+PWD := $(shell pwd)
+
+all:
+    $(MAKE) -C $(KDIR) M=$(PWD) modules
+
+clean:
+    $(MAKE) -C $(KDIR) M=$(PWD) clean
\ No newline at end of file
diff --git a/null_reproduction_test/fullcpu.c 
b/null_reproduction_test/fullcpu.c
new file mode 100644
index 000000000000..136c73671035
--- /dev/null
+++ b/null_reproduction_test/fullcpu.c
@@ -0,0 +1,12 @@
+#include <string.h>
+#include <unistd.h>
+
+int main()
+{
+    int a=9;
+    while(1) {
+            a*=9;
+    }
+
+    return 0;
+}
\ No newline at end of file
diff --git a/null_reproduction_test/make.sh b/null_reproduction_test/make.sh
new file mode 100755
index 000000000000..002385d17046
--- /dev/null
+++ b/null_reproduction_test/make.sh
@@ -0,0 +1,17 @@
+#!/bin/bash
+
+make clean
+
+cd ..
+
+make modules_prepare
+
+cd ./null_reproduction_test
+
+make -C ../ M=$(pwd)
+
+gcc fullcpu.c -o fullcpu
+
+echo "===================="
+echo 'please run test.sh'
+echo "===================="
\ No newline at end of file
diff --git a/null_reproduction_test/test.sh b/null_reproduction_test/test.sh
new file mode 100755
index 000000000000..d36de68c683b
--- /dev/null
+++ b/null_reproduction_test/test.sh
@@ -0,0 +1,32 @@
+#!/bin/bash
+
+test() {
+    cpu=$1
+    cgroup=test0
+
+    mkdir /sys/fs/cgroup/cpu/$cgroup/
+    mkdir /sys/fs/cgroup/memory/$cgroup/
+    echo 10000000 > /sys/fs/cgroup/memory/$cgroup/memory.limit_in_bytes
+
+    taskset -c $cpu ./fullcpu &
+    pid=$!
+
+    echo $pid > /sys/fs/cgroup/cpu/$cgroup/tasks
+    echo $pid > /sys/fs/cgroup/memory/$cgroup/tasks
+
+    let cpu1_count=0
+    for pid in $(ps -auxf | grep test_sched  | grep -v grep | grep -v 
test_sched_0 | grep -v test_sched_1 | awk '{print($2)}'); do
+            echo $pid > /sys/fs/cgroup/cpu/$cgroup/tasks
+    done
+}
+
+killall fullcpu
+rmmod test_sched
+insmod ./test_sched.ko bind_cpu=1 test_count=15
+
+pid0=$(ps -auxf | grep  'test_sched_0' | grep -v grep | awk '{print($2)}')
+echo $pid0 > /sys/module/fair/parameters/se_schedule_pid
+
+echo 1 > /sys/module/fair/parameters/qzc_vlag_switch
+
+test 1
diff --git a/null_reproduction_test/test_sched.c 
b/null_reproduction_test/test_sched.c
new file mode 100644
index 000000000000..7a33fa77c923
--- /dev/null
+++ b/null_reproduction_test/test_sched.c
@@ -0,0 +1,141 @@
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/kthread.h>
+#include <linux/sched.h>
+#include <linux/delay.h>
+#include <linux/cpumask.h>
+#include <linux/completion.h>
+#include <linux/slab.h>
+#include <linux/sched/task.h>
+
+static DECLARE_COMPLETION(comp);
+
+#define THREAD_NUM 100000
+static struct task_struct *schedule_threads[THREAD_NUM];
+static int bind_cpu = 0;
+module_param(bind_cpu, int, 0644);
+MODULE_PARM_DESC(bind_cpu, "CPU core to bind the thread to");
+
+static int test_count = 1;
+module_param(test_count, int, 0644);
+MODULE_PARM_DESC(test_count, "test thread count (default: 1)");
+
+static int sched_debug_cfs_rq_info_print_cnt = 0;
+
+static int thread_function(void *data);
+static void start_one_thread(int id, int cpu);
+
+static int __init schedule_driver_init(void)
+{
+    printk(KERN_INFO "Schedule driver: Initializing\n");
+
+    start_one_thread(0, bind_cpu);
+    start_one_thread(1, bind_cpu);
+    for (int i=2; i<test_count; i++)
+        start_one_thread(i, -1);
+
+    return 0;
+}
+
+struct thread_data {
+    int id;
+};
+
+static void start_one_thread(int id, int cpu)
+{
+    char name[255];
+    sprintf(name, "test_sched_%u/%d", id, cpu);
+
+    struct thread_data  *tdata = kmalloc(sizeof(struct thread_data), 
GFP_KERNEL);
+    tdata->id = id;
+
+    // create kthread but not run immediately
+    schedule_threads[id] = kthread_create(thread_function, tdata, name);
+    if (IS_ERR(schedule_threads[id])) {
+        schedule_threads[id] = 0;
+        printk("Failed to create %s, %ld\n", name, 
PTR_ERR(schedule_threads[id]));
+        return;
+    }
+
+    if (cpu > 0)
+        kthread_bind(schedule_threads[id], cpu);
+    // run the kthread
+    wake_up_process(schedule_threads[id]);
+
+    printk(KERN_INFO "create %s success\n", name);
+    return;
+}
+
+u64 sched_debug_min_vruntime(struct cfs_rq *cfs);
+void sched_debug_cfs_rq_info(struct cfs_rq *cfs_rq);
+
+static int thread_function(void *data)
+{
+    printk(KERN_INFO "Schedule thread: Started on CPU %d\n", 
smp_processor_id());
+    struct task_struct *task = current;
+
+    set_current_state(TASK_RUNNING);
+
+    struct thread_data  *tdata = data;
+    // test_sched_1 wait
+    if (tdata->id == 1) {
+        set_user_nice(task, 8);
+        wait_for_completion_interruptible(&comp);
+    }
+
+    while (!kthread_should_stop()) {
+        // test_sched_0 check the condition
+        if (tdata->id == 0) {
+            struct sched_entity *se = &task->se;
+            struct cfs_rq *cfs = se->cfs_rq;
+            u64 vruntime = se->vruntime;
+            u64 min_vruntime = sched_debug_min_vruntime(cfs);
+
+            if (sched_debug_cfs_rq_info_print_cnt % 10000 == 0) {
+                sched_debug_cfs_rq_info(cfs);
+            }
+            sched_debug_cfs_rq_info_print_cnt += 1;
+
+            if (-102743846405689 > (s64)(vruntime - min_vruntime)) {
+                int old_nice = task_nice(task);
+                set_user_nice(task, -20);
+
+                complete(&comp); // wake up test_sched_1
+                printk("vruntime: %llu, min_vruntime: %llu, renice: 
%d->%d\n",
+                    vruntime, min_vruntime, old_nice, -20);
+            }
+        } else if (tdata->id == 1) {
+            int a = 1;
+            for (int i=0; i<1000000; i++) {
+                a += tdata->id;
+            }
+        }
+
+        if (tdata->id == 1)
+            cond_resched();
+        else {
+            schedule_timeout_uninterruptible(1);
+        }
+    }
+
+    printk(KERN_INFO "Schedule thread: Exiting from CPU %d\n", 
smp_processor_id());
+    return 0;
+}
+
+static void __exit schedule_driver_exit(void)
+{
+    for (int i=0; i<test_count; i++) {
+        if (schedule_threads[i]) {
+            kthread_stop(schedule_threads[i]);
+            printk(KERN_INFO "Schedule driver: Thread stopped\n");
+        }
+    }
+}
+
+module_init(schedule_driver_init);
+module_exit(schedule_driver_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Your Name");
+MODULE_DESCRIPTION("A driver that creates a thread calling schedule() 
in a loop with CPU binding");
+MODULE_VERSION("1.0");
-- 
2.34.1

Re: [PATCH EEVDF NULL deref] sched/eevdf: Fix NULL deref when avg_vruntime nears overflow

Posted by Peter Zijlstra 3 months ago

On Wed, Nov 05, 2025 at 06:47:40PM +0800, Zicheng Qu wrote:
> This is the reproduction patch. We encountered a similar issue on a
> normally running kernel, and after analysis, we deliberately constructed
> a comparable scenario. Since it is difficult to control the conditions
> from user space, we proactively modified certain parts of the kernel
> code to simulate the issue (the code includes detailed explanations).

Thanks!

> As for the zero_vruntime patch, after applying it, we have not
> observed any crashes over than 24 hours.

Excellent, I'll go think about the SCHED_CORE case and make it a proper
patch then.

Re: [PATCH EEVDF NULL deref] sched/eevdf: Fix NULL deref when avg_vruntime nears overflow

Posted by Peter Zijlstra 3 months ago

On Wed, Nov 05, 2025 at 01:39:54PM +0100, Peter Zijlstra wrote:

> > As for the zero_vruntime patch, after applying it, we have not
> > observed any crashes over than 24 hours.
> 
> Excellent, I'll go think about the SCHED_CORE case and make it a proper
> patch then.

---
Subject: sched/core: Add comment explaining force-idle vruntime snapshots
From: Peter Zijlstra <peterz@infradead.org>
Date: Thu Nov 6 10:50:49 CET 2025

I always end up having to re-read these emails every time I look at
this code. And a future patch is going to change this story a little.
This means it is past time to stick them in a comment so it can be
modified and stay current.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200506143506.GH5298@hirez.programming.kicks-ass.net
Link: https://lkml.kernel.org/r/20200515103844.GG2978@hirez.programming.kicks-ass.net
---
 kernel/sched/fair.c |  181 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 181 insertions(+)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -13015,6 +13015,187 @@ static inline void task_tick_core(struct
 }
 
 /*
+ * Consider any infeasible weight scenario. Take for instance two tasks,
+ * each bound to their respective sibling, one with weight 1 and one with
+ * weight 2. Then the lower weight task will run ahead of the higher weight
+ * task without bound.
+ *
+ * This utterly destroys the concept of a shared time base.
+ *
+ * Remember; all this is about a proportionally fair scheduling, where each
+ * tasks receives:
+ *
+ *              w_i
+ *   dt_i = ---------- dt                                     (1)
+ *          \Sum_j w_j
+ *
+ * which we do by tracking a virtual time, s_i:
+ *
+ *          1
+ *   s_i = --- d[t]_i                                         (2)
+ *         w_i
+ *
+ * Where d[t] is a delta of discrete time, while dt is an infinitesimal.
+ * The immediate corollary is that the ideal schedule S, where (2) to use
+ * an infinitesimal delta, is:
+ *
+ *           1
+ *   S = ---------- dt                                        (3)
+ *       \Sum_i w_i
+ *
+ * From which we can define the lag, or deviation from the ideal, as:
+ *
+ *   lag(i) = S - s_i                                         (4)
+ *
+ * And since the one and only purpose is to approximate S, we get that:
+ *
+ *   \Sum_i w_i lag(i) := 0                                   (5)
+ *
+ * If this were not so, we no longer converge to S, and we can no longer
+ * claim our scheduler has any of the properties we derive from S. This is
+ * exactly what you did above, you broke it!
+ *
+ *
+ * Let's continue for a while though; to see if there is anything useful to
+ * be learned. We can combine (1)-(3) or (4)-(5) and express S in s_i:
+ *
+ *       \Sum_i w_i s_i
+ *   S = --------------                                       (6)
+ *         \Sum_i w_i
+ *
+ * Which gives us a way to compute S, given our s_i. Now, if you've read
+ * our code, you know that we do not in fact do this, the reason for this
+ * is two-fold. Firstly, computing S in that way requires a 64bit division
+ * for every time we'd use it (see 12), and secondly, this only describes
+ * the steady-state, it doesn't handle dynamics.
+ *
+ * Anyway, in (6):  s_i -> x + (s_i - x), to get:
+ *
+ *           \Sum_i w_i (s_i - x)
+ *   S - x = --------------------                             (7)
+ *              \Sum_i w_i
+ *
+ * Which shows that S and s_i transform alike (which makes perfect sense
+ * given that S is basically the (weighted) average of s_i).
+ *
+ * Then:
+ *
+ *   x -> s_min := min{s_i}                                   (8)
+ *
+ * to obtain:
+ *
+ *               \Sum_i w_i (s_i - s_min)
+ *   S = s_min + ------------------------                     (9)
+ *                     \Sum_i w_i
+ *
+ * Which already looks familiar, and is the basis for our current
+ * approximation:
+ *
+ *   S ~= s_min                                              (10)
+ *
+ * Now, obviously, (10) is absolute crap :-), but it sorta works.
+ *
+ * So the thing to remember is that the above is strictly UP. It is
+ * possible to generalize to multiple runqueues -- however it gets really
+ * yuck when you have to add affinity support, as illustrated by our very
+ * first counter-example.
+ *
+ * Luckily I think we can avoid needing a full multi-queue variant for
+ * core-scheduling (or load-balancing). The crucial observation is that we
+ * only actually need this comparison in the presence of forced-idle; only
+ * then do we need to tell if the stalled rq has higher priority over the
+ * other.
+ *
+ * [XXX assumes SMT2; better consider the more general case, I suspect
+ * it'll work out because our comparison is always between 2 rqs and the
+ * answer is only interesting if one of them is forced-idle]
+ *
+ * And (under assumption of SMT2) when there is forced-idle, there is only
+ * a single queue, so everything works like normal.
+ *
+ * Let, for our runqueue 'k':
+ *
+ *   T_k = \Sum_i w_i s_i
+ *   W_k = \Sum_i w_i      ; for all i of k                  (11)
+ *
+ * Then we can write (6) like:
+ *
+ *         T_k
+ *   S_k = ---                                               (12)
+ *         W_k
+ *
+ * From which immediately follows that:
+ *
+ *           T_k + T_l
+ *   S_k+l = ---------                                       (13)
+ *           W_k + W_l
+ *
+ * On which we can define a combined lag:
+ *
+ *   lag_k+l(i) := S_k+l - s_i                               (14)
+ *
+ * And that gives us the tools to compare tasks across a combined runqueue.
+ *
+ *
+ * Combined this gives the following:
+ *
+ *  a) when a runqueue enters force-idle, sync it against it's sibling rq(s)
+ *     using (7); this only requires storing single 'time'-stamps.
+ *
+ *  b) when comparing tasks between 2 runqueues of which one is forced-idle,
+ *     compare the combined lag, per (14).
+ *
+ * Now, of course cgroups (I so hate them) make this more interesting in
+ * that a) seems to suggest we need to iterate all cgroup on a CPU at such
+ * boundaries, but I think we can avoid that. The force-idle is for the
+ * whole CPU, all it's rqs. So we can mark it in the root and lazily
+ * propagate downward on demand.
+ */
+
+/*
+ * So this sync is basically a relative reset of S to 0.
+ *
+ * So with 2 queues, when one goes idle, we drop them both to 0 and one
+ * then increases due to not being idle, and the idle one builds up lag to
+ * get re-elected. So far so simple, right?
+ *
+ * When there's 3, we can have the situation where 2 run and one is idle,
+ * we sync to 0 and let the idle one build up lag to get re-election. Now
+ * suppose another one also drops idle. At this point dropping all to 0
+ * again would destroy the built-up lag from the queue that was already
+ * idle, not good.
+ *
+ * So instead of syncing everything, we can:
+ *
+ *   less := !((s64)(s_a - s_b) <= 0)
+ *
+ *   (v_a - S_a) - (v_b - S_b) == v_a - v_b - S_a + S_b
+ *                             == v_a - (v_b - S_a + S_b)
+ *
+ * IOW, we can recast the (lag) comparison to a one-sided difference.
+ * So if then, instead of syncing the whole queue, sync the idle queue
+ * against the active queue with S_a + S_b at the point where we sync.
+ *
+ * (XXX consider the implication of living in a cyclic group: N / 2^n N)
+ *
+ * This gives us means of syncing single queues against the active queue,
+ * and for already idle queues to preserve their build-up lag.
+ *
+ * Of course, then we get the situation where there's 2 active and one
+ * going idle, who do we pick to sync against? Theory would have us sync
+ * against the combined S, but as we've already demonstrated, there is no
+ * such thing in infeasible weight scenarios.
+ *
+ * One thing I've considered; and this is where that core_active rudiment
+ * came from, is having active queues sync up between themselves after
+ * every tick. This limits the observed divergence due to the work
+ * conservancy.
+ *
+ * On top of that, we can improve upon things by moving away from our
+ * horrible (10) hack and moving to (9) and employing (13) here.
+ */
+
+/*
  * se_fi_update - Update the cfs_rq->min_vruntime_fi in a CFS hierarchy if needed.
  */
 static void se_fi_update(const struct sched_entity *se, unsigned int fi_seq,

Re: [PATCH EEVDF NULL deref] sched/eevdf: Fix NULL deref when avg_vruntime nears overflow

Posted by Peter Zijlstra 3 months ago

On Thu, Nov 06, 2025 at 12:16:03PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 05, 2025 at 01:39:54PM +0100, Peter Zijlstra wrote:
> 
> > > As for the zero_vruntime patch, after applying it, we have not
> > > observed any crashes over than 24 hours.
> > 
> > Excellent, I'll go think about the SCHED_CORE case and make it a proper
> > patch then.
> 
> ---


Subject: sched/eevdf: Fix min_vruntime vs avg_vruntime
From: Peter Zijlstra <peterz@infradead.org>
Date: Wed, 2 Apr 2025 20:07:34 +0200

Basically, from the constraint that the sum of lag is zero, you can
infer that the 0-lag point is the weighted average of the individual
vruntime, which is what we're trying to compute:

        \Sum w_i * v_i
  avg = --------------
           \Sum w_i

Now, since vruntime takes the whole u64 (worse, it wraps), this
multiplication term in the numerator is not something we can compute;
instead we do the min_vruntime (v0 henceforth) thing like:

  v_i = (v_i - v0) + v0

This does two things:
 - it keeps the key: (v_i - v0) 'small';
 - it creates a relative 0-point in the modular space.

If you do that subtitution and work it all out, you end up with:

        \Sum w_i * (v_i - v0)
  avg = --------------------- + v0
              \Sum w_i

Since you cannot very well track a ratio like that (and not suffer
terrible numerical problems) we simpy track the numerator and
denominator individually and only perform the division when strictly
needed.

Notably, the numerator lives in cfs_rq->avg_vruntime and the denominator
lives in cfs_rq->avg_load.

The one extra 'funny' is that these numbers track the entities in the
tree, and current is typically outside of the tree, so avg_vruntime()
adds current when needed before doing the division.

(vruntime_eligible() elides the division by cross-wise multiplication)

Anyway, as mentioned above, we currently use the CFS era min_vruntime
for this purpose. However, this thing can only move forward, while the
above avg can in fact move backward (when a non-eligible task leaves,
the average becomes smaller), this can cause trouble when through
happenstance (or construction) these values drift far enough apart to
wreck the game.

Replace cfs_rq::min_vruntime with cfs_rq::zero_vruntime which is kept
near/at avg_vruntime, following its motion.

The down-side is that this requires computing the avg more often.

Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
Reported-by: Zicheng Qu <quzicheng@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20250402180734.GX5880@noisy.programming.kicks-ass.net
Cc: stable@vger.kernel.org
---
 kernel/sched/debug.c |    8 +--
 kernel/sched/fair.c  |  114 +++++++++++----------------------------------------
 kernel/sched/sched.h |    4 -
 3 files changed, 31 insertions(+), 95 deletions(-)

--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -796,7 +796,7 @@ static void print_rq(struct seq_file *m,
 
 void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 {
-	s64 left_vruntime = -1, min_vruntime, right_vruntime = -1, left_deadline = -1, spread;
+	s64 left_vruntime = -1, zero_vruntime, right_vruntime = -1, left_deadline = -1, spread;
 	struct sched_entity *last, *first, *root;
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long flags;
@@ -819,15 +819,15 @@ void print_cfs_rq(struct seq_file *m, in
 	last = __pick_last_entity(cfs_rq);
 	if (last)
 		right_vruntime = last->vruntime;
-	min_vruntime = cfs_rq->min_vruntime;
+	zero_vruntime = cfs_rq->zero_vruntime;
 	raw_spin_rq_unlock_irqrestore(rq, flags);
 
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "left_deadline",
 			SPLIT_NS(left_deadline));
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "left_vruntime",
 			SPLIT_NS(left_vruntime));
-	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "min_vruntime",
-			SPLIT_NS(min_vruntime));
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "zero_vruntime",
+			SPLIT_NS(zero_vruntime));
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "avg_vruntime",
 			SPLIT_NS(avg_vruntime(cfs_rq)));
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "right_vruntime",
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -554,7 +554,7 @@ static inline bool entity_before(const s
 
 static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	return (s64)(se->vruntime - cfs_rq->min_vruntime);
+	return (s64)(se->vruntime - cfs_rq->zero_vruntime);
 }
 
 #define __node_2_se(node) \
@@ -606,13 +606,13 @@ static inline s64 entity_key(struct cfs_
  *
  * Which we track using:
  *
- *                    v0 := cfs_rq->min_vruntime
+ *                    v0 := cfs_rq->zero_vruntime
  * \Sum (v_i - v0) * w_i := cfs_rq->avg_vruntime
  *              \Sum w_i := cfs_rq->avg_load
  *
- * Since min_vruntime is a monotonic increasing variable that closely tracks
- * the per-task service, these deltas: (v_i - v), will be in the order of the
- * maximal (virtual) lag induced in the system due to quantisation.
+ * Since zero_vruntime closely tracks the per-task service, these
+ * deltas: (v_i - v), will be in the order of the maximal (virtual) lag
+ * induced in the system due to quantisation.
  *
  * Also, we use scale_load_down() to reduce the size.
  *
@@ -671,7 +671,7 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
 		avg = div_s64(avg, load);
 	}
 
-	return cfs_rq->min_vruntime + avg;
+	return cfs_rq->zero_vruntime + avg;
 }
 
 /*
@@ -732,7 +732,7 @@ static int vruntime_eligible(struct cfs_
 		load += weight;
 	}
 
-	return avg >= (s64)(vruntime - cfs_rq->min_vruntime) * load;
+	return avg >= (s64)(vruntime - cfs_rq->zero_vruntime) * load;
 }
 
 int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -740,42 +740,14 @@ int entity_eligible(struct cfs_rq *cfs_r
 	return vruntime_eligible(cfs_rq, se->vruntime);
 }
 
-static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
+static void update_zero_vruntime(struct cfs_rq *cfs_rq)
 {
-	u64 min_vruntime = cfs_rq->min_vruntime;
-	/*
-	 * open coded max_vruntime() to allow updating avg_vruntime
-	 */
-	s64 delta = (s64)(vruntime - min_vruntime);
-	if (delta > 0) {
-		avg_vruntime_update(cfs_rq, delta);
-		min_vruntime = vruntime;
-	}
-	return min_vruntime;
-}
-
-static void update_min_vruntime(struct cfs_rq *cfs_rq)
-{
-	struct sched_entity *se = __pick_root_entity(cfs_rq);
-	struct sched_entity *curr = cfs_rq->curr;
-	u64 vruntime = cfs_rq->min_vruntime;
+	u64 vruntime = avg_vruntime(cfs_rq);
+	s64 delta = (s64)(vruntime - cfs_rq->zero_vruntime);
 
-	if (curr) {
-		if (curr->on_rq)
-			vruntime = curr->vruntime;
-		else
-			curr = NULL;
-	}
-
-	if (se) {
-		if (!curr)
-			vruntime = se->min_vruntime;
-		else
-			vruntime = min_vruntime(vruntime, se->min_vruntime);
-	}
+	avg_vruntime_update(cfs_rq, delta);
 
-	/* ensure we never gain time by being placed backwards. */
-	cfs_rq->min_vruntime = __update_min_vruntime(cfs_rq, vruntime);
+	cfs_rq->zero_vruntime = vruntime;
 }
 
 static inline u64 cfs_rq_min_slice(struct cfs_rq *cfs_rq)
@@ -848,6 +820,7 @@ RB_DECLARE_CALLBACKS(static, min_vruntim
 static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	avg_vruntime_add(cfs_rq, se);
+	update_zero_vruntime(cfs_rq);
 	se->min_vruntime = se->vruntime;
 	se->min_slice = se->slice;
 	rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
@@ -859,6 +832,7 @@ static void __dequeue_entity(struct cfs_
 	rb_erase_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
 				  &min_vruntime_cb);
 	avg_vruntime_sub(cfs_rq, se);
+	update_zero_vruntime(cfs_rq);
 }
 
 struct sched_entity *__pick_root_entity(struct cfs_rq *cfs_rq)
@@ -1226,7 +1200,6 @@ static void update_curr(struct cfs_rq *c
 
 	curr->vruntime += calc_delta_fair(delta_exec, curr);
 	resched = update_deadline(cfs_rq, curr);
-	update_min_vruntime(cfs_rq);
 
 	if (entity_is_task(curr)) {
 		/*
@@ -3808,15 +3781,6 @@ static void reweight_entity(struct cfs_r
 		if (!curr)
 			__enqueue_entity(cfs_rq, se);
 		cfs_rq->nr_queued++;
-
-		/*
-		 * The entity's vruntime has been adjusted, so let's check
-		 * whether the rq-wide min_vruntime needs updated too. Since
-		 * the calculations above require stable min_vruntime rather
-		 * than up-to-date one, we do the update at the end of the
-		 * reweight process.
-		 */
-		update_min_vruntime(cfs_rq);
 	}
 }
 
@@ -5429,15 +5393,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 
 	update_cfs_group(se);
 
-	/*
-	 * Now advance min_vruntime if @se was the entity holding it back,
-	 * except when: DEQUEUE_SAVE && !DEQUEUE_MOVE, in this case we'll be
-	 * put back on, and if we advance min_vruntime, we'll be placed back
-	 * further than we started -- i.e. we'll be penalized.
-	 */
-	if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) != DEQUEUE_SAVE)
-		update_min_vruntime(cfs_rq);
-
 	if (flags & DEQUEUE_DELAYED)
 		finish_delayed_dequeue_entity(se);
 
@@ -9015,7 +8970,6 @@ static void yield_task_fair(struct rq *r
 	if (entity_eligible(cfs_rq, se)) {
 		se->vruntime = se->deadline;
 		se->deadline += calc_delta_fair(se->slice, se);
-		update_min_vruntime(cfs_rq);
 	}
 }
 
@@ -13078,23 +13032,6 @@ static inline void task_tick_core(struct
  * Which shows that S and s_i transform alike (which makes perfect sense
  * given that S is basically the (weighted) average of s_i).
  *
- * Then:
- *
- *   x -> s_min := min{s_i}                                   (8)
- *
- * to obtain:
- *
- *               \Sum_i w_i (s_i - s_min)
- *   S = s_min + ------------------------                     (9)
- *                     \Sum_i w_i
- *
- * Which already looks familiar, and is the basis for our current
- * approximation:
- *
- *   S ~= s_min                                              (10)
- *
- * Now, obviously, (10) is absolute crap :-), but it sorta works.
- *
  * So the thing to remember is that the above is strictly UP. It is
  * possible to generalize to multiple runqueues -- however it gets really
  * yuck when you have to add affinity support, as illustrated by our very
@@ -13116,23 +13053,23 @@ static inline void task_tick_core(struct
  * Let, for our runqueue 'k':
  *
  *   T_k = \Sum_i w_i s_i
- *   W_k = \Sum_i w_i      ; for all i of k                  (11)
+ *   W_k = \Sum_i w_i      ; for all i of k                  (8)
  *
  * Then we can write (6) like:
  *
  *         T_k
- *   S_k = ---                                               (12)
+ *   S_k = ---                                               (9)
  *         W_k
  *
  * From which immediately follows that:
  *
  *           T_k + T_l
- *   S_k+l = ---------                                       (13)
+ *   S_k+l = ---------                                       (10)
  *           W_k + W_l
  *
  * On which we can define a combined lag:
  *
- *   lag_k+l(i) := S_k+l - s_i                               (14)
+ *   lag_k+l(i) := S_k+l - s_i                               (11)
  *
  * And that gives us the tools to compare tasks across a combined runqueue.
  *
@@ -13143,7 +13080,7 @@ static inline void task_tick_core(struct
  *     using (7); this only requires storing single 'time'-stamps.
  *
  *  b) when comparing tasks between 2 runqueues of which one is forced-idle,
- *     compare the combined lag, per (14).
+ *     compare the combined lag, per (11).
  *
  * Now, of course cgroups (I so hate them) make this more interesting in
  * that a) seems to suggest we need to iterate all cgroup on a CPU at such
@@ -13191,12 +13128,11 @@ static inline void task_tick_core(struct
  * every tick. This limits the observed divergence due to the work
  * conservancy.
  *
- * On top of that, we can improve upon things by moving away from our
- * horrible (10) hack and moving to (9) and employing (13) here.
+ * On top of that, we can improve upon things by employing (10) here.
  */
 
 /*
- * se_fi_update - Update the cfs_rq->min_vruntime_fi in a CFS hierarchy if needed.
+ * se_fi_update - Update the cfs_rq->zero_vruntime_fi in a CFS hierarchy if needed.
  */
 static void se_fi_update(const struct sched_entity *se, unsigned int fi_seq,
 			 bool forceidle)
@@ -13210,7 +13146,7 @@ static void se_fi_update(const struct sc
 			cfs_rq->forceidle_seq = fi_seq;
 		}
 
-		cfs_rq->min_vruntime_fi = cfs_rq->min_vruntime;
+		cfs_rq->zero_vruntime_fi = cfs_rq->zero_vruntime;
 	}
 }
 
@@ -13263,11 +13199,11 @@ bool cfs_prio_less(const struct task_str
 
 	/*
 	 * Find delta after normalizing se's vruntime with its cfs_rq's
-	 * min_vruntime_fi, which would have been updated in prior calls
+	 * zero_vruntime_fi, which would have been updated in prior calls
 	 * to se_fi_update().
 	 */
 	delta = (s64)(sea->vruntime - seb->vruntime) +
-		(s64)(cfs_rqb->min_vruntime_fi - cfs_rqa->min_vruntime_fi);
+		(s64)(cfs_rqb->zero_vruntime_fi - cfs_rqa->zero_vruntime_fi);
 
 	return delta > 0;
 }
@@ -13513,7 +13449,7 @@ static void set_next_task_fair(struct rq
 void init_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	cfs_rq->tasks_timeline = RB_ROOT_CACHED;
-	cfs_rq->min_vruntime = (u64)(-(1LL << 20));
+	cfs_rq->zero_vruntime = (u64)(-(1LL << 20));
 	raw_spin_lock_init(&cfs_rq->removed.lock);
 }
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -681,10 +681,10 @@ struct cfs_rq {
 	s64			avg_vruntime;
 	u64			avg_load;
 
-	u64			min_vruntime;
+	u64			zero_vruntime;
 #ifdef CONFIG_SCHED_CORE
 	unsigned int		forceidle_seq;
-	u64			min_vruntime_fi;
+	u64			zero_vruntime_fi;
 #endif
 
 	struct rb_root_cached	tasks_timeline;

[tip: sched/core] sched/eevdf: Fix min_vruntime vs avg_vruntime

Posted by tip-bot2 for Peter Zijlstra 2 months, 3 weeks ago

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     79f3f9bedd149ea438aaeb0fb6a083637affe205
Gitweb:        https://git.kernel.org/tip/79f3f9bedd149ea438aaeb0fb6a083637affe205
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Wed, 02 Apr 2025 20:07:34 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 11 Nov 2025 12:33:38 +01:00

sched/eevdf: Fix min_vruntime vs avg_vruntime

Basically, from the constraint that the sum of lag is zero, you can
infer that the 0-lag point is the weighted average of the individual
vruntime, which is what we're trying to compute:

        \Sum w_i * v_i
  avg = --------------
           \Sum w_i

Now, since vruntime takes the whole u64 (worse, it wraps), this
multiplication term in the numerator is not something we can compute;
instead we do the min_vruntime (v0 henceforth) thing like:

  v_i = (v_i - v0) + v0

This does two things:
 - it keeps the key: (v_i - v0) 'small';
 - it creates a relative 0-point in the modular space.

If you do that subtitution and work it all out, you end up with:

        \Sum w_i * (v_i - v0)
  avg = --------------------- + v0
              \Sum w_i

Since you cannot very well track a ratio like that (and not suffer
terrible numerical problems) we simpy track the numerator and
denominator individually and only perform the division when strictly
needed.

Notably, the numerator lives in cfs_rq->avg_vruntime and the denominator
lives in cfs_rq->avg_load.

The one extra 'funny' is that these numbers track the entities in the
tree, and current is typically outside of the tree, so avg_vruntime()
adds current when needed before doing the division.

(vruntime_eligible() elides the division by cross-wise multiplication)

Anyway, as mentioned above, we currently use the CFS era min_vruntime
for this purpose. However, this thing can only move forward, while the
above avg can in fact move backward (when a non-eligible task leaves,
the average becomes smaller), this can cause trouble when through
happenstance (or construction) these values drift far enough apart to
wreck the game.

Replace cfs_rq::min_vruntime with cfs_rq::zero_vruntime which is kept
near/at avg_vruntime, following its motion.

The down-side is that this requires computing the avg more often.

Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
Reported-by: Zicheng Qu <quzicheng@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251106111741.GC4068168@noisy.programming.kicks-ass.net
Cc: stable@vger.kernel.org
---
 kernel/sched/debug.c |   8 +--
 kernel/sched/fair.c  | 114 +++++++++---------------------------------
 kernel/sched/sched.h |   4 +-
 3 files changed, 31 insertions(+), 95 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 02e16b7..41caa22 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -796,7 +796,7 @@ static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu)
 
 void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 {
-	s64 left_vruntime = -1, min_vruntime, right_vruntime = -1, left_deadline = -1, spread;
+	s64 left_vruntime = -1, zero_vruntime, right_vruntime = -1, left_deadline = -1, spread;
 	struct sched_entity *last, *first, *root;
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long flags;
@@ -819,15 +819,15 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 	last = __pick_last_entity(cfs_rq);
 	if (last)
 		right_vruntime = last->vruntime;
-	min_vruntime = cfs_rq->min_vruntime;
+	zero_vruntime = cfs_rq->zero_vruntime;
 	raw_spin_rq_unlock_irqrestore(rq, flags);
 
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "left_deadline",
 			SPLIT_NS(left_deadline));
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "left_vruntime",
 			SPLIT_NS(left_vruntime));
-	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "min_vruntime",
-			SPLIT_NS(min_vruntime));
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "zero_vruntime",
+			SPLIT_NS(zero_vruntime));
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "avg_vruntime",
 			SPLIT_NS(avg_vruntime(cfs_rq)));
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "right_vruntime",
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4a11a83..8d971d4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -554,7 +554,7 @@ static inline bool entity_before(const struct sched_entity *a,
 
 static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	return (s64)(se->vruntime - cfs_rq->min_vruntime);
+	return (s64)(se->vruntime - cfs_rq->zero_vruntime);
 }
 
 #define __node_2_se(node) \
@@ -606,13 +606,13 @@ static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se)
  *
  * Which we track using:
  *
- *                    v0 := cfs_rq->min_vruntime
+ *                    v0 := cfs_rq->zero_vruntime
  * \Sum (v_i - v0) * w_i := cfs_rq->avg_vruntime
  *              \Sum w_i := cfs_rq->avg_load
  *
- * Since min_vruntime is a monotonic increasing variable that closely tracks
- * the per-task service, these deltas: (v_i - v), will be in the order of the
- * maximal (virtual) lag induced in the system due to quantisation.
+ * Since zero_vruntime closely tracks the per-task service, these
+ * deltas: (v_i - v), will be in the order of the maximal (virtual) lag
+ * induced in the system due to quantisation.
  *
  * Also, we use scale_load_down() to reduce the size.
  *
@@ -671,7 +671,7 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
 		avg = div_s64(avg, load);
 	}
 
-	return cfs_rq->min_vruntime + avg;
+	return cfs_rq->zero_vruntime + avg;
 }
 
 /*
@@ -732,7 +732,7 @@ static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
 		load += weight;
 	}
 
-	return avg >= (s64)(vruntime - cfs_rq->min_vruntime) * load;
+	return avg >= (s64)(vruntime - cfs_rq->zero_vruntime) * load;
 }
 
 int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -740,42 +740,14 @@ int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	return vruntime_eligible(cfs_rq, se->vruntime);
 }
 
-static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
+static void update_zero_vruntime(struct cfs_rq *cfs_rq)
 {
-	u64 min_vruntime = cfs_rq->min_vruntime;
-	/*
-	 * open coded max_vruntime() to allow updating avg_vruntime
-	 */
-	s64 delta = (s64)(vruntime - min_vruntime);
-	if (delta > 0) {
-		avg_vruntime_update(cfs_rq, delta);
-		min_vruntime = vruntime;
-	}
-	return min_vruntime;
-}
+	u64 vruntime = avg_vruntime(cfs_rq);
+	s64 delta = (s64)(vruntime - cfs_rq->zero_vruntime);
 
-static void update_min_vruntime(struct cfs_rq *cfs_rq)
-{
-	struct sched_entity *se = __pick_root_entity(cfs_rq);
-	struct sched_entity *curr = cfs_rq->curr;
-	u64 vruntime = cfs_rq->min_vruntime;
-
-	if (curr) {
-		if (curr->on_rq)
-			vruntime = curr->vruntime;
-		else
-			curr = NULL;
-	}
+	avg_vruntime_update(cfs_rq, delta);
 
-	if (se) {
-		if (!curr)
-			vruntime = se->min_vruntime;
-		else
-			vruntime = min_vruntime(vruntime, se->min_vruntime);
-	}
-
-	/* ensure we never gain time by being placed backwards. */
-	cfs_rq->min_vruntime = __update_min_vruntime(cfs_rq, vruntime);
+	cfs_rq->zero_vruntime = vruntime;
 }
 
 static inline u64 cfs_rq_min_slice(struct cfs_rq *cfs_rq)
@@ -848,6 +820,7 @@ RB_DECLARE_CALLBACKS(static, min_vruntime_cb, struct sched_entity,
 static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	avg_vruntime_add(cfs_rq, se);
+	update_zero_vruntime(cfs_rq);
 	se->min_vruntime = se->vruntime;
 	se->min_slice = se->slice;
 	rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
@@ -859,6 +832,7 @@ static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	rb_erase_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
 				  &min_vruntime_cb);
 	avg_vruntime_sub(cfs_rq, se);
+	update_zero_vruntime(cfs_rq);
 }
 
 struct sched_entity *__pick_root_entity(struct cfs_rq *cfs_rq)
@@ -1226,7 +1200,6 @@ static void update_curr(struct cfs_rq *cfs_rq)
 
 	curr->vruntime += calc_delta_fair(delta_exec, curr);
 	resched = update_deadline(cfs_rq, curr);
-	update_min_vruntime(cfs_rq);
 
 	if (entity_is_task(curr)) {
 		/*
@@ -3808,15 +3781,6 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 		if (!curr)
 			__enqueue_entity(cfs_rq, se);
 		cfs_rq->nr_queued++;
-
-		/*
-		 * The entity's vruntime has been adjusted, so let's check
-		 * whether the rq-wide min_vruntime needs updated too. Since
-		 * the calculations above require stable min_vruntime rather
-		 * than up-to-date one, we do the update at the end of the
-		 * reweight process.
-		 */
-		update_min_vruntime(cfs_rq);
 	}
 }
 
@@ -5429,15 +5393,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 	update_cfs_group(se);
 
-	/*
-	 * Now advance min_vruntime if @se was the entity holding it back,
-	 * except when: DEQUEUE_SAVE && !DEQUEUE_MOVE, in this case we'll be
-	 * put back on, and if we advance min_vruntime, we'll be placed back
-	 * further than we started -- i.e. we'll be penalized.
-	 */
-	if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) != DEQUEUE_SAVE)
-		update_min_vruntime(cfs_rq);
-
 	if (flags & DEQUEUE_DELAYED)
 		finish_delayed_dequeue_entity(se);
 
@@ -9015,7 +8970,6 @@ static void yield_task_fair(struct rq *rq)
 	if (entity_eligible(cfs_rq, se)) {
 		se->vruntime = se->deadline;
 		se->deadline += calc_delta_fair(se->slice, se);
-		update_min_vruntime(cfs_rq);
 	}
 }
 
@@ -13078,23 +13032,6 @@ static inline void task_tick_core(struct rq *rq, struct task_struct *curr)
  * Which shows that S and s_i transform alike (which makes perfect sense
  * given that S is basically the (weighted) average of s_i).
  *
- * Then:
- *
- *   x -> s_min := min{s_i}                                   (8)
- *
- * to obtain:
- *
- *               \Sum_i w_i (s_i - s_min)
- *   S = s_min + ------------------------                     (9)
- *                     \Sum_i w_i
- *
- * Which already looks familiar, and is the basis for our current
- * approximation:
- *
- *   S ~= s_min                                              (10)
- *
- * Now, obviously, (10) is absolute crap :-), but it sorta works.
- *
  * So the thing to remember is that the above is strictly UP. It is
  * possible to generalize to multiple runqueues -- however it gets really
  * yuck when you have to add affinity support, as illustrated by our very
@@ -13116,23 +13053,23 @@ static inline void task_tick_core(struct rq *rq, struct task_struct *curr)
  * Let, for our runqueue 'k':
  *
  *   T_k = \Sum_i w_i s_i
- *   W_k = \Sum_i w_i      ; for all i of k                  (11)
+ *   W_k = \Sum_i w_i      ; for all i of k                  (8)
  *
  * Then we can write (6) like:
  *
  *         T_k
- *   S_k = ---                                               (12)
+ *   S_k = ---                                               (9)
  *         W_k
  *
  * From which immediately follows that:
  *
  *           T_k + T_l
- *   S_k+l = ---------                                       (13)
+ *   S_k+l = ---------                                       (10)
  *           W_k + W_l
  *
  * On which we can define a combined lag:
  *
- *   lag_k+l(i) := S_k+l - s_i                               (14)
+ *   lag_k+l(i) := S_k+l - s_i                               (11)
  *
  * And that gives us the tools to compare tasks across a combined runqueue.
  *
@@ -13143,7 +13080,7 @@ static inline void task_tick_core(struct rq *rq, struct task_struct *curr)
  *     using (7); this only requires storing single 'time'-stamps.
  *
  *  b) when comparing tasks between 2 runqueues of which one is forced-idle,
- *     compare the combined lag, per (14).
+ *     compare the combined lag, per (11).
  *
  * Now, of course cgroups (I so hate them) make this more interesting in
  * that a) seems to suggest we need to iterate all cgroup on a CPU at such
@@ -13191,12 +13128,11 @@ static inline void task_tick_core(struct rq *rq, struct task_struct *curr)
  * every tick. This limits the observed divergence due to the work
  * conservancy.
  *
- * On top of that, we can improve upon things by moving away from our
- * horrible (10) hack and moving to (9) and employing (13) here.
+ * On top of that, we can improve upon things by employing (10) here.
  */
 
 /*
- * se_fi_update - Update the cfs_rq->min_vruntime_fi in a CFS hierarchy if needed.
+ * se_fi_update - Update the cfs_rq->zero_vruntime_fi in a CFS hierarchy if needed.
  */
 static void se_fi_update(const struct sched_entity *se, unsigned int fi_seq,
 			 bool forceidle)
@@ -13210,7 +13146,7 @@ static void se_fi_update(const struct sched_entity *se, unsigned int fi_seq,
 			cfs_rq->forceidle_seq = fi_seq;
 		}
 
-		cfs_rq->min_vruntime_fi = cfs_rq->min_vruntime;
+		cfs_rq->zero_vruntime_fi = cfs_rq->zero_vruntime;
 	}
 }
 
@@ -13263,11 +13199,11 @@ bool cfs_prio_less(const struct task_struct *a, const struct task_struct *b,
 
 	/*
 	 * Find delta after normalizing se's vruntime with its cfs_rq's
-	 * min_vruntime_fi, which would have been updated in prior calls
+	 * zero_vruntime_fi, which would have been updated in prior calls
 	 * to se_fi_update().
 	 */
 	delta = (s64)(sea->vruntime - seb->vruntime) +
-		(s64)(cfs_rqb->min_vruntime_fi - cfs_rqa->min_vruntime_fi);
+		(s64)(cfs_rqb->zero_vruntime_fi - cfs_rqa->zero_vruntime_fi);
 
 	return delta > 0;
 }
@@ -13513,7 +13449,7 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
 void init_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	cfs_rq->tasks_timeline = RB_ROOT_CACHED;
-	cfs_rq->min_vruntime = (u64)(-(1LL << 20));
+	cfs_rq->zero_vruntime = (u64)(-(1LL << 20));
 	raw_spin_lock_init(&cfs_rq->removed.lock);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 82e74e8..5a3cf81 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -681,10 +681,10 @@ struct cfs_rq {
 	s64			avg_vruntime;
 	u64			avg_load;
 
-	u64			min_vruntime;
+	u64			zero_vruntime;
 #ifdef CONFIG_SCHED_CORE
 	unsigned int		forceidle_seq;
-	u64			min_vruntime_fi;
+	u64			zero_vruntime_fi;
 #endif
 
 	struct rb_root_cached	tasks_timeline;

[tip: sched/core] sched/core: Add comment explaining force-idle vruntime snapshots

Posted by tip-bot2 for Peter Zijlstra 2 months, 3 weeks ago

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     9359d9785d85bb53f1ff1738a59aeeec4b878906
Gitweb:        https://git.kernel.org/tip/9359d9785d85bb53f1ff1738a59aeeec4b878906
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Thu, 06 Nov 2025 10:50:49 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 11 Nov 2025 12:33:37 +01:00

sched/core: Add comment explaining force-idle vruntime snapshots

I always end up having to re-read these emails every time I look at
this code. And a future patch is going to change this story a little.
This means it is past time to stick them in a comment so it can be
modified and stay current.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200506143506.GH5298@hirez.programming.kicks-ass.net
Link: https://lkml.kernel.org/r/20200515103844.GG2978@hirez.programming.kicks-ass.net
Link: https://patch.msgid.link/20251106111603.GB4068168@noisy.programming.kicks-ass.net
---
 kernel/sched/fair.c | 181 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 181 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f1d8eb3..4a11a83 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -13015,6 +13015,187 @@ static inline void task_tick_core(struct rq *rq, struct task_struct *curr)
 }
 
 /*
+ * Consider any infeasible weight scenario. Take for instance two tasks,
+ * each bound to their respective sibling, one with weight 1 and one with
+ * weight 2. Then the lower weight task will run ahead of the higher weight
+ * task without bound.
+ *
+ * This utterly destroys the concept of a shared time base.
+ *
+ * Remember; all this is about a proportionally fair scheduling, where each
+ * tasks receives:
+ *
+ *              w_i
+ *   dt_i = ---------- dt                                     (1)
+ *          \Sum_j w_j
+ *
+ * which we do by tracking a virtual time, s_i:
+ *
+ *          1
+ *   s_i = --- d[t]_i                                         (2)
+ *         w_i
+ *
+ * Where d[t] is a delta of discrete time, while dt is an infinitesimal.
+ * The immediate corollary is that the ideal schedule S, where (2) to use
+ * an infinitesimal delta, is:
+ *
+ *           1
+ *   S = ---------- dt                                        (3)
+ *       \Sum_i w_i
+ *
+ * From which we can define the lag, or deviation from the ideal, as:
+ *
+ *   lag(i) = S - s_i                                         (4)
+ *
+ * And since the one and only purpose is to approximate S, we get that:
+ *
+ *   \Sum_i w_i lag(i) := 0                                   (5)
+ *
+ * If this were not so, we no longer converge to S, and we can no longer
+ * claim our scheduler has any of the properties we derive from S. This is
+ * exactly what you did above, you broke it!
+ *
+ *
+ * Let's continue for a while though; to see if there is anything useful to
+ * be learned. We can combine (1)-(3) or (4)-(5) and express S in s_i:
+ *
+ *       \Sum_i w_i s_i
+ *   S = --------------                                       (6)
+ *         \Sum_i w_i
+ *
+ * Which gives us a way to compute S, given our s_i. Now, if you've read
+ * our code, you know that we do not in fact do this, the reason for this
+ * is two-fold. Firstly, computing S in that way requires a 64bit division
+ * for every time we'd use it (see 12), and secondly, this only describes
+ * the steady-state, it doesn't handle dynamics.
+ *
+ * Anyway, in (6):  s_i -> x + (s_i - x), to get:
+ *
+ *           \Sum_i w_i (s_i - x)
+ *   S - x = --------------------                             (7)
+ *              \Sum_i w_i
+ *
+ * Which shows that S and s_i transform alike (which makes perfect sense
+ * given that S is basically the (weighted) average of s_i).
+ *
+ * Then:
+ *
+ *   x -> s_min := min{s_i}                                   (8)
+ *
+ * to obtain:
+ *
+ *               \Sum_i w_i (s_i - s_min)
+ *   S = s_min + ------------------------                     (9)
+ *                     \Sum_i w_i
+ *
+ * Which already looks familiar, and is the basis for our current
+ * approximation:
+ *
+ *   S ~= s_min                                              (10)
+ *
+ * Now, obviously, (10) is absolute crap :-), but it sorta works.
+ *
+ * So the thing to remember is that the above is strictly UP. It is
+ * possible to generalize to multiple runqueues -- however it gets really
+ * yuck when you have to add affinity support, as illustrated by our very
+ * first counter-example.
+ *
+ * Luckily I think we can avoid needing a full multi-queue variant for
+ * core-scheduling (or load-balancing). The crucial observation is that we
+ * only actually need this comparison in the presence of forced-idle; only
+ * then do we need to tell if the stalled rq has higher priority over the
+ * other.
+ *
+ * [XXX assumes SMT2; better consider the more general case, I suspect
+ * it'll work out because our comparison is always between 2 rqs and the
+ * answer is only interesting if one of them is forced-idle]
+ *
+ * And (under assumption of SMT2) when there is forced-idle, there is only
+ * a single queue, so everything works like normal.
+ *
+ * Let, for our runqueue 'k':
+ *
+ *   T_k = \Sum_i w_i s_i
+ *   W_k = \Sum_i w_i      ; for all i of k                  (11)
+ *
+ * Then we can write (6) like:
+ *
+ *         T_k
+ *   S_k = ---                                               (12)
+ *         W_k
+ *
+ * From which immediately follows that:
+ *
+ *           T_k + T_l
+ *   S_k+l = ---------                                       (13)
+ *           W_k + W_l
+ *
+ * On which we can define a combined lag:
+ *
+ *   lag_k+l(i) := S_k+l - s_i                               (14)
+ *
+ * And that gives us the tools to compare tasks across a combined runqueue.
+ *
+ *
+ * Combined this gives the following:
+ *
+ *  a) when a runqueue enters force-idle, sync it against it's sibling rq(s)
+ *     using (7); this only requires storing single 'time'-stamps.
+ *
+ *  b) when comparing tasks between 2 runqueues of which one is forced-idle,
+ *     compare the combined lag, per (14).
+ *
+ * Now, of course cgroups (I so hate them) make this more interesting in
+ * that a) seems to suggest we need to iterate all cgroup on a CPU at such
+ * boundaries, but I think we can avoid that. The force-idle is for the
+ * whole CPU, all it's rqs. So we can mark it in the root and lazily
+ * propagate downward on demand.
+ */
+
+/*
+ * So this sync is basically a relative reset of S to 0.
+ *
+ * So with 2 queues, when one goes idle, we drop them both to 0 and one
+ * then increases due to not being idle, and the idle one builds up lag to
+ * get re-elected. So far so simple, right?
+ *
+ * When there's 3, we can have the situation where 2 run and one is idle,
+ * we sync to 0 and let the idle one build up lag to get re-election. Now
+ * suppose another one also drops idle. At this point dropping all to 0
+ * again would destroy the built-up lag from the queue that was already
+ * idle, not good.
+ *
+ * So instead of syncing everything, we can:
+ *
+ *   less := !((s64)(s_a - s_b) <= 0)
+ *
+ *   (v_a - S_a) - (v_b - S_b) == v_a - v_b - S_a + S_b
+ *                             == v_a - (v_b - S_a + S_b)
+ *
+ * IOW, we can recast the (lag) comparison to a one-sided difference.
+ * So if then, instead of syncing the whole queue, sync the idle queue
+ * against the active queue with S_a + S_b at the point where we sync.
+ *
+ * (XXX consider the implication of living in a cyclic group: N / 2^n N)
+ *
+ * This gives us means of syncing single queues against the active queue,
+ * and for already idle queues to preserve their build-up lag.
+ *
+ * Of course, then we get the situation where there's 2 active and one
+ * going idle, who do we pick to sync against? Theory would have us sync
+ * against the combined S, but as we've already demonstrated, there is no
+ * such thing in infeasible weight scenarios.
+ *
+ * One thing I've considered; and this is where that core_active rudiment
+ * came from, is having active queues sync up between themselves after
+ * every tick. This limits the observed divergence due to the work
+ * conservancy.
+ *
+ * On top of that, we can improve upon things by moving away from our
+ * horrible (10) hack and moving to (9) and employing (13) here.
+ */
+
+/*
  * se_fi_update - Update the cfs_rq->min_vruntime_fi in a CFS hierarchy if needed.
  */
 static void se_fi_update(const struct sched_entity *se, unsigned int fi_seq,