sched: fair: make V move forward only

[PATCH] sched: fair: make V move forward only

Posted by wangtao 2 months, 1 week ago

V is the weighted average of entities. Adding tasks with positive lag or
removing tasks with negative lag may cause V to move backward. This will
result in unfair task scheduling, causing previously eligible tasks to
become ineligible, shorter runtimes, and more task switches.

For example, when adding tasks a, x, and b, where a and b have zero lag
and x has positive lag, task b (added later) might be scheduled before
task a.

Making V move forward only resolves such issues and simplifies the code
for adding tasks with positive lag.

hackbench tests show that with this patch, execution time is significantly
reduced due to fewer task switches.

-------------------------------------------------
hackbench test              base    patch   opt
-------------------------------------------------
process 1 group:            0.141   0.100   -29.3%
process 4 group:            0.375   0.295   -21.2%
process 16 group:           1.495   1.204   -19.5%
thread 1 group:             0.090   0.068   -25.1%
thread 4 group:             0.244   0.211   -13.4%
thread 16 group:            0.860   0.795    -7.6%
pipe process 1 group:       0.124   0.090   -27.8%
pipe process 4 group:       0.340   0.289   -15.2%
pipe process 16 group:      1.401   1.144   -18.3%
pipe thread 1 group:        0.081   0.071   -11.7%
pipe thread 4 group:        0.241   0.181   -24.7%
pipe thread 16 group:       0.787   0.706   -10.2%

Signed-off-by: wangtao <tao.wangtao@honor.com>
---
 kernel/sched/fair.c  | 16 ++++++++++++----
 kernel/sched/sched.h |  1 +
 2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5b752324270b..889ee8d4c9bd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -671,7 +671,11 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
 		avg = div_s64(avg, load);
 	}
 
-	return cfs_rq->min_vruntime + avg;
+	avg += cfs_rq->min_vruntime;
+	if ((s64)(cfs_rq->forward_avg_vruntime - avg) < 0)
+		cfs_rq->forward_avg_vruntime = avg;
+
+	return cfs_rq->forward_avg_vruntime;
 }
 
 /*
@@ -725,6 +729,9 @@ static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
 	s64 avg = cfs_rq->avg_vruntime;
 	long load = cfs_rq->avg_load;
 
+	if ((s64)(cfs_rq->forward_avg_vruntime - vruntime) >= 0)
+		return 1;
+
 	if (curr && curr->on_rq) {
 		unsigned long weight = scale_load_down(curr->load.weight);
 
@@ -5139,12 +5146,13 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 *
 	 * EEVDF: placement strategy #1 / #2
 	 */
-	if (sched_feat(PLACE_LAG) && cfs_rq->nr_queued && se->vlag) {
+	if (sched_feat(PLACE_LAG) && cfs_rq->nr_queued && se->vlag)
+		lag = se->vlag;
+	/* positive lag does not evaporate with forward_avg_vruntime */
+	if (lag < 0) {
 		struct sched_entity *curr = cfs_rq->curr;
 		unsigned long load;
 
-		lag = se->vlag;
-
 		/*
 		 * If we want to place a task and preserve lag, we have to
 		 * consider the effect of the new entity on the weighted
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index adfb6e3409d7..2691d5e8a0ab 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -681,6 +681,7 @@ struct cfs_rq {
 
 	s64			avg_vruntime;
 	u64			avg_load;
+	u64			forward_avg_vruntime;
 
 	u64			min_vruntime;
 #ifdef CONFIG_SCHED_CORE
-- 
2.17.1

Re: [PATCH] sched: fair: make V move forward only

Posted by Peter Zijlstra 2 months, 1 week ago

On Fri, Nov 28, 2025 at 04:11:18PM +0800, wangtao wrote:
> V is the weighted average of entities. Adding tasks with positive lag or
> removing tasks with negative lag may cause V to move backward. This will
> result in unfair task scheduling,

Have you actually read the paper? Why do you think this breaks fairness?

> causing previously eligible tasks to become ineligible, shorter
> runtimes, and more task switches.

None of that is a fairness issue. Those are issues related to when,
rather than how much time is given.

> Making V move forward only resolves such issues and simplifies the code
> for adding tasks with positive lag.

It breaks a metric ton of math. Which you don't provide updates for.

Yes, the paper is light on dynamic behaviour, but please don't disregard
the math like this. Either stay inside the constraints laid out, or
provide coherent alternatives. Notably EEVDF is in the same class of
scheduling functions as WF2Q and both provide better lag bounds than the
simpler WFQ class of schedulers.

The 'zero-lag point is the weighted average of the entities' is a fairly
core tenet of EEVDF. Mucking with this *will* mess with the lag bounds.

The delayed dequeue feature tries to address some of these concerns by
keeping non-eligible (negative lag) tasks on the runqueue until such
time that they become eligible (approximated by getting picked again) at
which point they get removed (and any positive lag gets truncated, as if
they were removed at zero-lag). As a consequence you will have much less
removal of negative lag, additionally such tasks will be eligible the
moment they come back.

Also, there is the small matter that your patch simply does not apply.

RE: [PATCH] sched: fair: make V move forward only

Posted by wangtao 2 months, 1 week ago

> -----Original Message-----
> From: Peter Zijlstra <peterz@infradead.org>
> Sent: Friday, November 28, 2025 5:29 PM
> To: wangtao <tao.wangtao@honor.com>
> Cc: mingo@redhat.com; juri.lelli@redhat.com; vincent.guittot@linaro.org;
> dietmar.eggemann@arm.com; rostedt@goodmis.org; bsegall@google.com;
> mgorman@suse.de; vschneid@redhat.com; linux-kernel@vger.kernel.org;
> liulu.liu@honor.com; bintian.wang@honor.com
> Subject: Re: [PATCH] sched: fair: make V move forward only
> 
> On Fri, Nov 28, 2025 at 04:11:18PM +0800, wangtao wrote:
> > V is the weighted average of entities. Adding tasks with positive lag
> > or removing tasks with negative lag may cause V to move backward. This
> > will result in unfair task scheduling,
> 
> Have you actually read the paper? Why do you think this breaks fairness?
> 
> > causing previously eligible tasks to become ineligible, shorter
> > runtimes, and more task switches.
> 
> None of that is a fairness issue. Those are issues related to when, rather than
> how much time is given.
> 
Maybe my earlier explanation was unclear, so here is a detailed example.

The run queue is empty. V is V_0. We add three tasks A, X, B in order.
lag_A = 0, lag_X > 0, lag_B = 0.
To simplify, assume the time gaps of the three additions are so small that
we can treat them as zero.

  Add task A: v_A = V_0. After adding A, V does not change, V_A = V_0.
  Add task X: v_X = V_A = V_0. After adding X, V_B < V_0.
  Add task B: v_B = V_X < V_0 = v_A. After adding B, V_B = V_X < V_0.

X runs first. After X runs, because v_B < v_A, we also get
deadline_B < deadline_A. So the later-added task B runs before A, which
is not fair.

But in reality the three additions have time gaps. Suppose A runs for a
short time before adding X. Let V_A' be V at that moment, and let v_A' be
A's vruntime. We have v_A' = V_A' > V_0.
v_X = V_A' - lag_X < V_A'. After putting X into the queue, V becomes
smaller: V_X < V_A'. Since v_A' > V_X, A becomes uneligible. Even if A is
in protect_slice, it will switch to X immediately, increasing context
switches.

Here is a trace where several tasks sleep for a while and then run briefly:
v_30981 = 16015784959115 = V, schedule 30981
  <idle>-0 (...) place_entity ... pid 30981 vruntime 16015784959115 vlag 0 lag 0
  sched_switch ... next_pid=30981 ...
After adding 30983, since vlag_30983 > 0:
v_30983 = 16015785028217 < V_30981' = v_30981' = 16015785080365
  place_entity ... pid 30983 vruntime 16015785028217 vlag 26074 lag 52148
After adding 30982, even though vlag_30982 < 0, because vlag_30983 is
large: v_30982 = 16015785061209 < V_30981' = 16015785080365
  place_entity ... pid 30982 vruntime 16015785061209 vlag -670 lag -3793
Switch to 30983:
  sched_switch ... next_pid=30983 ...
Then task 30982 with negative vlag runs before 30981:
  sched_switch ... next_pid=30982 ...
  sched_switch ... next_pid=30981 ...
  sched_switch ... next_pid=0 ...

> > Making V move forward only resolves such issues and simplifies the
> > code for adding tasks with positive lag.
> 
> It breaks a metric ton of math. Which you don't provide updates for.
> 
> Yes, the paper is light on dynamic behaviour, but please don't disregard the
> math like this. Either stay inside the constraints laid out, or provide coherent
> alternatives. Notably EEVDF is in the same class of scheduling functions as
> WF2Q and both provide better lag bounds than the simpler WFQ class of
> schedulers.
> 
> The 'zero-lag point is the weighted average of the entities' is a fairly core
> tenet of EEVDF. Mucking with this *will* mess with the lag bounds.
> 
You are right. The reasoning should be based on math. I will submit a new
patch based on the derivation below.

When deriving V, we previously assumed \Sum lag_i = 0, but in many cases
\Sum lag_i is not zero.
The simplest case is when the run queue is empty and we add a task with
lag not zero, Then \Sum lag_i is clearly not zero.

Our goal is \Sum lag_i = 0. So when task i finishes running and is removed
from the queue, its lag is:

  lag_i = S - s_i = w_i * (V - v_i)

After some time, when task i is added back, the other tasks' v have
changed, and V has changed. Let A be V at that moment:

  \Sum lag_i = \Sum w_i * (A - v_i)

We can compute A from v_i and lag_i:

  A = \Sum (w_i * v_i + lag_i) / \Sum w_i = \Sum (w_i * v_i + lag_i) / W

Use A to compute v for the newly added task k. After adding k:

  A' = (\Sum (w_i * v_i + lag_i) + w_k * v_k + lag_k) / (W + w_k)

To preserve lag_k, set v_k = A' - lag_k / w_k, then:

  A' = (\Sum (w_i * v_i + lag_i) + w_k * A') / (W + w_k)
  A' * (W + w_k) = \Sum (w_i * v_i + lag_i) + w_k * A'
  A' = \Sum (w_i * v_i + lag_i) / W = A

This shows that adding task k does not change A.
Similarly, removing task k does not change A.

It is easy to get:

  A = V + (\Sum lag_i) / W

\Sum lag_i stays around 0 when tasks are added or removed.

The roles of V and A are:
  1) Use V to judge whether task is eligible.
  2) Use V to compute lag when a task is removed.
  3) Use A to compute v when a task is added.

> The delayed dequeue feature tries to address some of these concerns by
> keeping non-eligible (negative lag) tasks on the runqueue until such time
> that they become eligible (approximated by getting picked again) at which
> point they get removed (and any positive lag gets truncated, as if they were
> removed at zero-lag). As a consequence you will have much less removal of
> negative lag, additionally such tasks will be eligible the moment they come
> back.
> 
> Also, there is the small matter that your patch simply does not apply.

Thanks,
Tao

Re: [PATCH] sched: fair: make V move forward only

Posted by Peter Zijlstra 2 months, 1 week ago

On Tue, Dec 02, 2025 at 12:43:15PM +0000, wangtao wrote:
> 
> > -----Original Message-----
> > From: Peter Zijlstra <peterz@infradead.org>
> > Sent: Friday, November 28, 2025 5:29 PM
> > To: wangtao <tao.wangtao@honor.com>
> > Cc: mingo@redhat.com; juri.lelli@redhat.com; vincent.guittot@linaro.org;
> > dietmar.eggemann@arm.com; rostedt@goodmis.org; bsegall@google.com;
> > mgorman@suse.de; vschneid@redhat.com; linux-kernel@vger.kernel.org;
> > liulu.liu@honor.com; bintian.wang@honor.com
> > Subject: Re: [PATCH] sched: fair: make V move forward only
> > 
> > On Fri, Nov 28, 2025 at 04:11:18PM +0800, wangtao wrote:
> > > V is the weighted average of entities. Adding tasks with positive lag
> > > or removing tasks with negative lag may cause V to move backward. This
> > > will result in unfair task scheduling,
> > 
> > Have you actually read the paper? Why do you think this breaks fairness?
> > 
> > > causing previously eligible tasks to become ineligible, shorter
> > > runtimes, and more task switches.
> > 
> > None of that is a fairness issue. Those are issues related to when, rather than
> > how much time is given.
> > 
> Maybe my earlier explanation was unclear, so here is a detailed example.

Not unclear, still not sure how its a fairness issue. Yes it increases
context switches, but fairness is about the amount of time distributed,
not about when time is given.

It is entirely reasonable for a task that wakes up to run now if it has
positive lag.

> > > Making V move forward only resolves such issues and simplifies the
> > > code for adding tasks with positive lag.
> > 
> > It breaks a metric ton of math. Which you don't provide updates for.
> > 
> > Yes, the paper is light on dynamic behaviour, but please don't disregard the
> > math like this. Either stay inside the constraints laid out, or provide coherent
> > alternatives. Notably EEVDF is in the same class of scheduling functions as
> > WF2Q and both provide better lag bounds than the simpler WFQ class of
> > schedulers.
> > 
> > The 'zero-lag point is the weighted average of the entities' is a fairly core
> > tenet of EEVDF. Mucking with this *will* mess with the lag bounds.
> > 
> You are right. The reasoning should be based on math. I will submit a new
> patch based on the derivation below.
> 
> When deriving V, we previously assumed \Sum lag_i = 0, but in many cases
> \Sum lag_i is not zero.

Sure, which is where the V adjustments come from.

> The simplest case is when the run queue is empty and we add a task with
> lag not zero, Then \Sum lag_i is clearly not zero.

Well, all of this is only valid under contention. If there is only a
single task, there is no lag. Note how we ignore lag when placing the
first entry.

Still the point remains, you can leave/join with non-zero lag, and in
those cases we have to adjust V.

> Our goal is \Sum lag_i = 0. 

This is equivalent to the statement that the zero-lag point is the
weighted average of vruntime. These two things are interchangeable.

> So when task i finishes running and is removed
> from the queue, its lag is:
> 
>   lag_i = S - s_i = w_i * (V - v_i)

Yes, but taking it out also affects V.

> After some time, when task i is added back,

Note how you state we're adding _i_ back.

> the other tasks' v have
> changed, and V has changed. Let A be V at that moment:
> 
>   \Sum lag_i = \Sum w_i * (A - v_i)

You need to introduce more indices, i was our removed task, but now
you're re-using i to sum over the remaining tasks?

And if indeed you did mean:

	\Sun_j!=i lag_j = \Sum_j!=i w_j * (A - v_j)

                                   \Sum v_j * w_j
Then yes, because A, or rather V = --------------
                                      \Sum w_j

Making both left- and right-hand expressions 0.

> We can compute A from v_i and lag_i:
> 
>   A = \Sum (w_i * v_i + lag_i) / \Sum w_i = \Sum (w_i * v_i + lag_i) / W

You're doing:

lag_j = w_j * (A - v_j)

lag_j 
----- = A - v_j
w_j

lag_j
----- + v_j = A
w_j

lag_j   v_j * w_j
----- + --------- = A
w_j     w_j

lag_j + v_j * w_j
----------------- = A
w_j

Which is creating a new clock that absorbs a non-zero lag sum?

I don't think there's a bound on the difference between A and V. You can
extract unbounded lag from the system. We recently had someone showcase
exactly that, they managed to wrap V backwards far enough to make the
old min_vruntime thing wrap the s64 space and things went sideways real
fast.

> Use A to compute v for the newly added task k.

But you were adding task i, per [*]

> After adding k:
> 
>   A' = (\Sum (w_i * v_i + lag_i) + w_k * v_k + lag_k) / (W + w_k)
> 
> To preserve lag_k, set v_k = A' - lag_k / w_k, then:
> 
>   A' = (\Sum (w_i * v_i + lag_i) + w_k * A') / (W + w_k)
>   A' * (W + w_k) = \Sum (w_i * v_i + lag_i) + w_k * A'
>   A' = \Sum (w_i * v_i + lag_i) / W = A
> 
> This shows that adding task k does not change A.
> Similarly, removing task k does not change A.
> 
> It is easy to get:
> 
>   A = V + (\Sum lag_i) / W
> 
> \Sum lag_i stays around 0 when tasks are added or removed.
> 
> The roles of V and A are:
>   1) Use V to judge whether task is eligible.
>   2) Use V to compute lag when a task is removed.
>   3) Use A to compute v when a task is added.

So I don't think A can work. As mentioned, they can drift unbounded, and
per that they would also break the lag bounds as outlined in lemmas 3
and onwards.

Now, there is a clue in the WF2Q paper, that it is possible to change
the V function. This is mentioned in the future work section of the
paper. I've never gotten around to figuring out which paper that is and
getting a copy.

RE: [PATCH] sched: fair: make V move forward only

Posted by wangtao 2 months ago

> 
> Not unclear, still not sure how its a fairness issue. Yes it increases context
> switches, but fairness is about the amount of time distributed, not about
> when time is given.
> 
> It is entirely reasonable for a task that wakes up to run now if it has positive
> lag.
> 
Tasks with larger lag should run first. If a task with smaller lag runs
earlier, it is unfair in those time slices, or you can say it is not reasonable.

It may be easier to explain this by separating
inqueue_lag (lag), join_lag (jlag), and leave_lag (llag).
For all tasks in the queue, \Sum lag_i is 0, but \Sum jlag_i is not always 0.

Suppose the current V = V0, all weights are 1, and we add four tasks
with jlag values 0, 10, 80, and -12. Considering preserve-lag handling
with vlag_i = (W + w_i) * vlag_i' / W:

-----------------------------
event | jlag | v      | W | V
-----------------------------
add T1 |   0 | V0     | 1 | V0
add T2 |  20 | V0-20  | 2 | V0-10
add T3 | 120 | V0-130 | 3 | V0-50
add T4 | -16 | V0-34  | 4 | V0-46

Because V becomes smaller after adding T2 and T3, even though
lag_T4 < 0, we still have v_T3 < v_T4 < v_T2 < v_T1.
So the schedule order is T3, T4, T2, T1.
A similar issue exists even without preserve-lag.
If tasks are added in order T2, T3, T1, T4, the schedule order becomes
T3, T1, T4, T2, which shows instability.

> 
> Which is creating a new clock that absorbs a non-zero lag sum?
> 
> I don't think there's a bound on the difference between A and V. You can
> extract unbounded lag from the system. We recently had someone
> showcase exactly that, they managed to wrap V backwards far enough to
> make the old min_vruntime thing wrap the s64 space and things went
> sideways real fast.
>

Earlier I did not clearly separate jlag/lag/llag. Intuitively one may think
  sum_jlag = \Sum jlag_j - \Sum llag_l.
But \Sum llag_l may approach 0, while \Sum jlag_j may not have a clear
boundary. For example, if all jlag>0 go to cfs_rq0 and all jlag<0 go to
cfs_rq1, then \Sum jlag_j of cfs_rq0 is unbounded.
 
From deriving the leave process, we find that jlag_l and llag_l are not always equal.

We define A to stay unchanged when tasks join or leave, and we have:
  v = V - vlag = A - vjlag
  J = A - V
  J * W = sum_jlag = (\Sum jlag_j - \Sum jlag_l)

1) A task leaves and later re-joins. We know its previous llag.
   Set jlag_j = llag. Then v_j = A - vjlag_j keeps A unchanged:
   A_j = (A * W + v_j * w_j + vjlag_j * w_j) / (W + w_j) = A

2) For any task i running in the queue, v_i and V change.
   Then vlag_i = V - v_i also changes, but vjlag_i does not change.

3) When a task leaves, from v_l we get vllag_l = V - v_l
   and vjlag_l = A - v_l.
   vjlag_l = vllag_l + A - V = vllag_l + J
   Because J may not be 0, vjlag_l and vllag_l are not always equal.

Now we analyze the boundary of sum_jlag:

When a task is added:
  sum_jlag_j = sum_jlag + jlag_j
  Here jlag_j is the llag from the last leave.
  Its upper bound is the same as the lag limit q.
  Adding n tasks means sum_jlag does not exceed q * n.

When a task leaves:
  sum_jlag_l = sum_jlag - jlag_l = sum_jlag - vjlag_l * w_l
  sum_jlag_l = sum_jlag - (vllag_l + J) * w_l
  sum_jlag_l = sum_jlag - llag_l - sum_jlag * w_l / W
  sum_jlag_l = sum_jlag * (W - w_l) / W - llag_l
This means sum_jlag is reduced proportionally by weight.
When all tasks leave, sum_jlag becomes the last llag.

If n tasks are added and m tasks leave, the upper bound of sum_jlag
is still no more than q * (n - m).

Using v_j = A - vjlag_j makes scheduling more stable when tasks join
or leave. It is equal to preserve vlag = vjlag - J.

Using the same example with jlag values 0, 10, 80, -12:

------------------------------------------
event | jlag | v       | W | V        | J
------------------------------------------
add T1 |  0  | V0      | 1 | V0       | 0
add T2 | 10  | V0-10   | 2 | V0-5     | 5
add T3 | 80  | V0-80   | 3 | V0-30    | 30
add T4 | -12 | V0+12   | 4 | V0-19.5  | 19.5

As long as the four tasks are added at the same time, no matter the order,
the schedule order will be T3, T2, T1, T4.

Thank,
Tao