sched/fair: Fix EEVDF entity placement bug causing scheduling lag

[tip: sched/urgent] sched/fair: Fix EEVDF entity placement bug causing scheduling lag

Posted by tip-bot2 for Peter Zijlstra 1 year, 1 month ago

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     6d71a9c6160479899ee744d2c6d6602a191deb1f
Gitweb:        https://git.kernel.org/tip/6d71a9c6160479899ee744d2c6d6602a191deb1f
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Thu, 09 Jan 2025 11:59:59 +01:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Thu, 09 Jan 2025 12:55:27 +01:00

sched/fair: Fix EEVDF entity placement bug causing scheduling lag

I noticed this in my traces today:

       turbostat-1222    [006] d..2.   311.935649: reweight_entity: (ffff888108f13e00-ffff88885ef38440-6)
                               { weight: 1048576 avg_vruntime: 3184159639071 vruntime: 3184159640194 (-1123) deadline: 3184162621107 } ->
                               { weight: 2 avg_vruntime: 3184177463330 vruntime: 3184748414495 (-570951165) deadline: 4747605329439 }
       turbostat-1222    [006] d..2.   311.935651: reweight_entity: (ffff888108f13e00-ffff88885ef38440-6)
                               { weight: 2 avg_vruntime: 3184177463330 vruntime: 3184748414495 (-570951165) deadline: 4747605329439 } ->
                               { weight: 1048576 avg_vruntime: 3184176414812 vruntime: 3184177464419 (-1049607) deadline: 3184180445332 }

Which is a weight transition: 1048576 -> 2 -> 1048576.

One would expect the lag to shoot out *AND* come back, notably:

  -1123*1048576/2 = -588775424
  -588775424*2/1048576 = -1123

Except the trace shows it is all off. Worse, subsequent cycles shoot it
out further and further.

This made me have a very hard look at reweight_entity(), and
specifically the ->on_rq case, which is more prominent with
DELAY_DEQUEUE.

And indeed, it is all sorts of broken. While the computation of the new
lag is correct, the computation for the new vruntime, using the new lag
is broken for it does not consider the logic set out in place_entity().

With the below patch, I now see things like:

    migration/12-55      [012] d..3.   309.006650: reweight_entity: (ffff8881e0e6f600-ffff88885f235f40-12)
                               { weight: 977582 avg_vruntime: 4860513347366 vruntime: 4860513347908 (-542) deadline: 4860516552475 } ->
                               { weight: 2 avg_vruntime: 4860528915984 vruntime: 4860793840706 (-264924722) deadline: 6427157349203 }
    migration/14-62      [014] d..3.   309.006698: reweight_entity: (ffff8881e0e6cc00-ffff88885f3b5f40-15)
                               { weight: 2 avg_vruntime: 4874472992283 vruntime: 4939833828823 (-65360836540) deadline: 6316614641111 } ->
                               { weight: 967149 avg_vruntime: 4874217684324 vruntime: 4874217688559 (-4235) deadline: 4874220535650 }

Which isn't perfect yet, but much closer.

Reported-by: Doug Smythies <dsmythies@telus.net>
Reported-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fixes: eab03c23c2a1 ("sched/eevdf: Fix vruntime adjustment on reweight")
Link: https://lore.kernel.org/r/20250109105959.GA2981@noisy.programming.kicks-ass.net
---
 kernel/sched/fair.c | 145 +++++--------------------------------------
 1 file changed, 18 insertions(+), 127 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3e9ca38..eeed8e3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -689,21 +689,16 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
  *
  * XXX could add max_slice to the augmented data to track this.
  */
-static s64 entity_lag(u64 avruntime, struct sched_entity *se)
+static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	s64 vlag, limit;
 
-	vlag = avruntime - se->vruntime;
-	limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
-
-	return clamp(vlag, -limit, limit);
-}
-
-static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
 	SCHED_WARN_ON(!se->on_rq);
 
-	se->vlag = entity_lag(avg_vruntime(cfs_rq), se);
+	vlag = avg_vruntime(cfs_rq) - se->vruntime;
+	limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
+
+	se->vlag = clamp(vlag, -limit, limit);
 }
 
 /*
@@ -3774,137 +3769,32 @@ static inline void
 dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
 #endif
 
-static void reweight_eevdf(struct sched_entity *se, u64 avruntime,
-			   unsigned long weight)
-{
-	unsigned long old_weight = se->load.weight;
-	s64 vlag, vslice;
-
-	/*
-	 * VRUNTIME
-	 * --------
-	 *
-	 * COROLLARY #1: The virtual runtime of the entity needs to be
-	 * adjusted if re-weight at !0-lag point.
-	 *
-	 * Proof: For contradiction assume this is not true, so we can
-	 * re-weight without changing vruntime at !0-lag point.
-	 *
-	 *             Weight	VRuntime   Avg-VRuntime
-	 *     before    w          v            V
-	 *      after    w'         v'           V'
-	 *
-	 * Since lag needs to be preserved through re-weight:
-	 *
-	 *	lag = (V - v)*w = (V'- v')*w', where v = v'
-	 *	==>	V' = (V - v)*w/w' + v		(1)
-	 *
-	 * Let W be the total weight of the entities before reweight,
-	 * since V' is the new weighted average of entities:
-	 *
-	 *	V' = (WV + w'v - wv) / (W + w' - w)	(2)
-	 *
-	 * by using (1) & (2) we obtain:
-	 *
-	 *	(WV + w'v - wv) / (W + w' - w) = (V - v)*w/w' + v
-	 *	==> (WV-Wv+Wv+w'v-wv)/(W+w'-w) = (V - v)*w/w' + v
-	 *	==> (WV - Wv)/(W + w' - w) + v = (V - v)*w/w' + v
-	 *	==>	(V - v)*W/(W + w' - w) = (V - v)*w/w' (3)
-	 *
-	 * Since we are doing at !0-lag point which means V != v, we
-	 * can simplify (3):
-	 *
-	 *	==>	W / (W + w' - w) = w / w'
-	 *	==>	Ww' = Ww + ww' - ww
-	 *	==>	W * (w' - w) = w * (w' - w)
-	 *	==>	W = w	(re-weight indicates w' != w)
-	 *
-	 * So the cfs_rq contains only one entity, hence vruntime of
-	 * the entity @v should always equal to the cfs_rq's weighted
-	 * average vruntime @V, which means we will always re-weight
-	 * at 0-lag point, thus breach assumption. Proof completed.
-	 *
-	 *
-	 * COROLLARY #2: Re-weight does NOT affect weighted average
-	 * vruntime of all the entities.
-	 *
-	 * Proof: According to corollary #1, Eq. (1) should be:
-	 *
-	 *	(V - v)*w = (V' - v')*w'
-	 *	==>    v' = V' - (V - v)*w/w'		(4)
-	 *
-	 * According to the weighted average formula, we have:
-	 *
-	 *	V' = (WV - wv + w'v') / (W - w + w')
-	 *	   = (WV - wv + w'(V' - (V - v)w/w')) / (W - w + w')
-	 *	   = (WV - wv + w'V' - Vw + wv) / (W - w + w')
-	 *	   = (WV + w'V' - Vw) / (W - w + w')
-	 *
-	 *	==>  V'*(W - w + w') = WV + w'V' - Vw
-	 *	==>	V' * (W - w) = (W - w) * V	(5)
-	 *
-	 * If the entity is the only one in the cfs_rq, then reweight
-	 * always occurs at 0-lag point, so V won't change. Or else
-	 * there are other entities, hence W != w, then Eq. (5) turns
-	 * into V' = V. So V won't change in either case, proof done.
-	 *
-	 *
-	 * So according to corollary #1 & #2, the effect of re-weight
-	 * on vruntime should be:
-	 *
-	 *	v' = V' - (V - v) * w / w'		(4)
-	 *	   = V  - (V - v) * w / w'
-	 *	   = V  - vl * w / w'
-	 *	   = V  - vl'
-	 */
-	if (avruntime != se->vruntime) {
-		vlag = entity_lag(avruntime, se);
-		vlag = div_s64(vlag * old_weight, weight);
-		se->vruntime = avruntime - vlag;
-	}
-
-	/*
-	 * DEADLINE
-	 * --------
-	 *
-	 * When the weight changes, the virtual time slope changes and
-	 * we should adjust the relative virtual deadline accordingly.
-	 *
-	 *	d' = v' + (d - v)*w/w'
-	 *	   = V' - (V - v)*w/w' + (d - v)*w/w'
-	 *	   = V  - (V - v)*w/w' + (d - v)*w/w'
-	 *	   = V  + (d - V)*w/w'
-	 */
-	vslice = (s64)(se->deadline - avruntime);
-	vslice = div_s64(vslice * old_weight, weight);
-	se->deadline = avruntime + vslice;
-}
+static void place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags);
 
 static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 			    unsigned long weight)
 {
 	bool curr = cfs_rq->curr == se;
-	u64 avruntime;
 
 	if (se->on_rq) {
 		/* commit outstanding execution time */
 		update_curr(cfs_rq);
-		avruntime = avg_vruntime(cfs_rq);
+		update_entity_lag(cfs_rq, se);
+		se->deadline -= se->vruntime;
+		se->rel_deadline = 1;
 		if (!curr)
 			__dequeue_entity(cfs_rq, se);
 		update_load_sub(&cfs_rq->load, se->load.weight);
 	}
 	dequeue_load_avg(cfs_rq, se);
 
-	if (se->on_rq) {
-		reweight_eevdf(se, avruntime, weight);
-	} else {
-		/*
-		 * Because we keep se->vlag = V - v_i, while: lag_i = w_i*(V - v_i),
-		 * we need to scale se->vlag when w_i changes.
-		 */
-		se->vlag = div_s64(se->vlag * se->load.weight, weight);
-	}
+	/*
+	 * Because we keep se->vlag = V - v_i, while: lag_i = w_i*(V - v_i),
+	 * we need to scale se->vlag when w_i changes.
+	 */
+	se->vlag = div_s64(se->vlag * se->load.weight, weight);
+	if (se->rel_deadline)
+		se->deadline = div_s64(se->deadline * se->load.weight, weight);
 
 	update_load_set(&se->load, weight);
 
@@ -3919,6 +3809,7 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 	enqueue_load_avg(cfs_rq, se);
 	if (se->on_rq) {
 		update_load_add(&cfs_rq->load, se->load.weight);
+		place_entity(cfs_rq, se, 0);
 		if (!curr)
 			__enqueue_entity(cfs_rq, se);
 
@@ -5359,7 +5250,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 	se->vruntime = vruntime - lag;
 
-	if (sched_feat(PLACE_REL_DEADLINE) && se->rel_deadline) {
+	if (se->rel_deadline) {
 		se->deadline += se->vruntime;
 		se->rel_deadline = 0;
 		return;

Re: [tip: sched/urgent] sched/fair: Fix EEVDF entity placement bug causing scheduling lag

Posted by Alexander Egorenkov 9 months, 3 weeks ago

Hi Peter,

after this change, we are seeing big latencies when trying to execute a
simple command per SSH on a Fedora 41 s390x remote system which is under
stress.

I was able to bisect the problem to this commit.

The problem is easy to reproduce with stress-ng executed on the remote
system which is otherwise unoccupied and concurrent SSH connect attempts
from a local system to the remote one.

stress-ng (on remote system)
----------------------------

$ cpus=$(nproc)
$ stress-ng --cpu $((cpus * 2)) --matrix 50 --mq 50 --aggressive --brk 2
            --stack 2 --bigheap 2 --userfaultfd 0 --perf -t 5m

SSH connect attempts (from local to remote system)
--------------------------------------------------

$ ssh_options=(
        -o UserKnownHostsFile=/dev/null
        -o StrictHostKeyChecking=no
        -o LogLevel=ERROR
        -o ConnectTimeout=10
        -o TCPKeepAlive=yes
        -o ServerAliveInterval=10
        -o PreferredAuthentications=publickey
        -o PubkeyAuthentication=yes
        -o BatchMode=yes
        -o ForwardX11=no
        -A
)

$ while true; do time ssh "${ssh_options[@]}" root@remote-system true; sleep 2; done

========
My tests
========

commit v6.12
------------

$ while true; do time ssh "${ssh_options[@]}" root@remote-system true; sleep 2; done

ssh "${ssh_options[@]}" ciuser@a8345039 true  0.01s user 0.00s system 1% cpu 0.919 total
ssh "${ssh_options[@]}" ciuser@a8345039 true  0.00s user 0.00s system 9% cpu 0.068 total
ssh "${ssh_options[@]}" ciuser@a8345039 true  0.00s user 0.00s system 8% cpu 0.069 total
ssh "${ssh_options[@]}" ciuser@a8345039 true  0.00s user 0.00s system 6% cpu 0.092 total
ssh "${ssh_options[@]}" ciuser@a8345039 true  0.00s user 0.00s system 6% cpu 0.097 total
ssh "${ssh_options[@]}" ciuser@a8345039 true  0.00s user 0.00s system 5% cpu 0.109 total
ssh "${ssh_options[@]}" ciuser@a8345039 true  0.00s user 0.00s system 7% cpu 0.083 total
ssh "${ssh_options[@]}" ciuser@a8345039 true  0.00s user 0.00s system 7% cpu 0.079 total
ssh "${ssh_options[@]}" ciuser@a8345039 true  0.00s user 0.00s system 11% cpu 0.054 total

commit 6d71a9c6160479899ee744d2c6d6602a191deb1f
-----------------------------------------------

$ while true; do time ssh "${ssh_options[@]}" root@remote-system true; sleep 2; done

ssh "${ssh_options[@]}" ciuser@a8345034 true  0.01s user 0.00s system 0% cpu 33.379 total
ssh "${ssh_options[@]}" ciuser@a8345034 true  0.00s user 0.00s system 0% cpu 1.206 total
ssh "${ssh_options[@]}" ciuser@a8345034 true  0.00s user 0.00s system 0% cpu 2.388 total
ssh "${ssh_options[@]}" ciuser@a8345034 true  0.00s user 0.00s system 9% cpu 0.055 total
ssh "${ssh_options[@]}" ciuser@a8345034 true  0.00s user 0.00s system 0% cpu 2.376 total
ssh "${ssh_options[@]}" ciuser@a8345034 true  0.00s user 0.00s system 2% cpu 0.243 total
ssh "${ssh_options[@]}" ciuser@a8345034 true  0.00s user 0.00s system 11% cpu 0.049 total
ssh "${ssh_options[@]}" ciuser@a8345034 true  0.00s user 0.00s system 0% cpu 2.563 total
ssh "${ssh_options[@]}" ciuser@a8345034 true  0.00s user 0.00s system 8% cpu 0.065 total

Thank you
Regards
Alex

ll"RE: [tip: sched/urgent] sched/fair: Fix EEVDF entity placement bug causing scheduling lag

Posted by Doug Smythies 9 months, 3 weeks ago

On 2025.04.17 02:57 Alexander Egorenkov wrote:

> Hi Peter,
>
> after this change, we are seeing big latencies when trying to execute a
> simple command per SSH on a Fedora 41 s390x remote system which is under
> stress.
>
> I was able to bisect the problem to this commit.
>
> The problem is easy to reproduce with stress-ng executed on the remote
> system which is otherwise unoccupied and concurrent SSH connect attempts
> from a local system to the remote one.
>
> stress-ng (on remote system)
> ----------------------------
>
> $ cpus=$(nproc)
> $ stress-ng --cpu $((cpus * 2)) --matrix 50 --mq 50 --aggressive --brk 2
>            --stack 2 --bigheap 2 --userfaultfd 0 --perf -t 5m

That is a very very stressful test. It crashes within a few seconds on my test computer,
with a " Segmentation fault (core dumped)" message.
If I back it off to this:

$ stress-ng --cpu 24 --matrix 50 --mq 50 --aggressive --brk 2 --stack 2 --bigheap 2 -t 300m

It runs, but still makes a great many entries in /var/log/kern.log as the oom killer runs etc.
I am suggesting it is not a reasonable test workload.

Anyway, I used turbostat the same way I was using it back in January for this work, and did observe
longer than requested intervals.
I took 1427 samples and got 10 where the interval time was more than 1 second more than requested.
The worst was 7.5 seconds longer than requested.

I rechecked the 100% workload used in January (12X "yes > dev/null") and it was fine.
3551 samples and the actual interval was never more than 10 milliseconds longer than requested.

Kernel 6.15-rc2
Turbostat version: 2025.04.06
Turbostat sample interval: 2 seconds.
Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz (12 CPU, 6 cores)

... Doug

Re: ll"RE: [tip: sched/urgent] sched/fair: Fix EEVDF entity placement bug causing scheduling lag

Posted by Alexander Egorenkov 9 months, 2 weeks ago

Hi all,

> That is a very very stressful test. It crashes within a few seconds on my test computer,
> with a " Segmentation fault (core dumped)" message.

Yes, this is an artificial test i came up with to demonstrate the
problem we have with another realistic test which i can hardly
use here for the sake of demonstration. But it reveals the exact
same problem we have with our CI test on s390x test systems.

Let me explain shortly how it happens.

Basically, we have a test system where we execute a test suite and
simultaneously monitor this system on another system via simple SSH
logins (approximately invoked every 15 seconds) whether the test system
is still online and dump automatically if it remains unresponsive for
5m straight. We limit every such SSH login to 10 seconds because
we had situations where SSH sometimes hanged for a long time due to
various problems with networking, test system itself etc., just to make
our monitoring robust.

And since the commit "sched/fair: Fix EEVDF entity placement bug causing
scheduling lag" we regularly see SSH logins (limited to 10s) failing for
5m straight, not a single SSH login succeeds. This happens regularly
with test suites which compile software with GCC and use all CPUs
at 100%. Before the commit, a SSH login required under 1 second.
I cannot judge whether the problem really in this commit, or it is just an
accumulated effect after multiple ones.

FYI:
One such system where it happens regularly has 7 cores (5.2Ghz SMT 2x, 14 cpus)
and 8G of main memory with 20G of swap.

Thanks
Regards
Alex

RE: [tip: sched/urgent] sched/fair: Fix EEVDF entity placement bug causing scheduling lag

Posted by Doug Smythies 9 months, 2 weeks ago

Hi Alexander,

Thank you for your reply.
Note that I have adjusted the address list for this email, because I don't know if bots can get emails, and Peter was not on the
"To" line, and might not have noticed this thread.
@Peter : Off-list I will forward you the other emails, in case you missed them. I apologise if you did see them but haven't had time
to get to them or whatever.

Also note that I know nothing about the scheduler and was only on the original email because I had a "Reported-by" tag.

On 2025.04.24 00:57 Alexander Egorenkov wrote:

> Hi all,

[Doug wrote]
>> That is a very very stressful test. It crashes within a few seconds on my test computer,
>> with a " Segmentation fault (core dumped)" message.
>
> Yes, this is an artificial test i came up with to demonstrate the
> problem we have with another realistic test which i can hardly
> use here for the sake of demonstration. But it reveals the exact
> same problem we have with our CI test on s390x test systems.
>
> Let me explain shortly how it happens.
>
> Basically, we have a test system where we execute a test suite and
> simultaneously monitor this system on another system via simple SSH
> logins (approximately invoked every 15 seconds) whether the test system
> is still online and dump automatically if it remains unresponsive for
> 5m straight. We limit every such SSH login to 10 seconds because
> we had situations where SSH sometimes hanged for a long time due to
> various problems with networking, test system itself etc., just to make
> our monitoring robust.
>
> And since the commit "sched/fair: Fix EEVDF entity placement bug causing
> scheduling lag" we regularly see SSH logins (limited to 10s) failing for
> 5m straight, not a single SSH login succeeds. This happens regularly
> with test suites which compile software with GCC and use all CPUs
> at 100%. Before the commit, a SSH login required under 1 second.
> I cannot judge whether the problem really in this commit, or it is just an
> accumulated effect after multiple ones.
>
> FYI:
> One such system where it happens regularly has 7 cores (5.2Ghz SMT 2x, 14 cpus)
> and 8G of main memory with 20G of swap.
>
> Thanks
> Regards
> Alex

Thanks for the explanation.
I have recreated your situation with a workflow that, while it stresses the CPUs,
doesn't make any entries in /var/log/kern.log and /var/log/syslog.
Under the same conditions, I have confirmed that the ssh login lag doesn't occur
With kernel 6.12, but does with kernel 6.13

My workflow is stuff I have used for many years and wrote myself.
Basically, I create a huge queue of running tasks, with each doing a little work
and then sleeping for a short period. I have 2 methods to achieve similar overall
workflow, and one shows the issue and one does not. I can also create a huge
queue by just increasing the number "yes" tasks to a ridiculous number, but
that does not show your ssh login lag issue.

Anyway, for the workflow that does show your issue, I had a load average of
about 19,500 (20,000 tasks) and ssh login times ranged from 38 to 10 seconds,
with an average of about 13 seconds. ssh login times using kernel 6.12 were
negligible.

... Doug