sched/fair: Fix vruntime drift by preventing double lag scaling during reweight

[PATCH] sched/fair: Fix vruntime drift by preventing double lag scaling during reweight

Posted by Zicheng Qu 1 month, 2 weeks ago

In reweight_entity(), when reweighting a currently running entity (se ==
cfs_rq->curr), the entity remains on the runqueue context without
undergoing a full dequeue/enqueue cycle. This means avg_vruntime()
remains constant throughout the reweight operation.

However, the current implementation calls place_entity(..., 0) at the
end of reweight_entity(). Under EEVDF, place_entity() is designed to
handle entities entering the runqueue and calculates the virtual lag
(vlag) to account for the change in the weighted average vruntime (V)
using the formula:

	vlag' = vlag * (W + w_i) / W

Where 'W' is the current aggregate weight (including
cfs_rq->curr->load.weight) and 'w_i' is the weight of the entity being
enqueued (in this case, the se is exactly the cfs_rq->curr).

This leads to a "double scaling" logic for running entities:
1. reweight_entity() already rescales se->vlag based on the new weight
   ratio.
2. place_entity() then mistakenly applies the (W + w_i)/W scaling again,
   treating the reweight as a fresh enqueue into a new total weight
pool.

This can cause the entity's vlag to be amplified (if positive) or
suppressed (if negative) incorrectly during the reweight process.

In environments with frequent cgroup throttle/unthrottle operations,
this math error manifests as a vruntime drift.

A hungtask was observed as below:
crash> runq -c 0 -g
CPU 0
  CURRENT: PID: 330440  TASK: ffff00004cd61540  COMMAND: "stress-ng"
  ROOT_TASK_GROUP: ffff8001025fa4c0  RT_RQ: ffff0000fff42500
	 [no tasks queued]
  ROOT_TASK_GROUP: ffff8001025fa4c0  CFS_RQ: ffff0000fff422c0
	 TASK_GROUP: ffff0000c130fc00  CFS_RQ: ffff00009125a400  <test_cg>	cfs_bandwidth: period=100000000, quota=18446744073709551615, gse: 0xffff000091258c00, vruntime=127285708384434, deadline=127285714880550, vlag=11721467, weight=338965, my_q=ffff00009125a400, cfs_rq: avg_vruntime=0, zero_vruntime=2029704519792, avg_load=0, nr_running=1
		TASK_GROUP: ffff0000d7cc8800  CFS_RQ: ffff0000c8f86800  <test_test329274_1>	cfs_bandwidth: period=14000000, quota=14000000, gse: 0xffff0000c8f86400, vruntime=2034894470719, deadline=2034898697770, vlag=0, weight=215291, my_q=ffff0000c8f86800, cfs_rq: avg_vruntime=-422528991, zero_vruntime=8444226681954, avg_load=54, nr_running=19
		   [110] PID: 330440  TASK: ffff00004cd61540  COMMAND: "stress-ng" [CURRENT]    vruntime=8444367524951, deadline=8444932411139, vlag=8444932411139, weight=3072, last_arrival=4002964107010, last_queued=0, exec_start=3872860294100, sum_exec_runtime=22252021900
		   ...
		   [110] PID: 330291  TASK: ffff0000c02c9540  COMMAND: "stress-ng"	vruntime=8444229273009, deadline=8444946073008, vlag=-2701415, weight=3072, last_arrival=4002964076840, last_queued=4002964550990, exec_start=3872859839290, sum_exec_runtime=22310951770
	 [100] PID: 97     TASK: ffff0000c2432a00  COMMAND: "kworker/0:1H"	vruntime=127285720095197, deadline=127285720119423, vlag=48453, weight=90891264, last_arrival=3846600432710, last_queued=3846600721010, exec_start=3743307237970, sum_exec_runtime=413405210
	 [120] PID: 15     TASK: ffff0000c0368080  COMMAND: "ksoftirqd/0"	vruntime=127285722433404, deadline=127285724533404, vlag=0, weight=1048576, last_arrival=3506755665780, last_queued=3506852159390, exec_start=3461615726670, sum_exec_runtime=16341041340
	 [120] PID: 50173  TASK: ffff0000741d8080  COMMAND: "kworker/0:0"	vruntime=127285722960040, deadline=127285725060040, vlag=-414755, weight=1048576, last_arrival=3506828139580, last_queued=3506972354700, exec_start=3461676584440, sum_exec_runtime=84414080
	 [120] PID: 58662  TASK: ffff000091180080  COMMAND: "kworker/0:2"	vruntime=127285723428168, deadline=127285725528168, vlag=3049158, weight=1048576, last_arrival=3505689085070, last_queued=3506848131990, exec_start=3460592328510, sum_exec_runtime=89193000

TASK 1 (systemd) is waiting for cgroup_mutex.
TASK 329296 (sh) holds cgroup_mutex and is waiting for cpus_read_lock.
TASK 50173 (kworker/0:0) holds the cpus_read_lock, but fail to be
scheduled.
test_cg and TASK 97 may have suppressed TASK 50173, causing
it to not be scheduled for a long time, thus failing to release locks in
a timely manner and ultimately causing a hungtask issue.

Fix by adding ENQUEUE_REWEIGHT_CURR flag and skipping vlag recalculation
in place_entity() when reweighting the current running entity. For
non-current entities, the existing logic remains as dequeue/enqueue
changes avg_vruntime().

Fixes: 6d71a9c61604 ("sched/fair: Fix EEVDF entity placement bug causing scheduling lag")
Signed-off-by: Zicheng Qu <quzicheng@huawei.com>
---
 kernel/sched/fair.c  | 11 ++++++++++-
 kernel/sched/sched.h |  1 +
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da46c3164537..3be42729049e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3787,7 +3787,7 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 
 	enqueue_load_avg(cfs_rq, se);
 	if (se->on_rq) {
-		place_entity(cfs_rq, se, 0);
+		place_entity(cfs_rq, se, curr ? ENQUEUE_REWEIGHT_CURR : 0);
 		update_load_add(&cfs_rq->load, se->load.weight);
 		if (!curr)
 			__enqueue_entity(cfs_rq, se);
@@ -5123,6 +5123,14 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 		lag = se->vlag;
 
+		/*
+		 * ENQUEUE_REWEIGHT_CURR:
+		 * current running se (cfs_rq->curr) should skip vlag recalculation,
+		 * because avg_vruntime(...) hasn't changed.
+		 */
+		if (flags & ENQUEUE_REWEIGHT_CURR)
+			goto skip_lag_scale;
+
 		/*
 		 * If we want to place a task and preserve lag, we have to
 		 * consider the effect of the new entity on the weighted
@@ -5185,6 +5193,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 		lag = div_s64(lag, load);
 	}
 
+skip_lag_scale:
 	se->vruntime = vruntime - lag;
 
 	if (se->rel_deadline) {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d30cca6870f5..e3a43f94dd2f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2412,6 +2412,7 @@ extern const u32		sched_prio_to_wmult[40];
 #define ENQUEUE_MIGRATED	0x00040000
 #define ENQUEUE_INITIAL		0x00080000
 #define ENQUEUE_RQ_SELECTED	0x00100000
+#define ENQUEUE_REWEIGHT_CURR	0x00200000
 
 #define RETRY_TASK		((void *)-1UL)
 
-- 
2.34.1

Re: [PATCH] sched/fair: Fix vruntime drift by preventing double lag scaling during reweight

Posted by Peter Zijlstra 1 week, 6 days ago

On Fri, Dec 26, 2025 at 12:17:31AM +0000, Zicheng Qu wrote:
> In reweight_entity(), when reweighting a currently running entity (se ==
> cfs_rq->curr), the entity remains on the runqueue context without
> undergoing a full dequeue/enqueue cycle.

I am horribly confused by this statement. In case current is on_rq
(most often the case) then it is dequeued just as much as any other
on_rq task being reweighted.

[edit: ... after much puzzling ...

Ooh, are you perhaps alluding to the fact that avg_vruntime() always
adds cfs_rq->curr in?]

> This means avg_vruntime()
> remains constant throughout the reweight operation.

Now this might have a point; but I don't see how it would be specific to
current in any way.

Consider the tree entities (vruntime,weight):

 (-40,10)
 ( 30,10)
 (-20,10)

This gives:

         -40*10 + 30*10 + -20*10   -300
 w_avg = ----------------------- = ---- = -10
                    30              30

Now, lets randomly pick (30,10) and do reweight to 5, then we have:

  vlag  = w_avg - vruntime = -10 - 30 = -40
  vlag' = vlag * weight / weight' = -40*10/5 = -80

Then placing that at -10 would give:

 (-40,10)
 ( 70,5 )
 (-20,10)

         -40*10 + 70*5 + -20*10   -250
 w_avg = ---------------------- = ---- = -10
                    25             25

However, if we pick any other entity, lets say (-40/10) and do the same:

 30*10/5 = 60

placing at -10 gives, (-70,5), giving the w_avg:

         -70*5 + 30*10 + -20*10   -250
 w_avg = ---------------------- = ---- = -10
                    25             25


Or rather, we can say that w_avg is/should-be invariant under the
reweight transform. Current or no current, it doesn't matter.

Which is more or less what we had prior to: 6d71a9c61604 ("sched/fair:
Fix EEVDF entity placement bug causing scheduling lag")

> However, the current implementation calls place_entity(..., 0) at the
> end of reweight_entity(). Under EEVDF, place_entity() is designed to
> handle entities entering the runqueue and calculates the virtual lag
> (vlag) to account for the change in the weighted average vruntime (V)
> using the formula:
> 
> 	vlag' = vlag * (W + w_i) / W
>
> Where 'W' is the current aggregate weight (including
> cfs_rq->curr->load.weight) and 'w_i' is the weight of the entity being
> enqueued (in this case, the se is exactly the cfs_rq->curr).
> 
> This leads to a "double scaling" logic for running entities:
> 1. reweight_entity() already rescales se->vlag based on the new weight
>    ratio.
> 2. place_entity() then mistakenly applies the (W + w_i)/W scaling again,
>    treating the reweight as a fresh enqueue into a new total weight
> pool.
> 
> This can cause the entity's vlag to be amplified (if positive) or
> suppressed (if negative) incorrectly during the reweight process.

Again, nothing specific to current here.

[edit: except for the fact that avg_vruntime() actually does include
current when re-adding current -- but that still doesn't explain why it
would be a good idea to have two different ways of doing reweight]

This all suggests we should revert 6d71a9c61604 ("sched/fair: Fix EEVDF
entity placement bug causing scheduling lag"), except we first need an
explanation for the problem described there.

At the time I observed reweight cycles 1048576 -> 2 -> 1048576 inflating
the lag, which obviously should not be.

I might have made a mistake, but we should ensure we're not
re-introducing that problem, so let me see if I can figure out where
that came from if not from this.

Now, poking at all this, I did find some numerical stability issues, but
rather than blowing up like I saw for 6d71a9c61604, these cause lag to
evaporate (also not good).

Observe the below proglet; when using w_avg_t (for truncate, like
sum_w_vruntime with use of scale_load_weight()):

vruntime: 40960 (-51200) -> 40960 (-51200)
avg: -10240 -- -10240
vruntime: 40960 (-51200) -> 26843535360 (-26843545600)
avg: -35840 -- -35840
vruntime: 26843535360 (-26843571200) -> 15360 (-51200)
avg: -18773 -- -18773
vruntime: 15360 (-34133) -> 17895503531 (-17895522304)
avg: -35840 -- -35840
vruntime: 17895503531 (-17895539371) -> -1707 (-34133)
avg: -24462 -- -24462
vruntime: -1707 (-22755) -> 11930148978 (-11930173440)
avg: -35840 -- -35840
vruntime: 11930148978 (-11930184818) -> -13085 (-22755)
avg: -28255 -- -28255


Where as, if we run it with w_avg_n:

vruntime: 40960 (-51200) -> 40960 (-51200)
avg: -10240 -- -10240
vruntime: 40960 (-51200) -> 26843535360 (-26843545600)
avg: -10240 -- -35840
vruntime: 26843535360 (-26843545600) -> 40960 (-51200)
avg: -10240 -- -10240
vruntime: 40960 (-51200) -> 26843535360 (-26843545600)
avg: -10240 -- -35840
vruntime: 26843535360 (-26843545600) -> 40960 (-51200)
avg: -10240 -- -10240
vruntime: 40960 (-51200) -> 26843535360 (-26843545600)
avg: -10240 -- -35840
vruntime: 26843535360 (-26843545600) -> 40960 (-51200)
avg: -10240 -- -10240

Specifically, the problem is because the 2 weight is below the
representable value, causing accuracy issues for avg_vruntime which
propagate.

Luckily, we recently replaced min_vruntime with zero_vruntime, which
tracks avg_vruntime far better resulting in smaller 'keys', which in
turn -- as it happens -- allows us to get rid of that
scale_load_down().


Could you please try queue.git/sched/reweight ?

I've only confirmed it boots on my machine. I'll prod a little more at
it tomorrow.


---
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

#define min(a,b) \
	({ __typeof__ (a) _a = (a); \
	   __typeof__ (b) _b = (b); \
	   _a < _b ? _a : _b; })

struct e { int64_t vr; int64_t w; };

const int scale = 1024;

struct e es[] = {
	{  40*scale, 1024*1024 },
	{ -20*scale, 1024*1024 },
	{ -50*scale, 1024*1024 },
};

const int n = sizeof(es)/sizeof(es[0]);

int64_t w_avg_n(void)
{
	int64_t ws = 0, w = 0;
	for (int i=0; i<n; i++) {
		w += es[i].w;
		ws += es[i].vr * es[i].w;
	}
	return ws/w;
}

int64_t w_avg_t(void)
{
	int64_t ws = 0, w = 0;
	for (int i=0; i<n; i++) {
		int64_t t = min(2, es[i].w >> 10);
		w += t;
		ws += es[i].vr * t;
	}
	return ws/w;
}

typedef int64_t (*avg_f)(void);

avg_f w_avg = &w_avg_t; // w_avg_n for good, w_avg_t for broken

void reweight(int i, int64_t w)
{
	int64_t a = w_avg();
	int64_t t, vl = a - es[i].vr;
	t = (vl * es[i].w) / w;
	printf("vruntime: %Ld (%Ld) -> %Ld (%Ld)\n", es[i].vr, vl, a-t, t);
	es[i].vr = a - t;
	es[i].w = w;
}

int main(int argc, char **argv)
{
	int i = 0;
	
	if (argc > 1)
		i = atoi(argv[1]);

	reweight(i, 1024*1024);
	printf("avg: %Ld -- %Ld\n", w_avg(), w_avg_t());
	for (int j=0; j<3; j++) {
		reweight(i, 2);
		printf("avg: %Ld -- %Ld\n", w_avg(), w_avg_t());
		reweight(i, 1024*1024);
		printf("avg: %Ld -- %Ld\n", w_avg(), w_avg_t());
	}
	return 0;
}

Re: [PATCH] sched/fair: Fix vruntime drift by preventing double lag scaling during reweight

Posted by K Prateek Nayak 1 month ago

Hello Zicheng,

On 12/26/2025 5:47 AM, Zicheng Qu wrote:
> In reweight_entity(), when reweighting a currently running entity (se ==
> cfs_rq->curr), the entity remains on the runqueue context without
> undergoing a full dequeue/enqueue cycle. This means avg_vruntime()
> remains constant throughout the reweight operation.
> 
> However, the current implementation calls place_entity(..., 0) at the
> end of reweight_entity(). Under EEVDF, place_entity() is designed to
> handle entities entering the runqueue and calculates the virtual lag
> (vlag) to account for the change in the weighted average vruntime (V)
> using the formula:
> 
> 	vlag' = vlag * (W + w_i) / W
> 
> Where 'W' is the current aggregate weight (including
> cfs_rq->curr->load.weight) and 'w_i' is the weight of the entity being
> enqueued (in this case, the se is exactly the cfs_rq->curr).
> 
> This leads to a "double scaling" logic for running entities:
> 1. reweight_entity() already rescales se->vlag based on the new weight
>    ratio.
> 2. place_entity() then mistakenly applies the (W + w_i)/W scaling again,
>    treating the reweight as a fresh enqueue into a new total weight
> pool.
> 
> This can cause the entity's vlag to be amplified (if positive) or
> suppressed (if negative) incorrectly during the reweight process.
> 
> In environments with frequent cgroup throttle/unthrottle operations,
> this math error manifests as a vruntime drift.
> 
> A hungtask was observed as below:
> crash> runq -c 0 -g
> CPU 0
>   CURRENT: PID: 330440  TASK: ffff00004cd61540  COMMAND: "stress-ng"
>   ROOT_TASK_GROUP: ffff8001025fa4c0  RT_RQ: ffff0000fff42500
> 	 [no tasks queued]
>   ROOT_TASK_GROUP: ffff8001025fa4c0  CFS_RQ: ffff0000fff422c0
> 	 TASK_GROUP: ffff0000c130fc00  CFS_RQ: ffff00009125a400  <test_cg>	cfs_bandwidth: period=100000000, quota=18446744073709551615, gse: 0xffff000091258c00, vruntime=127285708384434, deadline=127285714880550, vlag=11721467, weight=338965, my_q=ffff00009125a400, cfs_rq: avg_vruntime=0, zero_vruntime=2029704519792, avg_load=0, nr_running=1
> 		TASK_GROUP: ffff0000d7cc8800  CFS_RQ: ffff0000c8f86800  <test_test329274_1>	cfs_bandwidth: period=14000000, quota=14000000, gse: 0xffff0000c8f86400, vruntime=2034894470719, deadline=2034898697770, vlag=0, weight=215291, my_q=ffff0000c8f86800, cfs_rq: avg_vruntime=-422528991, zero_vruntime=8444226681954, avg_load=54, nr_running=19
> 		   [110] PID: 330440  TASK: ffff00004cd61540  COMMAND: "stress-ng" [CURRENT]    vruntime=8444367524951, deadline=8444932411139, vlag=8444932411139, weight=3072, last_arrival=4002964107010, last_queued=0, exec_start=3872860294100, sum_exec_runtime=22252021900
> 		   ...
> 		   [110] PID: 330291  TASK: ffff0000c02c9540  COMMAND: "stress-ng"	vruntime=8444229273009, deadline=8444946073008, vlag=-2701415, weight=3072, last_arrival=4002964076840, last_queued=4002964550990, exec_start=3872859839290, sum_exec_runtime=22310951770
> 	 [100] PID: 97     TASK: ffff0000c2432a00  COMMAND: "kworker/0:1H"	vruntime=127285720095197, deadline=127285720119423, vlag=48453, weight=90891264, last_arrival=3846600432710, last_queued=3846600721010, exec_start=3743307237970, sum_exec_runtime=413405210
> 	 [120] PID: 15     TASK: ffff0000c0368080  COMMAND: "ksoftirqd/0"	vruntime=127285722433404, deadline=127285724533404, vlag=0, weight=1048576, last_arrival=3506755665780, last_queued=3506852159390, exec_start=3461615726670, sum_exec_runtime=16341041340
> 	 [120] PID: 50173  TASK: ffff0000741d8080  COMMAND: "kworker/0:0"	vruntime=127285722960040, deadline=127285725060040, vlag=-414755, weight=1048576, last_arrival=3506828139580, last_queued=3506972354700, exec_start=3461676584440, sum_exec_runtime=84414080
> 	 [120] PID: 58662  TASK: ffff000091180080  COMMAND: "kworker/0:2"	vruntime=127285723428168, deadline=127285725528168, vlag=3049158, weight=1048576, last_arrival=3505689085070, last_queued=3506848131990, exec_start=3460592328510, sum_exec_runtime=89193000
> 
> TASK 1 (systemd) is waiting for cgroup_mutex.
> TASK 329296 (sh) holds cgroup_mutex and is waiting for cpus_read_lock.
> TASK 50173 (kworker/0:0) holds the cpus_read_lock, but fail to be
> scheduled.
> test_cg and TASK 97 may have suppressed TASK 50173, causing
> it to not be scheduled for a long time, thus failing to release locks in
> a timely manner and ultimately causing a hungtask issue.
> 
> Fix by adding ENQUEUE_REWEIGHT_CURR flag and skipping vlag recalculation
> in place_entity() when reweighting the current running entity. For
> non-current entities, the existing logic remains as dequeue/enqueue
> changes avg_vruntime().
> 
> Fixes: 6d71a9c61604 ("sched/fair: Fix EEVDF entity placement bug causing scheduling lag")
> Signed-off-by: Zicheng Qu <quzicheng@huawei.com>
> ---
>  kernel/sched/fair.c  | 11 ++++++++++-
>  kernel/sched/sched.h |  1 +
>  2 files changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index da46c3164537..3be42729049e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3787,7 +3787,7 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
>  
>  	enqueue_load_avg(cfs_rq, se);
>  	if (se->on_rq) {
> -		place_entity(cfs_rq, se, 0);
> +		place_entity(cfs_rq, se, curr ? ENQUEUE_REWEIGHT_CURR : 0);
>  		update_load_add(&cfs_rq->load, se->load.weight);
>  		if (!curr)
>  			__enqueue_entity(cfs_rq, se);
> @@ -5123,6 +5123,14 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  
>  		lag = se->vlag;
>  
> +		/*
> +		 * ENQUEUE_REWEIGHT_CURR:
> +		 * current running se (cfs_rq->curr) should skip vlag recalculation,
> +		 * because avg_vruntime(...) hasn't changed.
> +		 */
> +		if (flags & ENQUEUE_REWEIGHT_CURR)
> +			goto skip_lag_scale;

If I'm not mistaken, the problem is that we'll see "curr->on_rq" and
then do:

    if (curr && curr->on_rq)
        load += scale_load_down(curr->load.weight);

    lag *= load + scale_load_down(se->load.weight);


which shouldn't be the case since we are accounting "se" twice when
it is also the "curr" and avg_vruntime() would have also accounted it
already since "curr->on_rq" and then we do everything twice for "se".

I'm wondering if instead of adding a flag, we can do:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7377f9117501..7b4a7f4f2efa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3792,8 +3792,9 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 			    unsigned long weight)
 {
 	bool curr = cfs_rq->curr == se;
+	bool queued = !!se->on_rq;
 
-	if (se->on_rq) {
+	if (queued) {
 		/* commit outstanding execution time */
 		update_curr(cfs_rq);
 		update_entity_lag(cfs_rq, se);
@@ -3803,6 +3804,12 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 		if (!curr)
 			__dequeue_entity(cfs_rq, se);
 		update_load_sub(&cfs_rq->load, se->load.weight);
+		/*
+		 * Indicate that se is off the cfs_rq for place_entity()
+		 * to correctly scale the weight especially when curr is
+		 * being placed back.
+		 */
+		se->on_rq = 0;
 	}
 	dequeue_load_avg(cfs_rq, se);
 
@@ -3823,12 +3830,14 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 	} while (0);
 
 	enqueue_load_avg(cfs_rq, se);
-	if (se->on_rq) {
+	if (queued) {
 		place_entity(cfs_rq, se, 0);
 		update_load_add(&cfs_rq->load, se->load.weight);
 		if (!curr)
 			__enqueue_entity(cfs_rq, se);
 		cfs_rq->nr_queued++;
+		/* Entity has been enqueued back. */
+		se->on_rq = 1;
 	}
 }
 
---

This matches what we do for curr in enqueue_entity() where we know
"cfs_rq->curr == se" but "se->on_rq == 0". Thoughts?

On a side note, I was looking at requeue_delayed_entity() and was
wondering if something like this makes sense there since it also does a
place_entity() but then an entity can never be "cfs_rq->curr" and be
delayed when we drop the rq_lock:

1) If se is ineligible, there must be another queued entity and if it is
   runnable, pick_task_fair() will pick the runnable entity and do an
   equivalent of (put_prev/set_next)_entity() to switch the
   "cfs_rq->curr" to the runnable hierarchy before dropping the rq_lock.

2) If everything is delayed, pick_next_entity() will dequeue them all
   completely before dropping the rq_lock for idle balancing.

FWIW, I've been running with:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7377f9117501..550bddfb2cc0 100644
@@ -6843,6 +6852,7 @@ requeue_delayed_entity(struct sched_entity *se)
 	 */
 	WARN_ON_ONCE(!se->sched_delayed);
 	WARN_ON_ONCE(!se->on_rq);
+	WARN_ON_ONCE(cfs_rq->curr == se);
 
 	if (sched_feat(DELAY_ZERO)) {
 		update_entity_lag(cfs_rq, se);
---

and I haven't seen any splats (yet!) :-)

Peter, thoughts?

> +
>  		/*
>  		 * If we want to place a task and preserve lag, we have to
>  		 * consider the effect of the new entity on the weighted
> @@ -5185,6 +5193,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  		lag = div_s64(lag, load);
>  	}
>  
> +skip_lag_scale:
>  	se->vruntime = vruntime - lag;
>  
>  	if (se->rel_deadline) {
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index d30cca6870f5..e3a43f94dd2f 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2412,6 +2412,7 @@ extern const u32		sched_prio_to_wmult[40];
>  #define ENQUEUE_MIGRATED	0x00040000
>  #define ENQUEUE_INITIAL		0x00080000
>  #define ENQUEUE_RQ_SELECTED	0x00100000
> +#define ENQUEUE_REWEIGHT_CURR	0x00200000
>  
>  #define RETRY_TASK		((void *)-1UL)
>  

-- 
Thanks and Regards,
Prateek

Re: [PATCH] sched/fair: Fix vruntime drift by preventing double lag scaling during reweight

Posted by Zicheng Qu 1 month ago

Hi Prateek,

On 1/9/2026 12:50 PM, K Prateek Nayak wrote:
> If I'm not mistaken, the problem is that we'll see "curr->on_rq" and
> then do:
>
>      if (curr && curr->on_rq)
>          load += scale_load_down(curr->load.weight);
>
>      lag *= load + scale_load_down(se->load.weight);
>
>
> which shouldn't be the case since we are accounting "se" twice when
> it is also the "curr" and avg_vruntime() would have also accounted it
> already since "curr->on_rq" and then we do everything twice for "se".
Thanks for the analysis — I agree your concern is reasonable, but I
think the issue here is slightly different from "accounting se twice",
but a semantic mismatch in how place_entity() is used.

place_entity() is meant to compensate lag for entities being inserted
into the runqueue, accounting for the effect of a new entity on the
weighted average vruntime. That assumption holds when an se is joining
the rq. However, when se == cfs_rq->curr, the entity never left the
runqueue and avg_vruntime() has not changed, so applying enqueue-style
lag scaling is not appropriate.
> I'm wondering if instead of adding a flag, we can do:
Yes, I totally agree that adding a new flag is unnecessary. We
can handle this directly in place_entity() by skipping lag scaling in
case of `se == cfs_rq->curr`, for example:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da46c3164537..1b279bf43f38 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5123,6 +5123,15 @@ place_entity(struct cfs_rq *cfs_rq, struct 
sched_entity *se, int flags)

                 lag = se->vlag;

+              /*
+               * place_entity() compensates lag for entities being 
inserted into the
+               * runqueue. When se == cfs_rq->curr, the entity never 
left the rq and
+               * avg_vruntime() did not change, so enqueue-style lag 
scaling does not
+               * apply.
+               */
+              if (se == cfs_rq->curr)
+                      goto skip_lag_scale;
+
                 /*
                  * If we want to place a task and preserve lag, we have to
                  * consider the effect of the new entity on the weighted
@@ -5185,6 +5194,7 @@ place_entity(struct cfs_rq *cfs_rq, struct 
sched_entity *se, int flags)
                 lag = div_s64(lag, load);
         }

+skip_lag_scale:
         se->vruntime = vruntime - lag;

         if (se->rel_deadline) {

Best regards,
Zicheng

Re: [PATCH] sched/fair: Fix vruntime drift by preventing double lag scaling during reweight

Posted by K Prateek Nayak 1 month ago

Hello Zicheng,

On 1/9/2026 2:10 PM, Zicheng Qu wrote:
> Hi Prateek,
> 
> On 1/9/2026 12:50 PM, K Prateek Nayak wrote:
>> If I'm not mistaken, the problem is that we'll see "curr->on_rq" and
>> then do:
>>
>>      if (curr && curr->on_rq)
>>          load += scale_load_down(curr->load.weight);
>>
>>      lag *= load + scale_load_down(se->load.weight);
>>
>>
>> which shouldn't be the case since we are accounting "se" twice when
>> it is also the "curr" and avg_vruntime() would have also accounted it
>> already since "curr->on_rq" and then we do everything twice for "se".
> Thanks for the analysis — I agree your concern is reasonable, but I
> think the issue here is slightly different from "accounting se twice",
> but a semantic mismatch in how place_entity() is used.
> 
> place_entity() is meant to compensate lag for entities being inserted
> into the runqueue, accounting for the effect of a new entity on the
> weighted average vruntime. That assumption holds when an se is joining
> the rq. However, when se == cfs_rq->curr, the entity never left the
> runqueue and avg_vruntime() has not changed, so applying enqueue-style
> lag scaling is not appropriate.

I believe the intention is to discount the contribution of the
task and then re-account it again after the reweigh. I don't think
se being the "curr" makes it any different except for the fact that
its vruntime and load contribution isn't reflected in the sum and
is added in by avg_vruntime()

>> I'm wondering if instead of adding a flag, we can do:
> Yes, I totally agree that adding a new flag is unnecessary. We
> can handle this directly in place_entity() by skipping lag scaling in
> case of `se == cfs_rq->curr`, for example:
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index da46c3164537..1b279bf43f38 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5123,6 +5123,15 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> 
>                 lag = se->vlag;
> 
> +              /*
> +               * place_entity() compensates lag for entities being inserted into the
> +               * runqueue. When se == cfs_rq->curr, the entity never left the rq and
> +               * avg_vruntime() did not change, so enqueue-style lag scaling does not
> +               * apply.
> +               */
> +              if (se == cfs_rq->curr)
> +                      goto skip_lag_scale;

This affects the place_entity() from enqueue_task() where se is dequeued
(se->on_rq == 0) but it is still the curr - can happen when rq drops
lock for newidle balance and a concurrent wakeup is queuing the task.

You need to check for "se == cfs_rq->curr && se->on_rq" here and then
this bit should be good.

Let me stare more at the avg_vruntime() since "curr->on_rq" would add
its vruntime too in that calculation.

> +
>                 /*
>                  * If we want to place a task and preserve lag, we have to
>                  * consider the effect of the new entity on the weighted
> @@ -5185,6 +5194,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>                 lag = div_s64(lag, load);
>         }
> 
> +skip_lag_scale:
>         se->vruntime = vruntime - lag;
> 
>         if (se->rel_deadline) {
> 
> Best regards,
> Zicheng

-- 
Thanks and Regards,
Prateek

Re: [PATCH] sched/fair: Fix vruntime drift by preventing double lag scaling during reweight

Posted by Zicheng Qu 3 weeks, 6 days ago

On 1/9/2026 6:21 PM, K Prateek Nayak wrote:

> Hello Zicheng,
>
> On 1/9/2026 2:10 PM, Zicheng Qu wrote:
>> Hi Prateek,
>>
>> On 1/9/2026 12:50 PM, K Prateek Nayak wrote:
>>> If I'm not mistaken, the problem is that we'll see "curr->on_rq" and
>>> then do:
>>>
>>>       if (curr && curr->on_rq)
>>>           load += scale_load_down(curr->load.weight);
>>>
>>>       lag *= load + scale_load_down(se->load.weight);
>>>
>>>
>>> which shouldn't be the case since we are accounting "se" twice when
>>> it is also the "curr" and avg_vruntime() would have also accounted it
>>> already since "curr->on_rq" and then we do everything twice for "se".
>> Thanks for the analysis — I agree your concern is reasonable, but I
>> think the issue here is slightly different from "accounting se twice",
>> but a semantic mismatch in how place_entity() is used.
>>
>> place_entity() is meant to compensate lag for entities being inserted
>> into the runqueue, accounting for the effect of a new entity on the
>> weighted average vruntime. That assumption holds when an se is joining
>> the rq. However, when se == cfs_rq->curr, the entity never left the
>> runqueue and avg_vruntime() has not changed, so applying enqueue-style
>> lag scaling is not appropriate.
> I believe the intention is to discount the contribution of the
> task and then re-account it again after the reweigh. I don't think
> se being the "curr" makes it any different except for the fact that
> its vruntime and load contribution isn't reflected in the sum and
> is added in by avg_vruntime()
>
>>> I'm wondering if instead of adding a flag, we can do:
>> Yes, I totally agree that adding a new flag is unnecessary. We
>> can handle this directly in place_entity() by skipping lag scaling in
>> case of `se == cfs_rq->curr`, for example:
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index da46c3164537..1b279bf43f38 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5123,6 +5123,15 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>>
>>                  lag = se->vlag;
>>
>> +              /*
>> +               * place_entity() compensates lag for entities being inserted into the
>> +               * runqueue. When se == cfs_rq->curr, the entity never left the rq and
>> +               * avg_vruntime() did not change, so enqueue-style lag scaling does not
>> +               * apply.
>> +               */
>> +              if (se == cfs_rq->curr)
>> +                      goto skip_lag_scale;
> This affects the place_entity() from enqueue_task() where se is dequeued
> (se->on_rq == 0) but it is still the curr - can happen when rq drops
> lock for newidle balance and a concurrent wakeup is queuing the task.
>
> You need to check for "se == cfs_rq->curr && se->on_rq" here and then
> this bit should be good.
>
> Let me stare more at the avg_vruntime() since "curr->on_rq" would add
> its vruntime too in that calculation.
>
>> +
>>                  /*
>>                   * If we want to place a task and preserve lag, we have to
>>                   * consider the effect of the new entity on the weighted
>> @@ -5185,6 +5194,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>>                  lag = div_s64(lag, load);
>>          }
>>
>> +skip_lag_scale:
>>          se->vruntime = vruntime - lag;
>>
>>          if (se->rel_deadline) {
>>
>> Best regards,
>> Zicheng

Hi Prateek,

Thanks for the careful review and for pointing out the concurrent
enqueue case where se can still be curr while se->on_rq == 0. You're
right that skipping lag scaling purely based on se == cfs_rq->curr would
incorrectly affect the enqueue_task() path under newidle balance, and
the condition needs to be refined.

I agree that the correct distinction is whether the entity actually left
the runqueue. Updating the condition to:

     se == cfs_rq->curr && se->on_rq

accurately limits the skip to the reweight case, while preserving the
intended enqueue semantics when the entity was dequeued and reinserted.

On the avg_vruntime() / invariance point, the reasoning I'm relying on
is that a "pure" dequeue → enqueue cycle (with no execution progress and
no other state changes) should not change an entity's relative position
or the avg_vruntime() of the cfs_rq.

Let V denote avg_vruntime() of the cfs_rq and let l denote se->vlag,
where l = V - v (v is se->vruntime). Consider what happens if an entity
is removed and then reinserted without vlag compensating for the change
in V.

   * Case A: l > 0 (entity is lagging; it "pulls" the weighted average 
backwards)
     - Before dequeue: v1 = V1 - l, where l > 0
     - Dequeue removes this lagging entity, causing the weighted average 
to advance: V2 > V1
     - If we re-enqueue with unchanged lag value l (i.e., place_entity() 
sets v3 = V2 - l):
         v3 = V2 - l = V2 - (V1 - v1) = v1 + (V2 - V1)
         Since V2 > V1, we get v3 > v1
     - This means the entity's vruntime has artificially advanced 
despite no execution
     - After reinsertion, the new average V3 will satisfy V1 < V3 < V2
       (as we've reintroduced the lagging entity), but the entity's 
relative position
       has shifted forward
     - The enqueue-style lag scaling compensates by adjusting l to 
maintain the invariant
         that v3 = v1 and V3 = V1, preserving the entity's fair position

   * Case B: l < 0 (The thinking is mostly the same)

This is distinct from the reweight case on `se == cfs_rq->curr &&
se->on_rq`, the entity never actually leaves the cfs_rq, so V does not
change "due to its absence". In that situation, applying the same
enqueue-style lag scaling modifies curr’s vruntime without any dequeue/
enqueue event to compensate for, introducing artificial drift.

This drift has a direct fairness impact under EEVDF. If curr->vlag > 0,
the extra scaling increases its positive lag and biases subsequent picks
in its favor; if curr->vlag < 0, the magnitude becomes more negative and
it becomes less likely to be scheduled. If a se weight is temporarily
adjusted and later restored, multiple place_entity() invocations can
accumulate drift. Even ignoring the "if (curr && curr->on_rq)" load
adjustment, only with `lag *= load + scale_load_down(se->load.weight)`,
the enqueue-style scaling itself is also to move curr’s vruntime in the
running case.

I'll update the patch accordingly with the
refined condition and updated comment if you think this logic is
reasonable.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da46c3164537..51482186fd31 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5123,6 +5123,15 @@ place_entity(struct cfs_rq *cfs_rq, struct 
sched_entity *se, int flags)

                 lag = se->vlag;

+               /*
+                * place_entity() compensates lag for entities being 
inserted into the
+                * runqueue. When se == cfs_rq->curr, the entity never 
left the rq and
+                * avg_vruntime() did not change, so enqueue-style lag 
scaling does not
+                * apply.
+                */
+               if (se == cfs_rq->curr && se->on_rq)
+                       goto skip_lag_scale;
+
                 /*
                  * If we want to place a task and preserve lag, we have to
                  * consider the effect of the new entity on the weighted
@@ -5185,6 +5194,7 @@ place_entity(struct cfs_rq *cfs_rq, struct 
sched_entity *se, int flags)
                 lag = div_s64(lag, load);
         }

+skip_lag_scale:
         se->vruntime = vruntime - lag;

         if (se->rel_deadline) {

Thanks again for the careful review.

Best regards,
Zicheng

Re: [PATCH] sched/fair: Fix vruntime drift by preventing double lag scaling during reweight

Posted by Zicheng Qu 1 month ago

Hi,

Just a gentle ping.

I can reproduce the same issue on the mainline 6.19.0-rc4 as well.
The observed behavior matches the problem model described in my previous 
patch description.

Sharing this mainline dmesg in case it's useful.

[ 1217.519433] INFO: task systemd:1 blocked for more than 606 seconds.
[ 1217.526904]       Not tainted 
6.19.0-rc4-qzc-test-hungtask-reweight_entity+ #5
[ 1217.535242] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 1217.544177] task:systemd         state:D stack:0     pid:1  tgid:1    
  ppid:0      task_flags:0x400100 flags:0x00000000
[ 1217.556312] Call trace:
[ 1217.559665]  __switch_to+0xdc/0x108 (T)
[ 1217.564401]  __schedule+0x288/0x650
[ 1217.568786]  schedule+0x30/0xa8
[ 1217.572821]  schedule_preempt_disabled+0x18/0x30
[ 1217.578324]  __mutex_lock.constprop.0+0x2fc/0xc20
[ 1217.583906]  __mutex_lock_slowpath+0x1c/0x30
[ 1217.589054]  mutex_lock+0x50/0x68
[ 1217.593247]  cgroup_kn_lock_live+0x60/0x158
[ 1217.598302]  cgroup_mkdir+0x44/0x218
[ 1217.602744]  kernfs_iop_mkdir+0x6c/0xc8
[ 1217.607444]  vfs_mkdir+0x218/0x318
[ 1217.611711]  do_mkdirat+0x198/0x200
[ 1217.616056]  __arm64_sys_mkdirat+0x38/0x58
[ 1217.621007]  invoke_syscall+0x50/0x120
[ 1217.625610]  el0_svc_common.constprop.0+0x48/0xf0
[ 1217.631157]  do_el0_svc+0x24/0x38
[ 1217.635365]  el0_svc+0x34/0x170
[ 1217.639348]  el0t_64_sync_handler+0xa0/0xe8
[ 1217.644371]  el0t_64_sync+0x190/0x198
[ 1217.649244] INFO: task systemd:1 is blocked on a mutex likely owned 
by task cgexec:105632.
[ 1217.658479] INFO: task kworker/0:1:11 blocked for more than 606 seconds.
[ 1217.666204]       Not tainted 
6.19.0-rc4-qzc-test-hungtask-reweight_entity+ #5
[ 1217.674429] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 1217.683211] task:kworker/0:1     state:D stack:0     pid:11 tgid:11  
   ppid:2      task_flags:0x4208060 flags:0x00000010
[ 1217.695285] Workqueue: events vmstat_shepherd
[ 1217.700480] Call trace:
[ 1217.703767]  __switch_to+0xdc/0x108 (T)
[ 1217.708438]  __schedule+0x288/0x650
[ 1217.712768]  schedule+0x30/0xa8
[ 1217.716750]  percpu_rwsem_wait+0xdc/0x208
[ 1217.721592]  __percpu_down_read+0x64/0x110

Thanks,
Zicheng

On 12/26/2025 8:17 AM, Zicheng Qu wrote:
> In reweight_entity(), when reweighting a currently running entity (se ==
> cfs_rq->curr), the entity remains on the runqueue context without
> undergoing a full dequeue/enqueue cycle. This means avg_vruntime()
> remains constant throughout the reweight operation.
>
> However, the current implementation calls place_entity(..., 0) at the
> end of reweight_entity(). Under EEVDF, place_entity() is designed to
> handle entities entering the runqueue and calculates the virtual lag
> (vlag) to account for the change in the weighted average vruntime (V)
> using the formula:
>
> 	vlag' = vlag * (W + w_i) / W
>
> Where 'W' is the current aggregate weight (including
> cfs_rq->curr->load.weight) and 'w_i' is the weight of the entity being
> enqueued (in this case, the se is exactly the cfs_rq->curr).
>
> This leads to a "double scaling" logic for running entities:
> 1. reweight_entity() already rescales se->vlag based on the new weight
>     ratio.
> 2. place_entity() then mistakenly applies the (W + w_i)/W scaling again,
>     treating the reweight as a fresh enqueue into a new total weight
> pool.
>
> This can cause the entity's vlag to be amplified (if positive) or
> suppressed (if negative) incorrectly during the reweight process.
>
> In environments with frequent cgroup throttle/unthrottle operations,
> this math error manifests as a vruntime drift.
>
> A hungtask was observed as below:
> crash> runq -c 0 -g
> CPU 0
>    CURRENT: PID: 330440  TASK: ffff00004cd61540  COMMAND: "stress-ng"
>    ROOT_TASK_GROUP: ffff8001025fa4c0  RT_RQ: ffff0000fff42500
> 	 [no tasks queued]
>    ROOT_TASK_GROUP: ffff8001025fa4c0  CFS_RQ: ffff0000fff422c0
> 	 TASK_GROUP: ffff0000c130fc00  CFS_RQ: ffff00009125a400  <test_cg>	cfs_bandwidth: period=100000000, quota=18446744073709551615, gse: 0xffff000091258c00, vruntime=127285708384434, deadline=127285714880550, vlag=11721467, weight=338965, my_q=ffff00009125a400, cfs_rq: avg_vruntime=0, zero_vruntime=2029704519792, avg_load=0, nr_running=1
> 		TASK_GROUP: ffff0000d7cc8800  CFS_RQ: ffff0000c8f86800  <test_test329274_1>	cfs_bandwidth: period=14000000, quota=14000000, gse: 0xffff0000c8f86400, vruntime=2034894470719, deadline=2034898697770, vlag=0, weight=215291, my_q=ffff0000c8f86800, cfs_rq: avg_vruntime=-422528991, zero_vruntime=8444226681954, avg_load=54, nr_running=19
> 		   [110] PID: 330440  TASK: ffff00004cd61540  COMMAND: "stress-ng" [CURRENT]    vruntime=8444367524951, deadline=8444932411139, vlag=8444932411139, weight=3072, last_arrival=4002964107010, last_queued=0, exec_start=3872860294100, sum_exec_runtime=22252021900
> 		   ...
> 		   [110] PID: 330291  TASK: ffff0000c02c9540  COMMAND: "stress-ng"	vruntime=8444229273009, deadline=8444946073008, vlag=-2701415, weight=3072, last_arrival=4002964076840, last_queued=4002964550990, exec_start=3872859839290, sum_exec_runtime=22310951770
> 	 [100] PID: 97     TASK: ffff0000c2432a00  COMMAND: "kworker/0:1H"	vruntime=127285720095197, deadline=127285720119423, vlag=48453, weight=90891264, last_arrival=3846600432710, last_queued=3846600721010, exec_start=3743307237970, sum_exec_runtime=413405210
> 	 [120] PID: 15     TASK: ffff0000c0368080  COMMAND: "ksoftirqd/0"	vruntime=127285722433404, deadline=127285724533404, vlag=0, weight=1048576, last_arrival=3506755665780, last_queued=3506852159390, exec_start=3461615726670, sum_exec_runtime=16341041340
> 	 [120] PID: 50173  TASK: ffff0000741d8080  COMMAND: "kworker/0:0"	vruntime=127285722960040, deadline=127285725060040, vlag=-414755, weight=1048576, last_arrival=3506828139580, last_queued=3506972354700, exec_start=3461676584440, sum_exec_runtime=84414080
> 	 [120] PID: 58662  TASK: ffff000091180080  COMMAND: "kworker/0:2"	vruntime=127285723428168, deadline=127285725528168, vlag=3049158, weight=1048576, last_arrival=3505689085070, last_queued=3506848131990, exec_start=3460592328510, sum_exec_runtime=89193000
>
> TASK 1 (systemd) is waiting for cgroup_mutex.
> TASK 329296 (sh) holds cgroup_mutex and is waiting for cpus_read_lock.
> TASK 50173 (kworker/0:0) holds the cpus_read_lock, but fail to be
> scheduled.
> test_cg and TASK 97 may have suppressed TASK 50173, causing
> it to not be scheduled for a long time, thus failing to release locks in
> a timely manner and ultimately causing a hungtask issue.
>
> Fix by adding ENQUEUE_REWEIGHT_CURR flag and skipping vlag recalculation
> in place_entity() when reweighting the current running entity. For
> non-current entities, the existing logic remains as dequeue/enqueue
> changes avg_vruntime().
>
> Fixes: 6d71a9c61604 ("sched/fair: Fix EEVDF entity placement bug causing scheduling lag")
> Signed-off-by: Zicheng Qu <quzicheng@huawei.com>
> ---
>   kernel/sched/fair.c  | 11 ++++++++++-
>   kernel/sched/sched.h |  1 +
>   2 files changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index da46c3164537..3be42729049e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3787,7 +3787,7 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
>   
>   	enqueue_load_avg(cfs_rq, se);
>   	if (se->on_rq) {
> -		place_entity(cfs_rq, se, 0);
> +		place_entity(cfs_rq, se, curr ? ENQUEUE_REWEIGHT_CURR : 0);
>   		update_load_add(&cfs_rq->load, se->load.weight);
>   		if (!curr)
>   			__enqueue_entity(cfs_rq, se);
> @@ -5123,6 +5123,14 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>   
>   		lag = se->vlag;
>   
> +		/*
> +		 * ENQUEUE_REWEIGHT_CURR:
> +		 * current running se (cfs_rq->curr) should skip vlag recalculation,
> +		 * because avg_vruntime(...) hasn't changed.
> +		 */
> +		if (flags & ENQUEUE_REWEIGHT_CURR)
> +			goto skip_lag_scale;
> +
>   		/*
>   		 * If we want to place a task and preserve lag, we have to
>   		 * consider the effect of the new entity on the weighted
> @@ -5185,6 +5193,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>   		lag = div_s64(lag, load);
>   	}
>   
> +skip_lag_scale:
>   	se->vruntime = vruntime - lag;
>   
>   	if (se->rel_deadline) {
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index d30cca6870f5..e3a43f94dd2f 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2412,6 +2412,7 @@ extern const u32		sched_prio_to_wmult[40];
>   #define ENQUEUE_MIGRATED	0x00040000
>   #define ENQUEUE_INITIAL		0x00080000
>   #define ENQUEUE_RQ_SELECTED	0x00100000
> +#define ENQUEUE_REWEIGHT_CURR	0x00200000
>   
>   #define RETRY_TASK		((void *)-1UL)
>