From nobody Mon Feb 9 08:50:26 2026 Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9996A10E3 for ; Fri, 26 Dec 2025 00:37:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.187 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766709470; cv=none; b=XtkqTl9ocZ7fUPtotEQxcVEHVzWLJW1akXnfAykgTuYE2m3pJ8HMGvni/fEciUWYPCxaGU9awuYWp7PtGErzYghWR+6TIwvvxYcTBDZ7vIH/wn4jdkG23kge2rXAZFqvPzXjCHQiS9JYjT+wQDFNdWrOrvCq+ZPAaZLAC/yHX3Y= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766709470; c=relaxed/simple; bh=FJgKTaYrrfiziTNPgAGmGzw+O+8jZkNP+homRB2Rmwg=; h=From:To:CC:Subject:Date:Message-ID:MIME-Version:Content-Type; b=N03nGRd1PpFlf9k2CjUSHPFIjZ6sy7yFgMcdmmjrWXEOnqBYsdF9mVbchhsjElJQ8p1+J5ksaoc9hWoPX31tHpEnxKzcbdkLrlcbWC3qNaE0cEAT/cWgAfwZmhZEx8VHHqfFmgQhK66pkzRJRt7A6QNi6CIgeZgoaStB4uCZkfI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; dkim=pass (1024-bit key) header.d=huawei.com header.i=@huawei.com header.b=gSwfTM9C; dkim=pass (1024-bit key) header.d=huawei.com header.i=@huawei.com header.b=gSwfTM9C; arc=none smtp.client-ip=45.249.212.187 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=huawei.com header.i=@huawei.com header.b="gSwfTM9C"; dkim=pass (1024-bit key) header.d=huawei.com header.i=@huawei.com header.b="gSwfTM9C" dkim-signature: v=1; a=rsa-sha256; d=huawei.com; s=dkim; c=relaxed/relaxed; q=dns/txt; h=From; bh=paAdi6UOxxbLNuD9sVRw20V/96Pfk+mc7/xirLD+3Zg=; b=gSwfTM9CjfBMpvzrCaB9goyrXfoZ7Pmh/oo1Dksi4B5yxvCbM3VKmZY2SCcwqlC4uLXE3jCqU haCTG4j5PUYTXdIlPpoWhi7OyU3XKZZsGodYPVzOINRAQ+BalNyjETcAnX1QzNA6nu/lvFY6Zhc /F3L5czQic1W/bA+fownaGQ= Received: from canpmsgout03.his.huawei.com (unknown [172.19.92.159]) by szxga01-in.huawei.com (SkyGuard) with ESMTPS id 4dcms86jXRz1BG2p for ; Fri, 26 Dec 2025 08:37:00 +0800 (CST) dkim-signature: v=1; a=rsa-sha256; d=huawei.com; s=dkim; c=relaxed/relaxed; q=dns/txt; h=From; bh=paAdi6UOxxbLNuD9sVRw20V/96Pfk+mc7/xirLD+3Zg=; b=gSwfTM9CjfBMpvzrCaB9goyrXfoZ7Pmh/oo1Dksi4B5yxvCbM3VKmZY2SCcwqlC4uLXE3jCqU haCTG4j5PUYTXdIlPpoWhi7OyU3XKZZsGodYPVzOINRAQ+BalNyjETcAnX1QzNA6nu/lvFY6Zhc /F3L5czQic1W/bA+fownaGQ= Received: from mail.maildlp.com (unknown [172.19.163.104]) by canpmsgout03.his.huawei.com (SkyGuard) with ESMTPS id 4dcmpG2mltzpStn; Fri, 26 Dec 2025 08:34:30 +0800 (CST) Received: from dggemv712-chm.china.huawei.com (unknown [10.1.198.32]) by mail.maildlp.com (Postfix) with ESMTPS id C22DF40363; Fri, 26 Dec 2025 08:37:29 +0800 (CST) Received: from kwepemq100012.china.huawei.com (7.202.195.195) by dggemv712-chm.china.huawei.com (10.1.198.32) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Fri, 26 Dec 2025 08:37:29 +0800 Received: from huawei.com (10.67.175.84) by kwepemq100012.china.huawei.com (7.202.195.195) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Fri, 26 Dec 2025 08:37:28 +0800 From: Zicheng Qu To: , , , , , , , , , CC: , Subject: [PATCH] sched/fair: Fix vruntime drift by preventing double lag scaling during reweight Date: Fri, 26 Dec 2025 00:17:31 +0000 Message-ID: <20251226001731.3730586-1-quzicheng@huawei.com> X-Mailer: git-send-email 2.34.1 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: kwepems100002.china.huawei.com (7.221.188.206) To kwepemq100012.china.huawei.com (7.202.195.195) Content-Type: text/plain; charset="utf-8" In reweight_entity(), when reweighting a currently running entity (se =3D= =3D cfs_rq->curr), the entity remains on the runqueue context without undergoing a full dequeue/enqueue cycle. This means avg_vruntime() remains constant throughout the reweight operation. However, the current implementation calls place_entity(..., 0) at the end of reweight_entity(). Under EEVDF, place_entity() is designed to handle entities entering the runqueue and calculates the virtual lag (vlag) to account for the change in the weighted average vruntime (V) using the formula: vlag' =3D vlag * (W + w_i) / W Where 'W' is the current aggregate weight (including cfs_rq->curr->load.weight) and 'w_i' is the weight of the entity being enqueued (in this case, the se is exactly the cfs_rq->curr). This leads to a "double scaling" logic for running entities: 1. reweight_entity() already rescales se->vlag based on the new weight ratio. 2. place_entity() then mistakenly applies the (W + w_i)/W scaling again, treating the reweight as a fresh enqueue into a new total weight pool. This can cause the entity's vlag to be amplified (if positive) or suppressed (if negative) incorrectly during the reweight process. In environments with frequent cgroup throttle/unthrottle operations, this math error manifests as a vruntime drift. A hungtask was observed as below: crash> runq -c 0 -g CPU 0 CURRENT: PID: 330440 TASK: ffff00004cd61540 COMMAND: "stress-ng" ROOT_TASK_GROUP: ffff8001025fa4c0 RT_RQ: ffff0000fff42500 [no tasks queued] ROOT_TASK_GROUP: ffff8001025fa4c0 CFS_RQ: ffff0000fff422c0 TASK_GROUP: ffff0000c130fc00 CFS_RQ: ffff00009125a400 cfs_ban= dwidth: period=3D100000000, quota=3D18446744073709551615, gse: 0xffff000091= 258c00, vruntime=3D127285708384434, deadline=3D127285714880550, vlag=3D1172= 1467, weight=3D338965, my_q=3Dffff00009125a400, cfs_rq: avg_vruntime=3D0, z= ero_vruntime=3D2029704519792, avg_load=3D0, nr_running=3D1 TASK_GROUP: ffff0000d7cc8800 CFS_RQ: ffff0000c8f86800 cfs_bandwidth: period=3D14000000, quota=3D14000000, gse: 0xffff0000c8f86= 400, vruntime=3D2034894470719, deadline=3D2034898697770, vlag=3D0, weight= =3D215291, my_q=3Dffff0000c8f86800, cfs_rq: avg_vruntime=3D-422528991, zero= _vruntime=3D8444226681954, avg_load=3D54, nr_running=3D19 [110] PID: 330440 TASK: ffff00004cd61540 COMMAND: "stress-ng" [CURRE= NT] vruntime=3D8444367524951, deadline=3D8444932411139, vlag=3D844493241= 1139, weight=3D3072, last_arrival=3D4002964107010, last_queued=3D0, exec_st= art=3D3872860294100, sum_exec_runtime=3D22252021900 ... [110] PID: 330291 TASK: ffff0000c02c9540 COMMAND: "stress-ng" vrunti= me=3D8444229273009, deadline=3D8444946073008, vlag=3D-2701415, weight=3D307= 2, last_arrival=3D4002964076840, last_queued=3D4002964550990, exec_start=3D= 3872859839290, sum_exec_runtime=3D22310951770 [100] PID: 97 TASK: ffff0000c2432a00 COMMAND: "kworker/0:1H" vruntim= e=3D127285720095197, deadline=3D127285720119423, vlag=3D48453, weight=3D908= 91264, last_arrival=3D3846600432710, last_queued=3D3846600721010, exec_star= t=3D3743307237970, sum_exec_runtime=3D413405210 [120] PID: 15 TASK: ffff0000c0368080 COMMAND: "ksoftirqd/0" vruntime= =3D127285722433404, deadline=3D127285724533404, vlag=3D0, weight=3D1048576,= last_arrival=3D3506755665780, last_queued=3D3506852159390, exec_start=3D34= 61615726670, sum_exec_runtime=3D16341041340 [120] PID: 50173 TASK: ffff0000741d8080 COMMAND: "kworker/0:0" vruntime= =3D127285722960040, deadline=3D127285725060040, vlag=3D-414755, weight=3D10= 48576, last_arrival=3D3506828139580, last_queued=3D3506972354700, exec_star= t=3D3461676584440, sum_exec_runtime=3D84414080 [120] PID: 58662 TASK: ffff000091180080 COMMAND: "kworker/0:2" vruntime= =3D127285723428168, deadline=3D127285725528168, vlag=3D3049158, weight=3D10= 48576, last_arrival=3D3505689085070, last_queued=3D3506848131990, exec_star= t=3D3460592328510, sum_exec_runtime=3D89193000 TASK 1 (systemd) is waiting for cgroup_mutex. TASK 329296 (sh) holds cgroup_mutex and is waiting for cpus_read_lock. TASK 50173 (kworker/0:0) holds the cpus_read_lock, but fail to be scheduled. test_cg and TASK 97 may have suppressed TASK 50173, causing it to not be scheduled for a long time, thus failing to release locks in a timely manner and ultimately causing a hungtask issue. Fix by adding ENQUEUE_REWEIGHT_CURR flag and skipping vlag recalculation in place_entity() when reweighting the current running entity. For non-current entities, the existing logic remains as dequeue/enqueue changes avg_vruntime(). Fixes: 6d71a9c61604 ("sched/fair: Fix EEVDF entity placement bug causing sc= heduling lag") Signed-off-by: Zicheng Qu --- kernel/sched/fair.c | 11 ++++++++++- kernel/sched/sched.h | 1 + 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index da46c3164537..3be42729049e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3787,7 +3787,7 @@ static void reweight_entity(struct cfs_rq *cfs_rq, st= ruct sched_entity *se, =20 enqueue_load_avg(cfs_rq, se); if (se->on_rq) { - place_entity(cfs_rq, se, 0); + place_entity(cfs_rq, se, curr ? ENQUEUE_REWEIGHT_CURR : 0); update_load_add(&cfs_rq->load, se->load.weight); if (!curr) __enqueue_entity(cfs_rq, se); @@ -5123,6 +5123,14 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_ent= ity *se, int flags) =20 lag =3D se->vlag; =20 + /* + * ENQUEUE_REWEIGHT_CURR: + * current running se (cfs_rq->curr) should skip vlag recalculation, + * because avg_vruntime(...) hasn't changed. + */ + if (flags & ENQUEUE_REWEIGHT_CURR) + goto skip_lag_scale; + /* * If we want to place a task and preserve lag, we have to * consider the effect of the new entity on the weighted @@ -5185,6 +5193,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_enti= ty *se, int flags) lag =3D div_s64(lag, load); } =20 +skip_lag_scale: se->vruntime =3D vruntime - lag; =20 if (se->rel_deadline) { diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index d30cca6870f5..e3a43f94dd2f 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2412,6 +2412,7 @@ extern const u32 sched_prio_to_wmult[40]; #define ENQUEUE_MIGRATED 0x00040000 #define ENQUEUE_INITIAL 0x00080000 #define ENQUEUE_RQ_SELECTED 0x00100000 +#define ENQUEUE_REWEIGHT_CURR 0x00200000 =20 #define RETRY_TASK ((void *)-1UL) =20 --=20 2.34.1