From nobody Mon Feb 9 10:33:40 2026 Received: from mta22.hihonor.com (mta22.hihonor.com [81.70.192.198]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E37AC5FDA7 for ; Mon, 22 Dec 2025 09:58:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=81.70.192.198 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766397524; cv=none; b=NioepVJbmiU9PHfY3/moQjvrhBpN/vkSJKSAg1NL0xIqq0dCzV6+j6Qn9FyDTuUyY+dCWYkJxG1XOS2jC58GCMGHoNsByCZyBDNvHO4QwuGYnfwIjgC8cSbYNFNZEtXGUWRdmKLVYLxvNEqsqJw/XuclaFnboWoiTt75S/ALQ6E= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766397524; c=relaxed/simple; bh=lg19u9iWFr3twJ7/sRIJFt530jpuPLbxkJOQxSpLdPI=; h=From:To:CC:Subject:Date:Message-ID:MIME-Version:Content-Type; b=p5Dcki/XChk97UYIyWX0vdkvNU8LyAXKQ29kstDG2FxL3RjkLbehcH3pfP+CshV/40hXkeBvWzypNuWMQjSdqfsLztP0oWy595PQSMQG5RAYkWQ7PTOqZS2+ktEEUY4Z+ET0lelkcQ/bl7eck8eb4LiurAMuuEyMv0uWTDFpQoQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=honor.com; spf=pass smtp.mailfrom=honor.com; dkim=pass (1024-bit key) header.d=honor.com header.i=@honor.com header.b=ERzYfrhZ; arc=none smtp.client-ip=81.70.192.198 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=honor.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=honor.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=honor.com header.i=@honor.com header.b="ERzYfrhZ" dkim-signature: v=1; a=rsa-sha256; d=honor.com; s=dkim; c=relaxed/relaxed; q=dns/txt; h=To:From; bh=ADhqkDOliCH1EyzcV8XTUYCvK6Mw6jk9NGdf23w5BQc=; b=ERzYfrhZ8Fdf1aogBNL2JNg6yMSg3vOJM6RvxiYa+rbPHF2apL3V+66aiRxM8/LxZ+NojJJxu zmwIBzyBfzxyYjtyXaG93JoMODbtBnpvTfAvPDXbf3y+PPdJ1osS2PBrFwSKmj4iklgXO5apKN+ M0uHQZBp8fJLZhrzjga+Hng= Received: from w001.hihonor.com (unknown [10.68.25.235]) by mta22.hihonor.com (SkyGuard) with ESMTPS id 4dZYSc5xNLzYl3KV; Mon, 22 Dec 2025 17:56:32 +0800 (CST) Received: from a011.hihonor.com (10.68.31.243) by w001.hihonor.com (10.68.25.235) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.27; Mon, 22 Dec 2025 17:58:33 +0800 Received: from localhost.localdomain (10.144.18.117) by a011.hihonor.com (10.68.31.243) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.27; Mon, 22 Dec 2025 17:58:33 +0800 From: wangtao To: , , , CC: , , , , , , , , , wangtao Subject: [PATCH v5] sched/fair: Add fair placement lag Date: Mon, 22 Dec 2025 17:57:00 +0800 Message-ID: <20251222095700.6598-1-tao.wangtao@honor.com> X-Mailer: git-send-email 2.17.1 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: w001.hihonor.com (10.68.25.235) To a011.hihonor.com (10.68.31.243) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" This patch introduces a fair placement lag mechanism that tracks join/leave lag in sum_jlag and its weighted average J to keep the virtual time V aligned with the EEVDF fluid schedule. Assume the current virtual time V is V0 and that all weights are 1, and three tasks join with lag values 0, 36 and -12. Consider the current preserve_lag handling: vlag_i =3D (W + w_i) * vlag_i' / W. The following example illustrates the problem: event | vlag'| vlag | v | W | V ---------------------------------------- T1 join | 0 | 0 | V0 | 1 | V0 T2 join | 36 | 72 | V0-72| 2 | V0-36 T3 join | -12 | -18 | V0-18| 3 | V0-30 Because V moves backwards after T2 joins, even though lag_T3 < lag_T1 =3D 0, we still have v_T2 < v_T3 < v_T1. Therefore the scheduling order is T2, T3, T1, i.e. T3 with a negative lag is scheduled before T1 with zero lag. A similar issue exists even without preserve_lag. If the tasks are added in the order T3, T2, T1, the scheduling order becomes T2, T1, T3, which shows instability. We track the cumulative join/leave lag compensation (jlag) in sum_jlag, and its weighted average over all runnable tasks in J. We choose jlag on joins and leaves so that V + J remains invariant, improving stability and fairness. J can be shown to be bounded by a constant. Example run with fair placement lag: time | event | W | sum_jlag | J | V+J | v | V | lag ------------------------------------------------------------------------- t | T1 join- | 0 | 0 | 0 | V0 | NA | V0 | 0 t | T1 join+ |> 1 |> 0 |> 0 | V0 |> V0 |> V0 |> 0 t | T2 join- | 1 | 0 | 0 | V0 | NA | V0 | 36 t | T2 join+ |> 2 |> 36 |> 18 | V0 |> V0-36 |> V0-18 |> 18 t | T3 join- | 2 | 36 | 0 | V0 | NA | V0 | -12 t | T3 join+ |> 3 |> 24 |> 8 | V0 |> V0+12 |> V0-8 |> -20 t | pick T2 | 3 | 24 | 8 | V0 | V0-36 | V0-8 |> 28 t+24 | T2 leave- | 3 | 24 | 8 |> V0+8 |> V0-12 |> V0 |> 12 t+24 | T2 leave+ |> 2 |> 4 |> 2 | V0+8 |> NA |> V0+6 | 12 t+24 | pick T1 | 2 | 4 | 2 | V0+8 | V0 | V0+6 |> 6 t+40 | T1 leave- | 2 | 4 | 2 |> V0+16 |> V0+16 |> V0+14 |> -2 t+40 | T1 leave+ |> 1 |> 4 |> 4 | V0+16 |> NA |> V0+12 | -2 t+40 | T2 rejoin- | 1 | 4 | 4 | V0+16 | NA | V0+12 | 12 t+40 | T2 rejoin+ |> 2 |> 16 |> 8 | V0+16 |> V0+4 |> V0+8 |> 4 t+40 | pick T2 | 2 | 16 | 8 | V0+16 | V0+4 | V0+8 |> 4 t+52 | T2 leave- | 2 | 16 | 8 |> V0+22 |> V0+16 |> V0+14 |> -2 t+52 | T2 leave+ |> 1 |> 10 |> 10 | V0+22 |> NA |> V0+12 | -2 t+52 | pick T3 | 1 | 10 | 10 | V0+22 | V0+12 | V0+12 |> 0 t+60 | T3 leave- | 1 | 10 | 10 |> V0+30 |> V0+20 |> V0+20 |> 0 t+60 | T3 leave+ |> 0 |> 0 |> 0*| V0+20*| NA |> V0+20 | 0 Notes: '>' indicates that the value needs to be recomputed. 'NA' means the value is not defined for that event. '*' the queue becomes empty, J =3D sum_jlag / W is undefined (W =3D 0), and we set J =3D 0 by convention. Hackbench tests show that this patch reduces execution time. hackbench test base fair_place_lag change ---------------------------------------------------------- process 1 group: 0.119 0.107 -10.6% process 10 group: 0.869 0.767 -11.7% process 20 group: 1.840 1.579 -14.1% thread 1 group: 0.089 0.074 -17.2% thread 10 group: 0.555 0.480 -13.5% thread 20 group: 1.099 1.022 -7.0% pipe process 1 group: 0.126 0.084 -33.5% pipe process 10 group: 0.810 0.673 -17.0% pipe process 20 group: 1.625 1.314 -19.1% pipe thread 1 group: 0.092 0.077 -16.7% pipe thread 10 group: 0.503 0.465 -7.6% pipe thread 20 group: 0.947 0.906 -4.4% Signed-off-by: wangtao --- kernel/sched/debug.c | 2 + kernel/sched/fair.c | 194 +++++++++++++++++++++++++++++++++++++++- kernel/sched/features.h | 5 ++ kernel/sched/sched.h | 2 + 4 files changed, 201 insertions(+), 2 deletions(-) diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 41caa22e0680..a4fd11c5ca2d 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -830,6 +830,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct c= fs_rq *cfs_rq) SPLIT_NS(zero_vruntime)); SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "avg_vruntime", SPLIT_NS(avg_vruntime(cfs_rq))); + SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "avg_vjlag", + SPLIT_NS(avg_vjlag(cfs_rq))); SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "right_vruntime", SPLIT_NS(right_vruntime)); spread =3D right_vruntime - left_vruntime; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index da46c3164537..a775b4361b1c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -593,7 +593,160 @@ static inline s64 entity_key(struct cfs_rq *cfs_rq, s= truct sched_entity *se) * * V +-=3D lag_i / W * - * Also see the comment in place_entity() that deals with this. ]] + * If a joining or leaving task j has non-zero lag, V will change. We defi= ne + * sum_jlag as the sum of lag compensations at task joins/leaves, and J as= the + * weighted average: + * + * J =3D sum_jlag / W + * + * By construction, we choose jlag at joins/leaves so that V + J is invari= ant + * when there are tasks in the queue. + * + * -------------------------------------------------------------------- + * 1) Task l leaving + * + * Before task l leaves we use the suffix "_l"; after it leaves we use a + * prime ('). Given v_l or lag_l, we can compute jlag_l. + * + * W' =3D W_l - w_l + * V' =3D (V_l * W_l - v_l * w_l) / W' + * =3D (V_l * W' + V_l * w_l - v_l * w_l) / W' + * =3D V_l + (V_l - v_l) * w_l / W' + * =3D V_l + lag_l / W' // lag_l =3D (V_l - v_l) * w= _l + * + * For J: + * + * sum_jlag' =3D sum_jlag_l - jlag_l =3D J_l * W_l - jlag_l + * J' =3D sum_jlag' / W' + * =3D (J_l * W_l - jlag_l) / W' + * =3D (J_l * W' + J_l * w_l - jlag_l) / W' + * =3D J_l + (J_l * w_l - jlag_l) / W' + * + * Enforcing V' + J' =3D V_l + J_l gives: + * + * jlag_l =3D (J_l + V_l - v_l) * w_l =3D J_l * w_l + lag_l + * + * -------------------------------------------------------------------- + * 2) Task j joining + * + * Before task j joins we use V, W, J; after joining we use the suffix "_j= ": + * + * W_j =3D W + w_j + * V_j =3D (V * W + v_j * w_j) / W_j + * =3D (V * W_j - V * w_j + v_j * w_j) / W_j + * =3D V - (V - v_j) * w_j / W_j + * + * For J: + * + * sum_jlag_j =3D sum_jlag + jlag_j =3D J * W + jlag_j + * J_j =3D sum_jlag_j / W_j + * =3D (J * W + jlag_j) / W_j + * =3D (J * W_j - J * w_j + jlag_j) / W_j + * =3D J - (J * w_j - jlag_j) / W_j + * + * Enforcing V_j + J_j =3D V + J gives: + * + * jlag_j =3D (J + V - v_j) * w_j + * + * When task j joins, v_j, lag_j and jlag_j are not yet defined. + * Given V, J and last_leave_lag_j for j, if we set + * + * jlag_j =3D last_leave_lag_j + * + * then + * + * v_j =3D J + V - jlag_j / w_j + * =3D V - (last_leave_lag_j / w_j - J) + * + * We only use jlag_j to compute v_j; it does not affect EEVDF's + * eligibility decision. Therefore we can also choose jlag_j to be + * last_leave_lag_j / 2, 2 * last_leave_lag_j, or any value with a + * uniform constant bound. + * + * -------------------------------------------------------------------- + * 3) Meaning of sum_jlag and J + * + * sum_jlag tracks aggregate join/leave service lag ("jlag") relative to t= he + * EEVDF fluid schedule: service that future scheduling must preferentially + * repay or reclaim: + * - sum_jlag > 0: positive jlag (we owe service; joined behind fluid); + * - sum_jlag =3D 0: no outstanding jlag; + * - sum_jlag < 0: negative jlag (we gave extra service; joined ahead). + * + * sum_jlag is the outstanding join/leave lag repaid when tasks leave, + * keeping V + J aligned with the fluid schedule. + * + * -------------------------------------------------------------------- + * 4) Recursive relations for sum_jlag + * + * sum_jlag(nj + 1, nl) =3D sum_jlag(nj, nl) + last_leave_lag_j + * sum_jlag(nj, nl + 1) =3D sum_jlag(nj, nl) * (1 - w_l / W) - lag_l + * J(nj, nl) =3D sum_jlag(nj, nl) / W + * + * Here nj is the number of joins and nl is the number of leaves, and + * + * nj >=3D nl + * sum_jlag(0, 0) =3D 0 + * + * -------------------------------------------------------------------- + * 5) Boundedness of J + * + * Let nr =3D nj - nl be the number of tasks currently in the queue. If nr= =3D 1, + * the only task l leaves with lag_l =3D 0 and w_l =3D W, so sum_jlag beco= mes 0 + * and J =3D 0. + * + * For nr > 1, assume for all tasks i in the queue: + * + * 1 < W_MIN <=3D w_i < W <=3D W_MAX + * nr * W_MIN <=3D W <=3D W_MAX + * 0 < W_MIN / W_MAX <=3D 1 - w_i / W <=3D a < 1. + * + * and lag bounds + * + * |last_leave_lag_j| <=3D q, |lag_l| <=3D q + * + * for some global constant q. Then nj-increase (task joins) gives + * + * |sum_jlag| <=3D nr * q, W >=3D nr * W_MIN =3D> |J| <=3D q / W_MIN + * + * For nl-increase (task leaves), the recurrence + * + * sum_jlag(nj, nl + 1) =3D sum_jlag(nj, nl) * (1 - w_l / W) - lag_l + * + * has the standard form + * + * x_{k+1} =3D a_k * x_k + d_k + * + * with + * + * a_k =3D 1 - w_l / W in (0, 1), |d_k| <=3D q. + * + * In such a system with contraction factor |a_k| < 1 and bounded + * disturbance (BIBS/ISS-type), x_k and J =3D x_k / W remain uniformly + * bounded. In particular, there exists a constant C1, independent of + * nj, nl and nr, such that + * + * |sum_jlag(nj, nl)| =3D |sum_jlag(nl0 + nr, nl0 + k)| + * <=3D a^k * nr * q + (1 - a^k) * q / (1 - a) + * + * The number of remaining tasks is nr - k, so + * + * W >=3D (nr - k) * W_MIN + * + * Therefore + * + * |J| =3D |sum_jlag(nj, nl) / W| + * <=3D q * (nr * a^k + (1 - a^k) / (1 - a)) / ((nr - k) * W_MIN) + * + * Hence + * + * |J| <=3D (q / W_MIN) * C1 + * + * where C1 can be chosen as + * + * C1 =3D 1 + 1 / (1 - a) + 1 / (e * ln(1 / a)). + * + * Thus J has a uniform constant upper bound. ]] * * However, since v_i is u64, and the multiplication could easily overflow * transform it into a relative form that uses smaller quantities: @@ -638,6 +791,36 @@ avg_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_e= ntity *se) cfs_rq->avg_load -=3D weight; } =20 +/* Average join/leave lag debt J =3D sum_jlag / W for this cfs_rq. */ +s64 avg_vjlag(struct cfs_rq *cfs_rq) +{ + struct sched_entity *curr =3D cfs_rq->curr; + long load =3D cfs_rq->avg_load; + + if (curr && curr->on_rq) + load +=3D scale_load_down(curr->load.weight); + + return load ? div_s64(cfs_rq->sum_jlag, load) : 0; +} + +/* Account the join lag contribution of a newly enqueued entity. */ +static void sum_jlag_add(struct cfs_rq *cfs_rq, struct sched_entity *se) +{ + unsigned long weight =3D scale_load_down(se->load.weight); + s64 jlag_join =3D se->vlag * weight; /* preserve vlag: vlag - J_j */ + + cfs_rq->sum_jlag +=3D jlag_join; +} + +/* Account the leave lag correction when an entity dequeues. */ +static void sum_jlag_sub(struct cfs_rq *cfs_rq, struct sched_entity *se) +{ + unsigned long weight =3D scale_load_down(se->load.weight); + s64 jlag_leave =3D (se->vlag + avg_vjlag(cfs_rq)) * weight; + + cfs_rq->sum_jlag -=3D jlag_leave; +} + static inline void avg_vruntime_update(struct cfs_rq *cfs_rq, s64 delta) { @@ -5116,8 +5299,13 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_ent= ity *se, int flags) * other tasks. * * EEVDF: placement strategy #1 / #2 + * + * FAIR_PLACE_LAG uses avg_vjlag to keep placement fair and stable + * even when tasks that join or leave have non-zero lag. */ - if (sched_feat(PLACE_LAG) && cfs_rq->nr_queued && se->vlag) { + if (sched_feat(FAIR_PLACE_LAG)) + lag =3D se->vlag - avg_vjlag(cfs_rq); + else if (sched_feat(PLACE_LAG) && cfs_rq->nr_queued && se->vlag) { struct sched_entity *curr =3D cfs_rq->curr; unsigned long load; =20 @@ -5260,6 +5448,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_en= tity *se, int flags) =20 check_schedstat_required(); update_stats_enqueue_fair(cfs_rq, se, flags); + sum_jlag_add(cfs_rq, se); if (!curr) __enqueue_entity(cfs_rq, se); se->on_rq =3D 1; @@ -5397,6 +5586,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_en= tity *se, int flags) se->rel_deadline =3D 1; } =20 + sum_jlag_sub(cfs_rq, se); if (se !=3D cfs_rq->curr) __dequeue_entity(cfs_rq, se); se->on_rq =3D 0; diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 980d92bab8ab..81c72b572630 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -5,6 +5,11 @@ * sleep+wake cycles. EEVDF placement strategy #1, #2 if disabled. */ SCHED_FEAT(PLACE_LAG, true) +/* + * Adjust EEVDF placement strategy #1 using avg_vjlag to keep scheduling + * fair and stable even when tasks that join or leave have non-zero lag. + */ +SCHED_FEAT(FAIR_PLACE_LAG, false) /* * Give new tasks half a slice to ease into the competition. */ diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index d30cca6870f5..8f7eb75f9ab5 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -680,6 +680,7 @@ struct cfs_rq { =20 s64 avg_vruntime; u64 avg_load; + s64 sum_jlag; =20 u64 zero_vruntime; #ifdef CONFIG_SCHED_CORE @@ -3891,6 +3892,7 @@ static inline void mm_cid_switch_to(struct task_struc= t *prev, struct task_struct #endif /* !CONFIG_SCHED_MM_CID */ =20 extern u64 avg_vruntime(struct cfs_rq *cfs_rq); +extern s64 avg_vjlag(struct cfs_rq *cfs_rq); extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se); static inline void move_queued_task_locked(struct rq *src_rq, struct rq *dst_rq, struct = task_struct *task) --=20 2.17.1