From nobody Thu Apr 2 20:26:35 2026 Received: from desiato.infradead.org (desiato.infradead.org [90.155.92.199]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AA12C2C0F8C for ; Thu, 19 Feb 2026 08:10:59 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.92.199 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1771488661; cv=none; b=Kol1fvC4oqCRvgNJVLH0Tun8KCjFzlI8VUGujDk8LXQgdyoOYe4U4yu2pKbZJ6m+vS2ea0AzD1kX6wPtm1Jxe5vga0kK56/sjgCpHvY8p2UxOAYVoPWwvwJrm2rjOzh1T1M818uSF33PWlEF5aXxQjZBlniiZUPj3hSs82xTUf4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1771488661; c=relaxed/simple; bh=X5PbYyiEZoWfHsrPmMOtpM5BonYokdI+rJdtM0L6q00=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=HhrjfXNnk+wkQdHRM/4Tpcr/lugpemmOkbmr2EPEpCZ+foZ95r2YZlGMovckw0aeeIuPwhAF4QRQM8B07+4U4diw25hktdtYILXCZ2kHUFLM3On9vlmMkbdkDD9Yqu8lmkto7lvVWKNwIiFG8wdOmfG/qBb+wjf5KYY953Bhy1s= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=TAHUabfJ; arc=none smtp.client-ip=90.155.92.199 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="TAHUabfJ" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=Content-Type:MIME-Version:References: Subject:Cc:To:From:Date:Message-ID:Sender:Reply-To:Content-Transfer-Encoding: Content-ID:Content-Description:In-Reply-To; bh=W+0qmspPjrQHWGqt6jMkbGOJYKyMWhs+tRmai8YC5o0=; b=TAHUabfJEjyI5uu5ScOjUyuCR6 sx1HIiSkz6GCZpoXDWLXQFHV93xTJgWEqiFnLhae83yXL1ryHa1dd8nKXm9mpAGD7RMhbYyrEV/JH 6tgY1oeGPbWUtlaMswI15s9VTbqeQ2EUpvGNqp8uy0pbBTu+hNNFj3UVDTzvw2JbYrwopYPL5rqlc 3HizwvLTZ2lJfH/TjxIAqBWrTtI4OI+AStSeGNjoIRkkOUKeRZhRzSyGHE0SPXBPeTjrPNEgAq0rz 5deP1W9VJQi6/49p834IoOrR0g/VVjbkSMx5WGW7I5Ucb9lUTUcwfpPXhUSU81lkZ7JpeivVp8xGp 2VAst1Pg==; Received: from 77-249-17-252.cable.dynamic.v4.ziggo.nl ([77.249.17.252] helo=noisy.programming.kicks-ass.net) by desiato.infradead.org with esmtpsa (Exim 4.98.2 #2 (Red Hat Linux)) id 1vsz7Z-00000000xzb-43cj; Thu, 19 Feb 2026 08:10:50 +0000 Received: by noisy.programming.kicks-ass.net (Postfix, from userid 0) id CCD64303364; Thu, 19 Feb 2026 09:10:48 +0100 (CET) Message-ID: <20260219080624.942813440@infradead.org> User-Agent: quilt/0.68 Date: Thu, 19 Feb 2026 08:58:45 +0100 From: Peter Zijlstra To: mingo@kernel.org Cc: peterz@infradead.org, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, linux-kernel@vger.kernel.org, wangtao554@huawei.com, quzicheng@huawei.com, kprateek.nayak@amd.com, dsmythies@telus.net, shubhang@os.amperecomputing.com Subject: [PATCH v2 5/7] sched/fair: Increase weight bits for avg_vruntime References: <20260219075840.162631716@infradead.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Due to the zero_vruntime patch, the deltas are now a lot smaller and measurement with kernel-build and hackbench runs show about 45 bits used. This ensures avg_vruntime() tracks the full weight range, reducing numerical artifacts in reweight and the like. Also, lets keep the paranoid debug code around fow now. Signed-off-by: Peter Zijlstra (Intel) Tested-by: K Prateek Nayak Tested-by: Shubhang Kaushik --- kernel/sched/debug.c | 14 ++++++- kernel/sched/fair.c | 91 ++++++++++++++++++++++++++++++++++++++-----= ----- kernel/sched/features.h | 2 + kernel/sched/sched.h | 3 + 4 files changed, 90 insertions(+), 20 deletions(-) --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -8,6 +8,7 @@ */ #include #include +#include #include "sched.h" =20 /* @@ -901,10 +902,13 @@ static void print_rq(struct seq_file *m, =20 void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq) { - s64 left_vruntime =3D -1, zero_vruntime, right_vruntime =3D -1, left_dead= line =3D -1, spread; + s64 left_vruntime =3D -1, right_vruntime =3D -1, left_deadline =3D -1, sp= read; + s64 zero_vruntime =3D -1, sum_w_vruntime =3D -1; struct sched_entity *last, *first, *root; struct rq *rq =3D cpu_rq(cpu); + unsigned int sum_shift; unsigned long flags; + u64 sum_weight; =20 #ifdef CONFIG_FAIR_GROUP_SCHED SEQ_printf(m, "\n"); @@ -925,6 +929,9 @@ void print_cfs_rq(struct seq_file *m, in if (last) right_vruntime =3D last->vruntime; zero_vruntime =3D cfs_rq->zero_vruntime; + sum_w_vruntime =3D cfs_rq->sum_w_vruntime; + sum_weight =3D cfs_rq->sum_weight; + sum_shift =3D cfs_rq->sum_shift; raw_spin_rq_unlock_irqrestore(rq, flags); =20 SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "left_deadline", @@ -933,6 +940,11 @@ void print_cfs_rq(struct seq_file *m, in SPLIT_NS(left_vruntime)); SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "zero_vruntime", SPLIT_NS(zero_vruntime)); + SEQ_printf(m, " .%-30s: %Ld (%d bits)\n", "sum_w_vruntime", + sum_w_vruntime, ilog2(abs(sum_w_vruntime))); + SEQ_printf(m, " .%-30s: %Lu\n", "sum_weight", + sum_weight); + SEQ_printf(m, " .%-30s: %u\n", "sum_shift", sum_shift); SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "avg_vruntime", SPLIT_NS(avg_vruntime(cfs_rq))); SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "right_vruntime", --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -665,15 +665,20 @@ static inline s64 entity_key(struct cfs_ * Since zero_vruntime closely tracks the per-task service, these * deltas: (v_i - v0), will be in the order of the maximal (virtual) lag * induced in the system due to quantisation. - * - * Also, we use scale_load_down() to reduce the size. - * - * As measured, the max (key * weight) value was ~44 bits for a kernel bui= ld. */ -static void -sum_w_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se) +static inline unsigned long avg_vruntime_weight(struct cfs_rq *cfs_rq, uns= igned long w) +{ +#ifdef CONFIG_64BIT + if (cfs_rq->sum_shift) + w =3D max(2UL, w >> cfs_rq->sum_shift); +#endif + return w; +} + +static inline void +__sum_w_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se) { - unsigned long weight =3D scale_load_down(se->load.weight); + unsigned long weight =3D avg_vruntime_weight(cfs_rq, se->load.weight); s64 key =3D entity_key(cfs_rq, se); =20 cfs_rq->sum_w_vruntime +=3D key * weight; @@ -681,9 +686,59 @@ sum_w_vruntime_add(struct cfs_rq *cfs_rq } =20 static void +sum_w_vruntime_add_paranoid(struct cfs_rq *cfs_rq, struct sched_entity *se) +{ + unsigned long weight; + s64 key, tmp; + +again: + weight =3D avg_vruntime_weight(cfs_rq, se->load.weight); + key =3D entity_key(cfs_rq, se); + + if (check_mul_overflow(key, weight, &key)) + goto overflow; + + if (check_add_overflow(cfs_rq->sum_w_vruntime, key, &tmp)) + goto overflow; + + cfs_rq->sum_w_vruntime =3D tmp; + cfs_rq->sum_weight +=3D weight; + return; + +overflow: + /* + * There's gotta be a limit -- if we're still failing at this point + * there's really nothing much to be done about things. + */ + BUG_ON(cfs_rq->sum_shift >=3D 10); + cfs_rq->sum_shift++; + + /* + * Note: \Sum (k_i * (w_i >> 1)) !=3D (\Sum (k_i * w_i)) >> 1 + */ + cfs_rq->sum_w_vruntime =3D 0; + cfs_rq->sum_weight =3D 0; + + for (struct rb_node *node =3D cfs_rq->tasks_timeline.rb_leftmost; + node; node =3D rb_next(node)) + __sum_w_vruntime_add(cfs_rq, __node_2_se(node)); + + goto again; +} + +static void +sum_w_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se) +{ + if (sched_feat(PARANOID_AVG)) + return sum_w_vruntime_add_paranoid(cfs_rq, se); + + __sum_w_vruntime_add(cfs_rq, se); +} + +static void sum_w_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se) { - unsigned long weight =3D scale_load_down(se->load.weight); + unsigned long weight =3D avg_vruntime_weight(cfs_rq, se->load.weight); s64 key =3D entity_key(cfs_rq, se); =20 cfs_rq->sum_w_vruntime -=3D key * weight; @@ -725,7 +780,7 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq) s64 runtime =3D cfs_rq->sum_w_vruntime; =20 if (curr) { - unsigned long w =3D scale_load_down(curr->load.weight); + unsigned long w =3D avg_vruntime_weight(cfs_rq, curr->load.weight); =20 runtime +=3D entity_key(cfs_rq, curr) * w; weight +=3D w; @@ -735,7 +790,7 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq) if (runtime < 0) runtime -=3D (weight - 1); =20 - delta =3D div_s64(runtime, weight); + delta =3D div64_long(runtime, weight); } else if (curr) { /* * When there is but one element, it is the average. @@ -801,7 +856,7 @@ static int vruntime_eligible(struct cfs_ long load =3D cfs_rq->sum_weight; =20 if (curr && curr->on_rq) { - unsigned long weight =3D scale_load_down(curr->load.weight); + unsigned long weight =3D avg_vruntime_weight(cfs_rq, curr->load.weight); =20 avg +=3D entity_key(cfs_rq, curr) * weight; load +=3D weight; @@ -3871,12 +3926,12 @@ static void reweight_entity(struct cfs_r * Because we keep se->vlag =3D V - v_i, while: lag_i =3D w_i*(V - v_i), * we need to scale se->vlag when w_i changes. */ - se->vlag =3D div_s64(se->vlag * se->load.weight, weight); + se->vlag =3D div64_long(se->vlag * se->load.weight, weight); if (se->rel_deadline) - se->deadline =3D div_s64(se->deadline * se->load.weight, weight); + se->deadline =3D div64_long(se->deadline * se->load.weight, weight); =20 if (rel_vprot) - vprot =3D div_s64(vprot * se->load.weight, weight); + vprot =3D div64_long(vprot * se->load.weight, weight); =20 update_load_set(&se->load, weight); =20 @@ -5180,7 +5235,7 @@ place_entity(struct cfs_rq *cfs_rq, stru */ if (sched_feat(PLACE_LAG) && cfs_rq->nr_queued && se->vlag) { struct sched_entity *curr =3D cfs_rq->curr; - unsigned long load; + long load; =20 lag =3D se->vlag; =20 @@ -5238,12 +5293,12 @@ place_entity(struct cfs_rq *cfs_rq, stru */ load =3D cfs_rq->sum_weight; if (curr && curr->on_rq) - load +=3D scale_load_down(curr->load.weight); + load +=3D avg_vruntime_weight(cfs_rq, curr->load.weight); =20 - lag *=3D load + scale_load_down(se->load.weight); + lag *=3D load + avg_vruntime_weight(cfs_rq, se->load.weight); if (WARN_ON_ONCE(!load)) load =3D 1; - lag =3D div_s64(lag, load); + lag =3D div64_long(lag, load); } =20 se->vruntime =3D vruntime - lag; --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -58,6 +58,8 @@ SCHED_FEAT(CACHE_HOT_BUDDY, true) SCHED_FEAT(DELAY_DEQUEUE, true) SCHED_FEAT(DELAY_ZERO, true) =20 +SCHED_FEAT(PARANOID_AVG, false) + /* * Allow wakeup-time preemption of the current task: */ --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -684,8 +684,9 @@ struct cfs_rq { =20 s64 sum_w_vruntime; u64 sum_weight; - u64 zero_vruntime; + unsigned int sum_shift; + #ifdef CONFIG_SCHED_CORE unsigned int forceidle_seq; u64 zero_vruntime_fi;