From nobody Mon Jun 8 19:46:41 2026 Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B96EA1F37D3 for ; Wed, 27 May 2026 01:32:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=96.67.55.147 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779845571; cv=none; b=VX5ZCWErNSXb3H8OvcyQ4u4mVnwmu8nB0UW7lVilo2fpUHlnlRNDn5vqq2r64HS3Amc8SB2dEqyyFFi+ABoVe7mf/yQc6oLn0laKmqyFz7W/JrOuOYAOF/d4fSURWFZxHHCI7Rp9tCgLg9QIqfFqBnov2sfGsj7NplJXD4zw5YU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779845571; c=relaxed/simple; bh=5zj/6Kk5cEu6lyvC1VN4RKwT/s0YItYocxTZ0oHlFdM=; h=Date:From:To:Cc:Subject:Message-ID:MIME-Version:Content-Type; b=NTtJ7DMR0Z5bgaum3Gp8c9IYzhe8NoAdTQxvuo8as7Qcdws3wONTC1BaHDld6fTCh8xCKQhgawtypcEu8k7Scb1C8/bfpTLG07VGtgcC91mBY6TLb+TxHOYPER2V+8fUpkGnRaXsj6A3ti0vGGJEHwACL+e2++jk01+enJNRdNs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com; spf=pass smtp.mailfrom=surriel.com; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b=mfhuL6br; arc=none smtp.client-ip=96.67.55.147 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=surriel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b="mfhuL6br" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=surriel.com ; s=mail; h=Content-Transfer-Encoding:Content-Type:MIME-Version:Message-ID: Subject:Cc:To:From:Date:Sender:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: In-Reply-To:References:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=YgZYKzk9gyyUdfbQbCH9LFDz8XvCTClFMGaGqD5hVLQ=; b=mfhuL6brcF6YAvRcMk2eryhvWr 2WoBDWdHgG0l1WIx3O4kiGz6D5aKruytZ4IENQbKKPLp28O+nkjTIMcElw/2+WLaskSnIauoV1NOH cT0onvTs36/4bjhjk7zBDqNeL8vO37N40F3GLDesrgphTMp/aGKfLmIs+m6Iu2JKjFkwvyE0g8okg dBEk3occEJFPu2zw83+ACaI5hWeHtj7qTPekSY9dOL9Khpewc6Qo1oXH1Jl9udM2AJWSmzUg/1bGt O5NYRmdVBEgKYLfCnN8TIbzu36CsN2q1dGgP5b6EfiJAYy2uj7KhLZEnubuz7+IKRTL2q6Cv4naJT NKbgudGw==; Received: from [2601:18c:8180:83cc:5a47:caff:fe78:8708] (helo=fangorn) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.97.1) (envelope-from ) id 1wS37g-00000000643-37q6; Tue, 26 May 2026 21:31:52 -0400 Date: Tue, 26 May 2026 21:31:52 -0400 From: Rik van Riel To: Ingo Molnar Cc: Peter Zijlstra , Juri Lelli , Jakub Kicinski , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Valentin Schneider , linux-kernel@vger.kernel.org Subject: [PATCH] sched/fair: use rq_clock_task() in update_tg_load_avg() rate-limit Message-ID: <20260526213152.445ca27c@fangorn> X-Mailer: Claws Mail 4.3.1 (GTK 3.24.49; x86_64-redhat-linux-gnu) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" update_tg_load_avg() is called once per leaf cfs_rq from the __update_blocked_fair() walk that runs inside the NOHZ idle-balance softirq, and again from update_load_avg() with UPDATE_TG. Its first operation after the trivial early-outs is unconditionally: now =3D sched_clock_cpu(cpu_of(rq_of(cfs_rq))); if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC) return; Jakub ran into a system where nohz_idle_balance() was taking 75% of a CPU (which is handling network traffic and doing many irq_exit_cpu calls), with 35% of that CPU spent in update_load_avg, and 17% of the CPU in sched_clock_cpu(), reading the TSC. There are some optimizations upstream already to reduce that overhead, but it also looks like those rdtsc calls may not me necessary at all, giving another easy win. Switch the rate-limit to read rq_clock_task(rq_of(cfs_rq)) instead. This does two things: 1. Eliminates the rdtsc. rq->clock_task is already updated by the enclosing update_rq_clock(rq), sits in a hot cacheline, and reads as a single load. 2. Aligns the rate-limit clock with the clock the rate-limited data is computed on. cfs_rq->avg.load_avg (the value being published to tg->load_avg) is computed by update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq) where cfs_rq_clock_pelt() is derived from rq->clock_pelt, which is derived from rq->clock_task. PELT intentionally excludes IRQ-handling time and steal time from its decay (commit 23127296889f "sched/fair: Update scale invariance of PELT"), so cfs_rq->avg.load_avg already evolves in clock_task time. The rate-limit was the only outlier in the propagation chain using a different clock (wall time, via sched_clock_cpu). Under normal load and on idle CPUs (where clock_task advances at wall-clock rate) behaviour is unchanged. Under heavy IRQ load clock_task advances slower than wall time, so the rate-limit fires less often -- consistent with the fact that the underlying cfs_rq->avg.load_avg is also changing slower under the same conditions. The publish cadence tracks the signal. Note: update_tg_load_avg() propagates the *already-decayed* cfs_rq->avg.load_avg to tg->load_avg; it does not drive decay. Decay happens in update_cfs_rq_load_avg() on the PELT clock regardless of what clock the rate-limit uses, so this change cannot lose decay information. The rate-limit governs how often we publish, not how fast load decays. All callers of update_tg_load_avg() and clear_tg_load_avg() hold rq->lock and have called update_rq_clock(rq) within microseconds: caller pre-state __update_blocked_fair encloser did update_rq_clock(rq) update_load_avg's three UPDATE_TG sites under rq->lock after enqueue/deq= ueue/update_curr attach_/detach_entity_cfs_rq preceded by update_load_avg(...) clear_tg_load_avg via offline path rq_clock_start_loop_update(rq) u= pfront so rq->clock_task is fresh at every call. Since cfs_rqs are per-CPU per-task_group, cfs_rq->last_update_tg_load_avg is always compared against the same rq's clock; no cross-rq drift. The same hoisting pattern was recently applied to find_new_ilb() in commit 76504bce4ee6 ("sched/fair: Get this cpu once in find_new_ilb()"). Signed-off-by: Rik van Riel Assisted-by: Claude (Anthropic) Reviewed-by: Aaron Lu --- kernel/sched/fair.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 863de57a8a2c..096bcb00fa62 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4429,8 +4429,15 @@ static inline void update_tg_load_avg(struct cfs_rq = *cfs_rq) /* * For migration heavy workloads, access to tg->load_avg can be * unbound. Limit the update rate to at most once per ms. - */ - now =3D sched_clock_cpu(cpu_of(rq_of(cfs_rq))); + * + * The enclosing PELT update paths always hold rq->lock and have + * called update_rq_clock(rq) within microseconds, so rq->clock_task + * is fresh. Use it instead of sched_clock_cpu() to avoid an rdtsc + * (plus pipeline serialisation) per call -- this function is invoked + * once per leaf cfs_rq in __update_blocked_fair(), so on hosts with + * many cgroups the rdtsc cost dominates the rate-limit check itself. + */ + now =3D rq_clock_task(rq_of(cfs_rq)); if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC) return; =20 @@ -4453,7 +4460,8 @@ static inline void clear_tg_load_avg(struct cfs_rq *c= fs_rq) if (cfs_rq->tg =3D=3D &root_task_group) return; =20 - now =3D sched_clock_cpu(cpu_of(rq_of(cfs_rq))); + /* See update_tg_load_avg() for the rq_clock_task() rationale. */ + now =3D rq_clock_task(rq_of(cfs_rq)); delta =3D 0 - cfs_rq->tg_load_avg_contrib; atomic_long_add(delta, &cfs_rq->tg->load_avg); cfs_rq->tg_load_avg_contrib =3D 0; --=20 2.53.0-Meta