sched/fair: Use rq_clock() in update_tg_load_avg() rate-limit

[tip: sched/core] sched/fair: Use rq_clock() in update_tg_load_avg() rate-limit
Posted by tip-bot2 for Rik van Riel 1 week, 3 days ago
The following commit has been merged into the sched/core branch of tip:

Commit-ID:     3b7be8e7fa698359616c3276e005f08c3b6070e4
Gitweb:        https://git.kernel.org/tip/3b7be8e7fa698359616c3276e005f08c3b6070e4
Author:        Rik van Riel <riel@surriel.com>
AuthorDate:    Tue, 26 May 2026 12:43:29 -07:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 29 May 2026 12:43:16 +02:00

sched/fair: Use rq_clock() in update_tg_load_avg() rate-limit

update_tg_load_avg() is called once per leaf cfs_rq from the
__update_blocked_fair() walk that runs inside the NOHZ idle-balance
softirq, and again from update_load_avg() with UPDATE_TG.  Its first
operation after the trivial early-outs is unconditionally:

	now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
	if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
		return;

Jakub ran into a system where nohz_idle_balance() was taking 75%
of a CPU (which is handling network traffic and doing many irq_exit_cpu
calls), with 35% of that CPU spent in update_load_avg, and 17% of the
CPU in sched_clock_cpu(), reading the TSC.

In a quick synthetic test, it looks like this patch reduces the
CPU use of sched_balance_update_blocked_averages by about 20%.

Switch the rate-limit to read rq_clock(rq_of(cfs_rq)) instead.
This eliminates the rdtsc, and uses a fairly fresh timestamp,
because all callers of update_tg_load_avg() and clear_tg_load_avg()
hold rq->lock and have called update_rq_clock(rq) within microseconds:

  caller                                   pre-state
  __update_blocked_fair                    encloser did update_rq_clock(rq)
  update_load_avg's three UPDATE_TG sites  under rq->lock after enqueue/dequeue/update_curr
  attach_/detach_entity_cfs_rq             preceded by update_load_avg(...)
  clear_tg_load_avg via offline path       rq_clock_start_loop_update(rq) upfront

so rq->clock is fresh at every call.  Since cfs_rqs are per-CPU
per-task_group, cfs_rq->last_update_tg_load_avg is always compared
against the same rq's clock; no cross-rq drift.

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude (Anthropic)
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://patch.msgid.link/20260527110250.6a91718d@fangorn
---
 kernel/sched/fair.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 62a2dcb..b5819c4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4962,7 +4962,7 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
 	 * For migration heavy workloads, access to tg->load_avg can be
 	 * unbound. Limit the update rate to at most once per ms.
 	 */
-	now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
+	now = rq_clock(rq_of(cfs_rq));
 	if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
 		return;
 
@@ -4985,7 +4985,7 @@ static inline void clear_tg_load_avg(struct cfs_rq *cfs_rq)
 	if (cfs_rq->tg == &root_task_group)
 		return;
 
-	now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
+	now = rq_clock(rq_of(cfs_rq));
 	delta = 0 - cfs_rq->tg_load_avg_contrib;
 	atomic_long_add(delta, &cfs_rq->tg->load_avg);
 	cfs_rq->tg_load_avg_contrib = 0;