sched/fair: use rq_clock_task() in update_tg_load_avg() rate-limit

[PATCH] sched/fair: use rq_clock_task() in update_tg_load_avg() rate-limit

Posted by Rik van Riel 1 week, 5 days ago

update_tg_load_avg() is called once per leaf cfs_rq from the
__update_blocked_fair() walk that runs inside the NOHZ idle-balance
softirq, and again from update_load_avg() with UPDATE_TG.  Its first
operation after the trivial early-outs is unconditionally:

	now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
	if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
		return;

Jakub ran into a system where nohz_idle_balance() was taking 75%
of a CPU (which is handling network traffic and doing many irq_exit_cpu
calls), with 35% of that CPU spent in update_load_avg, and 17% of the
CPU in sched_clock_cpu(), reading the TSC.

There are some optimizations upstream already to reduce that overhead,
but it also looks like those rdtsc calls may not me necessary at all,
giving another easy win.

Switch the rate-limit to read rq_clock_task(rq_of(cfs_rq)) instead.
This does two things:

1. Eliminates the rdtsc.  rq->clock_task is already updated by the
   enclosing update_rq_clock(rq), sits in a hot cacheline, and reads
   as a single load.

2. Aligns the rate-limit clock with the clock the rate-limited data
   is computed on.  cfs_rq->avg.load_avg (the value being published
   to tg->load_avg) is computed by
     update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq)
   where cfs_rq_clock_pelt() is derived from rq->clock_pelt, which is
   derived from rq->clock_task.  PELT intentionally excludes IRQ-handling
   time and steal time from its decay (commit 23127296889f
   "sched/fair: Update scale invariance of PELT"), so
   cfs_rq->avg.load_avg already evolves in clock_task time.  The
   rate-limit was the only outlier in the propagation chain using a
   different clock (wall time, via sched_clock_cpu).

   Under normal load and on idle CPUs (where clock_task advances at
   wall-clock rate) behaviour is unchanged.  Under heavy IRQ load
   clock_task advances slower than wall time, so the rate-limit fires
   less often -- consistent with the fact that the underlying
   cfs_rq->avg.load_avg is also changing slower under the same
   conditions.  The publish cadence tracks the signal.

Note: update_tg_load_avg() propagates the *already-decayed*
cfs_rq->avg.load_avg to tg->load_avg; it does not drive decay.  Decay
happens in update_cfs_rq_load_avg() on the PELT clock regardless of
what clock the rate-limit uses, so this change cannot lose decay
information.  The rate-limit governs how often we publish, not how
fast load decays.

All callers of update_tg_load_avg() and clear_tg_load_avg() hold
rq->lock and have called update_rq_clock(rq) within microseconds:

  caller                                   pre-state
  __update_blocked_fair                    encloser did update_rq_clock(rq)
  update_load_avg's three UPDATE_TG sites  under rq->lock after enqueue/dequeue/update_curr
  attach_/detach_entity_cfs_rq             preceded by update_load_avg(...)
  clear_tg_load_avg via offline path       rq_clock_start_loop_update(rq) upfront

so rq->clock_task is fresh at every call.  Since cfs_rqs are per-CPU
per-task_group, cfs_rq->last_update_tg_load_avg is always compared
against the same rq's clock; no cross-rq drift.

The same hoisting pattern was recently applied to find_new_ilb() in
commit 76504bce4ee6 ("sched/fair: Get this cpu once in find_new_ilb()").

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude (Anthropic)
---
 kernel/sched/fair.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 863de57a8a2c..096bcb00fa62 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4429,8 +4429,15 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
 	/*
 	 * For migration heavy workloads, access to tg->load_avg can be
 	 * unbound. Limit the update rate to at most once per ms.
-	 */
-	now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
+	 *
+	 * The enclosing PELT update paths always hold rq->lock and have
+	 * called update_rq_clock(rq) within microseconds, so rq->clock_task
+	 * is fresh.  Use it instead of sched_clock_cpu() to avoid an rdtsc
+	 * (plus pipeline serialisation) per call -- this function is invoked
+	 * once per leaf cfs_rq in __update_blocked_fair(), so on hosts with
+	 * many cgroups the rdtsc cost dominates the rate-limit check itself.
+	 */
+	now = rq_clock_task(rq_of(cfs_rq));
 	if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
 		return;
 
@@ -4453,7 +4460,8 @@ static inline void clear_tg_load_avg(struct cfs_rq *cfs_rq)
 	if (cfs_rq->tg == &root_task_group)
 		return;
 
-	now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
+	/* See update_tg_load_avg() for the rq_clock_task() rationale. */
+	now = rq_clock_task(rq_of(cfs_rq));
 	delta = 0 - cfs_rq->tg_load_avg_contrib;
 	atomic_long_add(delta, &cfs_rq->tg->load_avg);
 	cfs_rq->tg_load_avg_contrib = 0;
-- 
2.53.0-Meta

Re: [PATCH] sched/fair: use rq_clock_task() in update_tg_load_avg() rate-limit

Posted by Vincent Guittot 1 week, 5 days ago

On Wed, 27 May 2026 at 03:32, Rik van Riel <riel@surriel.com> wrote:
>
> update_tg_load_avg() is called once per leaf cfs_rq from the
> __update_blocked_fair() walk that runs inside the NOHZ idle-balance
> softirq, and again from update_load_avg() with UPDATE_TG.  Its first
> operation after the trivial early-outs is unconditionally:
>
>         now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
>         if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
>                 return;
>
> Jakub ran into a system where nohz_idle_balance() was taking 75%
> of a CPU (which is handling network traffic and doing many irq_exit_cpu
> calls), with 35% of that CPU spent in update_load_avg, and 17% of the
> CPU in sched_clock_cpu(), reading the TSC.

Do you have figures showing the improvement after applying this patch
compared to the figures above?

>
> There are some optimizations upstream already to reduce that overhead,
> but it also looks like those rdtsc calls may not me necessary at all,

s/me/be/

> giving another easy win.
>
> Switch the rate-limit to read rq_clock_task(rq_of(cfs_rq)) instead.
> This does two things:
>
> 1. Eliminates the rdtsc.  rq->clock_task is already updated by the
>    enclosing update_rq_clock(rq), sits in a hot cacheline, and reads
>    as a single load.

AFAICT this is the primary target of the patch because other clock
sources don't have such overhead

>
> 2. Aligns the rate-limit clock with the clock the rate-limited data
>    is computed on.  cfs_rq->avg.load_avg (the value being published
>    to tg->load_avg) is computed by
>      update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq)
>    where cfs_rq_clock_pelt() is derived from rq->clock_pelt, which is
>    derived from rq->clock_task.  PELT intentionally excludes IRQ-handling
>    time and steal time from its decay (commit 23127296889f
>    "sched/fair: Update scale invariance of PELT"), so
>    cfs_rq->avg.load_avg already evolves in clock_task time.  The
>    rate-limit was the only outlier in the propagation chain using a
>    different clock (wall time, via sched_clock_cpu).

Migrating a task is another way to update cfs_rq->avg.load_avg and
this is not related to how fast rq->clock_task moves forward

>
>    Under normal load and on idle CPUs (where clock_task advances at
>    wall-clock rate) behaviour is unchanged.  Under heavy IRQ load
>    clock_task advances slower than wall time, so the rate-limit fires
>    less often -- consistent with the fact that the underlying
>    cfs_rq->avg.load_avg is also changing slower under the same
>    conditions.  The publish cadence tracks the signal.

As mentioned above, cfs_rq->avg.load_avg can change outside the normal
PELT tracking.
It would be good to check that we do not end up in the opposite
situation where the tg->load_avg becomes stalled because
rq->clock_task doesn't advance, even though load_avg has changed
significantly.

So before moving to this new clock, we must ensure it will not trigger
other issues.

>
> Note: update_tg_load_avg() propagates the *already-decayed*
> cfs_rq->avg.load_avg to tg->load_avg; it does not drive decay.  Decay
> happens in update_cfs_rq_load_avg() on the PELT clock regardless of
> what clock the rate-limit uses, so this change cannot lose decay
> information.  The rate-limit governs how often we publish, not how
> fast load decays.
>
> All callers of update_tg_load_avg() and clear_tg_load_avg() hold
> rq->lock and have called update_rq_clock(rq) within microseconds:
>
>   caller                                   pre-state
>   __update_blocked_fair                    encloser did update_rq_clock(rq)
>   update_load_avg's three UPDATE_TG sites  under rq->lock after enqueue/dequeue/update_curr
>   attach_/detach_entity_cfs_rq             preceded by update_load_avg(...)
>   clear_tg_load_avg via offline path       rq_clock_start_loop_update(rq) upfront
>
> so rq->clock_task is fresh at every call.  Since cfs_rqs are per-CPU
> per-task_group, cfs_rq->last_update_tg_load_avg is always compared
> against the same rq's clock; no cross-rq drift.
>
> The same hoisting pattern was recently applied to find_new_ilb() in
> commit 76504bce4ee6 ("sched/fair: Get this cpu once in find_new_ilb()").

I'm not sure how I should link this patch to the one mentioned above

>
> Signed-off-by: Rik van Riel <riel@surriel.com>
> Assisted-by: Claude (Anthropic)
> ---
>  kernel/sched/fair.c | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 863de57a8a2c..096bcb00fa62 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4429,8 +4429,15 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
>         /*
>          * For migration heavy workloads, access to tg->load_avg can be
>          * unbound. Limit the update rate to at most once per ms.
> -        */
> -       now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
> +        *
> +        * The enclosing PELT update paths always hold rq->lock and have
> +        * called update_rq_clock(rq) within microseconds, so rq->clock_task
> +        * is fresh.  Use it instead of sched_clock_cpu() to avoid an rdtsc
> +        * (plus pipeline serialisation) per call -- this function is invoked
> +        * once per leaf cfs_rq in __update_blocked_fair(), so on hosts with
> +        * many cgroups the rdtsc cost dominates the rate-limit check itself.
> +        */
> +       now = rq_clock_task(rq_of(cfs_rq));
>         if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
>                 return;
>
> @@ -4453,7 +4460,8 @@ static inline void clear_tg_load_avg(struct cfs_rq *cfs_rq)
>         if (cfs_rq->tg == &root_task_group)
>                 return;
>
> -       now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
> +       /* See update_tg_load_avg() for the rq_clock_task() rationale. */
> +       now = rq_clock_task(rq_of(cfs_rq));
>         delta = 0 - cfs_rq->tg_load_avg_contrib;
>         atomic_long_add(delta, &cfs_rq->tg->load_avg);
>         cfs_rq->tg_load_avg_contrib = 0;
> --
> 2.53.0-Meta
>

Re: [PATCH] sched/fair: use rq_clock_task() in update_tg_load_avg() rate-limit

Posted by Vincent Guittot 1 week, 5 days ago

On Wed, 27 May 2026 at 09:37, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Wed, 27 May 2026 at 03:32, Rik van Riel <riel@surriel.com> wrote:
> >
> > update_tg_load_avg() is called once per leaf cfs_rq from the
> > __update_blocked_fair() walk that runs inside the NOHZ idle-balance
> > softirq, and again from update_load_avg() with UPDATE_TG.  Its first
> > operation after the trivial early-outs is unconditionally:
> >
> >         now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
> >         if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
> >                 return;
> >
> > Jakub ran into a system where nohz_idle_balance() was taking 75%
> > of a CPU (which is handling network traffic and doing many irq_exit_cpu
> > calls), with 35% of that CPU spent in update_load_avg, and 17% of the
> > CPU in sched_clock_cpu(), reading the TSC.
>
> Do you have figures showing the improvement after applying this patch
> compared to the figures above?
>
> >
> > There are some optimizations upstream already to reduce that overhead,
> > but it also looks like those rdtsc calls may not me necessary at all,
>
> s/me/be/
>
> > giving another easy win.
> >
> > Switch the rate-limit to read rq_clock_task(rq_of(cfs_rq)) instead.
> > This does two things:
> >
> > 1. Eliminates the rdtsc.  rq->clock_task is already updated by the
> >    enclosing update_rq_clock(rq), sits in a hot cacheline, and reads
> >    as a single load.
>
> AFAICT this is the primary target of the patch because other clock
> sources don't have such overhead
>
> >
> > 2. Aligns the rate-limit clock with the clock the rate-limited data
> >    is computed on.  cfs_rq->avg.load_avg (the value being published
> >    to tg->load_avg) is computed by
> >      update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq)
> >    where cfs_rq_clock_pelt() is derived from rq->clock_pelt, which is
> >    derived from rq->clock_task.  PELT intentionally excludes IRQ-handling
> >    time and steal time from its decay (commit 23127296889f
> >    "sched/fair: Update scale invariance of PELT"), so
> >    cfs_rq->avg.load_avg already evolves in clock_task time.  The
> >    rate-limit was the only outlier in the propagation chain using a
> >    different clock (wall time, via sched_clock_cpu).
>
> Migrating a task is another way to update cfs_rq->avg.load_avg and
> this is not related to how fast rq->clock_task moves forward
>
> >
> >    Under normal load and on idle CPUs (where clock_task advances at
> >    wall-clock rate) behaviour is unchanged.  Under heavy IRQ load
> >    clock_task advances slower than wall time, so the rate-limit fires
> >    less often -- consistent with the fact that the underlying
> >    cfs_rq->avg.load_avg is also changing slower under the same
> >    conditions.  The publish cadence tracks the signal.
>
> As mentioned above, cfs_rq->avg.load_avg can change outside the normal
> PELT tracking.
> It would be good to check that we do not end up in the opposite
> situation where the tg->load_avg becomes stalled because
> rq->clock_task doesn't advance, even though load_avg has changed
> significantly.
>
> So before moving to this new clock, we must ensure it will not trigger
> other issues.
>
> >
> > Note: update_tg_load_avg() propagates the *already-decayed*
> > cfs_rq->avg.load_avg to tg->load_avg; it does not drive decay.  Decay
> > happens in update_cfs_rq_load_avg() on the PELT clock regardless of
> > what clock the rate-limit uses, so this change cannot lose decay
> > information.  The rate-limit governs how often we publish, not how
> > fast load decays.
> >
> > All callers of update_tg_load_avg() and clear_tg_load_avg() hold
> > rq->lock and have called update_rq_clock(rq) within microseconds:
> >
> >   caller                                   pre-state
> >   __update_blocked_fair                    encloser did update_rq_clock(rq)
> >   update_load_avg's three UPDATE_TG sites  under rq->lock after enqueue/dequeue/update_curr
> >   attach_/detach_entity_cfs_rq             preceded by update_load_avg(...)
> >   clear_tg_load_avg via offline path       rq_clock_start_loop_update(rq) upfront
> >
> > so rq->clock_task is fresh at every call.  Since cfs_rqs are per-CPU
> > per-task_group, cfs_rq->last_update_tg_load_avg is always compared
> > against the same rq's clock; no cross-rq drift.
> >
> > The same hoisting pattern was recently applied to find_new_ilb() in
> > commit 76504bce4ee6 ("sched/fair: Get this cpu once in find_new_ilb()").
>
> I'm not sure how I should link this patch to the one mentioned above
>
> >
> > Signed-off-by: Rik van Riel <riel@surriel.com>
> > Assisted-by: Claude (Anthropic)
> > ---
> >  kernel/sched/fair.c | 14 +++++++++++---
> >  1 file changed, 11 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 863de57a8a2c..096bcb00fa62 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4429,8 +4429,15 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
> >         /*
> >          * For migration heavy workloads, access to tg->load_avg can be
> >          * unbound. Limit the update rate to at most once per ms.
> > -        */
> > -       now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
> > +        *
> > +        * The enclosing PELT update paths always hold rq->lock and have
> > +        * called update_rq_clock(rq) within microseconds, so rq->clock_task
> > +        * is fresh.  Use it instead of sched_clock_cpu() to avoid an rdtsc
> > +        * (plus pipeline serialisation) per call -- this function is invoked
> > +        * once per leaf cfs_rq in __update_blocked_fair(), so on hosts with
> > +        * many cgroups the rdtsc cost dominates the rate-limit check itself.
> > +        */
> > +       now = rq_clock_task(rq_of(cfs_rq));

Why not using rq_clock()  ? This removes rdtsc call and still move
forward whatever the irq pressure

> >         if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
> >                 return;
> >
> > @@ -4453,7 +4460,8 @@ static inline void clear_tg_load_avg(struct cfs_rq *cfs_rq)
> >         if (cfs_rq->tg == &root_task_group)
> >                 return;
> >
> > -       now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
> > +       /* See update_tg_load_avg() for the rq_clock_task() rationale. */
> > +       now = rq_clock_task(rq_of(cfs_rq));
> >         delta = 0 - cfs_rq->tg_load_avg_contrib;
> >         atomic_long_add(delta, &cfs_rq->tg->load_avg);
> >         cfs_rq->tg_load_avg_contrib = 0;
> > --
> > 2.53.0-Meta
> >

Re: [PATCH] sched/fair: use rq_clock_task() in update_tg_load_avg() rate-limit

Posted by Rik van Riel 1 week, 5 days ago

On Wed, 2026-05-27 at 10:17 +0200, Vincent Guittot wrote:
> On Wed, 27 May 2026 at 09:37, Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> > 
> > On Wed, 27 May 2026 at 03:32, Rik van Riel <riel@surriel.com>
> > wrote:
> > > 
> > > +++ b/kernel/sched/fair.c
> > > @@ -4429,8 +4429,15 @@ static inline void
> > > update_tg_load_avg(struct cfs_rq *cfs_rq)
> > >         /*
> > >          * For migration heavy workloads, access to tg->load_avg
> > > can be
> > >          * unbound. Limit the update rate to at most once per ms.
> > > -        */
> > > -       now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
> > > +        *
> > > +        * The enclosing PELT update paths always hold rq->lock
> > > and have
> > > +        * called update_rq_clock(rq) within microseconds, so rq-
> > > >clock_task
> > > +        * is fresh.  Use it instead of sched_clock_cpu() to
> > > avoid an rdtsc
> > > +        * (plus pipeline serialisation) per call -- this
> > > function is invoked
> > > +        * once per leaf cfs_rq in __update_blocked_fair(), so on
> > > hosts with
> > > +        * many cgroups the rdtsc cost dominates the rate-limit
> > > check itself.
> > > +        */
> > > +       now = rq_clock_task(rq_of(cfs_rq));
> 
> Why not using rq_clock()  ? This removes rdtsc call and still move
> forward whatever the irq pressure

Good idea, that does seem like a better clock to use!

Does that also automatically eliminate the worry about
task migration, since update_rq_clock() gets called
whenever a task is attached, rqs get attached/detached,
and in various other points including the load balancing
code?


-- 
All Rights Reversed.

Re: [PATCH] sched/fair: use rq_clock_task() in update_tg_load_avg() rate-limit

Posted by Vincent Guittot 1 week, 5 days ago

On Wed, 27 May 2026 at 13:52, Rik van Riel <riel@surriel.com> wrote:
>
> On Wed, 2026-05-27 at 10:17 +0200, Vincent Guittot wrote:
> > On Wed, 27 May 2026 at 09:37, Vincent Guittot
> > <vincent.guittot@linaro.org> wrote:
> > >
> > > On Wed, 27 May 2026 at 03:32, Rik van Riel <riel@surriel.com>
> > > wrote:
> > > >
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -4429,8 +4429,15 @@ static inline void
> > > > update_tg_load_avg(struct cfs_rq *cfs_rq)
> > > >         /*
> > > >          * For migration heavy workloads, access to tg->load_avg
> > > > can be
> > > >          * unbound. Limit the update rate to at most once per ms.
> > > > -        */
> > > > -       now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
> > > > +        *
> > > > +        * The enclosing PELT update paths always hold rq->lock
> > > > and have
> > > > +        * called update_rq_clock(rq) within microseconds, so rq-
> > > > >clock_task
> > > > +        * is fresh.  Use it instead of sched_clock_cpu() to
> > > > avoid an rdtsc
> > > > +        * (plus pipeline serialisation) per call -- this
> > > > function is invoked
> > > > +        * once per leaf cfs_rq in __update_blocked_fair(), so on
> > > > hosts with
> > > > +        * many cgroups the rdtsc cost dominates the rate-limit
> > > > check itself.
> > > > +        */
> > > > +       now = rq_clock_task(rq_of(cfs_rq));
> >
> > Why not using rq_clock()  ? This removes rdtsc call and still move
> > forward whatever the irq pressure
>
> Good idea, that does seem like a better clock to use!
>
> Does that also automatically eliminate the worry about
> task migration, since update_rq_clock() gets called
> whenever a task is attached, rqs get attached/detached,
> and in various other points including the load balancing
> code?

Yes, we move forward and can't get stuck by irq steal time with
rq_clock like the current implementation with the benefit of not
calling sched_clock_cpu() multiple times for the same scheduling event

>
>
> --
> All Rights Reversed.

Re: [PATCH] sched/fair: use rq_clock_task() in update_tg_load_avg() rate-limit

Posted by Aaron Lu 1 week, 5 days ago

On Tue, May 26, 2026 at 09:31:52PM -0400, Rik van Riel wrote:
> update_tg_load_avg() is called once per leaf cfs_rq from the
> __update_blocked_fair() walk that runs inside the NOHZ idle-balance
> softirq, and again from update_load_avg() with UPDATE_TG.  Its first
> operation after the trivial early-outs is unconditionally:
> 
> 	now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
> 	if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
> 		return;
> 
> Jakub ran into a system where nohz_idle_balance() was taking 75%
> of a CPU (which is handling network traffic and doing many irq_exit_cpu
> calls), with 35% of that CPU spent in update_load_avg, and 17% of the
> CPU in sched_clock_cpu(), reading the TSC.
> 
> There are some optimizations upstream already to reduce that overhead,
> but it also looks like those rdtsc calls may not me necessary at all,
> giving another easy win.
> 
> Switch the rate-limit to read rq_clock_task(rq_of(cfs_rq)) instead.
> This does two things:
> 
> 1. Eliminates the rdtsc.  rq->clock_task is already updated by the
>    enclosing update_rq_clock(rq), sits in a hot cacheline, and reads
>    as a single load.
> 
> 2. Aligns the rate-limit clock with the clock the rate-limited data
>    is computed on.  cfs_rq->avg.load_avg (the value being published
>    to tg->load_avg) is computed by
>      update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq)
>    where cfs_rq_clock_pelt() is derived from rq->clock_pelt, which is
>    derived from rq->clock_task.  PELT intentionally excludes IRQ-handling
>    time and steal time from its decay (commit 23127296889f
>    "sched/fair: Update scale invariance of PELT"), so
>    cfs_rq->avg.load_avg already evolves in clock_task time.  The
>    rate-limit was the only outlier in the propagation chain using a
>    different clock (wall time, via sched_clock_cpu).
> 
>    Under normal load and on idle CPUs (where clock_task advances at
>    wall-clock rate) behaviour is unchanged.  Under heavy IRQ load
>    clock_task advances slower than wall time, so the rate-limit fires
>    less often -- consistent with the fact that the underlying
>    cfs_rq->avg.load_avg is also changing slower under the same
>    conditions.  The publish cadence tracks the signal.
> 
> Note: update_tg_load_avg() propagates the *already-decayed*
> cfs_rq->avg.load_avg to tg->load_avg; it does not drive decay.  Decay
> happens in update_cfs_rq_load_avg() on the PELT clock regardless of
> what clock the rate-limit uses, so this change cannot lose decay
> information.  The rate-limit governs how often we publish, not how
> fast load decays.
> 
> All callers of update_tg_load_avg() and clear_tg_load_avg() hold
> rq->lock and have called update_rq_clock(rq) within microseconds:
> 
>   caller                                   pre-state
>   __update_blocked_fair                    encloser did update_rq_clock(rq)
>   update_load_avg's three UPDATE_TG sites  under rq->lock after enqueue/dequeue/update_curr
>   attach_/detach_entity_cfs_rq             preceded by update_load_avg(...)
>   clear_tg_load_avg via offline path       rq_clock_start_loop_update(rq) upfront
> 
> so rq->clock_task is fresh at every call.  Since cfs_rqs are per-CPU
> per-task_group, cfs_rq->last_update_tg_load_avg is always compared
> against the same rq's clock; no cross-rq drift.
> 
> The same hoisting pattern was recently applied to find_new_ilb() in
> commit 76504bce4ee6 ("sched/fair: Get this cpu once in find_new_ilb()").
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> Assisted-by: Claude (Anthropic)

Looks good to me.
Reviewed-by: Aaron Lu <ziqianlu@bytedance.com>