sched/fair: Mitigate the impact of retrieving tg->load_avg

[RFC] sched/fair: Mitigate the impact of retrieving tg->load_avg
Posted by Vinicius Costa Gomes 6 months, 4 weeks ago
Reduce the impact of update_cfs_group()/update_load_avg() by reducing
the frequency that tg->load_avg is retrieved.

This is "the other side" of commit 1528c661c24b ("sched/fair:
Ratelimit update to tg->load_avg"), which reduced the frequency of the
"store" side, now it's reducing the frequency of the "load" side.

Sending as a RFC because I want to point out that there is still
contention when updating/loading the load_avg of a group, I do not
believe that this particular piece of code is the solution.

On a series[1] with a similar objective (and independent), it was
pointed out that perhaps the effort was best spent on something like
this:

https://lore.kernel.org/all/20190906191237.27006-1-riel@surriel.com/

Would that be the way to go?

Just to make it a bit more fancy, some perf numbers, running:

$ ./schbench -r 60

* current master:
-   68.38%     0.05%  schbench         schbench                     [.] worker_thread
   - 68.37% worker_thread
      - 56.70% asm_sysvec_apic_timer_interrupt
         - 56.10% sysvec_apic_timer_interrupt
            - 54.32% __sysvec_apic_timer_interrupt
               - 54.11% hrtimer_interrupt
                  - 49.99% __hrtimer_run_queues
                     - 48.08% tick_nohz_handler
                        - 47.02% update_process_times
                           - 39.41% sched_tick
                              - 27.31% task_tick_fair
                                   12.88% update_cfs_group
                                 - 9.61% update_load_avg
                                      3.52% __update_load_avg_cfs_rq
                                      0.72% __update_load_avg_se

* patched kernel:
-   66.27%     0.05%  schbench         schbench                     [.] worker_thread
   - 66.26% worker_thread
      - 52.47% asm_sysvec_apic_timer_interrupt
         - 51.87% sysvec_apic_timer_interrupt
            - 50.19% __sysvec_apic_timer_interrupt
               - 49.97% hrtimer_interrupt
                  - 45.06% __hrtimer_run_queues
                     - 42.77% tick_nohz_handler
                        - 41.64% update_process_times
                           - 33.32% sched_tick
                              - 19.33% task_tick_fair
                                 - 7.72% update_load_avg
                                      4.24% __update_load_avg_cfs_rq
                                      0.80% __update_load_avg_se
                                   6.63% update_cfs_group

I can see some improvements in schbench, but seem to be in the noise.

[1] https://lore.kernel.org/all/20250605142851.GU39944@noisy.programming.kicks-ass.net/

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
---
 kernel/sched/fair.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a14da5396fb..c23c6e45f49d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3980,6 +3980,7 @@ static void update_cfs_group(struct sched_entity *se)
 {
 	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
 	long shares;
+	u64 now;
 
 	/*
 	 * When a group becomes empty, preserve its weight. This matters for
@@ -3991,6 +3992,14 @@ static void update_cfs_group(struct sched_entity *se)
 	if (throttled_hierarchy(gcfs_rq))
 		return;
 
+	/*
+	 * For migration heavy workloads, access to tg->load_avg can be
+	 * unbound. Limit the update rate to at most once per ms.
+	 */
+	now = sched_clock_cpu(cpu_of(rq_of(gcfs_rq)));
+	if (now - gcfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
+		return;
+
 #ifndef CONFIG_SMP
 	shares = READ_ONCE(gcfs_rq->tg->shares);
 #else
-- 
2.50.1