[v1] sched: Introduce removed.load_sum for precise load propagation

[RESEND][RFC] sched: Introduce removed.load_sum for precise load propagation

Posted by hupu 3 weeks, 1 day ago

Currently, load_sum to be propagated is estimated from
(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT, which relies on
runnable_avg as an approximation. This approach can introduce precision
loss due to the shift operation, and the error may become more visible
when small tasks frequently enter and leave the queue.

This patch introduces removed.load_sum to directly accumulate
se->avg.load_sum when tasks dequeue, and uses it during load
propagation. By doing so:

  a) Avoid relying on runnable_avg-based approximation and obtain
     higher precision in load_sum propagation;
  b) Eliminate precision loss from the shift operation, especially
     with frequent short-lived tasks;
  c) Reduce one multiplication and shift in the hotpath, which
     theoretically lowers overhead (though the impact is minor).

Signed-off-by: hupu <hupu.gm@gmail.com>
---
 kernel/sched/fair.c  | 8 +++++---
 kernel/sched/sched.h | 1 +
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b173a059315c..b8cf3687804b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4561,7 +4561,8 @@ static void migrate_se_pelt_lag(struct sched_entity *se) {}
 static inline int
 update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 {
-	unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
+	unsigned long removed_load_sum = 0, removed_load = 0;
+	unsigned long removed_util = 0, removed_runnable = 0;
 	struct sched_avg *sa = &cfs_rq->avg;
 	int decayed = 0;
 
@@ -4572,6 +4573,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 		raw_spin_lock(&cfs_rq->removed.lock);
 		swap(cfs_rq->removed.util_avg, removed_util);
 		swap(cfs_rq->removed.load_avg, removed_load);
+		swap(cfs_rq->removed.load_sum, removed_load_sum);
 		swap(cfs_rq->removed.runnable_avg, removed_runnable);
 		cfs_rq->removed.nr = 0;
 		raw_spin_unlock(&cfs_rq->removed.lock);
@@ -4609,8 +4611,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 		 * removed_runnable is the unweighted version of removed_load so we
 		 * can use it to estimate removed_load_sum.
 		 */
-		add_tg_cfs_propagate(cfs_rq,
-			-(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
+		add_tg_cfs_propagate(cfs_rq, -(long)removed_load_sum);
 
 		decayed = 1;
 	}
@@ -4792,6 +4793,7 @@ static void remove_entity_load_avg(struct sched_entity *se)
 	++cfs_rq->removed.nr;
 	cfs_rq->removed.util_avg	+= se->avg.util_avg;
 	cfs_rq->removed.load_avg	+= se->avg.load_avg;
+	cfs_rq->removed.load_sum	+= se->avg.load_sum;
 	cfs_rq->removed.runnable_avg	+= se->avg.runnable_avg;
 	raw_spin_unlock_irqrestore(&cfs_rq->removed.lock, flags);
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index be9745d104f7..659935a5c694 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -682,6 +682,7 @@ struct cfs_rq {
 	struct {
 		raw_spinlock_t	lock ____cacheline_aligned;
 		int		nr;
+		unsigned long	load_sum;
 		unsigned long	load_avg;
 		unsigned long	util_avg;
 		unsigned long	runnable_avg;
-- 
2.43.0

Re: [RESEND][RFC] sched: Introduce removed.load_sum for precise load propagation

Posted by Pierre Gondois 2 days, 12 hours ago

Hello Hupu,

On 9/10/25 10:43, hupu wrote:
> Currently, load_sum to be propagated is estimated from
> (removed_runnable * divider) >> SCHED_CAPACITY_SHIFT, which relies on
> runnable_avg as an approximation. This approach can introduce precision
> loss due to the shift operation, and the error may become more visible
> when small tasks frequently enter and leave the queue.
>
> This patch introduces removed.load_sum to directly accumulate
> se->avg.load_sum when tasks dequeue, and uses it during load
> propagation. By doing so:
>
>    a) Avoid relying on runnable_avg-based approximation and obtain
>       higher precision in load_sum propagation;
(runnable_sum == load_sum) is not exactly accurate anymore since:
static inline long se_runnable(struct sched_entity *se)
{
     if (se->sched_delayed)
         return false;

     return !!se->on_rq;
}

So obtaining load_[sum|avg] from the runnable_avg signal seems compromised.

It is possible to compute load_sum value without the runnable_signal, cf.
40f5aa4c5eae ("sched/pelt: Fix attach_entity_load_avg() corner case")
https://lore.kernel.org/all/20220414090229.342-1-kuyo.chang@mediatek.com/T/#u

I.e.:
+       se->avg.load_sum = se->avg.load_avg * divider;
+       if (se_weight(se) < se->avg.load_sum)
+               se->avg.load_sum = div_u64(se->avg.load_sum, se_weight(se));
+       else
+               se->avg.load_sum = 1;

As a side note, as a counterpart of the above patch, the lower the niceness,
the lower the weight (in sched_prio_to_weight[]) and the lower the task
load signal.
This means that the unweighted load_sum value looses granularity.
E.g.:
A task with weight=15 can have load_avg values in [0:15]. So all the values
for load_sum in the range [X * (47742/15) : (X + 1) * (47742/15)]
are floored to load_avg=X, but load_sum is not reset when computing 
load_avg.
attach_entity_load_avg() however resets load_sum to X * (47742/15).

>    b) Eliminate precision loss from the shift operation, especially
>       with frequent short-lived tasks;

It might also help aggregating multiple tasks. Right now, if 1.000 tasks
with load_sum=1 and load_avg=0 are removed, the rq's load signal will
not be impacted at all.

On the other side, there might also be questions about the PELT windows
of all these tasks and the rq being aligned, cf.
https://lore.kernel.org/all/20170512171336.148578996@infradead.org/

>    c) Reduce one multiplication and shift in the hotpath, which
>       theoretically lowers overhead (though the impact is minor).
>
> Signed-off-by: hupu <hupu.gm@gmail.com>
> ---
>   kernel/sched/fair.c  | 8 +++++---
>   kernel/sched/sched.h | 1 +
>   2 files changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b173a059315c..b8cf3687804b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4561,7 +4561,8 @@ static void migrate_se_pelt_lag(struct sched_entity *se) {}
>   static inline int
>   update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>   {
> -	unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
> +	unsigned long removed_load_sum = 0, removed_load = 0;
> +	unsigned long removed_util = 0, removed_runnable = 0;
>   	struct sched_avg *sa = &cfs_rq->avg;
>   	int decayed = 0;
>   
> @@ -4572,6 +4573,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>   		raw_spin_lock(&cfs_rq->removed.lock);
>   		swap(cfs_rq->removed.util_avg, removed_util);
>   		swap(cfs_rq->removed.load_avg, removed_load);
> +		swap(cfs_rq->removed.load_sum, removed_load_sum);
>   		swap(cfs_rq->removed.runnable_avg, removed_runnable);
>   		cfs_rq->removed.nr = 0;
>   		raw_spin_unlock(&cfs_rq->removed.lock);
> @@ -4609,8 +4611,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>   		 * removed_runnable is the unweighted version of removed_load so we
>   		 * can use it to estimate removed_load_sum.
>   		 */
> -		add_tg_cfs_propagate(cfs_rq,
> -			-(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
> +		add_tg_cfs_propagate(cfs_rq, -(long)removed_load_sum);
>   
>   		decayed = 1;
>   	}
> @@ -4792,6 +4793,7 @@ static void remove_entity_load_avg(struct sched_entity *se)
>   	++cfs_rq->removed.nr;
>   	cfs_rq->removed.util_avg	+= se->avg.util_avg;
>   	cfs_rq->removed.load_avg	+= se->avg.load_avg;
> +	cfs_rq->removed.load_sum	+= se->avg.load_sum;
>   	cfs_rq->removed.runnable_avg	+= se->avg.runnable_avg;
>   	raw_spin_unlock_irqrestore(&cfs_rq->removed.lock, flags);
>   }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index be9745d104f7..659935a5c694 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -682,6 +682,7 @@ struct cfs_rq {
>   	struct {
>   		raw_spinlock_t	lock ____cacheline_aligned;
>   		int		nr;
> +		unsigned long	load_sum;
>   		unsigned long	load_avg;
>   		unsigned long	util_avg;
>   		unsigned long	runnable_avg;

Re: [RESEND][RFC] sched: Introduce removed.load_sum for precise load propagation

Posted by Vincent Guittot 3 weeks ago

On Wed, 10 Sept 2025 at 10:43, hupu <hupu.gm@gmail.com> wrote:
>
> Currently, load_sum to be propagated is estimated from
> (removed_runnable * divider) >> SCHED_CAPACITY_SHIFT, which relies on
> runnable_avg as an approximation. This approach can introduce precision
> loss due to the shift operation, and the error may become more visible
> when small tasks frequently enter and leave the queue.

Do you have a level of error ? Do you have a typical use case ?

>
> This patch introduces removed.load_sum to directly accumulate
> se->avg.load_sum when tasks dequeue, and uses it during load
> propagation. By doing so:
>
>   a) Avoid relying on runnable_avg-based approximation and obtain
>      higher precision in load_sum propagation;
>   b) Eliminate precision loss from the shift operation, especially
>      with frequent short-lived tasks;
>   c) Reduce one multiplication and shift in the hotpath, which
>      theoretically lowers overhead (though the impact is minor).

This doesn't work because rq's load_sum tracks current weight whereas
se's load_sum doesn't include the weight which is only applied on se's
load_avg. So you can't directly add/sub se's load_sum and rq's
load_sum. Only load_avg of both se and rq use the same unit.

Also we don't want to track both load_sum and load_avg, only one is
enough and by the above it is load_avg

Then, why is it a problem only for load and not util or runnable ?

>
> Signed-off-by: hupu <hupu.gm@gmail.com>
> ---
>  kernel/sched/fair.c  | 8 +++++---
>  kernel/sched/sched.h | 1 +
>  2 files changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b173a059315c..b8cf3687804b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4561,7 +4561,8 @@ static void migrate_se_pelt_lag(struct sched_entity *se) {}
>  static inline int
>  update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>  {
> -       unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
> +       unsigned long removed_load_sum = 0, removed_load = 0;
> +       unsigned long removed_util = 0, removed_runnable = 0;
>         struct sched_avg *sa = &cfs_rq->avg;
>         int decayed = 0;
>
> @@ -4572,6 +4573,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>                 raw_spin_lock(&cfs_rq->removed.lock);
>                 swap(cfs_rq->removed.util_avg, removed_util);
>                 swap(cfs_rq->removed.load_avg, removed_load);
> +               swap(cfs_rq->removed.load_sum, removed_load_sum);
>                 swap(cfs_rq->removed.runnable_avg, removed_runnable);
>                 cfs_rq->removed.nr = 0;
>                 raw_spin_unlock(&cfs_rq->removed.lock);
> @@ -4609,8 +4611,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>                  * removed_runnable is the unweighted version of removed_load so we
>                  * can use it to estimate removed_load_sum.
>                  */
> -               add_tg_cfs_propagate(cfs_rq,
> -                       -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
> +               add_tg_cfs_propagate(cfs_rq, -(long)removed_load_sum);
>
>                 decayed = 1;
>         }
> @@ -4792,6 +4793,7 @@ static void remove_entity_load_avg(struct sched_entity *se)
>         ++cfs_rq->removed.nr;
>         cfs_rq->removed.util_avg        += se->avg.util_avg;
>         cfs_rq->removed.load_avg        += se->avg.load_avg;
> +       cfs_rq->removed.load_sum        += se->avg.load_sum;
>         cfs_rq->removed.runnable_avg    += se->avg.runnable_avg;
>         raw_spin_unlock_irqrestore(&cfs_rq->removed.lock, flags);
>  }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index be9745d104f7..659935a5c694 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -682,6 +682,7 @@ struct cfs_rq {
>         struct {
>                 raw_spinlock_t  lock ____cacheline_aligned;
>                 int             nr;
> +               unsigned long   load_sum;
>                 unsigned long   load_avg;
>                 unsigned long   util_avg;
>                 unsigned long   runnable_avg;
> --
> 2.43.0
>

Re: [RESEND][RFC] sched: Introduce removed.load_sum for precise load propagation

Posted by hupu 2 weeks, 6 days ago

Hi Vincent Guittot
Thank you very much for your reply.

On Thu, Sep 11, 2025 at 4:01 PM Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Wed, 10 Sept 2025 at 10:43, hupu <hupu.gm@gmail.com> wrote:
> >
> > Currently, load_sum to be propagated is estimated from
> > (removed_runnable * divider) >> SCHED_CAPACITY_SHIFT, which relies on
> > runnable_avg as an approximation. This approach can introduce precision
> > loss due to the shift operation, and the error may become more visible
> > when small tasks frequently enter and leave the queue.
>
> Do you have a level of error ? Do you have a typical use case ?
>

In fact, I derived the error here from the underlying mathematical
relationship. The error mainly comes from two sources:
a) The approximation of load_sum using runnable_avg;
b) The truncation introduced by the right shift (SCHED_CAPACITY_SHIFT).

Below is the detailed derivation and explanation.

removed_runnable records the sum of se->avg.runnable_avg for tasks
that have migrated to another CPU. It represents the decayed
cumulative contribution of a task’s runtime, with the unit being
microseconds (μs). Right-shifting by SCHED_CAPACITY_SHIFT (10 bits) is
equivalent to truncating the low part below 1024 μs. In other words,
if a task has accumulated less than 1024 μs of runtime before
dequeueing, its load contribution will be completely discarded by the
shift operation. Even if the accumulated runtime exceeds 1024 μs, the
shift may still introduce up to nearly 1024 μs of truncation error
(about 1 ms).

For example, suppose a task has accumulated 4095 μs (4.095 ms) of
runtime on CPU0, then goes to sleep and is migrated to CPU1 upon
wakeup. Ideally, CPU0 should remove that task’s contribution fully.
However, after the shift, the result is 4095 >> 10 = 3 ms, which means
CPU0 will still retain about 1023 μs (~= 1.023 ms) of the task’s
contribution.

Experimental results also confirm this. By adding debug output before
add_tg_cfs_propagate and comparing the actual removed_load_sum (the
true value to be removed) with the approximate value obtained through
the shift, we observed that the latter is smaller in most cases. This
discrepancy is exactly due to the approximation and truncation
introduced by the shift.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
old mode 100644
new mode 100755
index b173a059315c..92396da04520
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4561,7 +4561,8 @@ static void migrate_se_pelt_lag(struct
sched_entity *se) {}
 static inline int
 update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 {
-       unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
+       unsigned long removed_load_sum = 0, removed_load = 0;
+       unsigned long removed_util = 0, removed_runnable = 0;
        struct sched_avg *sa = &cfs_rq->avg;
        int decayed = 0;

@@ -4572,6 +4573,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
                raw_spin_lock(&cfs_rq->removed.lock);
                swap(cfs_rq->removed.util_avg, removed_util);
                swap(cfs_rq->removed.load_avg, removed_load);
+               swap(cfs_rq->removed.load_sum, removed_load_sum);
                swap(cfs_rq->removed.runnable_avg, removed_runnable);
                cfs_rq->removed.nr = 0;
                raw_spin_unlock(&cfs_rq->removed.lock);
@@ -4609,8 +4611,10 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
                 * removed_runnable is the unweighted version of
removed_load so we
                 * can use it to estimate removed_load_sum.
                 */
-               add_tg_cfs_propagate(cfs_rq,
-                       -(long)(removed_runnable * divider) >>
SCHED_CAPACITY_SHIFT);
+               trace_printk("DEBUG BYHP: removed_load_sum=%lu,
raw_removed_runnable_sum=%lu\n",
+                               (long)removed_load_sum,
+                               (long)((removed_runnable * divider) >>
SCHED_CAPACITY_SHIFT));
+               add_tg_cfs_propagate(cfs_rq, -(long)removed_load_sum);

                decayed = 1;
        }
@@ -4792,6 +4796,7 @@ static void remove_entity_load_avg(struct
sched_entity *se)
        ++cfs_rq->removed.nr;
        cfs_rq->removed.util_avg        += se->avg.util_avg;
        cfs_rq->removed.load_avg        += se->avg.load_avg;
+       cfs_rq->removed.load_sum        += se->avg.load_sum;
        cfs_rq->removed.runnable_avg    += se->avg.runnable_avg;
        raw_spin_unlock_irqrestore(&cfs_rq->removed.lock, flags);
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index be9745d104f7..659935a5c694 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -682,6 +682,7 @@ struct cfs_rq {
        struct {
                raw_spinlock_t  lock ____cacheline_aligned;
                int             nr;
+               unsigned long   load_sum;
                unsigned long   load_avg;
                unsigned long   util_avg;
                unsigned long   runnable_avg;

The logs are as follows: raw_removed_runnable_sum is often smaller
than removed_load_sum. This difference is exactly caused by the
approximate calculation and the truncation introduced by bit shifting.

   stress-ng-cpu-183     (-------) [001] dnh1.   144.338335:
update_load_avg: DEBUG BYHP: removed_load_sum=35463,
raw_removed_runnable_sum=35429
   stress-ng-cpu-184     (-------) [007] dNs1.   144.346203:
sched_balance_update_blocked_averages: DEBUG BYHP:
removed_load_sum=20607, raw_removed_runnable_sum=20496
   stress-ng-cpu-185     (-------) [001] d.h1.   144.568803:
update_load_avg: DEBUG BYHP: removed_load_sum=0,
raw_removed_runnable_sum=0
   stress-ng-cpu-183     (-------) [000] d.h1.   145.526897:
update_load_avg: DEBUG BYHP: removed_load_sum=11103,
raw_removed_runnable_sum=11072
   stress-ng-cpu-183     (-------) [000] d.h1.   145.563980:
update_load_avg: DEBUG BYHP: removed_load_sum=0,
raw_removed_runnable_sum=0
   stress-ng-cpu-185     (-------) [002] d..2.   145.593563:
sched_balance_update_blocked_averages: DEBUG BYHP: removed_load_sum=0,
raw_removed_runnable_sum=0
   stress-ng-cpu-181     (-------) [005] d.s1.   145.653525:
sched_balance_update_blocked_averages: DEBUG BYHP:
removed_load_sum=2537, raw_removed_runnable_sum=2508
   stress-ng-cpu-183     (-------) [003] d.s1.   145.657599:
sched_balance_update_blocked_averages: DEBUG BYHP:
removed_load_sum=28510, raw_removed_runnable_sum=28473
   stress-ng-cpu-180     (-------) [007] d.h1.   146.049167:
update_load_avg: DEBUG BYHP: removed_load_sum=9548,
raw_removed_runnable_sum=9526
   stress-ng-cpu-184     (-------) [005] d.h1.   146.057200:
update_load_avg: DEBUG BYHP: removed_load_sum=5974,
raw_removed_runnable_sum=5963
   stress-ng-cpu-182     (-------) [000] d.s1.   146.062025:
sched_balance_update_blocked_averages: DEBUG BYHP:
removed_load_sum=55, raw_removed_runnable_sum=45
      kcompactd0-65      (-------) [001] d..2.   146.095334:
update_load_avg: DEBUG BYHP: removed_load_sum=0,
raw_removed_runnable_sum=0
      kcompactd0-65      (-------) [001] d..2.   146.095433:
sched_balance_update_blocked_averages: DEBUG BYHP:
removed_load_sum=17493, raw_removed_runnable_sum=17461
   stress-ng-cpu-186     (-------) [006] d.h1.   146.118910:
update_load_avg: DEBUG BYHP: removed_load_sum=11404,
raw_removed_runnable_sum=11389
   stress-ng-cpu-186     (-------) [000] d.h1.   147.112614:
update_load_avg: DEBUG BYHP: removed_load_sum=0,
raw_removed_runnable_sum=0
             cat-234     (-------) [005] d.s2.   147.161900:
sched_balance_update_blocked_averages: DEBUG BYHP:
removed_load_sum=36778, raw_removed_runnable_sum=36768
   stress-ng-cpu-181     (-------) [004] d.h1.   147.406979:
update_load_avg: DEBUG BYHP: removed_load_sum=3029,
raw_removed_runnable_sum=3014
   stress-ng-cpu-185     (-------) [003] d.s1.   147.474502:
sched_balance_update_blocked_averages: DEBUG BYHP:
removed_load_sum=1242, raw_removed_runnable_sum=1205
   stress-ng-cpu-186     (-------) [000] d.h1.   147.533368:
update_load_avg: DEBUG BYHP: removed_load_sum=11,
raw_removed_runnable_sum=0
   stress-ng-cpu-181     (-------) [001] d.s1.   148.341639:
sched_balance_update_blocked_averages: DEBUG BYHP:
removed_load_sum=15837, raw_removed_runnable_sum=15804
     migration/7-51      (-------) [007] d..2.   148.384219:
sched_balance_update_blocked_averages: DEBUG BYHP: removed_load_sum=0,
raw_removed_runnable_sum=0
          <idle>-0       (-------) [004] d.s2.   148.431501:
sched_balance_update_blocked_averages: DEBUG BYHP:
removed_load_sum=4292, raw_removed_runnable_sum=1924
             cat-234     (-------) [007] d.h1.   148.434474:
update_load_avg: DEBUG BYHP: removed_load_sum=10380,
raw_removed_runnable_sum=9945
   stress-ng-cpu-184     (-------) [001] d.h1.   148.853949:
update_load_avg: DEBUG BYHP: removed_load_sum=15896,
raw_removed_runnable_sum=15869
   stress-ng-cpu-185     (-------) [007] d.s1.   148.862267:
sched_balance_update_blocked_averages: DEBUG BYHP:
removed_load_sum=31, raw_removed_runnable_sum=0
   stress-ng-cpu-183     (-------) [006] d.h1.   149.157805:
update_load_avg: DEBUG BYHP: removed_load_sum=20553,
raw_removed_runnable_sum=20527
   stress-ng-cpu-179     (-------) [007] d.h1.   149.330189:
update_load_avg: DEBUG BYHP: removed_load_sum=9204,
raw_removed_runnable_sum=9177
   stress-ng-cpu-185     (-------) [002] d.h1.   149.434768:
update_load_avg: DEBUG BYHP: removed_load_sum=11198,
raw_removed_runnable_sum=11176
          <idle>-0       (-------) [006] dNs2.   149.456004:
sched_balance_update_blocked_averages: DEBUG BYHP:
removed_load_sum=2826, raw_removed_runnable_sum=465
   stress-ng-cpu-184     (-------) [005] d.s1.   149.483636:
sched_balance_update_blocked_averages: DEBUG BYHP:
removed_load_sum=11607, raw_removed_runnable_sum=11595
   stress-ng-cpu-186     (-------) [001] d.h1.   149.668063:
update_load_avg: DEBUG BYHP: removed_load_sum=0,
raw_removed_runnable_sum=0
          <idle>-0       (-------) [001] d.h2.   149.672477:
update_load_avg: DEBUG BYHP: removed_load_sum=0,
raw_removed_runnable_sum=0
   stress-ng-cpu-183     (-------) [007] d.s1.   149.684045:
sched_balance_update_blocked_averages: DEBUG BYHP:
removed_load_sum=5657, raw_removed_runnable_sum=5458
     ksoftirqd/1-22      (-------) [001] d.s1.   149.700089:
update_load_avg: DEBUG BYHP: removed_load_sum=0,
raw_removed_runnable_sum=0
   stress-ng-cpu-183     (-------) [004] d.h1.   149.807666:
update_load_avg: DEBUG BYHP: removed_load_sum=10481,
raw_removed_runnable_sum=10474
   stress-ng-cpu-184     (-------) [000] d.h1.   149.817148:
update_load_avg: DEBUG BYHP: removed_load_sum=3,
raw_removed_runnable_sum=0
          <idle>-0       (-------) [001] d.s2.   149.866309:
sched_balance_update_blocked_averages: DEBUG BYHP:
removed_load_sum=4010, raw_removed_runnable_sum=3999
   stress-ng-cpu-184     (-------) [000] d.s1.   149.914423:
sched_balance_update_blocked_averages: DEBUG BYHP:
removed_load_sum=1920, raw_removed_runnable_sum=1876




> >
> > This patch introduces removed.load_sum to directly accumulate
> > se->avg.load_sum when tasks dequeue, and uses it during load
> > propagation. By doing so:
> >
> >   a) Avoid relying on runnable_avg-based approximation and obtain
> >      higher precision in load_sum propagation;
> >   b) Eliminate precision loss from the shift operation, especially
> >      with frequent short-lived tasks;
> >   c) Reduce one multiplication and shift in the hotpath, which
> >      theoretically lowers overhead (though the impact is minor).
>
> This doesn't work because rq's load_sum tracks current weight whereas
> se's load_sum doesn't include the weight which is only applied on se's
> load_avg. So you can't directly add/sub se's load_sum and rq's
> load_sum. Only load_avg of both se and rq use the same unit.
>

I understand and agree with your point: cfs_rq->avg.load_sum includes
the weight while se->avg.load_sum does not, so the two are indeed in
different units and cannot be directly added or subtracted.

However, in this patch we DO NOT directly add or subtract
se->avg.load_sum to/from cfs_rq->avg.load_sum. Instead, the load_sum
of dequeued tasks is accumulated into cfs_rq->prop_runnable_sum.
Later, in update_tg_cfs_load, this prop_runnable_sum is used to
recompute the load_sum and load_avg of both gse and cfs_rq.

In other words, the update here is performed via recomputation rather
than direct arithmetic, so the “unit mismatch” issue you mentioned
does not occur.

update_tg_cfs_load
  |-- long delta_avg, runnable_sum = gcfs_rq->prop_runnable_sum;
  |
  |   runnable_sum += gse->avg.load_sum;
  |
  |   load_sum = se_weight(gse) * runnable_sum;
  |   load_avg = div_u64(load_sum, divider);
  |
  |   delta_avg = load_avg - gse->avg.load_avg;
  |   delta_sum = load_sum - (s64)se_weight(gse) * gse->avg.load_sum;
  |
  |-- /* Recalculate the load_sum and load_avg of gse. */
  |   gse->avg.load_sum = runnable_sum;
  |   gse->avg.load_avg = load_avg;
  |
  |-- /* Recalculate the load_sum and load_avg of cfs_rq. */
  |   add_positive(&cfs_rq->avg.load_avg, delta_avg);
  |   add_positive(&cfs_rq->avg.load_sum, delta_sum);




>
> Then, why is it a problem only for load and not util or runnable ?
>

I broke this question into three parts and derived the related
formulas step by step.


Q1: Why doesn’t util_avg have the same problem?
=================================================
A1: Because the formulas of util_avg / util_sum are exactly the same
for both se and cfs_rq. Neither involves weight, and the units are
consistent, so direct addition and subtraction are valid.

The formulas for a sched_entity (tse) are:

             util_sum
util_avg = ------------
              divider


    decay(history) + contrib(running) * 1024
  = ----------------------------------------
                  divider


    (decay(history) + contrib(running)) * 1024
 ~= ------------------------------------------
                  divider


    decay(history) + contrib(running)
  = --------------------------------- * 1024
                  divider


Where:
util_sum = decay(history) + contrib(running) * 1024
util_avg < 1024
decay(history): represents geometric decay of the historical contribution (time)
contrib(running): contribution added when the task is in running state


For cfs_rq, the formulas of util_avg / util_sum are:

             util_sum
util_avg = ------------
              divider


    decay(history) + contrib(running) * 1024
  = ----------------------------------------
                  divider


    (decay(history) + contrib(running)) * 1024
 ~= ------------------------------------------
                      divider


    decay(history) + contrib(running)
  = --------------------------------- * 1024
                 divider


Where:
util_sum = decay(history) + contrib(running) * 1024
util_avg < 1024
decay(history): represents geometric decay of the historical contribution (time)
contrib(running): contribution added when the task is in running state


Therefore, se and cfs_rq share the same units for util_avg / util_sum,
which makes direct addition and subtraction valid. This is also why
update_tg_cfs_util performs direct subtraction when updating:

update_tg_cfs_util
  |-- /* Calculate the delta between gse and gcfs_rq directly by subtraction. */
  |   long delta_avg = gcfs_rq->avg.util_avg - se->avg.util_avg;
  |
  |-- /* Set new sched_entity's utilization */
  |   se->avg.util_avg = gcfs_rq->avg.util_avg;
  |   new_sum = se->avg.util_avg * divider;
  |   delta_sum = (long)new_sum - (long)se->avg.util_sum;
  |   se->avg.util_sum = new_sum;
  |
  |-- /* Update parent cfs_rq utilization */
  |   add_positive(&cfs_rq->avg.util_avg, delta_avg);
  |   add_positive(&cfs_rq->avg.util_sum, delta_sum);


Q2: Why doesn’t runnable_avg have the same problem?
===================================================
A2: Similarly, the runnable_avg / runnable_sum of se and cfs_rq also
do not include weight.

The calculation formulas for tse's runnable_avg and runnable_sum are as follows:

                 runnable_sum
runnable_avg = ----------------
                   divider


    decay(history) + contrib(running + runnable) * 1024
  = ---------------------------------------------------
                        divider


    (decay(history) + contrib(running + runnable)) * 1024
 ~= -----------------------------------------------------
                         divider


    decay(history) + contrib(running + runnable)
  = -------------------------------------------- * 1024
                      divider


Where:
runnable_sum = decay(history) + contrib(running + runnable) * 1024
runnable_avg < 1024
decay(history): represents geometric decay of the historical contribution (time)
contrib(running + runnable): contribution added when the task is in
running and runnable states


The calculation formulas for cfs_rq's runnable_avg and runnable_sum
are as follows:

                 runnable_sum
runnable_avg = ----------------
                   divider


    decay(history) + contrib(running + runnable) * cfs_rq->h_nr_running * 1024
  = --------------------------------------------------------------------------
                                     divider


  (decay(history) + contrib(running + runnable)) * cfs_rq->h_nr_running * 1024
 ~= --------------------------------------------------------------------------
                                   divider


    decay(history) + contrib(running + runnable)
  = -------------------------------------------- * cfs_rq->h_nr_running * 1024
                      divider


Where:
runnable_sum = decay(history) + contrib(running + runnable) *
cfs_rq->h_nr_running * 1024
runnable_avg < cfs_rq->h_nr_running * 1024
decay(history): represents geometric decay of the historical contribution (time)
contrib(running + runnable): contribution added when the task is in
running and runnable states


The runnable statistic of cfs_rq is represented by h_nr_running, which
indicates the number of tasks and can be regarded as the accumulated
runnable of all se. Therefore, in update_tg_cfs_runnable, the update
can also be done directly using subtraction.

update_tg_cfs_runnable
  |-- /* Calculate the delta directly by subtraction. */
  |   long delta_avg = gcfs_rq->avg.runnable_avg - gse->avg.runnable_avg;
  |
  |-- /* Set new sched_entity's runnable */
  |   gse->avg.runnable_avg = gcfs_rq->avg.runnable_avg;
  |   new_sum = gse->avg.runnable_avg * divider;
  |   delta_sum = (long)new_sum - (long)gse->avg.runnable_sum;
  |   gse->avg.runnable_sum = new_sum;
  |
  |-- /* Update parent cfs_rq runnable */
  |   add_positive(&cfs_rq->avg.runnable_avg, delta_avg);
  |   add_positive(&cfs_rq->avg.runnable_sum, delta_sum);


Q3: Why does load_avg have this problem?
========================================
A3: The key difference is that cfs_rq’s load_avg and load_sum include
weight information, while a task’s load_sum does not. Moreover, we
cannot reconstruct load_sum from a task’s load_avg.

For a task (tse), the formulas are:

             load_sum
load_avg = ------------ * se_weight(se)
              divider


   decay(history) + contrib(running + runnable)
 = -------------------------------------------- * se_weight(se)
                     divider


Where:
load_sum = decay(history) + contrib(running + runnable)
load_avg < se_weight(se)
decay(history): represents geometric decay of the historical contribution (time)
contrib(running + runnable): contribution added when the task is in
running and runnable states


For a cfs_rq, the formulas are:

             load_sum
load_avg = ------------
              divider


    decay(history) + contrib(running + runnable) * cfs_rq->load.weight
  = ------------------------------------------------------------------
                                  divider


    (decay(history) + contrib(running + runnable)) * cfs_rq->load.weight
 ~= --------------------------------------------------------------------
                                 divider


    decay(history) + contrib(running + runnable)
  = -------------------------------------------- * cfs_rq->load.weight
                      divider


Where:
load_sum = decay(history) + contrib(running + runnable) * cfs_rq->load.weight
load_avg < cfs_rq->load.weight
decay(history): represents geometric decay of the historical contribution (time)
contrib(running + runnable): contribution added when the task is in
running and runnable states


From these formulas, we can see that tse’s load_sum does not include
weight, while cfs_rq’s does. Their units differ, so they cannot be
directly added or subtracted. In addition, weight is a "historical
variable" that changes over time, which means load_sum cannot be
derived from se’s load_avg.

To propagate load_sum changes from a child cfs_rq, the upstream
implementation uses runnable_avg to APPROXIMATE load_sum. The
derivation is as follows.

For tse:

                 runnable_sum
runnable_avg = ----------------
                   divider

    decay(history) + contrib(running + runnable) * 1024
  = ---------------------------------------------------
                     divider

     decay(history) + contrib(running + runnable)
 ~= --------------------------------------------- * 1024
                     divider


Thus:

decay(history) + contrib(running + runnable)


     runnable_avg * divider
 ~= -----------------------
             1024


 ~= (runnable_avg * divider) >> SCHED_CAPACITY_SHIFT        (1)


Equation (1) represents the contribution when a task is in running or
runnable state. The kernel refers to this as the "unweighted version
of removed_load", which is essentially time with exponential decay
applied.

It is worth noting that equation (1) happens to be equal to tse’s
load_sum. Therefore, for tse we have:

load_sum ~= (runnable_avg * divider) >> SCHED_CAPACITY_SHIFT     (2)

However, equation (2) itself is only an approximation, and the
right-shift operation introduces further truncation error.

My idea is that since the task’s load_sum already exists when it
dequeues, it is more reasonable to record this value directly. By
doing so, we can avoid both the approximation and the shift-induced
error. This is the motivation behind introducing removed.load_sum in
my patch.




> Also we don't want to track both load_sum and load_avg, only one is
> enough and by the above it is load_avg
>

My view is consistent with yours, but I personally lean towards
keeping load_sum and deprecating load_avg, since load_avg does not
seem to be accurate.

That said, this is only my personal perspective and may not be
comprehensive. I would like to further discuss the feasibility of this
approach with you.

Thanks.
hupu

Re: [RESEND][RFC] sched: Introduce removed.load_sum for precise load propagation

Posted by hupu 1 week, 2 days ago

Hi Vincent Guittot and fellow maintainers,

As I mentioned in my previous email, I think there may be a
misunderstanding on your part regarding the modification in this
patch.

This change does *NOT* directly add `se->avg.load_sum` to
`cfs_rq->avg.load_sum`. In the end, it will also *NOT* result in
tracking both `load_avg` and `load_sum` simultaneously. My current
preference is to retain `load_sum` and eventually deprecate
`load_avg`, but only after we’ve confirmed through discussion that
this approach is reasonable.

The intent of this proposal lies in two key aspects, which I believe
can improve the accuracy of `cfs_rq` load statistics:
1. Avoid using `runnable_avg` to approximate `load_sum`, as such
approximations inherently introduce errors. Instead, recording the
`load_sum` of the task at the time of dequeue is more reasonable.
2. Avoid the additional error introduced by right-shift operations.

Could you please review the feasibility of this proposal again when
you have time?
Your feedback would be greatly appreciated.

Thanks and best regards,
hupu

On Fri, Sep 12, 2025 at 2:44 PM hupu <hupu.gm@gmail.com> wrote:
>
> Hi Vincent Guittot
> Thank you very much for your reply.
>
> On Thu, Sep 11, 2025 at 4:01 PM Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> >
> > On Wed, 10 Sept 2025 at 10:43, hupu <hupu.gm@gmail.com> wrote:
> > >
> > > Currently, load_sum to be propagated is estimated from
> > > (removed_runnable * divider) >> SCHED_CAPACITY_SHIFT, which relies on
> > > runnable_avg as an approximation. This approach can introduce precision
> > > loss due to the shift operation, and the error may become more visible
> > > when small tasks frequently enter and leave the queue.
> >
> > Do you have a level of error ? Do you have a typical use case ?
> >
>
> In fact, I derived the error here from the underlying mathematical
> relationship. The error mainly comes from two sources:
> a) The approximation of load_sum using runnable_avg;
> b) The truncation introduced by the right shift (SCHED_CAPACITY_SHIFT).
>
> Below is the detailed derivation and explanation.
>
> removed_runnable records the sum of se->avg.runnable_avg for tasks
> that have migrated to another CPU. It represents the decayed
> cumulative contribution of a task’s runtime, with the unit being
> microseconds (μs). Right-shifting by SCHED_CAPACITY_SHIFT (10 bits) is
> equivalent to truncating the low part below 1024 μs. In other words,
> if a task has accumulated less than 1024 μs of runtime before
> dequeueing, its load contribution will be completely discarded by the
> shift operation. Even if the accumulated runtime exceeds 1024 μs, the
> shift may still introduce up to nearly 1024 μs of truncation error
> (about 1 ms).
>
> For example, suppose a task has accumulated 4095 μs (4.095 ms) of
> runtime on CPU0, then goes to sleep and is migrated to CPU1 upon
> wakeup. Ideally, CPU0 should remove that task’s contribution fully.
> However, after the shift, the result is 4095 >> 10 = 3 ms, which means
> CPU0 will still retain about 1023 μs (~= 1.023 ms) of the task’s
> contribution.
>
> Experimental results also confirm this. By adding debug output before
> add_tg_cfs_propagate and comparing the actual removed_load_sum (the
> true value to be removed) with the approximate value obtained through
> the shift, we observed that the latter is smaller in most cases. This
> discrepancy is exactly due to the approximation and truncation
> introduced by the shift.
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> old mode 100644
> new mode 100755
> index b173a059315c..92396da04520
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4561,7 +4561,8 @@ static void migrate_se_pelt_lag(struct
> sched_entity *se) {}
>  static inline int
>  update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>  {
> -       unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
> +       unsigned long removed_load_sum = 0, removed_load = 0;
> +       unsigned long removed_util = 0, removed_runnable = 0;
>         struct sched_avg *sa = &cfs_rq->avg;
>         int decayed = 0;
>
> @@ -4572,6 +4573,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>                 raw_spin_lock(&cfs_rq->removed.lock);
>                 swap(cfs_rq->removed.util_avg, removed_util);
>                 swap(cfs_rq->removed.load_avg, removed_load);
> +               swap(cfs_rq->removed.load_sum, removed_load_sum);
>                 swap(cfs_rq->removed.runnable_avg, removed_runnable);
>                 cfs_rq->removed.nr = 0;
>                 raw_spin_unlock(&cfs_rq->removed.lock);
> @@ -4609,8 +4611,10 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>                  * removed_runnable is the unweighted version of
> removed_load so we
>                  * can use it to estimate removed_load_sum.
>                  */
> -               add_tg_cfs_propagate(cfs_rq,
> -                       -(long)(removed_runnable * divider) >>
> SCHED_CAPACITY_SHIFT);
> +               trace_printk("DEBUG BYHP: removed_load_sum=%lu,
> raw_removed_runnable_sum=%lu\n",
> +                               (long)removed_load_sum,
> +                               (long)((removed_runnable * divider) >>
> SCHED_CAPACITY_SHIFT));
> +               add_tg_cfs_propagate(cfs_rq, -(long)removed_load_sum);
>
>                 decayed = 1;
>         }
> @@ -4792,6 +4796,7 @@ static void remove_entity_load_avg(struct
> sched_entity *se)
>         ++cfs_rq->removed.nr;
>         cfs_rq->removed.util_avg        += se->avg.util_avg;
>         cfs_rq->removed.load_avg        += se->avg.load_avg;
> +       cfs_rq->removed.load_sum        += se->avg.load_sum;
>         cfs_rq->removed.runnable_avg    += se->avg.runnable_avg;
>         raw_spin_unlock_irqrestore(&cfs_rq->removed.lock, flags);
>  }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index be9745d104f7..659935a5c694 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -682,6 +682,7 @@ struct cfs_rq {
>         struct {
>                 raw_spinlock_t  lock ____cacheline_aligned;
>                 int             nr;
> +               unsigned long   load_sum;
>                 unsigned long   load_avg;
>                 unsigned long   util_avg;
>                 unsigned long   runnable_avg;
>
> The logs are as follows: raw_removed_runnable_sum is often smaller
> than removed_load_sum. This difference is exactly caused by the
> approximate calculation and the truncation introduced by bit shifting.
>
>    stress-ng-cpu-183     (-------) [001] dnh1.   144.338335:
> update_load_avg: DEBUG BYHP: removed_load_sum=35463,
> raw_removed_runnable_sum=35429
>    stress-ng-cpu-184     (-------) [007] dNs1.   144.346203:
> sched_balance_update_blocked_averages: DEBUG BYHP:
> removed_load_sum=20607, raw_removed_runnable_sum=20496
>    stress-ng-cpu-185     (-------) [001] d.h1.   144.568803:
> update_load_avg: DEBUG BYHP: removed_load_sum=0,
> raw_removed_runnable_sum=0
>    stress-ng-cpu-183     (-------) [000] d.h1.   145.526897:
> update_load_avg: DEBUG BYHP: removed_load_sum=11103,
> raw_removed_runnable_sum=11072
>    stress-ng-cpu-183     (-------) [000] d.h1.   145.563980:
> update_load_avg: DEBUG BYHP: removed_load_sum=0,
> raw_removed_runnable_sum=0
>    stress-ng-cpu-185     (-------) [002] d..2.   145.593563:
> sched_balance_update_blocked_averages: DEBUG BYHP: removed_load_sum=0,
> raw_removed_runnable_sum=0
>    stress-ng-cpu-181     (-------) [005] d.s1.   145.653525:
> sched_balance_update_blocked_averages: DEBUG BYHP:
> removed_load_sum=2537, raw_removed_runnable_sum=2508
>    stress-ng-cpu-183     (-------) [003] d.s1.   145.657599:
> sched_balance_update_blocked_averages: DEBUG BYHP:
> removed_load_sum=28510, raw_removed_runnable_sum=28473
>    stress-ng-cpu-180     (-------) [007] d.h1.   146.049167:
> update_load_avg: DEBUG BYHP: removed_load_sum=9548,
> raw_removed_runnable_sum=9526
>    stress-ng-cpu-184     (-------) [005] d.h1.   146.057200:
> update_load_avg: DEBUG BYHP: removed_load_sum=5974,
> raw_removed_runnable_sum=5963
>    stress-ng-cpu-182     (-------) [000] d.s1.   146.062025:
> sched_balance_update_blocked_averages: DEBUG BYHP:
> removed_load_sum=55, raw_removed_runnable_sum=45
>       kcompactd0-65      (-------) [001] d..2.   146.095334:
> update_load_avg: DEBUG BYHP: removed_load_sum=0,
> raw_removed_runnable_sum=0
>       kcompactd0-65      (-------) [001] d..2.   146.095433:
> sched_balance_update_blocked_averages: DEBUG BYHP:
> removed_load_sum=17493, raw_removed_runnable_sum=17461
>    stress-ng-cpu-186     (-------) [006] d.h1.   146.118910:
> update_load_avg: DEBUG BYHP: removed_load_sum=11404,
> raw_removed_runnable_sum=11389
>    stress-ng-cpu-186     (-------) [000] d.h1.   147.112614:
> update_load_avg: DEBUG BYHP: removed_load_sum=0,
> raw_removed_runnable_sum=0
>              cat-234     (-------) [005] d.s2.   147.161900:
> sched_balance_update_blocked_averages: DEBUG BYHP:
> removed_load_sum=36778, raw_removed_runnable_sum=36768
>    stress-ng-cpu-181     (-------) [004] d.h1.   147.406979:
> update_load_avg: DEBUG BYHP: removed_load_sum=3029,
> raw_removed_runnable_sum=3014
>    stress-ng-cpu-185     (-------) [003] d.s1.   147.474502:
> sched_balance_update_blocked_averages: DEBUG BYHP:
> removed_load_sum=1242, raw_removed_runnable_sum=1205
>    stress-ng-cpu-186     (-------) [000] d.h1.   147.533368:
> update_load_avg: DEBUG BYHP: removed_load_sum=11,
> raw_removed_runnable_sum=0
>    stress-ng-cpu-181     (-------) [001] d.s1.   148.341639:
> sched_balance_update_blocked_averages: DEBUG BYHP:
> removed_load_sum=15837, raw_removed_runnable_sum=15804
>      migration/7-51      (-------) [007] d..2.   148.384219:
> sched_balance_update_blocked_averages: DEBUG BYHP: removed_load_sum=0,
> raw_removed_runnable_sum=0
>           <idle>-0       (-------) [004] d.s2.   148.431501:
> sched_balance_update_blocked_averages: DEBUG BYHP:
> removed_load_sum=4292, raw_removed_runnable_sum=1924
>              cat-234     (-------) [007] d.h1.   148.434474:
> update_load_avg: DEBUG BYHP: removed_load_sum=10380,
> raw_removed_runnable_sum=9945
>    stress-ng-cpu-184     (-------) [001] d.h1.   148.853949:
> update_load_avg: DEBUG BYHP: removed_load_sum=15896,
> raw_removed_runnable_sum=15869
>    stress-ng-cpu-185     (-------) [007] d.s1.   148.862267:
> sched_balance_update_blocked_averages: DEBUG BYHP:
> removed_load_sum=31, raw_removed_runnable_sum=0
>    stress-ng-cpu-183     (-------) [006] d.h1.   149.157805:
> update_load_avg: DEBUG BYHP: removed_load_sum=20553,
> raw_removed_runnable_sum=20527
>    stress-ng-cpu-179     (-------) [007] d.h1.   149.330189:
> update_load_avg: DEBUG BYHP: removed_load_sum=9204,
> raw_removed_runnable_sum=9177
>    stress-ng-cpu-185     (-------) [002] d.h1.   149.434768:
> update_load_avg: DEBUG BYHP: removed_load_sum=11198,
> raw_removed_runnable_sum=11176
>           <idle>-0       (-------) [006] dNs2.   149.456004:
> sched_balance_update_blocked_averages: DEBUG BYHP:
> removed_load_sum=2826, raw_removed_runnable_sum=465
>    stress-ng-cpu-184     (-------) [005] d.s1.   149.483636:
> sched_balance_update_blocked_averages: DEBUG BYHP:
> removed_load_sum=11607, raw_removed_runnable_sum=11595
>    stress-ng-cpu-186     (-------) [001] d.h1.   149.668063:
> update_load_avg: DEBUG BYHP: removed_load_sum=0,
> raw_removed_runnable_sum=0
>           <idle>-0       (-------) [001] d.h2.   149.672477:
> update_load_avg: DEBUG BYHP: removed_load_sum=0,
> raw_removed_runnable_sum=0
>    stress-ng-cpu-183     (-------) [007] d.s1.   149.684045:
> sched_balance_update_blocked_averages: DEBUG BYHP:
> removed_load_sum=5657, raw_removed_runnable_sum=5458
>      ksoftirqd/1-22      (-------) [001] d.s1.   149.700089:
> update_load_avg: DEBUG BYHP: removed_load_sum=0,
> raw_removed_runnable_sum=0
>    stress-ng-cpu-183     (-------) [004] d.h1.   149.807666:
> update_load_avg: DEBUG BYHP: removed_load_sum=10481,
> raw_removed_runnable_sum=10474
>    stress-ng-cpu-184     (-------) [000] d.h1.   149.817148:
> update_load_avg: DEBUG BYHP: removed_load_sum=3,
> raw_removed_runnable_sum=0
>           <idle>-0       (-------) [001] d.s2.   149.866309:
> sched_balance_update_blocked_averages: DEBUG BYHP:
> removed_load_sum=4010, raw_removed_runnable_sum=3999
>    stress-ng-cpu-184     (-------) [000] d.s1.   149.914423:
> sched_balance_update_blocked_averages: DEBUG BYHP:
> removed_load_sum=1920, raw_removed_runnable_sum=1876
>
>
>
>
> > >
> > > This patch introduces removed.load_sum to directly accumulate
> > > se->avg.load_sum when tasks dequeue, and uses it during load
> > > propagation. By doing so:
> > >
> > >   a) Avoid relying on runnable_avg-based approximation and obtain
> > >      higher precision in load_sum propagation;
> > >   b) Eliminate precision loss from the shift operation, especially
> > >      with frequent short-lived tasks;
> > >   c) Reduce one multiplication and shift in the hotpath, which
> > >      theoretically lowers overhead (though the impact is minor).
> >
> > This doesn't work because rq's load_sum tracks current weight whereas
> > se's load_sum doesn't include the weight which is only applied on se's
> > load_avg. So you can't directly add/sub se's load_sum and rq's
> > load_sum. Only load_avg of both se and rq use the same unit.
> >
>
> I understand and agree with your point: cfs_rq->avg.load_sum includes
> the weight while se->avg.load_sum does not, so the two are indeed in
> different units and cannot be directly added or subtracted.
>
> However, in this patch we DO NOT directly add or subtract
> se->avg.load_sum to/from cfs_rq->avg.load_sum. Instead, the load_sum
> of dequeued tasks is accumulated into cfs_rq->prop_runnable_sum.
> Later, in update_tg_cfs_load, this prop_runnable_sum is used to
> recompute the load_sum and load_avg of both gse and cfs_rq.
>
> In other words, the update here is performed via recomputation rather
> than direct arithmetic, so the “unit mismatch” issue you mentioned
> does not occur.
>
> update_tg_cfs_load
>   |-- long delta_avg, runnable_sum = gcfs_rq->prop_runnable_sum;
>   |
>   |   runnable_sum += gse->avg.load_sum;
>   |
>   |   load_sum = se_weight(gse) * runnable_sum;
>   |   load_avg = div_u64(load_sum, divider);
>   |
>   |   delta_avg = load_avg - gse->avg.load_avg;
>   |   delta_sum = load_sum - (s64)se_weight(gse) * gse->avg.load_sum;
>   |
>   |-- /* Recalculate the load_sum and load_avg of gse. */
>   |   gse->avg.load_sum = runnable_sum;
>   |   gse->avg.load_avg = load_avg;
>   |
>   |-- /* Recalculate the load_sum and load_avg of cfs_rq. */
>   |   add_positive(&cfs_rq->avg.load_avg, delta_avg);
>   |   add_positive(&cfs_rq->avg.load_sum, delta_sum);
>
>
>
>
> >
> > Then, why is it a problem only for load and not util or runnable ?
> >
>
> I broke this question into three parts and derived the related
> formulas step by step.
>
>
> Q1: Why doesn’t util_avg have the same problem?
> =================================================
> A1: Because the formulas of util_avg / util_sum are exactly the same
> for both se and cfs_rq. Neither involves weight, and the units are
> consistent, so direct addition and subtraction are valid.
>
> The formulas for a sched_entity (tse) are:
>
>              util_sum
> util_avg = ------------
>               divider
>
>
>     decay(history) + contrib(running) * 1024
>   = ----------------------------------------
>                   divider
>
>
>     (decay(history) + contrib(running)) * 1024
>  ~= ------------------------------------------
>                   divider
>
>
>     decay(history) + contrib(running)
>   = --------------------------------- * 1024
>                   divider
>
>
> Where:
> util_sum = decay(history) + contrib(running) * 1024
> util_avg < 1024
> decay(history): represents geometric decay of the historical contribution (time)
> contrib(running): contribution added when the task is in running state
>
>
> For cfs_rq, the formulas of util_avg / util_sum are:
>
>              util_sum
> util_avg = ------------
>               divider
>
>
>     decay(history) + contrib(running) * 1024
>   = ----------------------------------------
>                   divider
>
>
>     (decay(history) + contrib(running)) * 1024
>  ~= ------------------------------------------
>                       divider
>
>
>     decay(history) + contrib(running)
>   = --------------------------------- * 1024
>                  divider
>
>
> Where:
> util_sum = decay(history) + contrib(running) * 1024
> util_avg < 1024
> decay(history): represents geometric decay of the historical contribution (time)
> contrib(running): contribution added when the task is in running state
>
>
> Therefore, se and cfs_rq share the same units for util_avg / util_sum,
> which makes direct addition and subtraction valid. This is also why
> update_tg_cfs_util performs direct subtraction when updating:
>
> update_tg_cfs_util
>   |-- /* Calculate the delta between gse and gcfs_rq directly by subtraction. */
>   |   long delta_avg = gcfs_rq->avg.util_avg - se->avg.util_avg;
>   |
>   |-- /* Set new sched_entity's utilization */
>   |   se->avg.util_avg = gcfs_rq->avg.util_avg;
>   |   new_sum = se->avg.util_avg * divider;
>   |   delta_sum = (long)new_sum - (long)se->avg.util_sum;
>   |   se->avg.util_sum = new_sum;
>   |
>   |-- /* Update parent cfs_rq utilization */
>   |   add_positive(&cfs_rq->avg.util_avg, delta_avg);
>   |   add_positive(&cfs_rq->avg.util_sum, delta_sum);
>
>
> Q2: Why doesn’t runnable_avg have the same problem?
> ===================================================
> A2: Similarly, the runnable_avg / runnable_sum of se and cfs_rq also
> do not include weight.
>
> The calculation formulas for tse's runnable_avg and runnable_sum are as follows:
>
>                  runnable_sum
> runnable_avg = ----------------
>                    divider
>
>
>     decay(history) + contrib(running + runnable) * 1024
>   = ---------------------------------------------------
>                         divider
>
>
>     (decay(history) + contrib(running + runnable)) * 1024
>  ~= -----------------------------------------------------
>                          divider
>
>
>     decay(history) + contrib(running + runnable)
>   = -------------------------------------------- * 1024
>                       divider
>
>
> Where:
> runnable_sum = decay(history) + contrib(running + runnable) * 1024
> runnable_avg < 1024
> decay(history): represents geometric decay of the historical contribution (time)
> contrib(running + runnable): contribution added when the task is in
> running and runnable states
>
>
> The calculation formulas for cfs_rq's runnable_avg and runnable_sum
> are as follows:
>
>                  runnable_sum
> runnable_avg = ----------------
>                    divider
>
>
>     decay(history) + contrib(running + runnable) * cfs_rq->h_nr_running * 1024
>   = --------------------------------------------------------------------------
>                                      divider
>
>
>   (decay(history) + contrib(running + runnable)) * cfs_rq->h_nr_running * 1024
>  ~= --------------------------------------------------------------------------
>                                    divider
>
>
>     decay(history) + contrib(running + runnable)
>   = -------------------------------------------- * cfs_rq->h_nr_running * 1024
>                       divider
>
>
> Where:
> runnable_sum = decay(history) + contrib(running + runnable) *
> cfs_rq->h_nr_running * 1024
> runnable_avg < cfs_rq->h_nr_running * 1024
> decay(history): represents geometric decay of the historical contribution (time)
> contrib(running + runnable): contribution added when the task is in
> running and runnable states
>
>
> The runnable statistic of cfs_rq is represented by h_nr_running, which
> indicates the number of tasks and can be regarded as the accumulated
> runnable of all se. Therefore, in update_tg_cfs_runnable, the update
> can also be done directly using subtraction.
>
> update_tg_cfs_runnable
>   |-- /* Calculate the delta directly by subtraction. */
>   |   long delta_avg = gcfs_rq->avg.runnable_avg - gse->avg.runnable_avg;
>   |
>   |-- /* Set new sched_entity's runnable */
>   |   gse->avg.runnable_avg = gcfs_rq->avg.runnable_avg;
>   |   new_sum = gse->avg.runnable_avg * divider;
>   |   delta_sum = (long)new_sum - (long)gse->avg.runnable_sum;
>   |   gse->avg.runnable_sum = new_sum;
>   |
>   |-- /* Update parent cfs_rq runnable */
>   |   add_positive(&cfs_rq->avg.runnable_avg, delta_avg);
>   |   add_positive(&cfs_rq->avg.runnable_sum, delta_sum);
>
>
> Q3: Why does load_avg have this problem?
> ========================================
> A3: The key difference is that cfs_rq’s load_avg and load_sum include
> weight information, while a task’s load_sum does not. Moreover, we
> cannot reconstruct load_sum from a task’s load_avg.
>
> For a task (tse), the formulas are:
>
>              load_sum
> load_avg = ------------ * se_weight(se)
>               divider
>
>
>    decay(history) + contrib(running + runnable)
>  = -------------------------------------------- * se_weight(se)
>                      divider
>
>
> Where:
> load_sum = decay(history) + contrib(running + runnable)
> load_avg < se_weight(se)
> decay(history): represents geometric decay of the historical contribution (time)
> contrib(running + runnable): contribution added when the task is in
> running and runnable states
>
>
> For a cfs_rq, the formulas are:
>
>              load_sum
> load_avg = ------------
>               divider
>
>
>     decay(history) + contrib(running + runnable) * cfs_rq->load.weight
>   = ------------------------------------------------------------------
>                                   divider
>
>
>     (decay(history) + contrib(running + runnable)) * cfs_rq->load.weight
>  ~= --------------------------------------------------------------------
>                                  divider
>
>
>     decay(history) + contrib(running + runnable)
>   = -------------------------------------------- * cfs_rq->load.weight
>                       divider
>
>
> Where:
> load_sum = decay(history) + contrib(running + runnable) * cfs_rq->load.weight
> load_avg < cfs_rq->load.weight
> decay(history): represents geometric decay of the historical contribution (time)
> contrib(running + runnable): contribution added when the task is in
> running and runnable states
>
>
> From these formulas, we can see that tse’s load_sum does not include
> weight, while cfs_rq’s does. Their units differ, so they cannot be
> directly added or subtracted. In addition, weight is a "historical
> variable" that changes over time, which means load_sum cannot be
> derived from se’s load_avg.
>
> To propagate load_sum changes from a child cfs_rq, the upstream
> implementation uses runnable_avg to APPROXIMATE load_sum. The
> derivation is as follows.
>
> For tse:
>
>                  runnable_sum
> runnable_avg = ----------------
>                    divider
>
>     decay(history) + contrib(running + runnable) * 1024
>   = ---------------------------------------------------
>                      divider
>
>      decay(history) + contrib(running + runnable)
>  ~= --------------------------------------------- * 1024
>                      divider
>
>
> Thus:
>
> decay(history) + contrib(running + runnable)
>
>
>      runnable_avg * divider
>  ~= -----------------------
>              1024
>
>
>  ~= (runnable_avg * divider) >> SCHED_CAPACITY_SHIFT        (1)
>
>
> Equation (1) represents the contribution when a task is in running or
> runnable state. The kernel refers to this as the "unweighted version
> of removed_load", which is essentially time with exponential decay
> applied.
>
> It is worth noting that equation (1) happens to be equal to tse’s
> load_sum. Therefore, for tse we have:
>
> load_sum ~= (runnable_avg * divider) >> SCHED_CAPACITY_SHIFT     (2)
>
> However, equation (2) itself is only an approximation, and the
> right-shift operation introduces further truncation error.
>
> My idea is that since the task’s load_sum already exists when it
> dequeues, it is more reasonable to record this value directly. By
> doing so, we can avoid both the approximation and the shift-induced
> error. This is the motivation behind introducing removed.load_sum in
> my patch.
>
>
>
>
> > Also we don't want to track both load_sum and load_avg, only one is
> > enough and by the above it is load_avg
> >
>
> My view is consistent with yours, but I personally lean towards
> keeping load_sum and deprecating load_avg, since load_avg does not
> seem to be accurate.
>
> That said, this is only my personal perspective and may not be
> comprehensive. I would like to further discuss the feasibility of this
> approach with you.
>
> Thanks.
> hupu

Re: [RESEND][RFC] sched: Introduce removed.load_sum for precise load propagation

Posted by Vincent Guittot 1 week, 2 days ago

Hi hupu,

I have been busy with other stuff so far and haven't been able to look
at your reply but It's on my todo list.

Vincent

On Tue, 23 Sept 2025 at 13:17, hupu <hupu.gm@gmail.com> wrote:
>
> Hi Vincent Guittot and fellow maintainers,
>
> As I mentioned in my previous email, I think there may be a
> misunderstanding on your part regarding the modification in this
> patch.
>
> This change does *NOT* directly add `se->avg.load_sum` to
> `cfs_rq->avg.load_sum`. In the end, it will also *NOT* result in
> tracking both `load_avg` and `load_sum` simultaneously. My current
> preference is to retain `load_sum` and eventually deprecate
> `load_avg`, but only after we’ve confirmed through discussion that
> this approach is reasonable.
>
> The intent of this proposal lies in two key aspects, which I believe
> can improve the accuracy of `cfs_rq` load statistics:
> 1. Avoid using `runnable_avg` to approximate `load_sum`, as such
> approximations inherently introduce errors. Instead, recording the
> `load_sum` of the task at the time of dequeue is more reasonable.
> 2. Avoid the additional error introduced by right-shift operations.
>
> Could you please review the feasibility of this proposal again when
> you have time?
> Your feedback would be greatly appreciated.
>
> Thanks and best regards,
> hupu
>
> On Fri, Sep 12, 2025 at 2:44 PM hupu <hupu.gm@gmail.com> wrote:
> >
> > Hi Vincent Guittot
> > Thank you very much for your reply.
> >
> > On Thu, Sep 11, 2025 at 4:01 PM Vincent Guittot
> > <vincent.guittot@linaro.org> wrote:
> > >
> > > On Wed, 10 Sept 2025 at 10:43, hupu <hupu.gm@gmail.com> wrote:
> > > >
> > > > Currently, load_sum to be propagated is estimated from
> > > > (removed_runnable * divider) >> SCHED_CAPACITY_SHIFT, which relies on
> > > > runnable_avg as an approximation. This approach can introduce precision
> > > > loss due to the shift operation, and the error may become more visible
> > > > when small tasks frequently enter and leave the queue.
> > >
> > > Do you have a level of error ? Do you have a typical use case ?
> > >
> >
> > In fact, I derived the error here from the underlying mathematical
> > relationship. The error mainly comes from two sources:
> > a) The approximation of load_sum using runnable_avg;
> > b) The truncation introduced by the right shift (SCHED_CAPACITY_SHIFT).
> >
> > Below is the detailed derivation and explanation.
> >
> > removed_runnable records the sum of se->avg.runnable_avg for tasks
> > that have migrated to another CPU. It represents the decayed
> > cumulative contribution of a task’s runtime, with the unit being
> > microseconds (μs). Right-shifting by SCHED_CAPACITY_SHIFT (10 bits) is
> > equivalent to truncating the low part below 1024 μs. In other words,
> > if a task has accumulated less than 1024 μs of runtime before
> > dequeueing, its load contribution will be completely discarded by the
> > shift operation. Even if the accumulated runtime exceeds 1024 μs, the
> > shift may still introduce up to nearly 1024 μs of truncation error
> > (about 1 ms).
> >
> > For example, suppose a task has accumulated 4095 μs (4.095 ms) of
> > runtime on CPU0, then goes to sleep and is migrated to CPU1 upon
> > wakeup. Ideally, CPU0 should remove that task’s contribution fully.
> > However, after the shift, the result is 4095 >> 10 = 3 ms, which means
> > CPU0 will still retain about 1023 μs (~= 1.023 ms) of the task’s
> > contribution.
> >
> > Experimental results also confirm this. By adding debug output before
> > add_tg_cfs_propagate and comparing the actual removed_load_sum (the
> > true value to be removed) with the approximate value obtained through
> > the shift, we observed that the latter is smaller in most cases. This
> > discrepancy is exactly due to the approximation and truncation
> > introduced by the shift.
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > old mode 100644
> > new mode 100755
> > index b173a059315c..92396da04520
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4561,7 +4561,8 @@ static void migrate_se_pelt_lag(struct
> > sched_entity *se) {}
> >  static inline int
> >  update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> >  {
> > -       unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
> > +       unsigned long removed_load_sum = 0, removed_load = 0;
> > +       unsigned long removed_util = 0, removed_runnable = 0;
> >         struct sched_avg *sa = &cfs_rq->avg;
> >         int decayed = 0;
> >
> > @@ -4572,6 +4573,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> >                 raw_spin_lock(&cfs_rq->removed.lock);
> >                 swap(cfs_rq->removed.util_avg, removed_util);
> >                 swap(cfs_rq->removed.load_avg, removed_load);
> > +               swap(cfs_rq->removed.load_sum, removed_load_sum);
> >                 swap(cfs_rq->removed.runnable_avg, removed_runnable);
> >                 cfs_rq->removed.nr = 0;
> >                 raw_spin_unlock(&cfs_rq->removed.lock);
> > @@ -4609,8 +4611,10 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> >                  * removed_runnable is the unweighted version of
> > removed_load so we
> >                  * can use it to estimate removed_load_sum.
> >                  */
> > -               add_tg_cfs_propagate(cfs_rq,
> > -                       -(long)(removed_runnable * divider) >>
> > SCHED_CAPACITY_SHIFT);
> > +               trace_printk("DEBUG BYHP: removed_load_sum=%lu,
> > raw_removed_runnable_sum=%lu\n",
> > +                               (long)removed_load_sum,
> > +                               (long)((removed_runnable * divider) >>
> > SCHED_CAPACITY_SHIFT));
> > +               add_tg_cfs_propagate(cfs_rq, -(long)removed_load_sum);
> >
> >                 decayed = 1;
> >         }
> > @@ -4792,6 +4796,7 @@ static void remove_entity_load_avg(struct
> > sched_entity *se)
> >         ++cfs_rq->removed.nr;
> >         cfs_rq->removed.util_avg        += se->avg.util_avg;
> >         cfs_rq->removed.load_avg        += se->avg.load_avg;
> > +       cfs_rq->removed.load_sum        += se->avg.load_sum;
> >         cfs_rq->removed.runnable_avg    += se->avg.runnable_avg;
> >         raw_spin_unlock_irqrestore(&cfs_rq->removed.lock, flags);
> >  }
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index be9745d104f7..659935a5c694 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -682,6 +682,7 @@ struct cfs_rq {
> >         struct {
> >                 raw_spinlock_t  lock ____cacheline_aligned;
> >                 int             nr;
> > +               unsigned long   load_sum;
> >                 unsigned long   load_avg;
> >                 unsigned long   util_avg;
> >                 unsigned long   runnable_avg;
> >
> > The logs are as follows: raw_removed_runnable_sum is often smaller
> > than removed_load_sum. This difference is exactly caused by the
> > approximate calculation and the truncation introduced by bit shifting.
> >
> >    stress-ng-cpu-183     (-------) [001] dnh1.   144.338335:
> > update_load_avg: DEBUG BYHP: removed_load_sum=35463,
> > raw_removed_runnable_sum=35429
> >    stress-ng-cpu-184     (-------) [007] dNs1.   144.346203:
> > sched_balance_update_blocked_averages: DEBUG BYHP:
> > removed_load_sum=20607, raw_removed_runnable_sum=20496
> >    stress-ng-cpu-185     (-------) [001] d.h1.   144.568803:
> > update_load_avg: DEBUG BYHP: removed_load_sum=0,
> > raw_removed_runnable_sum=0
> >    stress-ng-cpu-183     (-------) [000] d.h1.   145.526897:
> > update_load_avg: DEBUG BYHP: removed_load_sum=11103,
> > raw_removed_runnable_sum=11072
> >    stress-ng-cpu-183     (-------) [000] d.h1.   145.563980:
> > update_load_avg: DEBUG BYHP: removed_load_sum=0,
> > raw_removed_runnable_sum=0
> >    stress-ng-cpu-185     (-------) [002] d..2.   145.593563:
> > sched_balance_update_blocked_averages: DEBUG BYHP: removed_load_sum=0,
> > raw_removed_runnable_sum=0
> >    stress-ng-cpu-181     (-------) [005] d.s1.   145.653525:
> > sched_balance_update_blocked_averages: DEBUG BYHP:
> > removed_load_sum=2537, raw_removed_runnable_sum=2508
> >    stress-ng-cpu-183     (-------) [003] d.s1.   145.657599:
> > sched_balance_update_blocked_averages: DEBUG BYHP:
> > removed_load_sum=28510, raw_removed_runnable_sum=28473
> >    stress-ng-cpu-180     (-------) [007] d.h1.   146.049167:
> > update_load_avg: DEBUG BYHP: removed_load_sum=9548,
> > raw_removed_runnable_sum=9526
> >    stress-ng-cpu-184     (-------) [005] d.h1.   146.057200:
> > update_load_avg: DEBUG BYHP: removed_load_sum=5974,
> > raw_removed_runnable_sum=5963
> >    stress-ng-cpu-182     (-------) [000] d.s1.   146.062025:
> > sched_balance_update_blocked_averages: DEBUG BYHP:
> > removed_load_sum=55, raw_removed_runnable_sum=45
> >       kcompactd0-65      (-------) [001] d..2.   146.095334:
> > update_load_avg: DEBUG BYHP: removed_load_sum=0,
> > raw_removed_runnable_sum=0
> >       kcompactd0-65      (-------) [001] d..2.   146.095433:
> > sched_balance_update_blocked_averages: DEBUG BYHP:
> > removed_load_sum=17493, raw_removed_runnable_sum=17461
> >    stress-ng-cpu-186     (-------) [006] d.h1.   146.118910:
> > update_load_avg: DEBUG BYHP: removed_load_sum=11404,
> > raw_removed_runnable_sum=11389
> >    stress-ng-cpu-186     (-------) [000] d.h1.   147.112614:
> > update_load_avg: DEBUG BYHP: removed_load_sum=0,
> > raw_removed_runnable_sum=0
> >              cat-234     (-------) [005] d.s2.   147.161900:
> > sched_balance_update_blocked_averages: DEBUG BYHP:
> > removed_load_sum=36778, raw_removed_runnable_sum=36768
> >    stress-ng-cpu-181     (-------) [004] d.h1.   147.406979:
> > update_load_avg: DEBUG BYHP: removed_load_sum=3029,
> > raw_removed_runnable_sum=3014
> >    stress-ng-cpu-185     (-------) [003] d.s1.   147.474502:
> > sched_balance_update_blocked_averages: DEBUG BYHP:
> > removed_load_sum=1242, raw_removed_runnable_sum=1205
> >    stress-ng-cpu-186     (-------) [000] d.h1.   147.533368:
> > update_load_avg: DEBUG BYHP: removed_load_sum=11,
> > raw_removed_runnable_sum=0
> >    stress-ng-cpu-181     (-------) [001] d.s1.   148.341639:
> > sched_balance_update_blocked_averages: DEBUG BYHP:
> > removed_load_sum=15837, raw_removed_runnable_sum=15804
> >      migration/7-51      (-------) [007] d..2.   148.384219:
> > sched_balance_update_blocked_averages: DEBUG BYHP: removed_load_sum=0,
> > raw_removed_runnable_sum=0
> >           <idle>-0       (-------) [004] d.s2.   148.431501:
> > sched_balance_update_blocked_averages: DEBUG BYHP:
> > removed_load_sum=4292, raw_removed_runnable_sum=1924
> >              cat-234     (-------) [007] d.h1.   148.434474:
> > update_load_avg: DEBUG BYHP: removed_load_sum=10380,
> > raw_removed_runnable_sum=9945
> >    stress-ng-cpu-184     (-------) [001] d.h1.   148.853949:
> > update_load_avg: DEBUG BYHP: removed_load_sum=15896,
> > raw_removed_runnable_sum=15869
> >    stress-ng-cpu-185     (-------) [007] d.s1.   148.862267:
> > sched_balance_update_blocked_averages: DEBUG BYHP:
> > removed_load_sum=31, raw_removed_runnable_sum=0
> >    stress-ng-cpu-183     (-------) [006] d.h1.   149.157805:
> > update_load_avg: DEBUG BYHP: removed_load_sum=20553,
> > raw_removed_runnable_sum=20527
> >    stress-ng-cpu-179     (-------) [007] d.h1.   149.330189:
> > update_load_avg: DEBUG BYHP: removed_load_sum=9204,
> > raw_removed_runnable_sum=9177
> >    stress-ng-cpu-185     (-------) [002] d.h1.   149.434768:
> > update_load_avg: DEBUG BYHP: removed_load_sum=11198,
> > raw_removed_runnable_sum=11176
> >           <idle>-0       (-------) [006] dNs2.   149.456004:
> > sched_balance_update_blocked_averages: DEBUG BYHP:
> > removed_load_sum=2826, raw_removed_runnable_sum=465
> >    stress-ng-cpu-184     (-------) [005] d.s1.   149.483636:
> > sched_balance_update_blocked_averages: DEBUG BYHP:
> > removed_load_sum=11607, raw_removed_runnable_sum=11595
> >    stress-ng-cpu-186     (-------) [001] d.h1.   149.668063:
> > update_load_avg: DEBUG BYHP: removed_load_sum=0,
> > raw_removed_runnable_sum=0
> >           <idle>-0       (-------) [001] d.h2.   149.672477:
> > update_load_avg: DEBUG BYHP: removed_load_sum=0,
> > raw_removed_runnable_sum=0
> >    stress-ng-cpu-183     (-------) [007] d.s1.   149.684045:
> > sched_balance_update_blocked_averages: DEBUG BYHP:
> > removed_load_sum=5657, raw_removed_runnable_sum=5458
> >      ksoftirqd/1-22      (-------) [001] d.s1.   149.700089:
> > update_load_avg: DEBUG BYHP: removed_load_sum=0,
> > raw_removed_runnable_sum=0
> >    stress-ng-cpu-183     (-------) [004] d.h1.   149.807666:
> > update_load_avg: DEBUG BYHP: removed_load_sum=10481,
> > raw_removed_runnable_sum=10474
> >    stress-ng-cpu-184     (-------) [000] d.h1.   149.817148:
> > update_load_avg: DEBUG BYHP: removed_load_sum=3,
> > raw_removed_runnable_sum=0
> >           <idle>-0       (-------) [001] d.s2.   149.866309:
> > sched_balance_update_blocked_averages: DEBUG BYHP:
> > removed_load_sum=4010, raw_removed_runnable_sum=3999
> >    stress-ng-cpu-184     (-------) [000] d.s1.   149.914423:
> > sched_balance_update_blocked_averages: DEBUG BYHP:
> > removed_load_sum=1920, raw_removed_runnable_sum=1876
> >
> >
> >
> >
> > > >
> > > > This patch introduces removed.load_sum to directly accumulate
> > > > se->avg.load_sum when tasks dequeue, and uses it during load
> > > > propagation. By doing so:
> > > >
> > > >   a) Avoid relying on runnable_avg-based approximation and obtain
> > > >      higher precision in load_sum propagation;
> > > >   b) Eliminate precision loss from the shift operation, especially
> > > >      with frequent short-lived tasks;
> > > >   c) Reduce one multiplication and shift in the hotpath, which
> > > >      theoretically lowers overhead (though the impact is minor).
> > >
> > > This doesn't work because rq's load_sum tracks current weight whereas
> > > se's load_sum doesn't include the weight which is only applied on se's
> > > load_avg. So you can't directly add/sub se's load_sum and rq's
> > > load_sum. Only load_avg of both se and rq use the same unit.
> > >
> >
> > I understand and agree with your point: cfs_rq->avg.load_sum includes
> > the weight while se->avg.load_sum does not, so the two are indeed in
> > different units and cannot be directly added or subtracted.
> >
> > However, in this patch we DO NOT directly add or subtract
> > se->avg.load_sum to/from cfs_rq->avg.load_sum. Instead, the load_sum
> > of dequeued tasks is accumulated into cfs_rq->prop_runnable_sum.
> > Later, in update_tg_cfs_load, this prop_runnable_sum is used to
> > recompute the load_sum and load_avg of both gse and cfs_rq.
> >
> > In other words, the update here is performed via recomputation rather
> > than direct arithmetic, so the “unit mismatch” issue you mentioned
> > does not occur.
> >
> > update_tg_cfs_load
> >   |-- long delta_avg, runnable_sum = gcfs_rq->prop_runnable_sum;
> >   |
> >   |   runnable_sum += gse->avg.load_sum;
> >   |
> >   |   load_sum = se_weight(gse) * runnable_sum;
> >   |   load_avg = div_u64(load_sum, divider);
> >   |
> >   |   delta_avg = load_avg - gse->avg.load_avg;
> >   |   delta_sum = load_sum - (s64)se_weight(gse) * gse->avg.load_sum;
> >   |
> >   |-- /* Recalculate the load_sum and load_avg of gse. */
> >   |   gse->avg.load_sum = runnable_sum;
> >   |   gse->avg.load_avg = load_avg;
> >   |
> >   |-- /* Recalculate the load_sum and load_avg of cfs_rq. */
> >   |   add_positive(&cfs_rq->avg.load_avg, delta_avg);
> >   |   add_positive(&cfs_rq->avg.load_sum, delta_sum);
> >
> >
> >
> >
> > >
> > > Then, why is it a problem only for load and not util or runnable ?
> > >
> >
> > I broke this question into three parts and derived the related
> > formulas step by step.
> >
> >
> > Q1: Why doesn’t util_avg have the same problem?
> > =================================================
> > A1: Because the formulas of util_avg / util_sum are exactly the same
> > for both se and cfs_rq. Neither involves weight, and the units are
> > consistent, so direct addition and subtraction are valid.
> >
> > The formulas for a sched_entity (tse) are:
> >
> >              util_sum
> > util_avg = ------------
> >               divider
> >
> >
> >     decay(history) + contrib(running) * 1024
> >   = ----------------------------------------
> >                   divider
> >
> >
> >     (decay(history) + contrib(running)) * 1024
> >  ~= ------------------------------------------
> >                   divider
> >
> >
> >     decay(history) + contrib(running)
> >   = --------------------------------- * 1024
> >                   divider
> >
> >
> > Where:
> > util_sum = decay(history) + contrib(running) * 1024
> > util_avg < 1024
> > decay(history): represents geometric decay of the historical contribution (time)
> > contrib(running): contribution added when the task is in running state
> >
> >
> > For cfs_rq, the formulas of util_avg / util_sum are:
> >
> >              util_sum
> > util_avg = ------------
> >               divider
> >
> >
> >     decay(history) + contrib(running) * 1024
> >   = ----------------------------------------
> >                   divider
> >
> >
> >     (decay(history) + contrib(running)) * 1024
> >  ~= ------------------------------------------
> >                       divider
> >
> >
> >     decay(history) + contrib(running)
> >   = --------------------------------- * 1024
> >                  divider
> >
> >
> > Where:
> > util_sum = decay(history) + contrib(running) * 1024
> > util_avg < 1024
> > decay(history): represents geometric decay of the historical contribution (time)
> > contrib(running): contribution added when the task is in running state
> >
> >
> > Therefore, se and cfs_rq share the same units for util_avg / util_sum,
> > which makes direct addition and subtraction valid. This is also why
> > update_tg_cfs_util performs direct subtraction when updating:
> >
> > update_tg_cfs_util
> >   |-- /* Calculate the delta between gse and gcfs_rq directly by subtraction. */
> >   |   long delta_avg = gcfs_rq->avg.util_avg - se->avg.util_avg;
> >   |
> >   |-- /* Set new sched_entity's utilization */
> >   |   se->avg.util_avg = gcfs_rq->avg.util_avg;
> >   |   new_sum = se->avg.util_avg * divider;
> >   |   delta_sum = (long)new_sum - (long)se->avg.util_sum;
> >   |   se->avg.util_sum = new_sum;
> >   |
> >   |-- /* Update parent cfs_rq utilization */
> >   |   add_positive(&cfs_rq->avg.util_avg, delta_avg);
> >   |   add_positive(&cfs_rq->avg.util_sum, delta_sum);
> >
> >
> > Q2: Why doesn’t runnable_avg have the same problem?
> > ===================================================
> > A2: Similarly, the runnable_avg / runnable_sum of se and cfs_rq also
> > do not include weight.
> >
> > The calculation formulas for tse's runnable_avg and runnable_sum are as follows:
> >
> >                  runnable_sum
> > runnable_avg = ----------------
> >                    divider
> >
> >
> >     decay(history) + contrib(running + runnable) * 1024
> >   = ---------------------------------------------------
> >                         divider
> >
> >
> >     (decay(history) + contrib(running + runnable)) * 1024
> >  ~= -----------------------------------------------------
> >                          divider
> >
> >
> >     decay(history) + contrib(running + runnable)
> >   = -------------------------------------------- * 1024
> >                       divider
> >
> >
> > Where:
> > runnable_sum = decay(history) + contrib(running + runnable) * 1024
> > runnable_avg < 1024
> > decay(history): represents geometric decay of the historical contribution (time)
> > contrib(running + runnable): contribution added when the task is in
> > running and runnable states
> >
> >
> > The calculation formulas for cfs_rq's runnable_avg and runnable_sum
> > are as follows:
> >
> >                  runnable_sum
> > runnable_avg = ----------------
> >                    divider
> >
> >
> >     decay(history) + contrib(running + runnable) * cfs_rq->h_nr_running * 1024
> >   = --------------------------------------------------------------------------
> >                                      divider
> >
> >
> >   (decay(history) + contrib(running + runnable)) * cfs_rq->h_nr_running * 1024
> >  ~= --------------------------------------------------------------------------
> >                                    divider
> >
> >
> >     decay(history) + contrib(running + runnable)
> >   = -------------------------------------------- * cfs_rq->h_nr_running * 1024
> >                       divider
> >
> >
> > Where:
> > runnable_sum = decay(history) + contrib(running + runnable) *
> > cfs_rq->h_nr_running * 1024
> > runnable_avg < cfs_rq->h_nr_running * 1024
> > decay(history): represents geometric decay of the historical contribution (time)
> > contrib(running + runnable): contribution added when the task is in
> > running and runnable states
> >
> >
> > The runnable statistic of cfs_rq is represented by h_nr_running, which
> > indicates the number of tasks and can be regarded as the accumulated
> > runnable of all se. Therefore, in update_tg_cfs_runnable, the update
> > can also be done directly using subtraction.
> >
> > update_tg_cfs_runnable
> >   |-- /* Calculate the delta directly by subtraction. */
> >   |   long delta_avg = gcfs_rq->avg.runnable_avg - gse->avg.runnable_avg;
> >   |
> >   |-- /* Set new sched_entity's runnable */
> >   |   gse->avg.runnable_avg = gcfs_rq->avg.runnable_avg;
> >   |   new_sum = gse->avg.runnable_avg * divider;
> >   |   delta_sum = (long)new_sum - (long)gse->avg.runnable_sum;
> >   |   gse->avg.runnable_sum = new_sum;
> >   |
> >   |-- /* Update parent cfs_rq runnable */
> >   |   add_positive(&cfs_rq->avg.runnable_avg, delta_avg);
> >   |   add_positive(&cfs_rq->avg.runnable_sum, delta_sum);
> >
> >
> > Q3: Why does load_avg have this problem?
> > ========================================
> > A3: The key difference is that cfs_rq’s load_avg and load_sum include
> > weight information, while a task’s load_sum does not. Moreover, we
> > cannot reconstruct load_sum from a task’s load_avg.
> >
> > For a task (tse), the formulas are:
> >
> >              load_sum
> > load_avg = ------------ * se_weight(se)
> >               divider
> >
> >
> >    decay(history) + contrib(running + runnable)
> >  = -------------------------------------------- * se_weight(se)
> >                      divider
> >
> >
> > Where:
> > load_sum = decay(history) + contrib(running + runnable)
> > load_avg < se_weight(se)
> > decay(history): represents geometric decay of the historical contribution (time)
> > contrib(running + runnable): contribution added when the task is in
> > running and runnable states
> >
> >
> > For a cfs_rq, the formulas are:
> >
> >              load_sum
> > load_avg = ------------
> >               divider
> >
> >
> >     decay(history) + contrib(running + runnable) * cfs_rq->load.weight
> >   = ------------------------------------------------------------------
> >                                   divider
> >
> >
> >     (decay(history) + contrib(running + runnable)) * cfs_rq->load.weight
> >  ~= --------------------------------------------------------------------
> >                                  divider
> >
> >
> >     decay(history) + contrib(running + runnable)
> >   = -------------------------------------------- * cfs_rq->load.weight
> >                       divider
> >
> >
> > Where:
> > load_sum = decay(history) + contrib(running + runnable) * cfs_rq->load.weight
> > load_avg < cfs_rq->load.weight
> > decay(history): represents geometric decay of the historical contribution (time)
> > contrib(running + runnable): contribution added when the task is in
> > running and runnable states
> >
> >
> > From these formulas, we can see that tse’s load_sum does not include
> > weight, while cfs_rq’s does. Their units differ, so they cannot be
> > directly added or subtracted. In addition, weight is a "historical
> > variable" that changes over time, which means load_sum cannot be
> > derived from se’s load_avg.
> >
> > To propagate load_sum changes from a child cfs_rq, the upstream
> > implementation uses runnable_avg to APPROXIMATE load_sum. The
> > derivation is as follows.
> >
> > For tse:
> >
> >                  runnable_sum
> > runnable_avg = ----------------
> >                    divider
> >
> >     decay(history) + contrib(running + runnable) * 1024
> >   = ---------------------------------------------------
> >                      divider
> >
> >      decay(history) + contrib(running + runnable)
> >  ~= --------------------------------------------- * 1024
> >                      divider
> >
> >
> > Thus:
> >
> > decay(history) + contrib(running + runnable)
> >
> >
> >      runnable_avg * divider
> >  ~= -----------------------
> >              1024
> >
> >
> >  ~= (runnable_avg * divider) >> SCHED_CAPACITY_SHIFT        (1)
> >
> >
> > Equation (1) represents the contribution when a task is in running or
> > runnable state. The kernel refers to this as the "unweighted version
> > of removed_load", which is essentially time with exponential decay
> > applied.
> >
> > It is worth noting that equation (1) happens to be equal to tse’s
> > load_sum. Therefore, for tse we have:
> >
> > load_sum ~= (runnable_avg * divider) >> SCHED_CAPACITY_SHIFT     (2)
> >
> > However, equation (2) itself is only an approximation, and the
> > right-shift operation introduces further truncation error.
> >
> > My idea is that since the task’s load_sum already exists when it
> > dequeues, it is more reasonable to record this value directly. By
> > doing so, we can avoid both the approximation and the shift-induced
> > error. This is the motivation behind introducing removed.load_sum in
> > my patch.
> >
> >
> >
> >
> > > Also we don't want to track both load_sum and load_avg, only one is
> > > enough and by the above it is load_avg
> > >
> >
> > My view is consistent with yours, but I personally lean towards
> > keeping load_sum and deprecating load_avg, since load_avg does not
> > seem to be accurate.
> >
> > That said, this is only my personal perspective and may not be
> > comprehensive. I would like to further discuss the feasibility of this
> > approach with you.
> >
> > Thanks.
> > hupu

Re: [RESEND][RFC] sched: Introduce removed.load_sum for precise load propagation

Posted by hupu 1 week, 3 days ago

RESEND AGAIN

On Fri, Sep 12, 2025 at 2:44 PM hupu <hupu.gm@gmail.com> wrote:
>
> Hi Vincent Guittot
> Thank you very much for your reply.
>
> On Thu, Sep 11, 2025 at 4:01 PM Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> >
> > On Wed, 10 Sept 2025 at 10:43, hupu <hupu.gm@gmail.com> wrote:
> > >
> > > Currently, load_sum to be propagated is estimated from
> > > (removed_runnable * divider) >> SCHED_CAPACITY_SHIFT, which relies on
> > > runnable_avg as an approximation. This approach can introduce precision
> > > loss due to the shift operation, and the error may become more visible
> > > when small tasks frequently enter and leave the queue.
> >
> > Do you have a level of error ? Do you have a typical use case ?
> >
>
> In fact, I derived the error here from the underlying mathematical
> relationship. The error mainly comes from two sources:
> a) The approximation of load_sum using runnable_avg;
> b) The truncation introduced by the right shift (SCHED_CAPACITY_SHIFT).
>
> Below is the detailed derivation and explanation.
>
> removed_runnable records the sum of se->avg.runnable_avg for tasks
> that have migrated to another CPU. It represents the decayed
> cumulative contribution of a task’s runtime, with the unit being
> microseconds (μs). Right-shifting by SCHED_CAPACITY_SHIFT (10 bits) is
> equivalent to truncating the low part below 1024 μs. In other words,
> if a task has accumulated less than 1024 μs of runtime before
> dequeueing, its load contribution will be completely discarded by the
> shift operation. Even if the accumulated runtime exceeds 1024 μs, the
> shift may still introduce up to nearly 1024 μs of truncation error
> (about 1 ms).
>
> For example, suppose a task has accumulated 4095 μs (4.095 ms) of
> runtime on CPU0, then goes to sleep and is migrated to CPU1 upon
> wakeup. Ideally, CPU0 should remove that task’s contribution fully.
> However, after the shift, the result is 4095 >> 10 = 3 ms, which means
> CPU0 will still retain about 1023 μs (~= 1.023 ms) of the task’s
> contribution.
>
> Experimental results also confirm this. By adding debug output before
> add_tg_cfs_propagate and comparing the actual removed_load_sum (the
> true value to be removed) with the approximate value obtained through
> the shift, we observed that the latter is smaller in most cases. This
> discrepancy is exactly due to the approximation and truncation
> introduced by the shift.
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> old mode 100644
> new mode 100755
> index b173a059315c..92396da04520
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4561,7 +4561,8 @@ static void migrate_se_pelt_lag(struct
> sched_entity *se) {}
>  static inline int
>  update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>  {
> -       unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
> +       unsigned long removed_load_sum = 0, removed_load = 0;
> +       unsigned long removed_util = 0, removed_runnable = 0;
>         struct sched_avg *sa = &cfs_rq->avg;
>         int decayed = 0;
>
> @@ -4572,6 +4573,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>                 raw_spin_lock(&cfs_rq->removed.lock);
>                 swap(cfs_rq->removed.util_avg, removed_util);
>                 swap(cfs_rq->removed.load_avg, removed_load);
> +               swap(cfs_rq->removed.load_sum, removed_load_sum);
>                 swap(cfs_rq->removed.runnable_avg, removed_runnable);
>                 cfs_rq->removed.nr = 0;
>                 raw_spin_unlock(&cfs_rq->removed.lock);
> @@ -4609,8 +4611,10 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>                  * removed_runnable is the unweighted version of
> removed_load so we
>                  * can use it to estimate removed_load_sum.
>                  */
> -               add_tg_cfs_propagate(cfs_rq,
> -                       -(long)(removed_runnable * divider) >>
> SCHED_CAPACITY_SHIFT);
> +               trace_printk("DEBUG BYHP: removed_load_sum=%lu,
> raw_removed_runnable_sum=%lu\n",
> +                               (long)removed_load_sum,
> +                               (long)((removed_runnable * divider) >>
> SCHED_CAPACITY_SHIFT));
> +               add_tg_cfs_propagate(cfs_rq, -(long)removed_load_sum);
>
>                 decayed = 1;
>         }
> @@ -4792,6 +4796,7 @@ static void remove_entity_load_avg(struct
> sched_entity *se)
>         ++cfs_rq->removed.nr;
>         cfs_rq->removed.util_avg        += se->avg.util_avg;
>         cfs_rq->removed.load_avg        += se->avg.load_avg;
> +       cfs_rq->removed.load_sum        += se->avg.load_sum;
>         cfs_rq->removed.runnable_avg    += se->avg.runnable_avg;
>         raw_spin_unlock_irqrestore(&cfs_rq->removed.lock, flags);
>  }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index be9745d104f7..659935a5c694 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -682,6 +682,7 @@ struct cfs_rq {
>         struct {
>                 raw_spinlock_t  lock ____cacheline_aligned;
>                 int             nr;
> +               unsigned long   load_sum;
>                 unsigned long   load_avg;
>                 unsigned long   util_avg;
>                 unsigned long   runnable_avg;
>
> The logs are as follows: raw_removed_runnable_sum is often smaller
> than removed_load_sum. This difference is exactly caused by the
> approximate calculation and the truncation introduced by bit shifting.
>
>    stress-ng-cpu-183     (-------) [001] dnh1.   144.338335:
> update_load_avg: DEBUG BYHP: removed_load_sum=35463,
> raw_removed_runnable_sum=35429
>    stress-ng-cpu-184     (-------) [007] dNs1.   144.346203:
> sched_balance_update_blocked_averages: DEBUG BYHP:
> removed_load_sum=20607, raw_removed_runnable_sum=20496
>    stress-ng-cpu-185     (-------) [001] d.h1.   144.568803:
> update_load_avg: DEBUG BYHP: removed_load_sum=0,
> raw_removed_runnable_sum=0
>    stress-ng-cpu-183     (-------) [000] d.h1.   145.526897:
> update_load_avg: DEBUG BYHP: removed_load_sum=11103,
> raw_removed_runnable_sum=11072
>    stress-ng-cpu-183     (-------) [000] d.h1.   145.563980:
> update_load_avg: DEBUG BYHP: removed_load_sum=0,
> raw_removed_runnable_sum=0
>    stress-ng-cpu-185     (-------) [002] d..2.   145.593563:
> sched_balance_update_blocked_averages: DEBUG BYHP: removed_load_sum=0,
> raw_removed_runnable_sum=0
>    stress-ng-cpu-181     (-------) [005] d.s1.   145.653525:
> sched_balance_update_blocked_averages: DEBUG BYHP:
> removed_load_sum=2537, raw_removed_runnable_sum=2508
>    stress-ng-cpu-183     (-------) [003] d.s1.   145.657599:
> sched_balance_update_blocked_averages: DEBUG BYHP:
> removed_load_sum=28510, raw_removed_runnable_sum=28473
>    stress-ng-cpu-180     (-------) [007] d.h1.   146.049167:
> update_load_avg: DEBUG BYHP: removed_load_sum=9548,
> raw_removed_runnable_sum=9526
>    stress-ng-cpu-184     (-------) [005] d.h1.   146.057200:
> update_load_avg: DEBUG BYHP: removed_load_sum=5974,
> raw_removed_runnable_sum=5963
>    stress-ng-cpu-182     (-------) [000] d.s1.   146.062025:
> sched_balance_update_blocked_averages: DEBUG BYHP:
> removed_load_sum=55, raw_removed_runnable_sum=45
>       kcompactd0-65      (-------) [001] d..2.   146.095334:
> update_load_avg: DEBUG BYHP: removed_load_sum=0,
> raw_removed_runnable_sum=0
>       kcompactd0-65      (-------) [001] d..2.   146.095433:
> sched_balance_update_blocked_averages: DEBUG BYHP:
> removed_load_sum=17493, raw_removed_runnable_sum=17461
>    stress-ng-cpu-186     (-------) [006] d.h1.   146.118910:
> update_load_avg: DEBUG BYHP: removed_load_sum=11404,
> raw_removed_runnable_sum=11389
>    stress-ng-cpu-186     (-------) [000] d.h1.   147.112614:
> update_load_avg: DEBUG BYHP: removed_load_sum=0,
> raw_removed_runnable_sum=0
>              cat-234     (-------) [005] d.s2.   147.161900:
> sched_balance_update_blocked_averages: DEBUG BYHP:
> removed_load_sum=36778, raw_removed_runnable_sum=36768
>    stress-ng-cpu-181     (-------) [004] d.h1.   147.406979:
> update_load_avg: DEBUG BYHP: removed_load_sum=3029,
> raw_removed_runnable_sum=3014
>    stress-ng-cpu-185     (-------) [003] d.s1.   147.474502:
> sched_balance_update_blocked_averages: DEBUG BYHP:
> removed_load_sum=1242, raw_removed_runnable_sum=1205
>    stress-ng-cpu-186     (-------) [000] d.h1.   147.533368:
> update_load_avg: DEBUG BYHP: removed_load_sum=11,
> raw_removed_runnable_sum=0
>    stress-ng-cpu-181     (-------) [001] d.s1.   148.341639:
> sched_balance_update_blocked_averages: DEBUG BYHP:
> removed_load_sum=15837, raw_removed_runnable_sum=15804
>      migration/7-51      (-------) [007] d..2.   148.384219:
> sched_balance_update_blocked_averages: DEBUG BYHP: removed_load_sum=0,
> raw_removed_runnable_sum=0
>           <idle>-0       (-------) [004] d.s2.   148.431501:
> sched_balance_update_blocked_averages: DEBUG BYHP:
> removed_load_sum=4292, raw_removed_runnable_sum=1924
>              cat-234     (-------) [007] d.h1.   148.434474:
> update_load_avg: DEBUG BYHP: removed_load_sum=10380,
> raw_removed_runnable_sum=9945
>    stress-ng-cpu-184     (-------) [001] d.h1.   148.853949:
> update_load_avg: DEBUG BYHP: removed_load_sum=15896,
> raw_removed_runnable_sum=15869
>    stress-ng-cpu-185     (-------) [007] d.s1.   148.862267:
> sched_balance_update_blocked_averages: DEBUG BYHP:
> removed_load_sum=31, raw_removed_runnable_sum=0
>    stress-ng-cpu-183     (-------) [006] d.h1.   149.157805:
> update_load_avg: DEBUG BYHP: removed_load_sum=20553,
> raw_removed_runnable_sum=20527
>    stress-ng-cpu-179     (-------) [007] d.h1.   149.330189:
> update_load_avg: DEBUG BYHP: removed_load_sum=9204,
> raw_removed_runnable_sum=9177
>    stress-ng-cpu-185     (-------) [002] d.h1.   149.434768:
> update_load_avg: DEBUG BYHP: removed_load_sum=11198,
> raw_removed_runnable_sum=11176
>           <idle>-0       (-------) [006] dNs2.   149.456004:
> sched_balance_update_blocked_averages: DEBUG BYHP:
> removed_load_sum=2826, raw_removed_runnable_sum=465
>    stress-ng-cpu-184     (-------) [005] d.s1.   149.483636:
> sched_balance_update_blocked_averages: DEBUG BYHP:
> removed_load_sum=11607, raw_removed_runnable_sum=11595
>    stress-ng-cpu-186     (-------) [001] d.h1.   149.668063:
> update_load_avg: DEBUG BYHP: removed_load_sum=0,
> raw_removed_runnable_sum=0
>           <idle>-0       (-------) [001] d.h2.   149.672477:
> update_load_avg: DEBUG BYHP: removed_load_sum=0,
> raw_removed_runnable_sum=0
>    stress-ng-cpu-183     (-------) [007] d.s1.   149.684045:
> sched_balance_update_blocked_averages: DEBUG BYHP:
> removed_load_sum=5657, raw_removed_runnable_sum=5458
>      ksoftirqd/1-22      (-------) [001] d.s1.   149.700089:
> update_load_avg: DEBUG BYHP: removed_load_sum=0,
> raw_removed_runnable_sum=0
>    stress-ng-cpu-183     (-------) [004] d.h1.   149.807666:
> update_load_avg: DEBUG BYHP: removed_load_sum=10481,
> raw_removed_runnable_sum=10474
>    stress-ng-cpu-184     (-------) [000] d.h1.   149.817148:
> update_load_avg: DEBUG BYHP: removed_load_sum=3,
> raw_removed_runnable_sum=0
>           <idle>-0       (-------) [001] d.s2.   149.866309:
> sched_balance_update_blocked_averages: DEBUG BYHP:
> removed_load_sum=4010, raw_removed_runnable_sum=3999
>    stress-ng-cpu-184     (-------) [000] d.s1.   149.914423:
> sched_balance_update_blocked_averages: DEBUG BYHP:
> removed_load_sum=1920, raw_removed_runnable_sum=1876
>
>
>
>
> > >
> > > This patch introduces removed.load_sum to directly accumulate
> > > se->avg.load_sum when tasks dequeue, and uses it during load
> > > propagation. By doing so:
> > >
> > >   a) Avoid relying on runnable_avg-based approximation and obtain
> > >      higher precision in load_sum propagation;
> > >   b) Eliminate precision loss from the shift operation, especially
> > >      with frequent short-lived tasks;
> > >   c) Reduce one multiplication and shift in the hotpath, which
> > >      theoretically lowers overhead (though the impact is minor).
> >
> > This doesn't work because rq's load_sum tracks current weight whereas
> > se's load_sum doesn't include the weight which is only applied on se's
> > load_avg. So you can't directly add/sub se's load_sum and rq's
> > load_sum. Only load_avg of both se and rq use the same unit.
> >
>
> I understand and agree with your point: cfs_rq->avg.load_sum includes
> the weight while se->avg.load_sum does not, so the two are indeed in
> different units and cannot be directly added or subtracted.
>
> However, in this patch we DO NOT directly add or subtract
> se->avg.load_sum to/from cfs_rq->avg.load_sum. Instead, the load_sum
> of dequeued tasks is accumulated into cfs_rq->prop_runnable_sum.
> Later, in update_tg_cfs_load, this prop_runnable_sum is used to
> recompute the load_sum and load_avg of both gse and cfs_rq.
>
> In other words, the update here is performed via recomputation rather
> than direct arithmetic, so the “unit mismatch” issue you mentioned
> does not occur.
>
> update_tg_cfs_load
>   |-- long delta_avg, runnable_sum = gcfs_rq->prop_runnable_sum;
>   |
>   |   runnable_sum += gse->avg.load_sum;
>   |
>   |   load_sum = se_weight(gse) * runnable_sum;
>   |   load_avg = div_u64(load_sum, divider);
>   |
>   |   delta_avg = load_avg - gse->avg.load_avg;
>   |   delta_sum = load_sum - (s64)se_weight(gse) * gse->avg.load_sum;
>   |
>   |-- /* Recalculate the load_sum and load_avg of gse. */
>   |   gse->avg.load_sum = runnable_sum;
>   |   gse->avg.load_avg = load_avg;
>   |
>   |-- /* Recalculate the load_sum and load_avg of cfs_rq. */
>   |   add_positive(&cfs_rq->avg.load_avg, delta_avg);
>   |   add_positive(&cfs_rq->avg.load_sum, delta_sum);
>
>
>
>
> >
> > Then, why is it a problem only for load and not util or runnable ?
> >
>
> I broke this question into three parts and derived the related
> formulas step by step.
>
>
> Q1: Why doesn’t util_avg have the same problem?
> =================================================
> A1: Because the formulas of util_avg / util_sum are exactly the same
> for both se and cfs_rq. Neither involves weight, and the units are
> consistent, so direct addition and subtraction are valid.
>
> The formulas for a sched_entity (tse) are:
>
>              util_sum
> util_avg = ------------
>               divider
>
>
>     decay(history) + contrib(running) * 1024
>   = ----------------------------------------
>                   divider
>
>
>     (decay(history) + contrib(running)) * 1024
>  ~= ------------------------------------------
>                   divider
>
>
>     decay(history) + contrib(running)
>   = --------------------------------- * 1024
>                   divider
>
>
> Where:
> util_sum = decay(history) + contrib(running) * 1024
> util_avg < 1024
> decay(history): represents geometric decay of the historical contribution (time)
> contrib(running): contribution added when the task is in running state
>
>
> For cfs_rq, the formulas of util_avg / util_sum are:
>
>              util_sum
> util_avg = ------------
>               divider
>
>
>     decay(history) + contrib(running) * 1024
>   = ----------------------------------------
>                   divider
>
>
>     (decay(history) + contrib(running)) * 1024
>  ~= ------------------------------------------
>                       divider
>
>
>     decay(history) + contrib(running)
>   = --------------------------------- * 1024
>                  divider
>
>
> Where:
> util_sum = decay(history) + contrib(running) * 1024
> util_avg < 1024
> decay(history): represents geometric decay of the historical contribution (time)
> contrib(running): contribution added when the task is in running state
>
>
> Therefore, se and cfs_rq share the same units for util_avg / util_sum,
> which makes direct addition and subtraction valid. This is also why
> update_tg_cfs_util performs direct subtraction when updating:
>
> update_tg_cfs_util
>   |-- /* Calculate the delta between gse and gcfs_rq directly by subtraction. */
>   |   long delta_avg = gcfs_rq->avg.util_avg - se->avg.util_avg;
>   |
>   |-- /* Set new sched_entity's utilization */
>   |   se->avg.util_avg = gcfs_rq->avg.util_avg;
>   |   new_sum = se->avg.util_avg * divider;
>   |   delta_sum = (long)new_sum - (long)se->avg.util_sum;
>   |   se->avg.util_sum = new_sum;
>   |
>   |-- /* Update parent cfs_rq utilization */
>   |   add_positive(&cfs_rq->avg.util_avg, delta_avg);
>   |   add_positive(&cfs_rq->avg.util_sum, delta_sum);
>
>
> Q2: Why doesn’t runnable_avg have the same problem?
> ===================================================
> A2: Similarly, the runnable_avg / runnable_sum of se and cfs_rq also
> do not include weight.
>
> The calculation formulas for tse's runnable_avg and runnable_sum are as follows:
>
>                  runnable_sum
> runnable_avg = ----------------
>                    divider
>
>
>     decay(history) + contrib(running + runnable) * 1024
>   = ---------------------------------------------------
>                         divider
>
>
>     (decay(history) + contrib(running + runnable)) * 1024
>  ~= -----------------------------------------------------
>                          divider
>
>
>     decay(history) + contrib(running + runnable)
>   = -------------------------------------------- * 1024
>                       divider
>
>
> Where:
> runnable_sum = decay(history) + contrib(running + runnable) * 1024
> runnable_avg < 1024
> decay(history): represents geometric decay of the historical contribution (time)
> contrib(running + runnable): contribution added when the task is in
> running and runnable states
>
>
> The calculation formulas for cfs_rq's runnable_avg and runnable_sum
> are as follows:
>
>                  runnable_sum
> runnable_avg = ----------------
>                    divider
>
>
>     decay(history) + contrib(running + runnable) * cfs_rq->h_nr_running * 1024
>   = --------------------------------------------------------------------------
>                                      divider
>
>
>   (decay(history) + contrib(running + runnable)) * cfs_rq->h_nr_running * 1024
>  ~= --------------------------------------------------------------------------
>                                    divider
>
>
>     decay(history) + contrib(running + runnable)
>   = -------------------------------------------- * cfs_rq->h_nr_running * 1024
>                       divider
>
>
> Where:
> runnable_sum = decay(history) + contrib(running + runnable) *
> cfs_rq->h_nr_running * 1024
> runnable_avg < cfs_rq->h_nr_running * 1024
> decay(history): represents geometric decay of the historical contribution (time)
> contrib(running + runnable): contribution added when the task is in
> running and runnable states
>
>
> The runnable statistic of cfs_rq is represented by h_nr_running, which
> indicates the number of tasks and can be regarded as the accumulated
> runnable of all se. Therefore, in update_tg_cfs_runnable, the update
> can also be done directly using subtraction.
>
> update_tg_cfs_runnable
>   |-- /* Calculate the delta directly by subtraction. */
>   |   long delta_avg = gcfs_rq->avg.runnable_avg - gse->avg.runnable_avg;
>   |
>   |-- /* Set new sched_entity's runnable */
>   |   gse->avg.runnable_avg = gcfs_rq->avg.runnable_avg;
>   |   new_sum = gse->avg.runnable_avg * divider;
>   |   delta_sum = (long)new_sum - (long)gse->avg.runnable_sum;
>   |   gse->avg.runnable_sum = new_sum;
>   |
>   |-- /* Update parent cfs_rq runnable */
>   |   add_positive(&cfs_rq->avg.runnable_avg, delta_avg);
>   |   add_positive(&cfs_rq->avg.runnable_sum, delta_sum);
>
>
> Q3: Why does load_avg have this problem?
> ========================================
> A3: The key difference is that cfs_rq’s load_avg and load_sum include
> weight information, while a task’s load_sum does not. Moreover, we
> cannot reconstruct load_sum from a task’s load_avg.
>
> For a task (tse), the formulas are:
>
>              load_sum
> load_avg = ------------ * se_weight(se)
>               divider
>
>
>    decay(history) + contrib(running + runnable)
>  = -------------------------------------------- * se_weight(se)
>                      divider
>
>
> Where:
> load_sum = decay(history) + contrib(running + runnable)
> load_avg < se_weight(se)
> decay(history): represents geometric decay of the historical contribution (time)
> contrib(running + runnable): contribution added when the task is in
> running and runnable states
>
>
> For a cfs_rq, the formulas are:
>
>              load_sum
> load_avg = ------------
>               divider
>
>
>     decay(history) + contrib(running + runnable) * cfs_rq->load.weight
>   = ------------------------------------------------------------------
>                                   divider
>
>
>     (decay(history) + contrib(running + runnable)) * cfs_rq->load.weight
>  ~= --------------------------------------------------------------------
>                                  divider
>
>
>     decay(history) + contrib(running + runnable)
>   = -------------------------------------------- * cfs_rq->load.weight
>                       divider
>
>
> Where:
> load_sum = decay(history) + contrib(running + runnable) * cfs_rq->load.weight
> load_avg < cfs_rq->load.weight
> decay(history): represents geometric decay of the historical contribution (time)
> contrib(running + runnable): contribution added when the task is in
> running and runnable states
>
>
> From these formulas, we can see that tse’s load_sum does not include
> weight, while cfs_rq’s does. Their units differ, so they cannot be
> directly added or subtracted. In addition, weight is a "historical
> variable" that changes over time, which means load_sum cannot be
> derived from se’s load_avg.
>
> To propagate load_sum changes from a child cfs_rq, the upstream
> implementation uses runnable_avg to APPROXIMATE load_sum. The
> derivation is as follows.
>
> For tse:
>
>                  runnable_sum
> runnable_avg = ----------------
>                    divider
>
>     decay(history) + contrib(running + runnable) * 1024
>   = ---------------------------------------------------
>                      divider
>
>      decay(history) + contrib(running + runnable)
>  ~= --------------------------------------------- * 1024
>                      divider
>
>
> Thus:
>
> decay(history) + contrib(running + runnable)
>
>
>      runnable_avg * divider
>  ~= -----------------------
>              1024
>
>
>  ~= (runnable_avg * divider) >> SCHED_CAPACITY_SHIFT        (1)
>
>
> Equation (1) represents the contribution when a task is in running or
> runnable state. The kernel refers to this as the "unweighted version
> of removed_load", which is essentially time with exponential decay
> applied.
>
> It is worth noting that equation (1) happens to be equal to tse’s
> load_sum. Therefore, for tse we have:
>
> load_sum ~= (runnable_avg * divider) >> SCHED_CAPACITY_SHIFT     (2)
>
> However, equation (2) itself is only an approximation, and the
> right-shift operation introduces further truncation error.
>
> My idea is that since the task’s load_sum already exists when it
> dequeues, it is more reasonable to record this value directly. By
> doing so, we can avoid both the approximation and the shift-induced
> error. This is the motivation behind introducing removed.load_sum in
> my patch.
>
>
>
>
> > Also we don't want to track both load_sum and load_avg, only one is
> > enough and by the above it is load_avg
> >
>
> My view is consistent with yours, but I personally lean towards
> keeping load_sum and deprecating load_avg, since load_avg does not
> seem to be accurate.
>
> That said, this is only my personal perspective and may not be
> comprehensive. I would like to further discuss the feasibility of this
> approach with you.
>
> Thanks.
> hupu

Re: [RESEND][RFC] sched: Introduce removed.load_sum for precise load propagation

Posted by hupu 2 weeks, 2 days ago

Hi all,

Just resending this patch for review.
Any feedback would be appreciated.

Thanks,
hupu

Re: [RESEND][RFC] sched: Introduce removed.load_sum for precise load propagation

Posted by hupu 1 week, 6 days ago

Resending for your review and discussion.

Thanks.