[v2] uclamp sum aggregation

[RFC PATCH v2 3/7] sched/uclamp: Introduce root_cfs_util_uclamp for rq

Posted by Hongyan Xia 2 years ago

The problem with rq->cfs.avg.util_avg_uclamp is that it only tracks the
sum of contributions of CFS tasks that are on the rq. However, CFS tasks
that belong to a CPU which were just dequeued from the rq->cfs still
have decaying contributions to the rq utilization due to PELT. Introduce
root_cfs_util_uclamp to capture the total utilization of CFS tasks both
on and off this rq.

Theoretically, keeping track of the sum of all tasks on a CPU (either on
or off the rq) requires us to periodically sample the decaying PELT
utilization of all off-rq tasks and then sum them up, which introduces
substantial extra code and overhead. However, we can avoid the overhead,
shown in this example:

Let's assume 3 tasks, A, B and C. A is still on rq->cfs but B and C have
just been dequeued. The cfs.avg.util_avg_uclamp has dropped from A + B +
C to just A but the instantaneous utilization only just starts to decay
and is now still A + B + C. Let's denote root_cfs_util_uclamp_old as the
instantaneous total utilization right before B and C are dequeued.

After p periods, with y being the decay factor, the new
root_cfs_util_uclamp becomes:

root_cfs_util_uclamp
    = A + B * y^p + C * y^p
    = A + (A + B + C - A) * y^p
    = cfs.avg.util_avg_uclamp +
        (root_cfs_util_uclamp_old - cfs.avg.util_avg_uclamp) * y^p
    = cfs.avg.util_avg_uclamp + diff * y^p

So, whenever we want to calculate the new root_cfs_util_uclamp
(including both on- and off-rq CFS tasks of a CPU), we could just decay
the diff between root_cfs_util_uclamp and cfs.avg.util_avg_uclamp, and
add the decayed diff to cfs.avg.util_avg_uclamp to obtain the new
root_cfs_util_uclamp, without bothering to periodically sample off-rq
CFS tasks and sum them up. This significantly reduces the overhead
needed to maintain this signal, and makes sure we now also include the
decaying contributions of CFS tasks that are dequeued.

NOTE: In no way do we change how PELT and util_avg work. The original
PELT signal is kept as-is and is used when needed. The new signals,
util_avg_uclamp and root_cfs_util_uclamp are additional hints to the
scheduler and are not meant to replace the original PELT signals.

Signed-off-by: Hongyan Xia <hongyan.xia2@arm.com>
---
 kernel/sched/fair.c  |   7 +++
 kernel/sched/pelt.c  | 106 +++++++++++++++++++++++++++++++++++++++----
 kernel/sched/pelt.h  |   3 +-
 kernel/sched/sched.h |  16 +++++++
 4 files changed, 123 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4f535c96463b..36357cfaf48d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6710,6 +6710,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	struct sched_entity *se = &p->se;
 	int idle_h_nr_running = task_has_idle_policy(p);
 	int task_new = !(flags & ENQUEUE_WAKEUP);
+	bool __maybe_unused migrated = p->se.avg.last_update_time == 0;
 
 	/*
 	 * The code below (indirectly) updates schedutil which looks at
@@ -6769,6 +6770,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 #ifdef CONFIG_UCLAMP_TASK
 	util_uclamp_enqueue(&rq->cfs.avg, p);
 	update_util_uclamp(0, 0, 0, &rq->cfs.avg, p);
+	if (migrated)
+		rq->root_cfs_util_uclamp += p->se.avg.util_avg_uclamp;
+	rq->root_cfs_util_uclamp = max(rq->root_cfs_util_uclamp,
+				       rq->cfs.avg.util_avg_uclamp);
 	/* TODO: Better skip the frequency update in the for loop above. */
 	cpufreq_update_util(rq, 0);
 #endif
@@ -8252,6 +8257,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
 		migrate_se_pelt_lag(se);
 	}
 
+	remove_root_cfs_util_uclamp(p);
 	/* Tell new CPU we are migrated */
 	se->avg.last_update_time = 0;
 
@@ -8261,6 +8267,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
 static void task_dead_fair(struct task_struct *p)
 {
 	remove_entity_load_avg(&p->se);
+	remove_root_cfs_util_uclamp(p);
 }
 
 static int
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index eca45a863f9f..9ba208ac26db 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -267,14 +267,78 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load)
 }
 
 #ifdef CONFIG_UCLAMP_TASK
+static int ___update_util_uclamp_towards(u64 now,
+					 u64 last_update_time,
+					 u32 period_contrib,
+					 unsigned int *old,
+					 unsigned int new_val)
+{
+	unsigned int old_val = READ_ONCE(*old);
+	u64 delta, periods;
+
+	if (old_val <= new_val) {
+		WRITE_ONCE(*old, new_val);
+		return old_val < new_val;
+	}
+
+	if (!last_update_time)
+		return 0;
+	delta = now - last_update_time;
+	if ((s64)delta < 0)
+		return 0;
+	delta >>= 10;
+	if (!delta)
+		return 0;
+
+	delta += period_contrib;
+	periods = delta / 1024;
+	if (periods) {
+		u64 diff = old_val - new_val;
+
+		/*
+		 * Let's assume 3 tasks, A, B and C. A is still on rq but B and
+		 * C have just been dequeued. The cfs.avg.util_avg_uclamp has
+		 * become A but root_cfs_util_uclamp just starts to decay and is
+		 * now still A + B + C.
+		 *
+		 * After p periods with y being the decay factor, the new
+		 * root_cfs_util_uclamp should become
+		 *
+		 * A + B * y^p + C * y^p == A + (A + B + C - A) * y^p
+		 *     == cfs.avg.util_avg_uclamp +
+		 *        (root_cfs_util_uclamp_at_the_start - cfs.avg.util_avg_uclamp) * y^p
+		 *     == cfs.avg.util_avg_uclamp + diff * y^p
+		 *
+		 * So, instead of summing up each individual decayed values, we
+		 * could just decay the diff and not bother with the summation
+		 * at all. This is why we decay the diff here.
+		 */
+		diff = decay_load(diff, periods);
+		WRITE_ONCE(*old, new_val + diff);
+		return old_val != *old;
+	}
+
+	return 0;
+}
+
 /* avg must belong to the queue this se is on. */
-void update_util_uclamp(struct sched_avg *avg, struct task_struct *p)
+void update_util_uclamp(u64 now,
+			u64 last_update_time,
+			u32 period_contrib,
+			struct sched_avg *avg,
+			struct task_struct *p)
 {
 	unsigned int util, uclamp_min, uclamp_max;
 	int delta;
 
-	if (!p->se.on_rq)
+	if (!p->se.on_rq) {
+		___update_util_uclamp_towards(now,
+					      last_update_time,
+					      period_contrib,
+					      &p->se.avg.util_avg_uclamp,
+					      0);
 		return;
+	}
 
 	if (!avg)
 		return;
@@ -294,7 +358,11 @@ void update_util_uclamp(struct sched_avg *avg, struct task_struct *p)
 	WRITE_ONCE(avg->util_avg_uclamp, util);
 }
 #else /* !CONFIG_UCLAMP_TASK */
-void update_util_uclamp(struct sched_avg *avg, struct task_struct *p)
+void update_util_uclamp(u64 now,
+			u64 last_update_time,
+			u32 period_contrib,
+			struct sched_avg *avg,
+			struct task_struct *p)
 {
 }
 #endif
@@ -327,23 +395,32 @@ void update_util_uclamp(struct sched_avg *avg, struct task_struct *p)
 
 void __update_load_avg_blocked_se(u64 now, struct sched_entity *se)
 {
+	u64 last_update_time = se->avg.last_update_time;
+	u32 period_contrib = se->avg.period_contrib;
+
 	if (___update_load_sum(now, &se->avg, 0, 0, 0)) {
 		___update_load_avg(&se->avg, se_weight(se));
 		if (entity_is_task(se))
-			update_util_uclamp(NULL, task_of(se));
+			update_util_uclamp(now, last_update_time,
+					   period_contrib, NULL, task_of(se));
 		trace_pelt_se_tp(se);
 	}
 }
 
 void __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	u64 last_update_time = se->avg.last_update_time;
+	u32 period_contrib = se->avg.period_contrib;
+
 	if (___update_load_sum(now, &se->avg, !!se->on_rq, se_runnable(se),
 				cfs_rq->curr == se)) {
 
 		___update_load_avg(&se->avg, se_weight(se));
 		cfs_se_util_change(&se->avg);
 		if (entity_is_task(se))
-			update_util_uclamp(&rq_of(cfs_rq)->cfs.avg,
+			update_util_uclamp(now, last_update_time,
+					   period_contrib,
+					   &rq_of(cfs_rq)->cfs.avg,
 					   task_of(se));
 		trace_pelt_se_tp(se);
 	}
@@ -351,17 +428,30 @@ void __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *s
 
 int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
 {
+	u64 __maybe_unused last_update_time = cfs_rq->avg.last_update_time;
+	u32 __maybe_unused period_contrib = cfs_rq->avg.period_contrib;
+	int ret = 0;
+
 	if (___update_load_sum(now, &cfs_rq->avg,
 				scale_load_down(cfs_rq->load.weight),
 				cfs_rq->h_nr_running,
 				cfs_rq->curr != NULL)) {
 
 		___update_load_avg(&cfs_rq->avg, 1);
-		trace_pelt_cfs_tp(cfs_rq);
-		return 1;
+		ret = 1;
 	}
 
-	return 0;
+#ifdef CONFIG_UCLAMP_TASK
+	if (&rq_of(cfs_rq)->cfs == cfs_rq)
+		ret = ___update_util_uclamp_towards(now,
+				last_update_time, period_contrib,
+				&rq_of(cfs_rq)->root_cfs_util_uclamp,
+				READ_ONCE(cfs_rq->avg.util_avg_uclamp));
+#endif
+	if (ret)
+		trace_pelt_cfs_tp(cfs_rq);
+
+	return ret;
 }
 
 /*
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 6862f79e0fcd..a2852d5e862d 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -1,7 +1,8 @@
 #ifdef CONFIG_SMP
 #include "sched-pelt.h"
 
-void update_util_uclamp(struct sched_avg *avg, struct task_struct *p);
+void update_util_uclamp(u64 now, u64 last_update_time, u32 period_contrib,
+			struct sched_avg *avg, struct task_struct *p);
 void __update_load_avg_blocked_se(u64 now, struct sched_entity *se);
 void __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se);
 int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 35036246824b..ce80b87b549b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -998,6 +998,7 @@ struct rq {
 	/* Utilization clamp values based on CPU's RUNNABLE tasks */
 	struct uclamp_rq	uclamp[UCLAMP_CNT] ____cacheline_aligned;
 	unsigned int		uclamp_flags;
+	unsigned int		root_cfs_util_uclamp;
 #define UCLAMP_FLAG_IDLE 0x01
 #endif
 
@@ -3112,6 +3113,17 @@ static inline void util_uclamp_dequeue(struct sched_avg *avg,
 
 	WRITE_ONCE(avg->util_avg_uclamp, new_val);
 }
+
+static inline void remove_root_cfs_util_uclamp(struct task_struct *p)
+{
+	struct rq *rq = task_rq(p);
+	unsigned int root_util = READ_ONCE(rq->root_cfs_util_uclamp);
+	unsigned int p_util = READ_ONCE(p->se.avg.util_avg_uclamp), new_util;
+
+	new_util = (root_util > p_util) ? root_util - p_util : 0;
+	new_util = max(new_util, READ_ONCE(rq->cfs.avg.util_avg_uclamp));
+	WRITE_ONCE(rq->root_cfs_util_uclamp, new_util);
+}
 #else /* CONFIG_UCLAMP_TASK */
 static inline unsigned long uclamp_eff_value(struct task_struct *p,
 					     enum uclamp_id clamp_id)
@@ -3147,6 +3159,10 @@ static inline bool uclamp_rq_is_idle(struct rq *rq)
 {
 	return false;
 }
+
+static inline void remove_root_cfs_util_uclamp(struct task_struct *p)
+{
+}
 #endif /* CONFIG_UCLAMP_TASK */
 
 #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
-- 
2.34.1

Re: [RFC PATCH v2 3/7] sched/uclamp: Introduce root_cfs_util_uclamp for rq

Posted by Dietmar Eggemann 1 year, 10 months ago

On 01/02/2024 14:11, Hongyan Xia wrote:

[...]

>  	/*
>  	 * The code below (indirectly) updates schedutil which looks at
> @@ -6769,6 +6770,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  #ifdef CONFIG_UCLAMP_TASK
>  	util_uclamp_enqueue(&rq->cfs.avg, p);
>  	update_util_uclamp(0, 0, 0, &rq->cfs.avg, p);
> +	if (migrated)

IMHO, you don't need 'bool __maybe_unused migrated'. You can use:

  if (flags & ENQUEUE_MIGRATED)

> +		rq->root_cfs_util_uclamp += p->se.avg.util_avg_uclamp;
> +	rq->root_cfs_util_uclamp = max(rq->root_cfs_util_uclamp,
> +				       rq->cfs.avg.util_avg_uclamp);
>  	/* TODO: Better skip the frequency update in the for loop above. */
>  	cpufreq_update_util(rq, 0);
>  #endif
> @@ -8252,6 +8257,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
>  		migrate_se_pelt_lag(se);
>  	}
>  
> +	remove_root_cfs_util_uclamp(p);

You can't always do this here. In the '!task_on_rq_migrating()' case we
don't hold the 'old' rq->lock.

Have a look into remove_entity_load_avg() for what we do for PELT in
this case.

And:

144d8487bc6e ("sched/fair: Implement synchonous PELT detach on load-balance migrate")
e1f078f50478 ("sched/fair: Combine detach into dequeue when migrating task")

@@ -3081,6 +3081,8 @@ static inline void remove_root_cfs_util_uclamp(struct task_struct *p)
        unsigned int root_util = READ_ONCE(rq->root_cfs_util_uclamp);
        unsigned int p_util = READ_ONCE(p->se.avg.util_avg_uclamp), new_util;

+       lockdep_assert_rq_held(task_rq(p));
+
        new_util = (root_util > p_util) ? root_util - p_util : 0;
        new_util = max(new_util, READ_ONCE(rq->cfs.avg.util_avg_uclamp));
        WRITE_ONCE(rq->root_cfs_util_uclamp, new_util);

[...]

>  /* avg must belong to the queue this se is on. */
> -void update_util_uclamp(struct sched_avg *avg, struct task_struct *p)
> +void update_util_uclamp(u64 now,
> +			u64 last_update_time,
> +			u32 period_contrib,
> +			struct sched_avg *avg,
> +			struct task_struct *p)
>  {

I was wondering why you use such a long parameter list for this
function.

IMHO

  update_util_uclamp(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se)

would work as well. You could check whether se represents a task inside
update_util_uclamp() as well as get last_update_time and period_contrib.

The only reason I see is that you want to use this function for the RT
class as well later, where you have to deal with 'struct rt_rq' and
'struct sched_rt_entity'.

IMHO, it's always better to keep the implementation to the minimum and
only introduce changes which are related to the functionality you
present. This would make reviewing so much easier.

>  	unsigned int util, uclamp_min, uclamp_max;
>  	int delta;
>  
> -	if (!p->se.on_rq)
> +	if (!p->se.on_rq) {
> +		___update_util_uclamp_towards(now,
> +					      last_update_time,
> +					      period_contrib,
> +					      &p->se.avg.util_avg_uclamp,
> +					      0);
>  		return;
> +	}

You decay 'p->se.avg.util_avg_uclamp' which is not really related to
root_cfs_util_uclamp (patch header). IMHO, this would belong to 2/7.

This is the util_avg_uclamp handling for a se (task):

enqueue_task_fair()

  util_uclamp_enqueue()

  update_util_uclamp()                 (1)

    if (!p->se.on_rq)                  (x)
      ___update_util_uclamp_towards()  (2)

dequeue_task_fair()

  util_uclamp_dequeue()

__update_load_avg_blocked_se()

  update_util_uclamp()

    (x)

__update_load_avg_se()

  update_util_uclamp()

    (x)

Why is it so unbalanced? Why do you need (1) and (2)?

Isn't this just an indication that the se util_avg_uclamp
is done at the wrong places?

Is there an other way to provide a task/rq signal as the base
for uclamp sum aggregation?

[...]

Re: [RFC PATCH v2 3/7] sched/uclamp: Introduce root_cfs_util_uclamp for rq

Posted by Hongyan Xia 1 year, 10 months ago

On 18/03/2024 18:21, Dietmar Eggemann wrote:
> On 01/02/2024 14:11, Hongyan Xia wrote:
> 
> [...]
> 
>>   	/*
>>   	 * The code below (indirectly) updates schedutil which looks at
>> @@ -6769,6 +6770,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>   #ifdef CONFIG_UCLAMP_TASK
>>   	util_uclamp_enqueue(&rq->cfs.avg, p);
>>   	update_util_uclamp(0, 0, 0, &rq->cfs.avg, p);
>> +	if (migrated)
> 
> IMHO, you don't need 'bool __maybe_unused migrated'. You can use:
> 
>    if (flags & ENQUEUE_MIGRATED)

I'm not sure if they are entirely equivalent. Both 
task_on_rq_migrating() and !task_on_rq_migrating() can have 
last_update_time == 0 but ENQUEUE_MIGRATED flag is only for the former. 
Maybe I'm missing something?

>> +		rq->root_cfs_util_uclamp += p->se.avg.util_avg_uclamp;
>> +	rq->root_cfs_util_uclamp = max(rq->root_cfs_util_uclamp,
>> +				       rq->cfs.avg.util_avg_uclamp);
>>   	/* TODO: Better skip the frequency update in the for loop above. */
>>   	cpufreq_update_util(rq, 0);
>>   #endif
>> @@ -8252,6 +8257,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
>>   		migrate_se_pelt_lag(se);
>>   	}
>>   
>> +	remove_root_cfs_util_uclamp(p);
> 
> You can't always do this here. In the '!task_on_rq_migrating()' case we
> don't hold the 'old' rq->lock.
> 
> Have a look into remove_entity_load_avg() for what we do for PELT in
> this case.
> 
> And:
> 
> 144d8487bc6e ("sched/fair: Implement synchonous PELT detach on load-balance migrate")
> e1f078f50478 ("sched/fair: Combine detach into dequeue when migrating task")
> 
> @@ -3081,6 +3081,8 @@ static inline void remove_root_cfs_util_uclamp(struct task_struct *p)
>          unsigned int root_util = READ_ONCE(rq->root_cfs_util_uclamp);
>          unsigned int p_util = READ_ONCE(p->se.avg.util_avg_uclamp), new_util;
>   
> +       lockdep_assert_rq_held(task_rq(p));
> +
>          new_util = (root_util > p_util) ? root_util - p_util : 0;
>          new_util = max(new_util, READ_ONCE(rq->cfs.avg.util_avg_uclamp));
>          WRITE_ONCE(rq->root_cfs_util_uclamp, new_util);

Ack. I saw the removed_* functions. I will change into that style.

> [...]
> 
>>   /* avg must belong to the queue this se is on. */
>> -void update_util_uclamp(struct sched_avg *avg, struct task_struct *p)
>> +void update_util_uclamp(u64 now,
>> +			u64 last_update_time,
>> +			u32 period_contrib,
>> +			struct sched_avg *avg,
>> +			struct task_struct *p)
>>   {
> 
> I was wondering why you use such a long parameter list for this
> function.
> 
> IMHO
> 
>    update_util_uclamp(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se)
> 
> would work as well. You could check whether se represents a task inside
> update_util_uclamp() as well as get last_update_time and period_contrib.
> 
> The only reason I see is that you want to use this function for the RT
> class as well later, where you have to deal with 'struct rt_rq' and
> 'struct sched_rt_entity'.
> 
> IMHO, it's always better to keep the implementation to the minimum and
> only introduce changes which are related to the functionality you
> present. This would make reviewing so much easier.

Those parameters are necessary because of

if (___update_load_sum()) {
     ___update_load_avg();
     update_util_uclamp();
}

We have to cache last_update_time and period_contrib, because after 
___update_load_sum() is done, both parameters in sched_avg have already 
been updated and overwritten and we lose the timestamp when the 
sched_avg was previously updated. update_util_uclamp() needs to know 
when sched_avg was previously updated.

> 
>>   	unsigned int util, uclamp_min, uclamp_max;
>>   	int delta;
>>   
>> -	if (!p->se.on_rq)
>> +	if (!p->se.on_rq) {
>> +		___update_util_uclamp_towards(now,
>> +					      last_update_time,
>> +					      period_contrib,
>> +					      &p->se.avg.util_avg_uclamp,
>> +					      0);
>>   		return;
>> +	}
> 
> You decay 'p->se.avg.util_avg_uclamp' which is not really related to
> root_cfs_util_uclamp (patch header). IMHO, this would belong to 2/7.

I would say this still belongs to 3/7, because 2/7 only implements 
utilization for on_rq tasks. This patch implements utilization for both 
on_rq and !on_rq. For rq, we have rq->cfs.avg.util_avg_uclamp (for 
on_rq) and rq->root_cfs_util_uclamp (for on_rq plus !on_rq).

For tasks, we also have two utilization numbers, one is on_rq and the 
other is on_rq plus !on_rq. However, we know they do not have to be 
stored in separate variables and util_avg_uclamp can capture both.

Moving this to 2/7 might be fine, although then this would be the only 
!on_rq utilization in 2/7. I can add comments to clarify the situation.

> This is the util_avg_uclamp handling for a se (task):
> 
> enqueue_task_fair()
> 
>    util_uclamp_enqueue()
> 
>    update_util_uclamp()                 (1)
> 
>      if (!p->se.on_rq)                  (x)
>        ___update_util_uclamp_towards()  (2)
> 
> dequeue_task_fair()
> 
>    util_uclamp_dequeue()
> 
> __update_load_avg_blocked_se()
> 
>    update_util_uclamp()
> 
>      (x)
> 
> __update_load_avg_se()
> 
>    update_util_uclamp()
> 
>      (x)
> 
> Why is it so unbalanced? Why do you need (1) and (2)?
> 
> Isn't this just an indication that the se util_avg_uclamp
> is done at the wrong places?
> 
> Is there an other way to provide a task/rq signal as the base
> for uclamp sum aggregation?

(2) won't happen, because at that point p->se.on_rq must be 1.

The sequence during enqueue_task_fair() is:

enqueue_task_fair(p)
	enqueue_entity(se)
		update_load_avg(se)
			update_util_uclamp(p) (decay path)
		p->se.on_rq = 1;
	util_uclamp_enqueue(p)
	update_util_uclamp(p) (update path)

The only reason why we want to issue update_util_uclamp() after seeing 
on_rq == 1 is that now it goes down the normal uclamp path and not the 
decay path. Otherwise, uclamp won't immediately engage during enqueue 
and has to wait a timer tick.

Ideally, we should:

enqueue_task_fair(p)
	enqueue_entity(se)
		update_load_avg(se)	
			util_uclamp_enqueue(p)
			update_util_uclamp(p) (force update path)
		p->se.on_rq = 1;

This requires us to invent a new flag to update_util_uclamp() to force 
the update path even when p->se.on_rq is 0.

At the moment I'm treating util_avg_uclamp in the same way as util_est 
after the comments in v1, which is independent and has its own code 
path. We can go back to the old style, where util_avg_uclamp is closer 
to how we treat se rather than a separate thing like util_est.

> [...]

Re: [RFC PATCH v2 3/7] sched/uclamp: Introduce root_cfs_util_uclamp for rq

Posted by Dietmar Eggemann 1 year, 10 months ago

On 19/03/2024 12:50, Hongyan Xia wrote:
> On 18/03/2024 18:21, Dietmar Eggemann wrote:
>> On 01/02/2024 14:11, Hongyan Xia wrote:
>>
>> [...]
>>
>>>       /*
>>>        * The code below (indirectly) updates schedutil which looks at
>>> @@ -6769,6 +6770,10 @@ enqueue_task_fair(struct rq *rq, struct
>>> task_struct *p, int flags)
>>>   #ifdef CONFIG_UCLAMP_TASK
>>>       util_uclamp_enqueue(&rq->cfs.avg, p);
>>>       update_util_uclamp(0, 0, 0, &rq->cfs.avg, p);
>>> +    if (migrated)
>>
>> IMHO, you don't need 'bool __maybe_unused migrated'. You can use:
>>
>>    if (flags & ENQUEUE_MIGRATED)
> 
> I'm not sure if they are entirely equivalent. Both
> task_on_rq_migrating() and !task_on_rq_migrating() can have
> last_update_time == 0 but ENQUEUE_MIGRATED flag is only for the former.
> Maybe I'm missing something?

That's true. There are 2:

  (!task_on_rq_migrating() && !p->se.avg.last_update_time)

cases:

(1) wakeup migration: ENQUEUE_MIGRATED (0x40) set in ttwu_do_wakeup()

(2) new task: wake_up_new_task() -> activate_task(), ENQUEUE_MIGRATED is
    not set.

I assume you want to add the task's util_avg_uclamp to
rq->root_cfs_util_uclamp in (2) too? So:

    if (flags & ENQUEUE_MIGRATED || task_new)

[...]

>>>   /* avg must belong to the queue this se is on. */
>>> -void update_util_uclamp(struct sched_avg *avg, struct task_struct *p)
>>> +void update_util_uclamp(u64 now,
>>> +            u64 last_update_time,
>>> +            u32 period_contrib,
>>> +            struct sched_avg *avg,
>>> +            struct task_struct *p)
>>>   {
>>
>> I was wondering why you use such a long parameter list for this
>> function.
>>
>> IMHO
>>
>>    update_util_uclamp(u64 now, struct cfs_rq *cfs_rq, struct
>> sched_entity *se)
>>
>> would work as well. You could check whether se represents a task inside
>> update_util_uclamp() as well as get last_update_time and period_contrib.
>>
>> The only reason I see is that you want to use this function for the RT
>> class as well later, where you have to deal with 'struct rt_rq' and
>> 'struct sched_rt_entity'.
>>
>> IMHO, it's always better to keep the implementation to the minimum and
>> only introduce changes which are related to the functionality you
>> present. This would make reviewing so much easier.
> 
> Those parameters are necessary because of
> 
> if (___update_load_sum()) {
>     ___update_load_avg();
>     update_util_uclamp();
> }

So you need ___update_load_avg() happening for the `normal uclamp path`
and `last_uptade_time` and `period_contrib` for the `decay path` (1) of
update_util_uclamp()?

This is pretty hard to grasp. Isn't there a cleaner solution for this?

Why do you need the:

  if (!avg)
    return;

thing in update_util_uclamp()? __update_load_avg_blocked_se() calls
update_util_uclamp(..., avg = NULL, ...) but this should follow (1)?

> We have to cache last_update_time and period_contrib, because after
> ___update_load_sum() is done, both parameters in sched_avg have already
> been updated and overwritten and we lose the timestamp when the
> sched_avg was previously updated. update_util_uclamp() needs to know
> when sched_avg was previously updated.
> 
>>
>>>       unsigned int util, uclamp_min, uclamp_max;
>>>       int delta;
>>>   -    if (!p->se.on_rq)
>>> +    if (!p->se.on_rq) {
>>> +        ___update_util_uclamp_towards(now,
>>> +                          last_update_time,
>>> +                          period_contrib,
>>> +                          &p->se.avg.util_avg_uclamp,
>>> +                          0);
>>>           return;
>>> +    }
>>
>> You decay 'p->se.avg.util_avg_uclamp' which is not really related to
>> root_cfs_util_uclamp (patch header). IMHO, this would belong to 2/7.
> 
> I would say this still belongs to 3/7, because 2/7 only implements
> utilization for on_rq tasks. This patch implements utilization for both
> on_rq and !on_rq. For rq, we have rq->cfs.avg.util_avg_uclamp (for
> on_rq) and rq->root_cfs_util_uclamp (for on_rq plus !on_rq).

Looks like you maintain `rq->cfs.avg.util_avg_uclamp` (2) (consider all
runnable tasks) to be able to:

(a) set rq->root_cfs_util_uclamp (3) to max((3), (2))

(b) check that if 'rq->cfs.h_nr_running == 0' that (2) = 0 too.

uclamp is based on runnable tasks (so enqueue/dequeue) but you uclamp
around util_avg which has contributions from blocked tasks. And that's
why you have to maintain (3). And (3) only decays within
__update_load_avg_cfs_rq().
Can this be the reason why th implementation is so convoluted? It's
definitely more complicated than util_est with its easy layout:

  enqueue_task_fair()
    util_est_enqueue()

  dequeue_task_fair()
    util_est_dequeue()
    util_est_update()

> For tasks, we also have two utilization numbers, one is on_rq and the
> other is on_rq plus !on_rq. However, we know they do not have to be
> stored in separate variables and util_avg_uclamp can capture both.

Here you lost me. Which value does 'p->se.avg.util_avg_uclamp' store?
'runnable' or 'runnable + blocking'? I would say it's the latter one but
like in PELT we don't update the task signal when it's sleeping.

> Moving this to 2/7 might be fine, although then this would be the only
> !on_rq utilization in 2/7. I can add comments to clarify the situation.
> 
>> This is the util_avg_uclamp handling for a se (task):
>>
>> enqueue_task_fair()
>>
>>    util_uclamp_enqueue()
>>
>>    update_util_uclamp()                 (1)
>>
>>      if (!p->se.on_rq)                  (x)
>>        ___update_util_uclamp_towards()  (2)
>>
>> dequeue_task_fair()
>>
>>    util_uclamp_dequeue()
>>
>> __update_load_avg_blocked_se()
>>
>>    update_util_uclamp()
>>
>>      (x)
>>
>> __update_load_avg_se()
>>
>>    update_util_uclamp()
>>
>>      (x)
>>
>> Why is it so unbalanced? Why do you need (1) and (2)?
>>
>> Isn't this just an indication that the se util_avg_uclamp
>> is done at the wrong places?
>>
>> Is there an other way to provide a task/rq signal as the base
>> for uclamp sum aggregation?
> 
> (2) won't happen, because at that point p->se.on_rq must be 1.

I see.

> The sequence during enqueue_task_fair() is:
> 
> enqueue_task_fair(p)
>     enqueue_entity(se)
>         update_load_avg(se)
>             update_util_uclamp(p) (decay path)
>         p->se.on_rq = 1;
>     util_uclamp_enqueue(p)
>     update_util_uclamp(p) (update path)
> 
> The only reason why we want to issue update_util_uclamp() after seeing
> on_rq == 1 is that now it goes down the normal uclamp path and not the
> decay path. Otherwise, uclamp won't immediately engage during enqueue
> and has to wait a timer tick.

OK, I see, the task contribution should be visible immediately after the
enqueue.

> Ideally, we should:
> 
> enqueue_task_fair(p)
>     enqueue_entity(se)
>         update_load_avg(se)   
>             util_uclamp_enqueue(p)
>             update_util_uclamp(p) (force update path)
>         p->se.on_rq = 1;
> 
> This requires us to invent a new flag to update_util_uclamp() to force
> the update path even when p->se.on_rq is 0.

I guess you have to go this way to achieve a cleaner/easier integration
of 'util_avg_uclamp'.

> At the moment I'm treating util_avg_uclamp in the same way as util_est
> after the comments in v1, which is independent and has its own code
> path. We can go back to the old style, where util_avg_uclamp is closer
> to how we treat se rather than a separate thing like util_est.

Except that 'util_est' integration is much easier to understand. And
this is because of 'util_est' is clear runnable based only and is not
linked to any blocked part.

[...]

Re: [RFC PATCH v2 3/7] sched/uclamp: Introduce root_cfs_util_uclamp for rq

Posted by Hongyan Xia 1 year, 10 months ago

On 20/03/2024 15:27, Dietmar Eggemann wrote:
> On 19/03/2024 12:50, Hongyan Xia wrote:
>> On 18/03/2024 18:21, Dietmar Eggemann wrote:
>>> On 01/02/2024 14:11, Hongyan Xia wrote:
>>>
>>> [...]
>>>
>>>>        /*
>>>>         * The code below (indirectly) updates schedutil which looks at
>>>> @@ -6769,6 +6770,10 @@ enqueue_task_fair(struct rq *rq, struct
>>>> task_struct *p, int flags)
>>>>    #ifdef CONFIG_UCLAMP_TASK
>>>>        util_uclamp_enqueue(&rq->cfs.avg, p);
>>>>        update_util_uclamp(0, 0, 0, &rq->cfs.avg, p);
>>>> +    if (migrated)
>>>
>>> IMHO, you don't need 'bool __maybe_unused migrated'. You can use:
>>>
>>>     if (flags & ENQUEUE_MIGRATED)
>>
>> I'm not sure if they are entirely equivalent. Both
>> task_on_rq_migrating() and !task_on_rq_migrating() can have
>> last_update_time == 0 but ENQUEUE_MIGRATED flag is only for the former.
>> Maybe I'm missing something?
> 
> That's true. There are 2:
> 
>    (!task_on_rq_migrating() && !p->se.avg.last_update_time)
> 
> cases:
> 
> (1) wakeup migration: ENQUEUE_MIGRATED (0x40) set in ttwu_do_wakeup()
> 
> (2) new task: wake_up_new_task() -> activate_task(), ENQUEUE_MIGRATED is
>      not set.
> 
> I assume you want to add the task's util_avg_uclamp to
> rq->root_cfs_util_uclamp in (2) too? So:
> 
>      if (flags & ENQUEUE_MIGRATED || task_new)

I see. Maybe we don't need to check last_update_time. I'll take a look.

Although if we want to integrate sum aggregation with se rather than 
making it fully independent (in your comments below) like util_est, 
we'll probably just do that in

update_util_avg()
     if (!se->avg.last_update_time && (flags & DO_ATTACH)) { ... }

and don't bother doing it in the outside loop of enqueue_task_fair() anyway.

> [...]
> 
>>>>    /* avg must belong to the queue this se is on. */
>>>> -void update_util_uclamp(struct sched_avg *avg, struct task_struct *p)
>>>> +void update_util_uclamp(u64 now,
>>>> +            u64 last_update_time,
>>>> +            u32 period_contrib,
>>>> +            struct sched_avg *avg,
>>>> +            struct task_struct *p)
>>>>    {
>>>
>>> I was wondering why you use such a long parameter list for this
>>> function.
>>>
>>> IMHO
>>>
>>>     update_util_uclamp(u64 now, struct cfs_rq *cfs_rq, struct
>>> sched_entity *se)
>>>
>>> would work as well. You could check whether se represents a task inside
>>> update_util_uclamp() as well as get last_update_time and period_contrib.
>>>
>>> The only reason I see is that you want to use this function for the RT
>>> class as well later, where you have to deal with 'struct rt_rq' and
>>> 'struct sched_rt_entity'.
>>>
>>> IMHO, it's always better to keep the implementation to the minimum and
>>> only introduce changes which are related to the functionality you
>>> present. This would make reviewing so much easier.
>>
>> Those parameters are necessary because of
>>
>> if (___update_load_sum()) {
>>      ___update_load_avg();
>>      update_util_uclamp();
>> }
> 
> So you need ___update_load_avg() happening for the `normal uclamp path`
> and `last_uptade_time` and `period_contrib` for the `decay path` (1) of
> update_util_uclamp()?
> 
> This is pretty hard to grasp. Isn't there a cleaner solution for this?

You are correct. Not sure if there's a better way.

> Why do you need the:
> 
>    if (!avg)
>      return;
> 
> thing in update_util_uclamp()? __update_load_avg_blocked_se() calls
> update_util_uclamp(..., avg = NULL, ...) but this should follow (1)?

I added it as a guard to rule out edge cases, but I think when you do 
__update_load_avg_blocked_se(), it has to be !on_rq so it will never 
enter this path. I think it doesn't hurt but I can remove it.

>> We have to cache last_update_time and period_contrib, because after
>> ___update_load_sum() is done, both parameters in sched_avg have already
>> been updated and overwritten and we lose the timestamp when the
>> sched_avg was previously updated. update_util_uclamp() needs to know
>> when sched_avg was previously updated.
>>
>>>
>>>>        unsigned int util, uclamp_min, uclamp_max;
>>>>        int delta;
>>>>    -    if (!p->se.on_rq)
>>>> +    if (!p->se.on_rq) {
>>>> +        ___update_util_uclamp_towards(now,
>>>> +                          last_update_time,
>>>> +                          period_contrib,
>>>> +                          &p->se.avg.util_avg_uclamp,
>>>> +                          0);
>>>>            return;
>>>> +    }
>>>
>>> You decay 'p->se.avg.util_avg_uclamp' which is not really related to
>>> root_cfs_util_uclamp (patch header). IMHO, this would belong to 2/7.
>>
>> I would say this still belongs to 3/7, because 2/7 only implements
>> utilization for on_rq tasks. This patch implements utilization for both
>> on_rq and !on_rq. For rq, we have rq->cfs.avg.util_avg_uclamp (for
>> on_rq) and rq->root_cfs_util_uclamp (for on_rq plus !on_rq).
> 
> Looks like you maintain `rq->cfs.avg.util_avg_uclamp` (2) (consider all
> runnable tasks) to be able to:
> 
> (a) set rq->root_cfs_util_uclamp (3) to max((3), (2))
> 
> (b) check that if 'rq->cfs.h_nr_running == 0' that (2) = 0 too.
> 
> uclamp is based on runnable tasks (so enqueue/dequeue) but you uclamp
> around util_avg which has contributions from blocked tasks. And that's
> why you have to maintain (3). And (3) only decays within
> __update_load_avg_cfs_rq().
> Can this be the reason why th implementation is so convoluted? It's
> definitely more complicated than util_est with its easy layout:
> 
>    enqueue_task_fair()
>      util_est_enqueue()
> 
>    dequeue_task_fair()
>      util_est_dequeue()
>      util_est_update()
There's only one rule here, which is the root value must be the sum of 
all task util_avg_uclamp values. If PELT slowly decays each util_avg, 
then the sum will also decay in a similar fashion. The code here is just 
doing that.

This 'convoluted' property is from PELT and not from sum aggregation. 
Actually this is what PELT used to do a while back, when the root CFS 
util was the sum of all tasks instead of being tracked separately, and 
we do the same decay here as we did back then.

You are right that this feels convoluted, but sum aggregation doesn't 
attempt to change how PELT works so the decaying property is there due 
to PELT. I can keep exploring new ways to make the logic easier to follow.

>> For tasks, we also have two utilization numbers, one is on_rq and the
>> other is on_rq plus !on_rq. However, we know they do not have to be
>> stored in separate variables and util_avg_uclamp can capture both.
> 
> Here you lost me. Which value does 'p->se.avg.util_avg_uclamp' store?
> 'runnable' or 'runnable + blocking'? I would say it's the latter one but
> like in PELT we don't update the task signal when it's sleeping.

The latter. We don't update the task signal when it's sleeping but we do 
when we need to use it, and that's enough. This is also the case for all 
blocked util_avg.

>> Moving this to 2/7 might be fine, although then this would be the only
>> !on_rq utilization in 2/7. I can add comments to clarify the situation.
>>
>>> This is the util_avg_uclamp handling for a se (task):
>>>
>>> enqueue_task_fair()
>>>
>>>     util_uclamp_enqueue()
>>>
>>>     update_util_uclamp()                 (1)
>>>
>>>       if (!p->se.on_rq)                  (x)
>>>         ___update_util_uclamp_towards()  (2)
>>>
>>> dequeue_task_fair()
>>>
>>>     util_uclamp_dequeue()
>>>
>>> __update_load_avg_blocked_se()
>>>
>>>     update_util_uclamp()
>>>
>>>       (x)
>>>
>>> __update_load_avg_se()
>>>
>>>     update_util_uclamp()
>>>
>>>       (x)
>>>
>>> Why is it so unbalanced? Why do you need (1) and (2)?
>>>
>>> Isn't this just an indication that the se util_avg_uclamp
>>> is done at the wrong places?
>>>
>>> Is there an other way to provide a task/rq signal as the base
>>> for uclamp sum aggregation?
>>
>> (2) won't happen, because at that point p->se.on_rq must be 1.
> 
> I see.
> 
>> The sequence during enqueue_task_fair() is:
>>
>> enqueue_task_fair(p)
>>      enqueue_entity(se)
>>          update_load_avg(se)
>>              update_util_uclamp(p) (decay path)
>>          p->se.on_rq = 1;
>>      util_uclamp_enqueue(p)
>>      update_util_uclamp(p) (update path)
>>
>> The only reason why we want to issue update_util_uclamp() after seeing
>> on_rq == 1 is that now it goes down the normal uclamp path and not the
>> decay path. Otherwise, uclamp won't immediately engage during enqueue
>> and has to wait a timer tick.
> 
> OK, I see, the task contribution should be visible immediately after the
> enqueue.
> 
>> Ideally, we should:
>>
>> enqueue_task_fair(p)
>>      enqueue_entity(se)
>>          update_load_avg(se)
>>              util_uclamp_enqueue(p)
>>              update_util_uclamp(p) (force update path)
>>          p->se.on_rq = 1;
>>
>> This requires us to invent a new flag to update_util_uclamp() to force
>> the update path even when p->se.on_rq is 0.
> 
> I guess you have to go this way to achieve a cleaner/easier integration
> of 'util_avg_uclamp'.

If we don't want to keep util_avg_uclamp separately like util_est but 
move it closer to se, then we can explore this option.

>> At the moment I'm treating util_avg_uclamp in the same way as util_est
>> after the comments in v1, which is independent and has its own code
>> path. We can go back to the old style, where util_avg_uclamp is closer
>> to how we treat se rather than a separate thing like util_est.
> 
> Except that 'util_est' integration is much easier to understand. And
> this is because of 'util_est' is clear runnable based only and is not
> linked to any blocked part.

True.

Well, I can go back to the style in RFC v1. One big advantage of v2 is 
that we can compile out the support of uclamp very easily because it's 
treated more or less like an independent thing.

> [...]
>

[RFC PATCH v2 1/7] Revert "sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0"
[RFC PATCH v2 2/7] sched/uclamp: Track uclamped util_avg in sched_avg
[RFC PATCH v2 3/7] sched/uclamp: Introduce root_cfs_util_uclamp for rq
[RFC PATCH v2 4/7] sched/fair: Use CFS util_avg_uclamp for utilization and frequency
[RFC PATCH v2 5/7] sched/fair: Massively simplify util_fits_cpu()
[RFC PATCH v2 6/7] sched/uclamp: Remove all uclamp bucket logic
[RFC PATCH v2 7/7] sched/uclamp: Simplify uclamp_eff_value()