[v2] Fix relationship between uclamp and fits_capacity()

[PATCH v2 8/9] sched/fair: Detect capacity inversion

Posted by Qais Yousef 3 years, 8 months ago

Check each performance domain to see if thermal pressure is causing its
capacity to be lower than another performance domain.

We assume that each performance domain has CPUs with the same
capacities, which is similar to an assumption made in energy_model.c

We also assume that thermal pressure impacts all CPUs in a performance
domain equally.

If there're multiple performance domains with the same capacity_orig, we
will trigger a capacity inversion if the domain is under thermal
pressure.

The new cpu_in_capacity_inversion() should help users to know when
information about capacity_orig are not reliable and can opt in to use
the inverted capacity as the 'actual' capacity_orig.

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
---
 kernel/sched/fair.c  | 63 +++++++++++++++++++++++++++++++++++++++++---
 kernel/sched/sched.h | 19 +++++++++++++
 2 files changed, 79 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 59ba7106ddc6..cb32dc9a057f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8659,16 +8659,73 @@ static unsigned long scale_rt_capacity(int cpu)
 
 static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 {
+	unsigned long capacity_orig = arch_scale_cpu_capacity(cpu);
 	unsigned long capacity = scale_rt_capacity(cpu);
 	struct sched_group *sdg = sd->groups;
+	struct rq *rq = cpu_rq(cpu);
 
-	cpu_rq(cpu)->cpu_capacity_orig = arch_scale_cpu_capacity(cpu);
+	rq->cpu_capacity_orig = capacity_orig;
 
 	if (!capacity)
 		capacity = 1;
 
-	cpu_rq(cpu)->cpu_capacity = capacity;
-	trace_sched_cpu_capacity_tp(cpu_rq(cpu));
+	rq->cpu_capacity = capacity;
+
+	/*
+	 * Detect if the performance domain is in capacity inversion state.
+	 *
+	 * Capacity inversion happens when another perf domain with equal or
+	 * lower capacity_orig_of() ends up having higher capacity than this
+	 * domain after subtracting thermal pressure.
+	 *
+	 * We only take into account thermal pressure in this detection as it's
+	 * the only metric that actually results in *real* reduction of
+	 * capacity due to performance points (OPPs) being dropped/become
+	 * unreachable due to thermal throttling.
+	 *
+	 * We assume:
+	 *   * That all cpus in a perf domain have the same capacity_orig
+	 *     (same uArch).
+	 *   * Thermal pressure will impact all cpus in this perf domain
+	 *     equally.
+	 */
+	if (static_branch_unlikely(&sched_asym_cpucapacity)) {
+		unsigned long inv_cap = capacity_orig - thermal_load_avg(rq);
+		struct perf_domain *pd = rcu_dereference(rq->rd->pd);
+
+		rq->cpu_capacity_inverted = 0;
+
+		for (; pd; pd = pd->next) {
+			struct cpumask *pd_span = perf_domain_span(pd);
+			unsigned long pd_cap_orig, pd_cap;
+
+			cpu = cpumask_any(pd_span);
+			pd_cap_orig = arch_scale_cpu_capacity(cpu);
+
+			if (capacity_orig < pd_cap_orig)
+				continue;
+
+			/*
+			 * handle the case of multiple perf domains have the
+			 * same capacity_orig but one of them is under higher
+			 * thermal pressure. We record it as capacity
+			 * inversion.
+			 */
+			if (capacity_orig == pd_cap_orig) {
+				pd_cap = pd_cap_orig - thermal_load_avg(cpu_rq(cpu));
+
+				if (pd_cap > inv_cap) {
+					rq->cpu_capacity_inverted = inv_cap;
+					break;
+				}
+			} else if (pd_cap_orig > inv_cap) {
+				rq->cpu_capacity_inverted = inv_cap;
+				break;
+			}
+		}
+	}
+
+	trace_sched_cpu_capacity_tp(rq);
 
 	sdg->sgc->capacity = capacity;
 	sdg->sgc->min_capacity = capacity;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index caf017f7def6..541a70fa55b3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1033,6 +1033,7 @@ struct rq {
 
 	unsigned long		cpu_capacity;
 	unsigned long		cpu_capacity_orig;
+	unsigned long		cpu_capacity_inverted;
 
 	struct callback_head	*balance_callback;
 
@@ -2865,6 +2866,24 @@ static inline unsigned long capacity_orig_of(int cpu)
 	return cpu_rq(cpu)->cpu_capacity_orig;
 }
 
+/*
+ * Returns inverted capacity if the CPU is in capacity inversion state.
+ * 0 otherwise.
+ *
+ * Capacity inversion detection only considers thermal impact where actual
+ * performance points (OPPs) gets dropped.
+ *
+ * Capacity inversion state happens when another performance domain that has
+ * equal or lower capacity_orig_of() becomes effectively larger than the perf
+ * domain this CPU belongs to due to thermal pressure throttling it hard.
+ *
+ * See comment in update_cpu_capacity().
+ */
+static inline unsigned long cpu_in_capacity_inversion(int cpu)
+{
+	return cpu_rq(cpu)->cpu_capacity_inverted;
+}
+
 /**
  * enum cpu_util_type - CPU utilization type
  * @FREQUENCY_UTIL:	Utilization used to select frequency
-- 
2.25.1

Re: [PATCH v2 8/9] sched/fair: Detect capacity inversion

Posted by Dietmar Eggemann 3 years, 5 months ago

- Qais Yousef <qais.yousef@arm.com>
+ Qais Yousef <qyousef@layalina.io>

On 04/08/2022 16:36, Qais Yousef wrote:

I was surprised to see these capacity inversion patches in v2. They were
not part of v1 so I never review them (even internally).

> Check each performance domain to see if thermal pressure is causing its

I guess that's `PELT thermal pressure` instead of `instantaneous thermal
pressure`. IMHO an important detail here to understand the patch.

> capacity to be lower than another performance domain.
                            ^^^^^^^
s/another/next lower (CPU capacity) level/ ?

I assume that is the definition of `capacity inversion`? IMHO it
appeared the first time in your discussion with Xuewen and Lukasz:

https://lkml.kernel.org/r/20220503144352.lxduzhl6jq6xdhw2@airbuntu

> We assume that each performance domain has CPUs with the same
> capacities, which is similar to an assumption made in energy_model.c
> 
> We also assume that thermal pressure impacts all CPUs in a performance
> domain equally.
> 
> If there're multiple performance domains with the same capacity_orig, we

Not aware of such a system. At least it wouldn't make much sense. Not
sure if EAS would correctly work on such a system.

> will trigger a capacity inversion if the domain is under thermal
> pressure.
> 
> The new cpu_in_capacity_inversion() should help users to know when
> information about capacity_orig are not reliable and can opt in to use
> the inverted capacity as the 'actual' capacity_orig.
> 
> Signed-off-by: Qais Yousef <qais.yousef@arm.com>
> ---
>  kernel/sched/fair.c  | 63 +++++++++++++++++++++++++++++++++++++++++---
>  kernel/sched/sched.h | 19 +++++++++++++
>  2 files changed, 79 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 59ba7106ddc6..cb32dc9a057f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8659,16 +8659,73 @@ static unsigned long scale_rt_capacity(int cpu)
>  
>  static void update_cpu_capacity(struct sched_domain *sd, int cpu)
>  {
> +	unsigned long capacity_orig = arch_scale_cpu_capacity(cpu);
>  	unsigned long capacity = scale_rt_capacity(cpu);
>  	struct sched_group *sdg = sd->groups;
> +	struct rq *rq = cpu_rq(cpu);
>  
> -	cpu_rq(cpu)->cpu_capacity_orig = arch_scale_cpu_capacity(cpu);
> +	rq->cpu_capacity_orig = capacity_orig;
>  
>  	if (!capacity)
>  		capacity = 1;
>  
> -	cpu_rq(cpu)->cpu_capacity = capacity;
> -	trace_sched_cpu_capacity_tp(cpu_rq(cpu));
> +	rq->cpu_capacity = capacity;
> +
> +	/*
> +	 * Detect if the performance domain is in capacity inversion state.
> +	 *
> +	 * Capacity inversion happens when another perf domain with equal or
> +	 * lower capacity_orig_of() ends up having higher capacity than this
> +	 * domain after subtracting thermal pressure.
> +	 *
> +	 * We only take into account thermal pressure in this detection as it's
> +	 * the only metric that actually results in *real* reduction of
> +	 * capacity due to performance points (OPPs) being dropped/become
> +	 * unreachable due to thermal throttling.
> +	 *
> +	 * We assume:
> +	 *   * That all cpus in a perf domain have the same capacity_orig
> +	 *     (same uArch).
> +	 *   * Thermal pressure will impact all cpus in this perf domain
> +	 *     equally.
> +	 */
> +	if (static_branch_unlikely(&sched_asym_cpucapacity)) {

This should be sched_energy_enabled(). Performance Domains (PDs) are an
EAS thing.

> +		unsigned long inv_cap = capacity_orig - thermal_load_avg(rq);

rcu_read_lock()

> +		struct perf_domain *pd = rcu_dereference(rq->rd->pd);

rcu_read_unlock()

It's called from build_sched_domains() too. I assume
static_branch_unlikely(&sched_asym_cpucapacity) hides this issue so far.

> +
> +		rq->cpu_capacity_inverted = 0;
> +
> +		for (; pd; pd = pd->next) {
> +			struct cpumask *pd_span = perf_domain_span(pd);
> +			unsigned long pd_cap_orig, pd_cap;
> +
> +			cpu = cpumask_any(pd_span);
> +			pd_cap_orig = arch_scale_cpu_capacity(cpu);
> +
> +			if (capacity_orig < pd_cap_orig)
> +				continue;
> +
> +			/*
> +			 * handle the case of multiple perf domains have the
> +			 * same capacity_orig but one of them is under higher

Like I said above, I'm not aware of such an EAS system.

> +			 * thermal pressure. We record it as capacity
> +			 * inversion.
> +			 */
> +			if (capacity_orig == pd_cap_orig) {
> +				pd_cap = pd_cap_orig - thermal_load_avg(cpu_rq(cpu));
> +
> +				if (pd_cap > inv_cap) {
> +					rq->cpu_capacity_inverted = inv_cap;
> +					break;
> +				}

In case `capacity_orig == pd_cap_orig` and cpumask_test_cpu(cpu_of(rq),
pd_span) the code can set rq->cpu_capacity_inverted = inv_cap
erroneously since thermal_load_avg(rq) can return different values for
inv_cap and pd_cap.

So even on a classical big little system, this condition can set
rq->cpu_capacity_inverted for a CPU in the little or big cluster.

thermal_load_avg(rq) would have to stay constant for all CPUs within the
PD to avoid this.

This is one example of the `thermal pressure` is per PD (or Frequency
Domain) in Thermal but per-CPU in the task scheduler.



> +			} else if (pd_cap_orig > inv_cap) {
> +				rq->cpu_capacity_inverted = inv_cap;
> +				break;
> +			}
> +		}
> +	}
> +
> +	trace_sched_cpu_capacity_tp(rq);
>  
>  	sdg->sgc->capacity = capacity;
>  	sdg->sgc->min_capacity = capacity;

[...]

Re: [PATCH v2 8/9] sched/fair: Detect capacity inversion

Posted by Qais Yousef 3 years, 5 months ago

On 11/09/22 11:42, Dietmar Eggemann wrote:

[...]

> > +	/*
> > +	 * Detect if the performance domain is in capacity inversion state.
> > +	 *
> > +	 * Capacity inversion happens when another perf domain with equal or
> > +	 * lower capacity_orig_of() ends up having higher capacity than this
> > +	 * domain after subtracting thermal pressure.
> > +	 *
> > +	 * We only take into account thermal pressure in this detection as it's
> > +	 * the only metric that actually results in *real* reduction of
> > +	 * capacity due to performance points (OPPs) being dropped/become
> > +	 * unreachable due to thermal throttling.
> > +	 *
> > +	 * We assume:
> > +	 *   * That all cpus in a perf domain have the same capacity_orig
> > +	 *     (same uArch).
> > +	 *   * Thermal pressure will impact all cpus in this perf domain
> > +	 *     equally.
> > +	 */
> > +	if (static_branch_unlikely(&sched_asym_cpucapacity)) {
> 
> This should be sched_energy_enabled(). Performance Domains (PDs) are an
> EAS thing.

Bummer. I had a version that used cpumasks only, but I thought using pds is
cleaner and will save unnecessarily extra traversing. But I missed that it's
conditional on sched_energy_enabled().

This is not good news for CAS.

> 
> > +		unsigned long inv_cap = capacity_orig - thermal_load_avg(rq);
> 
> rcu_read_lock()
> 
> > +		struct perf_domain *pd = rcu_dereference(rq->rd->pd);
> 
> rcu_read_unlock()

Shouldn't we continue to hold it while traversing the pd too?

> 
> It's called from build_sched_domains() too. I assume
> static_branch_unlikely(&sched_asym_cpucapacity) hides this issue so far.
> 
> > +
> > +		rq->cpu_capacity_inverted = 0;
> > +
> > +		for (; pd; pd = pd->next) {
> > +			struct cpumask *pd_span = perf_domain_span(pd);
> > +			unsigned long pd_cap_orig, pd_cap;
> > +
> > +			cpu = cpumask_any(pd_span);
> > +			pd_cap_orig = arch_scale_cpu_capacity(cpu);
> > +
> > +			if (capacity_orig < pd_cap_orig)
> > +				continue;
> > +
> > +			/*
> > +			 * handle the case of multiple perf domains have the
> > +			 * same capacity_orig but one of them is under higher
> 
> Like I said above, I'm not aware of such an EAS system.

I did argue against that. But Vincent's PoV was that we shouldn't make
assumptions and handle the case where we have big cores each on its own domain.

> 
> > +			 * thermal pressure. We record it as capacity
> > +			 * inversion.
> > +			 */
> > +			if (capacity_orig == pd_cap_orig) {
> > +				pd_cap = pd_cap_orig - thermal_load_avg(cpu_rq(cpu));
> > +
> > +				if (pd_cap > inv_cap) {
> > +					rq->cpu_capacity_inverted = inv_cap;
> > +					break;
> > +				}
> 
> In case `capacity_orig == pd_cap_orig` and cpumask_test_cpu(cpu_of(rq),
> pd_span) the code can set rq->cpu_capacity_inverted = inv_cap
> erroneously since thermal_load_avg(rq) can return different values for
> inv_cap and pd_cap.

Good catch!

> 
> So even on a classical big little system, this condition can set
> rq->cpu_capacity_inverted for a CPU in the little or big cluster.
> 
> thermal_load_avg(rq) would have to stay constant for all CPUs within the
> PD to avoid this.
> 
> This is one example of the `thermal pressure` is per PD (or Frequency
> Domain) in Thermal but per-CPU in the task scheduler.

Only compile tested so far, does this patch address all your points? I should
get hardware soon to run some tests and send the patch. I might re-write it to
avoid using pds; though it seems cleaner this way but we miss CAS support.

Thoughts?


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 89dadaafc1ec..b01854984994 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8856,16 +8856,24 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
         *   * Thermal pressure will impact all cpus in this perf domain
         *     equally.
         */
-       if (static_branch_unlikely(&sched_asym_cpucapacity)) {
-               unsigned long inv_cap = capacity_orig - thermal_load_avg(rq);
-               struct perf_domain *pd = rcu_dereference(rq->rd->pd);
+       if (sched_energy_enabled()) {
+               struct perf_domain *pd;
+               unsigned long inv_cap;
+
+               rcu_read_lock();

+               inv_cap = capacity_orig - thermal_load_avg(rq);
+               pd = rcu_dereference(rq->rd->pd);
                rq->cpu_capacity_inverted = 0;

                for (; pd; pd = pd->next) {
                        struct cpumask *pd_span = perf_domain_span(pd);
                        unsigned long pd_cap_orig, pd_cap;

+                       /* We can't be inverted against our own pd */
+                       if (cpumask_test_cpu(cpu_of(rq), pd_span))
+                               continue;
+
                        cpu = cpumask_any(pd_span);
                        pd_cap_orig = arch_scale_cpu_capacity(cpu);

@@ -8890,6 +8898,8 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
                                break;
                        }
                }
+
+               rcu_read_unlock();
        }


Thanks!

--
Qais Yousef

Re: [PATCH v2 8/9] sched/fair: Detect capacity inversion

Posted by Dietmar Eggemann 3 years, 4 months ago

On 12/11/2022 20:35, Qais Yousef wrote:
> On 11/09/22 11:42, Dietmar Eggemann wrote:
> 

[...]

>>> +			 * thermal pressure. We record it as capacity
>>> +			 * inversion.
>>> +			 */
>>> +			if (capacity_orig == pd_cap_orig) {
>>> +				pd_cap = pd_cap_orig - thermal_load_avg(cpu_rq(cpu));
>>> +
>>> +				if (pd_cap > inv_cap) {
>>> +					rq->cpu_capacity_inverted = inv_cap;
>>> +					break;
>>> +				}
>>
>> In case `capacity_orig == pd_cap_orig` and cpumask_test_cpu(cpu_of(rq),
>> pd_span) the code can set rq->cpu_capacity_inverted = inv_cap
>> erroneously since thermal_load_avg(rq) can return different values for
>> inv_cap and pd_cap.
> 
> Good catch!
> 
>>
>> So even on a classical big little system, this condition can set
>> rq->cpu_capacity_inverted for a CPU in the little or big cluster.
>>
>> thermal_load_avg(rq) would have to stay constant for all CPUs within the
>> PD to avoid this.
>>
>> This is one example of the `thermal pressure` is per PD (or Frequency
>> Domain) in Thermal but per-CPU in the task scheduler.
> 
> Only compile tested so far, does this patch address all your points? I should
> get hardware soon to run some tests and send the patch. I might re-write it to
> avoid using pds; though it seems cleaner this way but we miss CAS support.
> 
> Thoughts?

I still don't think that the `CPU capacity inversion` implementation
which uses `cap_orig' = cap_orig - thermal load avg (2)` instead of
`cap_orig'' = cap_orig - thermal pressure (1)` for inverted CPUs (i.e.
other PD exists w/ cap_orig > cap_orig') is the right answer, besides
the EAS vs. CAS coverage.

The basic question for me is why do we have to switch between (1) and
(2)? IIRC we introduced (1) in feec() to cater for the CPUfreq policy
min/max capping between schedutil and the CPUfreq driver
__resolve_freq() [drivers/cpufreq/cpufreq.c] (3).

The policy caps are set together with thermal pressure in
cpufreq_set_cur_state() [drivers/thermal/cpufreq_cooling.c] via
freq_qos_update_request().

I think we should only use (2) in the task scheduler even though the
EAS-schedutil machinery would be not 100% in sync in some cases due to (3).
Thermal load avg has similar properties like all the other EWMA-based
signals we use and we have to live with a certain degree of inaccuracy
anyway (e.g. also because of lock-less CPU statistic fetching when
selecting CPU).

And in this case we wouldn't have to have infrastructure to switch
between (1) and (2) at all.

To illustrate the problem I ran 2 workloads (hackbench/sleep) on a H960
board with step-wise thermal governor tracing thermal load_avg
(sched_pelt_thermal), thermal pressure (thermal_pressure_update) and CPU
capacity (sched_cpu_capacity). Would we really gain something
substantial reliably when we would know the diff between (1) and (2)?

https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/thermal_pressure.ipynb

Re: [PATCH v2 8/9] sched/fair: Detect capacity inversion

Posted by Qais Yousef 3 years, 4 months ago

On 11/16/22 18:45, Dietmar Eggemann wrote:
> On 12/11/2022 20:35, Qais Yousef wrote:
> > On 11/09/22 11:42, Dietmar Eggemann wrote:
> > 
> 
> [...]
> 
> >>> +			 * thermal pressure. We record it as capacity
> >>> +			 * inversion.
> >>> +			 */
> >>> +			if (capacity_orig == pd_cap_orig) {
> >>> +				pd_cap = pd_cap_orig - thermal_load_avg(cpu_rq(cpu));
> >>> +
> >>> +				if (pd_cap > inv_cap) {
> >>> +					rq->cpu_capacity_inverted = inv_cap;
> >>> +					break;
> >>> +				}
> >>
> >> In case `capacity_orig == pd_cap_orig` and cpumask_test_cpu(cpu_of(rq),
> >> pd_span) the code can set rq->cpu_capacity_inverted = inv_cap
> >> erroneously since thermal_load_avg(rq) can return different values for
> >> inv_cap and pd_cap.
> > 
> > Good catch!
> > 
> >>
> >> So even on a classical big little system, this condition can set
> >> rq->cpu_capacity_inverted for a CPU in the little or big cluster.
> >>
> >> thermal_load_avg(rq) would have to stay constant for all CPUs within the
> >> PD to avoid this.
> >>
> >> This is one example of the `thermal pressure` is per PD (or Frequency
> >> Domain) in Thermal but per-CPU in the task scheduler.
> > 
> > Only compile tested so far, does this patch address all your points? I should
> > get hardware soon to run some tests and send the patch. I might re-write it to
> > avoid using pds; though it seems cleaner this way but we miss CAS support.
> > 
> > Thoughts?
> 
> I still don't think that the `CPU capacity inversion` implementation
> which uses `cap_orig' = cap_orig - thermal load avg (2)` instead of
> `cap_orig'' = cap_orig - thermal pressure (1)` for inverted CPUs (i.e.
> other PD exists w/ cap_orig > cap_orig') is the right answer, besides
> the EAS vs. CAS coverage.
> 
> The basic question for me is why do we have to switch between (1) and
> (2)? IIRC we introduced (1) in feec() to cater for the CPUfreq policy
> min/max capping between schedutil and the CPUfreq driver
> __resolve_freq() [drivers/cpufreq/cpufreq.c] (3).
> 
> The policy caps are set together with thermal pressure in
> cpufreq_set_cur_state() [drivers/thermal/cpufreq_cooling.c] via
> freq_qos_update_request().
> 
> I think we should only use (2) in the task scheduler even though the
> EAS-schedutil machinery would be not 100% in sync in some cases due to (3).
> Thermal load avg has similar properties like all the other EWMA-based
> signals we use and we have to live with a certain degree of inaccuracy
> anyway (e.g. also because of lock-less CPU statistic fetching when
> selecting CPU).
> 
> And in this case we wouldn't have to have infrastructure to switch
> between (1) and (2) at all.
> 
> To illustrate the problem I ran 2 workloads (hackbench/sleep) on a H960
> board with step-wise thermal governor tracing thermal load_avg
> (sched_pelt_thermal), thermal pressure (thermal_pressure_update) and CPU
> capacity (sched_cpu_capacity). Would we really gain something
> substantial reliably when we would know the diff between (1) and (2)?
> 
> https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/thermal_pressure.ipynb
> 

So what you're asking for is to switch to this?

	diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
	index b01854984994..989f1947bd34 100644
	--- a/kernel/sched/fair.c
	+++ b/kernel/sched/fair.c
	@@ -8862,7 +8862,7 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)

			rcu_read_lock();

	-               inv_cap = capacity_orig - thermal_load_avg(rq);
	+               inv_cap = capacity_orig - arch_scale_thermal_pressure(rq);
			pd = rcu_dereference(rq->rd->pd);
			rq->cpu_capacity_inverted = 0;

	@@ -8887,7 +8887,7 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
				 * inversion.
				 */
				if (capacity_orig == pd_cap_orig) {
	-                               pd_cap = pd_cap_orig - thermal_load_avg(cpu_rq(cpu));
	+                               pd_cap = pd_cap_orig - arch_scale_thermal_pressure(cpu_rq(cpu));

					if (pd_cap > inv_cap) {
						rq->cpu_capacity_inverted = inv_cap;

My main worry is that rq->cpu_capacity which is updated in the same location
uses thermal_load_avg(). The consistency was important IMO. Besides I think we
need good certainty the inversion is there - we don't want to be oscillating.
Say the big core thermal pressure is increasing and it is entering capacity
inversion. If we don't use the average we'd be avoiding the cpu one tick, but
place something that drives freq high on it the next tick. This ping-pong could
end up not giving the big cores some breathing room to cool down and settle on
one state, no?

I think Lukasz patch [1] is very important in controlling this aspect. And it
might help make the code more consistent by enabling switching all users to
thermal_load_avg() if we can speed up its reaction time sufficiently.

That said; I don't have a bullet proof answer or a very strong opinion about
it. Either direction we take I think we'll have room for improvements. It
seemed the safer more sensible option to me at this stage.

[1] https://lore.kernel.org/lkml/20220429091245.12423-1-lukasz.luba@arm.com/


Thanks!

--
Qais Yousef

[tip: sched/core] sched/fair: Detect capacity inversion

Posted by tip-bot2 for Qais Yousef 3 years, 5 months ago

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     44c7b80bffc3a657a36857098d5d9c49d94e652b
Gitweb:        https://git.kernel.org/tip/44c7b80bffc3a657a36857098d5d9c49d94e652b
Author:        Qais Yousef <qais.yousef@arm.com>
AuthorDate:    Thu, 04 Aug 2022 15:36:08 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 27 Oct 2022 11:01:20 +02:00

sched/fair: Detect capacity inversion

Check each performance domain to see if thermal pressure is causing its
capacity to be lower than another performance domain.

We assume that each performance domain has CPUs with the same
capacities, which is similar to an assumption made in energy_model.c

We also assume that thermal pressure impacts all CPUs in a performance
domain equally.

If there're multiple performance domains with the same capacity_orig, we
will trigger a capacity inversion if the domain is under thermal
pressure.

The new cpu_in_capacity_inversion() should help users to know when
information about capacity_orig are not reliable and can opt in to use
the inverted capacity as the 'actual' capacity_orig.

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220804143609.515789-9-qais.yousef@arm.com
---
 kernel/sched/fair.c  | 63 ++++++++++++++++++++++++++++++++++++++++---
 kernel/sched/sched.h | 19 +++++++++++++-
 2 files changed, 79 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0f32acb..4c4ea47 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8824,16 +8824,73 @@ static unsigned long scale_rt_capacity(int cpu)
 
 static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 {
+	unsigned long capacity_orig = arch_scale_cpu_capacity(cpu);
 	unsigned long capacity = scale_rt_capacity(cpu);
 	struct sched_group *sdg = sd->groups;
+	struct rq *rq = cpu_rq(cpu);
 
-	cpu_rq(cpu)->cpu_capacity_orig = arch_scale_cpu_capacity(cpu);
+	rq->cpu_capacity_orig = capacity_orig;
 
 	if (!capacity)
 		capacity = 1;
 
-	cpu_rq(cpu)->cpu_capacity = capacity;
-	trace_sched_cpu_capacity_tp(cpu_rq(cpu));
+	rq->cpu_capacity = capacity;
+
+	/*
+	 * Detect if the performance domain is in capacity inversion state.
+	 *
+	 * Capacity inversion happens when another perf domain with equal or
+	 * lower capacity_orig_of() ends up having higher capacity than this
+	 * domain after subtracting thermal pressure.
+	 *
+	 * We only take into account thermal pressure in this detection as it's
+	 * the only metric that actually results in *real* reduction of
+	 * capacity due to performance points (OPPs) being dropped/become
+	 * unreachable due to thermal throttling.
+	 *
+	 * We assume:
+	 *   * That all cpus in a perf domain have the same capacity_orig
+	 *     (same uArch).
+	 *   * Thermal pressure will impact all cpus in this perf domain
+	 *     equally.
+	 */
+	if (static_branch_unlikely(&sched_asym_cpucapacity)) {
+		unsigned long inv_cap = capacity_orig - thermal_load_avg(rq);
+		struct perf_domain *pd = rcu_dereference(rq->rd->pd);
+
+		rq->cpu_capacity_inverted = 0;
+
+		for (; pd; pd = pd->next) {
+			struct cpumask *pd_span = perf_domain_span(pd);
+			unsigned long pd_cap_orig, pd_cap;
+
+			cpu = cpumask_any(pd_span);
+			pd_cap_orig = arch_scale_cpu_capacity(cpu);
+
+			if (capacity_orig < pd_cap_orig)
+				continue;
+
+			/*
+			 * handle the case of multiple perf domains have the
+			 * same capacity_orig but one of them is under higher
+			 * thermal pressure. We record it as capacity
+			 * inversion.
+			 */
+			if (capacity_orig == pd_cap_orig) {
+				pd_cap = pd_cap_orig - thermal_load_avg(cpu_rq(cpu));
+
+				if (pd_cap > inv_cap) {
+					rq->cpu_capacity_inverted = inv_cap;
+					break;
+				}
+			} else if (pd_cap_orig > inv_cap) {
+				rq->cpu_capacity_inverted = inv_cap;
+				break;
+			}
+		}
+	}
+
+	trace_sched_cpu_capacity_tp(rq);
 
 	sdg->sgc->capacity = capacity;
 	sdg->sgc->min_capacity = capacity;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d6d488e..5f18460 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1041,6 +1041,7 @@ struct rq {
 
 	unsigned long		cpu_capacity;
 	unsigned long		cpu_capacity_orig;
+	unsigned long		cpu_capacity_inverted;
 
 	struct balance_callback *balance_callback;
 
@@ -2878,6 +2879,24 @@ static inline unsigned long capacity_orig_of(int cpu)
 	return cpu_rq(cpu)->cpu_capacity_orig;
 }
 
+/*
+ * Returns inverted capacity if the CPU is in capacity inversion state.
+ * 0 otherwise.
+ *
+ * Capacity inversion detection only considers thermal impact where actual
+ * performance points (OPPs) gets dropped.
+ *
+ * Capacity inversion state happens when another performance domain that has
+ * equal or lower capacity_orig_of() becomes effectively larger than the perf
+ * domain this CPU belongs to due to thermal pressure throttling it hard.
+ *
+ * See comment in update_cpu_capacity().
+ */
+static inline unsigned long cpu_in_capacity_inversion(int cpu)
+{
+	return cpu_rq(cpu)->cpu_capacity_inverted;
+}
+
 /**
  * enum cpu_util_type - CPU utilization type
  * @FREQUENCY_UTIL:	Utilization used to select frequency