cpuidle: governors: teo: Special-case nohz_full CPUs

[PATCH v1] cpuidle: governors: teo: Special-case nohz_full CPUs

Posted by Rafael J. Wysocki 5 months, 1 week ago

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

This change follows an analogous modification of the menu governor [1].

Namely, when the governor runs on a nohz_full CPU and there are no user
space timers in the workload on that CPU, it ends up selecting idle
states with target residency values above TICK_NSEC, or the deepest
enabled idle state in the absence of any of those, all the time due to
a tick_nohz_tick_stopped() check designed for running on CPUs where the
tick is not permanently disabled.  In that case, the fact that the tick
has been stopped means that the CPU was expected to be idle sufficiently
long previously, so it is not unreasonable to expect it to be idle
sufficiently long again, but this inference does not apply to nohz_full
CPUs.

In some cases, latency in the workload grows undesirably as a result of
selecting overly deep idle states, and the workload may also consume
more energy than necessary if the CPU does not spend enough time in the
selected deep idle state.

Address this by amending the tick_nohz_tick_stopped() check in question
with a tick_nohz_full_cpu() one to avoid effectively ignoring all
shallow idle states on nohz_full CPUs.  While doing so introduces a risk
of getting stuck in a shallow idle state for a long time, that only
affects energy efficiently, but the current behavior potentially hurts
both energy efficiency and performance that is arguably the priority for
nohz_full CPUs.

While at it, add a comment explaining the logic in teo_state_ok().

Link: https://lore.kernel.org/linux-pm/2244365.irdbgypaU6@rafael.j.wysocki/ [1]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpuidle/governors/teo.c |   18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -227,9 +227,17 @@
 	cpu_data->total += PULSE;
 }
 
-static bool teo_state_ok(int i, struct cpuidle_driver *drv)
+static bool teo_state_ok(int i, struct cpuidle_driver *drv, struct cpuidle_device *dev)
 {
-	return !tick_nohz_tick_stopped() ||
+	/*
+	 * If the scheduler tick has been stopped already, avoid selecting idle
+	 * states with target residency below the tick period length under the
+	 * assumption that the CPU is likely to be idle sufficiently long for
+	 * the tick to be stopped again (or the tick would not have been
+	 * stopped previously in the first place).  However, do not do that on
+	 * nohz_full CPUs where the above assumption does not hold.
+	 */
+	return !tick_nohz_tick_stopped() || tick_nohz_full_cpu(dev->cpu) ||
 		drv->states[i].target_residency_ns >= TICK_NSEC;
 }
 
@@ -379,7 +387,7 @@
 				 * shallow or disabled, in which case take the
 				 * first enabled state that is deep enough.
 				 */
-				if (teo_state_ok(i, drv) &&
+				if (teo_state_ok(i, drv, dev) &&
 				    !dev->states_usage[i].disable) {
 					idx = i;
 					break;
@@ -391,7 +399,7 @@
 			if (dev->states_usage[i].disable)
 				continue;
 
-			if (teo_state_ok(i, drv)) {
+			if (teo_state_ok(i, drv, dev)) {
 				/*
 				 * The current state is deep enough, but still
 				 * there may be a better one.
@@ -460,7 +468,7 @@
 	 */
 	if (drv->states[idx].target_residency_ns > duration_ns) {
 		i = teo_find_shallower_state(drv, dev, idx, duration_ns, false);
-		if (teo_state_ok(i, drv))
+		if (teo_state_ok(i, drv, dev))
 			idx = i;
 	}

Re: [PATCH v1] cpuidle: governors: teo: Special-case nohz_full CPUs

Posted by Rafael J. Wysocki 5 months, 1 week ago

On Thu, Aug 28, 2025 at 10:16 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> This change follows an analogous modification of the menu governor [1].
>
> Namely, when the governor runs on a nohz_full CPU and there are no user
> space timers in the workload on that CPU, it ends up selecting idle
> states with target residency values above TICK_NSEC, or the deepest
> enabled idle state in the absence of any of those, all the time due to
> a tick_nohz_tick_stopped() check designed for running on CPUs where the
> tick is not permanently disabled.  In that case, the fact that the tick
> has been stopped means that the CPU was expected to be idle sufficiently
> long previously, so it is not unreasonable to expect it to be idle
> sufficiently long again, but this inference does not apply to nohz_full
> CPUs.
>
> In some cases, latency in the workload grows undesirably as a result of
> selecting overly deep idle states, and the workload may also consume
> more energy than necessary if the CPU does not spend enough time in the
> selected deep idle state.
>
> Address this by amending the tick_nohz_tick_stopped() check in question
> with a tick_nohz_full_cpu() one to avoid effectively ignoring all
> shallow idle states on nohz_full CPUs.  While doing so introduces a risk
> of getting stuck in a shallow idle state for a long time, that only
> affects energy efficiently, but the current behavior potentially hurts
> both energy efficiency and performance that is arguably the priority for
> nohz_full CPUs.

This change is likely to break the use case in which CPU isolation is
used for power management reasons, to prevent CPUs from running any
code and so to save energy.

In that case, going into the deepest state every time on nohz_full
CPUs is a feature, so it can't be changed unconditionally.

For this reason, I'm not going to apply this patch and I'm going to
drop the menu governor one below.

The only way to allow everyone to do what they want/need I can see
would be to add a control knob for adjusting the behavior of cpuidle
governors regarding the handling of nohz_full CPUs.

> While at it, add a comment explaining the logic in teo_state_ok().
>
> Link: https://lore.kernel.org/linux-pm/2244365.irdbgypaU6@rafael.j.wysocki/ [1]
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  drivers/cpuidle/governors/teo.c |   18 +++++++++++++-----
>  1 file changed, 13 insertions(+), 5 deletions(-)
>
> --- a/drivers/cpuidle/governors/teo.c
> +++ b/drivers/cpuidle/governors/teo.c
> @@ -227,9 +227,17 @@
>         cpu_data->total += PULSE;
>  }
>
> -static bool teo_state_ok(int i, struct cpuidle_driver *drv)
> +static bool teo_state_ok(int i, struct cpuidle_driver *drv, struct cpuidle_device *dev)
>  {
> -       return !tick_nohz_tick_stopped() ||
> +       /*
> +        * If the scheduler tick has been stopped already, avoid selecting idle
> +        * states with target residency below the tick period length under the
> +        * assumption that the CPU is likely to be idle sufficiently long for
> +        * the tick to be stopped again (or the tick would not have been
> +        * stopped previously in the first place).  However, do not do that on
> +        * nohz_full CPUs where the above assumption does not hold.
> +        */
> +       return !tick_nohz_tick_stopped() || tick_nohz_full_cpu(dev->cpu) ||
>                 drv->states[i].target_residency_ns >= TICK_NSEC;
>  }
>
> @@ -379,7 +387,7 @@
>                                  * shallow or disabled, in which case take the
>                                  * first enabled state that is deep enough.
>                                  */
> -                               if (teo_state_ok(i, drv) &&
> +                               if (teo_state_ok(i, drv, dev) &&
>                                     !dev->states_usage[i].disable) {
>                                         idx = i;
>                                         break;
> @@ -391,7 +399,7 @@
>                         if (dev->states_usage[i].disable)
>                                 continue;
>
> -                       if (teo_state_ok(i, drv)) {
> +                       if (teo_state_ok(i, drv, dev)) {
>                                 /*
>                                  * The current state is deep enough, but still
>                                  * there may be a better one.
> @@ -460,7 +468,7 @@
>          */
>         if (drv->states[idx].target_residency_ns > duration_ns) {
>                 i = teo_find_shallower_state(drv, dev, idx, duration_ns, false);
> -               if (teo_state_ok(i, drv))
> +               if (teo_state_ok(i, drv, dev))
>                         idx = i;
>         }
>
>
>
>
>

Re: [PATCH v1] cpuidle: governors: teo: Special-case nohz_full CPUs

Posted by Christian Loehle 5 months, 1 week ago

On 8/29/25 20:37, Rafael J. Wysocki wrote:
> On Thu, Aug 28, 2025 at 10:16 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>>
>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>
>> This change follows an analogous modification of the menu governor [1].
>>
>> Namely, when the governor runs on a nohz_full CPU and there are no user
>> space timers in the workload on that CPU, it ends up selecting idle
>> states with target residency values above TICK_NSEC, or the deepest
>> enabled idle state in the absence of any of those, all the time due to
>> a tick_nohz_tick_stopped() check designed for running on CPUs where the
>> tick is not permanently disabled.  In that case, the fact that the tick
>> has been stopped means that the CPU was expected to be idle sufficiently
>> long previously, so it is not unreasonable to expect it to be idle
>> sufficiently long again, but this inference does not apply to nohz_full
>> CPUs.
>>
>> In some cases, latency in the workload grows undesirably as a result of
>> selecting overly deep idle states, and the workload may also consume
>> more energy than necessary if the CPU does not spend enough time in the
>> selected deep idle state.
>>
>> Address this by amending the tick_nohz_tick_stopped() check in question
>> with a tick_nohz_full_cpu() one to avoid effectively ignoring all
>> shallow idle states on nohz_full CPUs.  While doing so introduces a risk
>> of getting stuck in a shallow idle state for a long time, that only
>> affects energy efficiently, but the current behavior potentially hurts
>> both energy efficiency and performance that is arguably the priority for
>> nohz_full CPUs.
> 
> This change is likely to break the use case in which CPU isolation is
> used for power management reasons, to prevent CPUs from running any
> code and so to save energy.
> 
> In that case, going into the deepest state every time on nohz_full
> CPUs is a feature, so it can't be changed unconditionally.
> 
> For this reason, I'm not going to apply this patch and I'm going to
> drop the menu governor one below.
> 
> The only way to allow everyone to do what they want/need I can see
> would be to add a control knob for adjusting the behavior of cpuidle
> governors regarding the handling of nohz_full CPUs.

But then what's the advantage instead of just using
/sys/devices/system/cpu/cpuX/power/latency
for the nohz_full CPUs (if you don't want the current 'over-eagerly
selecting deepest state on nohz_full')?

Re: [PATCH v1] cpuidle: governors: teo: Special-case nohz_full CPUs

Posted by Rafael J. Wysocki 5 months, 1 week ago

On Sun, Aug 31, 2025 at 11:30 PM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 8/29/25 20:37, Rafael J. Wysocki wrote:
> > On Thu, Aug 28, 2025 at 10:16 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> >>
> >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>
> >> This change follows an analogous modification of the menu governor [1].
> >>
> >> Namely, when the governor runs on a nohz_full CPU and there are no user
> >> space timers in the workload on that CPU, it ends up selecting idle
> >> states with target residency values above TICK_NSEC, or the deepest
> >> enabled idle state in the absence of any of those, all the time due to
> >> a tick_nohz_tick_stopped() check designed for running on CPUs where the
> >> tick is not permanently disabled.  In that case, the fact that the tick
> >> has been stopped means that the CPU was expected to be idle sufficiently
> >> long previously, so it is not unreasonable to expect it to be idle
> >> sufficiently long again, but this inference does not apply to nohz_full
> >> CPUs.
> >>
> >> In some cases, latency in the workload grows undesirably as a result of
> >> selecting overly deep idle states, and the workload may also consume
> >> more energy than necessary if the CPU does not spend enough time in the
> >> selected deep idle state.
> >>
> >> Address this by amending the tick_nohz_tick_stopped() check in question
> >> with a tick_nohz_full_cpu() one to avoid effectively ignoring all
> >> shallow idle states on nohz_full CPUs.  While doing so introduces a risk
> >> of getting stuck in a shallow idle state for a long time, that only
> >> affects energy efficiently, but the current behavior potentially hurts
> >> both energy efficiency and performance that is arguably the priority for
> >> nohz_full CPUs.
> >
> > This change is likely to break the use case in which CPU isolation is
> > used for power management reasons, to prevent CPUs from running any
> > code and so to save energy.
> >
> > In that case, going into the deepest state every time on nohz_full
> > CPUs is a feature, so it can't be changed unconditionally.
> >
> > For this reason, I'm not going to apply this patch and I'm going to
> > drop the menu governor one below.
> >
> > The only way to allow everyone to do what they want/need I can see
> > would be to add a control knob for adjusting the behavior of cpuidle
> > governors regarding the handling of nohz_full CPUs.
>
> But then what's the advantage instead of just using
> /sys/devices/system/cpu/cpuX/power/latency
> for the nohz_full CPUs (if you don't want the current 'over-eagerly
> selecting deepest state on nohz_full')?

The difference is that PM QoS prevents the CPU from entering
high-latency idle states at all (depending on the limit value),
whereas the governor will sometimes ask for a deep idle state
(depending on the actual wakeup pattern).  If wakeup patterns in the
workload change over time, it may be quite cumbersome to have to
update PM QoS every time to follow those changes (and the operator may
not even be aware of them).