drivers/cpuidle/governors/menu.c | 29 ++++++++++++----------------- 1 file changed, 12 insertions(+), 17 deletions(-)
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Occasionally, the exit latency of the idle state selected by the menu
governor may exceed the PM QoS CPU wakeup latency limit. Namely, if the
scheduler tick has been stopped already and predicted_ns is greater than
the tick period length, the governor may return an idle state whose exit
latency exceeds latency_req because that decision is made before
checking the current idle state's exit latency.
For instance, say that there are 3 idle states, 0, 1, and 2. For idle
states 0 and 1, the exit latency is equal to the target residency and
the values are 0 and 5 us, respectively. State 2 is deeper and has the
exit latency and target residency of 200 us and 2 ms (which is greater
than the tick period length), respectively.
Say that predicted_ns is equal to TICK_NSEC and the PM QoS latency
limit is 20 us. After the first two iterations of the main loop in
menu_select(), idx becomes 1 and in the third iteration of it the target
residency of the current state (state 2) is greater than predicted_ns.
State 2 is not a polling one and predicted_ns is not less than TICK_NSEC,
so the check on whether or not the tick has been stopped is done. Say
that the tick has been stopped already and there are no imminent timers
(that is, delta_tick is greater than the target residency of state 2).
In that case, idx becomes 2 and it is returned immediately, but the exit
latency of state 2 exceeds the latency limit.
Address this issue by modifying the code to compare the exit latency of
the current idle state (idle state i) with the latency limit before
comparing its target residecy with predicted_ns, which allows one
more exit_latency_ns check that becomes redundant to be dropped.
However, after the above change, latency_req cannot take the predicted_ns
value any more, which takes place after commit 38f83090f515 ("cpuidle:
menu: Remove iowait influence"), because it may cause a polling state
to be returned prematurely.
In the context of the previous example say that predicted_ns is 3000 and
the PM QoS latency limit is still 20 us. Additionally, say that idle
state 0 is a polling one. Moving the exit_latency_ns check before the
target_residency_ns one causes the loop to terminate in the second
iteration, before the target_residency_ns check, so idle state 0 will be
returned even though previously state 1 would be returned if there were
no imminent timers.
For this reason, remove the assignment of the predicted_ns value to
latency_req from the code.
Fixes: 5ef499cd571c ("cpuidle: menu: Handle stopped tick more aggressively")
Cc: 4.17+ <stable@vger.kernel.org> # 4.17+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
drivers/cpuidle/governors/menu.c | 29 ++++++++++++-----------------
1 file changed, 12 insertions(+), 17 deletions(-)
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -287,20 +287,15 @@
return 0;
}
- if (tick_nohz_tick_stopped()) {
- /*
- * If the tick is already stopped, the cost of possible short
- * idle duration misprediction is much higher, because the CPU
- * may be stuck in a shallow idle state for a long time as a
- * result of it. In that case say we might mispredict and use
- * the known time till the closest timer event for the idle
- * state selection.
- */
- if (predicted_ns < TICK_NSEC)
- predicted_ns = data->next_timer_ns;
- } else if (latency_req > predicted_ns) {
- latency_req = predicted_ns;
- }
+ /*
+ * If the tick is already stopped, the cost of possible short idle
+ * duration misprediction is much higher, because the CPU may be stuck
+ * in a shallow idle state for a long time as a result of it. In that
+ * case, say we might mispredict and use the known time till the closest
+ * timer event for the idle state selection.
+ */
+ if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC)
+ predicted_ns = data->next_timer_ns;
/*
* Find the idle state with the lowest power while satisfying
@@ -316,13 +311,15 @@
if (idx == -1)
idx = i; /* first enabled state */
+ if (s->exit_latency_ns > latency_req)
+ break;
+
if (s->target_residency_ns > predicted_ns) {
/*
* Use a physical idle state, not busy polling, unless
* a timer is going to trigger soon enough.
*/
if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
- s->exit_latency_ns <= latency_req &&
s->target_residency_ns <= data->next_timer_ns) {
predicted_ns = s->target_residency_ns;
idx = i;
@@ -354,8 +351,6 @@
return idx;
}
- if (s->exit_latency_ns > latency_req)
- break;
idx = i;
}
Hi Rafael,
Recent email communications about other patches had me
looking at this one again.
On 2025.08.13 03:26 Rafael wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
... snip...
> However, after the above change, latency_req cannot take the predicted_ns
> value any more, which takes place after commit 38f83090f515 ("cpuidle:
> menu: Remove iowait influence"), because it may cause a polling state
> to be returned prematurely.
>
> In the context of the previous example say that predicted_ns is 3000 and
> the PM QoS latency limit is still 20 us. Additionally, say that idle
> state 0 is a polling one. Moving the exit_latency_ns check before the
> target_residency_ns one causes the loop to terminate in the second
> iteration, before the target_residency_ns check, so idle state 0 will be
> returned even though previously state 1 would be returned if there were
> no imminent timers.
>
> For this reason, remove the assignment of the predicted_ns value to
> latency_req from the code.
Which is okay for timer-based workflow,
but what about non-timer based, or interrupt driven, workflow?
Under conditions where idle state 0, or Polling, would be used a lot,
I am observing about a 11 % throughput regression with this patch
And idle state 0, polling, usage going from 20% to 0%.
From my testing of kernels 6.17-rc1, rc2,rc3 in August and September
and again now. I missed this in August/September:
779b1a1cb13a cpuidle: governors: menu: Avoid selecting states with too much latency - v6.17-rc3
fa3fa55de0d6 cpuidle: governors: menu: Avoid using invalid recent intervals data - v6.17-rc2
baseline reference: v6.17-rc1
teo was included also. As far as I can recall its response has always been similar to rc3. At least, recently.
Three graphs are attached:
Sampling data once per 20 seconds, the test is started after the first idle sample,
and at least one sample is taken after the system returns to idle after the test.
The faster the test runs the better.
Test computer:
Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
Distro: Ubuntu 24.04.3, server, no desktop GUI.
CPU frequency scaling driver: intel_pstate
HWP: disabled.
CPU frequency scaling governor: performance
Ilde driver: intel_idle
Idle governor: menu (except teo for one compare test run)
Idle states: 4: name : description:
state0/name:POLL desc:CPUIDLE CORE POLL IDLE
state1/name:C1_ACPI desc:ACPI FFH MWAIT 0x0
state2/name:C2_ACPI desc:ACPI FFH MWAIT 0x30
state3/name:C3_ACPI desc:ACPI FFH MWAIT 0x60
Hi Doug,
On Thursday, October 23, 2025 5:05:44 AM CEST Doug Smythies wrote:
> Hi Rafael,
>
> Recent email communications about other patches had me
> looking at this one again.
>
> On 2025.08.13 03:26 Rafael wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> ... snip...
>
> > However, after the above change, latency_req cannot take the predicted_ns
> > value any more, which takes place after commit 38f83090f515 ("cpuidle:
> > menu: Remove iowait influence"), because it may cause a polling state
> > to be returned prematurely.
> >
> > In the context of the previous example say that predicted_ns is 3000 and
> > the PM QoS latency limit is still 20 us. Additionally, say that idle
> > state 0 is a polling one. Moving the exit_latency_ns check before the
> > target_residency_ns one causes the loop to terminate in the second
> > iteration, before the target_residency_ns check, so idle state 0 will be
> > returned even though previously state 1 would be returned if there were
> > no imminent timers.
> >
> > For this reason, remove the assignment of the predicted_ns value to
> > latency_req from the code.
>
> Which is okay for timer-based workflow,
> but what about non-timer based, or interrupt driven, workflow?
>
> Under conditions where idle state 0, or Polling, would be used a lot,
> I am observing about a 11 % throughput regression with this patch
> And idle state 0, polling, usage going from 20% to 0%.
>
> From my testing of kernels 6.17-rc1, rc2,rc3 in August and September
> and again now. I missed this in August/September:
>
> 779b1a1cb13a cpuidle: governors: menu: Avoid selecting states with too much latency - v6.17-rc3
> fa3fa55de0d6 cpuidle: governors: menu: Avoid using invalid recent intervals data - v6.17-rc2
> baseline reference: v6.17-rc1
>
> teo was included also. As far as I can recall its response has always been similar to rc3. At least, recently.
>
> Three graphs are attached:
> Sampling data once per 20 seconds, the test is started after the first idle sample,
> and at least one sample is taken after the system returns to idle after the test.
> The faster the test runs the better.
>
> Test computer:
> Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
> Distro: Ubuntu 24.04.3, server, no desktop GUI.
> CPU frequency scaling driver: intel_pstate
> HWP: disabled.
> CPU frequency scaling governor: performance
> Ilde driver: intel_idle
> Idle governor: menu (except teo for one compare test run)
> Idle states: 4: name : description:
> state0/name:POLL desc:CPUIDLE CORE POLL IDLE
> state1/name:C1_ACPI desc:ACPI FFH MWAIT 0x0
> state2/name:C2_ACPI desc:ACPI FFH MWAIT 0x30
> state3/name:C3_ACPI desc:ACPI FFH MWAIT 0x60
OK, so since the exit residency of an idle state cannot exceed its target
residency, the appended change (on top of 6.18-rc2) should make the throughput
regression go away.
---
drivers/cpuidle/governors/menu.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -321,10 +321,13 @@ static int menu_select(struct cpuidle_dr
/*
* Use a physical idle state, not busy polling, unless a timer
- * is going to trigger soon enough.
+ * is going to trigger soon enough or the exit latency of the
+ * idle state in question is greater than the predicted idle
+ * duration.
*/
if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
- s->target_residency_ns <= data->next_timer_ns) {
+ s->target_residency_ns <= data->next_timer_ns &&
+ s->exit_latency_ns <= predicted_ns) {
predicted_ns = s->target_residency_ns;
idx = i;
break;
On 2025.10.23 07:51 Rafael wrote:
> Hi Doug,
>
> On Thursday, October 23, 2025 5:05:44 AM CEST Doug Smythies wrote:
>> Hi Rafael,
>>
>> Recent email communications about other patches had me
>> looking at this one again.
>>
>> On 2025.08.13 03:26 Rafael wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>
>> ... snip...
>>
>>> However, after the above change, latency_req cannot take the predicted_ns
>>> value any more, which takes place after commit 38f83090f515 ("cpuidle:
>>> menu: Remove iowait influence"), because it may cause a polling state
>>> to be returned prematurely.
>>>
>>> In the context of the previous example say that predicted_ns is 3000 and
>>> the PM QoS latency limit is still 20 us. Additionally, say that idle
>>> state 0 is a polling one. Moving the exit_latency_ns check before the
>>> target_residency_ns one causes the loop to terminate in the second
>>> iteration, before the target_residency_ns check, so idle state 0 will be
>>> returned even though previously state 1 would be returned if there were
>>> no imminent timers.
>>>
>>> For this reason, remove the assignment of the predicted_ns value to
>>> latency_req from the code.
>>
>> Which is okay for timer-based workflow,
>> but what about non-timer based, or interrupt driven, workflow?
>>
>> Under conditions where idle state 0, or Polling, would be used a lot,
>> I am observing about a 11 % throughput regression with this patch
>> And idle state 0, polling, usage going from 20% to 0%.
>>
>> From my testing of kernels 6.17-rc1, rc2,rc3 in August and September
>> and again now. I missed this in August/September:
>>
>> 779b1a1cb13a cpuidle: governors: menu: Avoid selecting states with too much latency - v6.17-rc3
>> fa3fa55de0d6 cpuidle: governors: menu: Avoid using invalid recent intervals data - v6.17-rc2
>> baseline reference: v6.17-rc1
>>
>> teo was included also. As far as I can recall its response has always been similar to rc3. At least, recently.
>>
>> Three graphs are attached:
>> Sampling data once per 20 seconds, the test is started after the first idle sample,
>> and at least one sample is taken after the system returns to idle after the test.
>> The faster the test runs the better.
>>
>> Test computer:
>> Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
>> Distro: Ubuntu 24.04.3, server, no desktop GUI.
>> CPU frequency scaling driver: intel_pstate
>> HWP: disabled.
>> CPU frequency scaling governor: performance
>> Ilde driver: intel_idle
>> Idle governor: menu (except teo for one compare test run)
>> Idle states: 4: name : description:
>> state0/name:POLL desc:CPUIDLE CORE POLL IDLE
>> state1/name:C1_ACPI desc:ACPI FFH MWAIT 0x0
>> state2/name:C2_ACPI desc:ACPI FFH MWAIT 0x30
>> state3/name:C3_ACPI desc:ACPI FFH MWAIT 0x60
>
> OK, so since the exit residency of an idle state cannot exceed its target
> residency, the appended change (on top of 6.18-rc2) should make the throughput
> regression go away.
Indeed, the patch you appended below did make the
throughput regression go away.
Thank you.
>
> ---
> drivers/cpuidle/governors/menu.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -321,10 +321,13 @@ static int menu_select(struct cpuidle_dr
>
> /*
> * Use a physical idle state, not busy polling, unless a timer
> - * is going to trigger soon enough.
> + * is going to trigger soon enough or the exit latency of the
> + * idle state in question is greater than the predicted idle
> + * duration.
> */
> if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
> - s->target_residency_ns <= data->next_timer_ns) {
> + s->target_residency_ns <= data->next_timer_ns &&
> + s->exit_latency_ns <= predicted_ns) {
> predicted_ns = s->target_residency_ns;
> idx = i;
> break;
On Thu, Oct 23, 2025 at 6:03 PM Doug Smythies <dsmythies@telus.net> wrote:
>
> On 2025.10.23 07:51 Rafael wrote:
>
> > Hi Doug,
> >
> > On Thursday, October 23, 2025 5:05:44 AM CEST Doug Smythies wrote:
> >> Hi Rafael,
> >>
> >> Recent email communications about other patches had me
> >> looking at this one again.
> >>
> >> On 2025.08.13 03:26 Rafael wrote:
> >>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>>
> >> ... snip...
> >>
> >>> However, after the above change, latency_req cannot take the predicted_ns
> >>> value any more, which takes place after commit 38f83090f515 ("cpuidle:
> >>> menu: Remove iowait influence"), because it may cause a polling state
> >>> to be returned prematurely.
> >>>
> >>> In the context of the previous example say that predicted_ns is 3000 and
> >>> the PM QoS latency limit is still 20 us. Additionally, say that idle
> >>> state 0 is a polling one. Moving the exit_latency_ns check before the
> >>> target_residency_ns one causes the loop to terminate in the second
> >>> iteration, before the target_residency_ns check, so idle state 0 will be
> >>> returned even though previously state 1 would be returned if there were
> >>> no imminent timers.
> >>>
> >>> For this reason, remove the assignment of the predicted_ns value to
> >>> latency_req from the code.
> >>
> >> Which is okay for timer-based workflow,
> >> but what about non-timer based, or interrupt driven, workflow?
> >>
> >> Under conditions where idle state 0, or Polling, would be used a lot,
> >> I am observing about a 11 % throughput regression with this patch
> >> And idle state 0, polling, usage going from 20% to 0%.
> >>
> >> From my testing of kernels 6.17-rc1, rc2,rc3 in August and September
> >> and again now. I missed this in August/September:
> >>
> >> 779b1a1cb13a cpuidle: governors: menu: Avoid selecting states with too much latency - v6.17-rc3
> >> fa3fa55de0d6 cpuidle: governors: menu: Avoid using invalid recent intervals data - v6.17-rc2
> >> baseline reference: v6.17-rc1
> >>
> >> teo was included also. As far as I can recall its response has always been similar to rc3. At least, recently.
> >>
> >> Three graphs are attached:
> >> Sampling data once per 20 seconds, the test is started after the first idle sample,
> >> and at least one sample is taken after the system returns to idle after the test.
> >> The faster the test runs the better.
> >>
> >> Test computer:
> >> Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
> >> Distro: Ubuntu 24.04.3, server, no desktop GUI.
> >> CPU frequency scaling driver: intel_pstate
> >> HWP: disabled.
> >> CPU frequency scaling governor: performance
> >> Ilde driver: intel_idle
> >> Idle governor: menu (except teo for one compare test run)
> >> Idle states: 4: name : description:
> >> state0/name:POLL desc:CPUIDLE CORE POLL IDLE
> >> state1/name:C1_ACPI desc:ACPI FFH MWAIT 0x0
> >> state2/name:C2_ACPI desc:ACPI FFH MWAIT 0x30
> >> state3/name:C3_ACPI desc:ACPI FFH MWAIT 0x60
> >
> > OK, so since the exit residency of an idle state cannot exceed its target
> > residency, the appended change (on top of 6.18-rc2) should make the throughput
> > regression go away.
>
> Indeed, the patch you appended below did make the
> throughput regression go away.
>
> Thank you.
OK, this is not an unreasonable change, so let me add a changelog to
it and submit it.
Le Wed, Aug 13, 2025 at 12:25:58PM +0200, Rafael J. Wysocki a écrit :
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> Occasionally, the exit latency of the idle state selected by the menu
> governor may exceed the PM QoS CPU wakeup latency limit. Namely, if the
> scheduler tick has been stopped already and predicted_ns is greater than
> the tick period length, the governor may return an idle state whose exit
> latency exceeds latency_req because that decision is made before
> checking the current idle state's exit latency.
>
> For instance, say that there are 3 idle states, 0, 1, and 2. For idle
> states 0 and 1, the exit latency is equal to the target residency and
> the values are 0 and 5 us, respectively. State 2 is deeper and has the
> exit latency and target residency of 200 us and 2 ms (which is greater
> than the tick period length), respectively.
>
> Say that predicted_ns is equal to TICK_NSEC and the PM QoS latency
> limit is 20 us. After the first two iterations of the main loop in
> menu_select(), idx becomes 1 and in the third iteration of it the target
> residency of the current state (state 2) is greater than predicted_ns.
> State 2 is not a polling one and predicted_ns is not less than TICK_NSEC,
> so the check on whether or not the tick has been stopped is done. Say
> that the tick has been stopped already and there are no imminent timers
> (that is, delta_tick is greater than the target residency of state 2).
> In that case, idx becomes 2 and it is returned immediately, but the exit
> latency of state 2 exceeds the latency limit.
>
> Address this issue by modifying the code to compare the exit latency of
> the current idle state (idle state i) with the latency limit before
> comparing its target residecy with predicted_ns, which allows one
> more exit_latency_ns check that becomes redundant to be dropped.
>
> However, after the above change, latency_req cannot take the predicted_ns
> value any more, which takes place after commit 38f83090f515 ("cpuidle:
> menu: Remove iowait influence"), because it may cause a polling state
> to be returned prematurely.
>
> In the context of the previous example say that predicted_ns is 3000 and
> the PM QoS latency limit is still 20 us. Additionally, say that idle
> state 0 is a polling one. Moving the exit_latency_ns check before the
> target_residency_ns one causes the loop to terminate in the second
> iteration, before the target_residency_ns check, so idle state 0 will be
> returned even though previously state 1 would be returned if there were
> no imminent timers.
>
> For this reason, remove the assignment of the predicted_ns value to
> latency_req from the code.
>
> Fixes: 5ef499cd571c ("cpuidle: menu: Handle stopped tick more aggressively")
> Cc: 4.17+ <stable@vger.kernel.org> # 4.17+
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Too late I guess but meanwhile:
Acked-by: Frederic Weisbecker <frederic@kernel.org>
--
Frederic Weisbecker
SUSE Labs
On 8/13/25 11:25, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> Occasionally, the exit latency of the idle state selected by the menu
> governor may exceed the PM QoS CPU wakeup latency limit. Namely, if the
> scheduler tick has been stopped already and predicted_ns is greater than
> the tick period length, the governor may return an idle state whose exit
> latency exceeds latency_req because that decision is made before
> checking the current idle state's exit latency.
>
> For instance, say that there are 3 idle states, 0, 1, and 2. For idle
> states 0 and 1, the exit latency is equal to the target residency and
> the values are 0 and 5 us, respectively. State 2 is deeper and has the
> exit latency and target residency of 200 us and 2 ms (which is greater
> than the tick period length), respectively.
>
> Say that predicted_ns is equal to TICK_NSEC and the PM QoS latency
> limit is 20 us. After the first two iterations of the main loop in
> menu_select(), idx becomes 1 and in the third iteration of it the target
Can drop "of it" here?
> residency of the current state (state 2) is greater than predicted_ns.
> State 2 is not a polling one and predicted_ns is not less than TICK_NSEC,
> so the check on whether or not the tick has been stopped is done. Say
> that the tick has been stopped already and there are no imminent timers
> (that is, delta_tick is greater than the target residency of state 2).
> In that case, idx becomes 2 and it is returned immediately, but the exit
> latency of state 2 exceeds the latency limit.
>
> Address this issue by modifying the code to compare the exit latency of
> the current idle state (idle state i) with the latency limit before
> comparing its target residecy with predicted_ns, which allows one
residency
> more exit_latency_ns check that becomes redundant to be dropped.
>
> However, after the above change, latency_req cannot take the predicted_ns
> value any more, which takes place after commit 38f83090f515 ("cpuidle:
> menu: Remove iowait influence"), because it may cause a polling state
> to be returned prematurely.
>
> In the context of the previous example say that predicted_ns is 3000 and
> the PM QoS latency limit is still 20 us. Additionally, say that idle
> state 0 is a polling one. Moving the exit_latency_ns check before the
> target_residency_ns one causes the loop to terminate in the second
> iteration, before the target_residency_ns check, so idle state 0 will be
> returned even though previously state 1 would be returned if there were
> no imminent timers.
>
> For this reason, remove the assignment of the predicted_ns value to
> latency_req from the code.
>
> Fixes: 5ef499cd571c ("cpuidle: menu: Handle stopped tick more aggressively")
> Cc: 4.17+ <stable@vger.kernel.org> # 4.17+
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> drivers/cpuidle/governors/menu.c | 29 ++++++++++++-----------------
> 1 file changed, 12 insertions(+), 17 deletions(-)
>
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -287,20 +287,15 @@
> return 0;
> }
>
> - if (tick_nohz_tick_stopped()) {
> - /*
> - * If the tick is already stopped, the cost of possible short
> - * idle duration misprediction is much higher, because the CPU
> - * may be stuck in a shallow idle state for a long time as a
> - * result of it. In that case say we might mispredict and use
> - * the known time till the closest timer event for the idle
> - * state selection.
> - */
> - if (predicted_ns < TICK_NSEC)
> - predicted_ns = data->next_timer_ns;
> - } else if (latency_req > predicted_ns) {
> - latency_req = predicted_ns;
> - }
> + /*
> + * If the tick is already stopped, the cost of possible short idle
> + * duration misprediction is much higher, because the CPU may be stuck
> + * in a shallow idle state for a long time as a result of it. In that
> + * case, say we might mispredict and use the known time till the closest
> + * timer event for the idle state selection.
> + */
> + if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC)
> + predicted_ns = data->next_timer_ns;
>
> /*
> * Find the idle state with the lowest power while satisfying
> @@ -316,13 +311,15 @@
> if (idx == -1)
> idx = i; /* first enabled state */
>
> + if (s->exit_latency_ns > latency_req)
> + break;
> +
> if (s->target_residency_ns > predicted_ns) {
> /*
> * Use a physical idle state, not busy polling, unless
> * a timer is going to trigger soon enough.
> */
> if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
> - s->exit_latency_ns <= latency_req &&
> s->target_residency_ns <= data->next_timer_ns) {
> predicted_ns = s->target_residency_ns;
> idx = i;
> @@ -354,8 +351,6 @@
>
> return idx;
> }
> - if (s->exit_latency_ns > latency_req)
> - break;
>
> idx = i;
> }
>
>
>
Good catch!
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
On Wed, Aug 13, 2025 at 9:13 PM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 8/13/25 11:25, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > Occasionally, the exit latency of the idle state selected by the menu
> > governor may exceed the PM QoS CPU wakeup latency limit. Namely, if the
> > scheduler tick has been stopped already and predicted_ns is greater than
> > the tick period length, the governor may return an idle state whose exit
> > latency exceeds latency_req because that decision is made before
> > checking the current idle state's exit latency.
> >
> > For instance, say that there are 3 idle states, 0, 1, and 2. For idle
> > states 0 and 1, the exit latency is equal to the target residency and
> > the values are 0 and 5 us, respectively. State 2 is deeper and has the
> > exit latency and target residency of 200 us and 2 ms (which is greater
> > than the tick period length), respectively.
> >
> > Say that predicted_ns is equal to TICK_NSEC and the PM QoS latency
> > limit is 20 us. After the first two iterations of the main loop in
> > menu_select(), idx becomes 1 and in the third iteration of it the target
> Can drop "of it" here?
But it also doesn't really hurt I think.
> > residency of the current state (state 2) is greater than predicted_ns.
> > State 2 is not a polling one and predicted_ns is not less than TICK_NSEC,
> > so the check on whether or not the tick has been stopped is done. Say
> > that the tick has been stopped already and there are no imminent timers
> > (that is, delta_tick is greater than the target residency of state 2).
> > In that case, idx becomes 2 and it is returned immediately, but the exit
> > latency of state 2 exceeds the latency limit.
> >
> > Address this issue by modifying the code to compare the exit latency of
> > the current idle state (idle state i) with the latency limit before
> > comparing its target residecy with predicted_ns, which allows one
> residency
Fixed while applying.
> > more exit_latency_ns check that becomes redundant to be dropped.
> >
> > However, after the above change, latency_req cannot take the predicted_ns
> > value any more, which takes place after commit 38f83090f515 ("cpuidle:
> > menu: Remove iowait influence"), because it may cause a polling state
> > to be returned prematurely.
> >
> > In the context of the previous example say that predicted_ns is 3000 and
> > the PM QoS latency limit is still 20 us. Additionally, say that idle
> > state 0 is a polling one. Moving the exit_latency_ns check before the
> > target_residency_ns one causes the loop to terminate in the second
> > iteration, before the target_residency_ns check, so idle state 0 will be
> > returned even though previously state 1 would be returned if there were
> > no imminent timers.
> >
> > For this reason, remove the assignment of the predicted_ns value to
> > latency_req from the code.
> >
> > Fixes: 5ef499cd571c ("cpuidle: menu: Handle stopped tick more aggressively")
> > Cc: 4.17+ <stable@vger.kernel.org> # 4.17+
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> > drivers/cpuidle/governors/menu.c | 29 ++++++++++++-----------------
> > 1 file changed, 12 insertions(+), 17 deletions(-)
> >
> > --- a/drivers/cpuidle/governors/menu.c
> > +++ b/drivers/cpuidle/governors/menu.c
> > @@ -287,20 +287,15 @@
> > return 0;
> > }
> >
> > - if (tick_nohz_tick_stopped()) {
> > - /*
> > - * If the tick is already stopped, the cost of possible short
> > - * idle duration misprediction is much higher, because the CPU
> > - * may be stuck in a shallow idle state for a long time as a
> > - * result of it. In that case say we might mispredict and use
> > - * the known time till the closest timer event for the idle
> > - * state selection.
> > - */
> > - if (predicted_ns < TICK_NSEC)
> > - predicted_ns = data->next_timer_ns;
> > - } else if (latency_req > predicted_ns) {
> > - latency_req = predicted_ns;
> > - }
> > + /*
> > + * If the tick is already stopped, the cost of possible short idle
> > + * duration misprediction is much higher, because the CPU may be stuck
> > + * in a shallow idle state for a long time as a result of it. In that
> > + * case, say we might mispredict and use the known time till the closest
> > + * timer event for the idle state selection.
> > + */
> > + if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC)
> > + predicted_ns = data->next_timer_ns;
> >
> > /*
> > * Find the idle state with the lowest power while satisfying
> > @@ -316,13 +311,15 @@
> > if (idx == -1)
> > idx = i; /* first enabled state */
> >
> > + if (s->exit_latency_ns > latency_req)
> > + break;
> > +
> > if (s->target_residency_ns > predicted_ns) {
> > /*
> > * Use a physical idle state, not busy polling, unless
> > * a timer is going to trigger soon enough.
> > */
> > if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
> > - s->exit_latency_ns <= latency_req &&
> > s->target_residency_ns <= data->next_timer_ns) {
> > predicted_ns = s->target_residency_ns;
> > idx = i;
> > @@ -354,8 +351,6 @@
> >
> > return idx;
> > }
> > - if (s->exit_latency_ns > latency_req)
> > - break;
> >
> > idx = i;
> > }
> >
> >
> >
>
> Good catch!
> Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Thank you!
© 2016 - 2026 Red Hat, Inc.