cpuidle: menu: Fix high wakeup latency on modern platforms

[PATCH v2 0/1] cpuidle: menu: Fix high wakeup latency on modern platforms

Posted by Ionut Nechita (Sunlight Linux) 2 weeks, 2 days ago

From: Ionut Nechita <ionut_n2001@yahoo.com>

Hi,

This v2 patch addresses high wakeup latency in the menu cpuidle governor
on modern platforms with high C-state exit latencies.

Changes in v2:
==============

Based on Christian Loehle's feedback, I've simplified the approach to use
min(predicted_ns, data->next_timer_ns) instead of the 25% safety margin
from v1.

The simpler approach is cleaner and achieves the same goal: preventing the
governor from selecting excessively deep C-states when the prediction
suggests a short idle period but next_timer_ns is large (e.g., 10ms).

I will test both approaches (simple min vs 25% margin) and provide
detailed comparison data including:
- C-state residency tables
- Usage statistics
- Idle miss counts (above/below)
- Actual latency measurements

Thank you Christian for the valuable feedback and for pointing out that
the simpler approach may be sufficient.

Background:
===========

On Intel server platforms from 2022 onwards (Sapphire Rapids, Granite
Rapids), we observe excessive wakeup latencies (~150us) in network-
sensitive workloads when using the menu governor with NOHZ_FULL enabled.

The issue stems from the governor using next_timer_ns directly when the
tick is already stopped and predicted_ns < TICK_NSEC. This causes
selection of very deep package C-states (PC6) even when the prediction
suggests a much shorter idle duration.

On platforms with high C-state exit latencies (Intel SPR: 190us for C6,
or systems with large C-state gaps like C2 36us → C3 700us with 350us
exit latency), this results in significant wakeup penalties.

Testing:
========

Initial testing on Sapphire Rapids shows 5x latency reduction
(151us → ~30us). I will provide comprehensive test results comparing
baseline, simple min(), and the 25% margin approach.

Ionut Nechita (1):
  cpuidle: menu: Use min() to prevent deep C-states when tick is stopped

 drivers/cpuidle/governors/menu.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

-- 
2.52.0

Re: [PATCH v2 0/1] cpuidle: menu: Fix high wakeup latency on modern platforms

Posted by Christian Loehle 2 weeks, 2 days ago

On 1/22/26 08:09, Ionut Nechita (Sunlight Linux) wrote:
> From: Ionut Nechita <ionut_n2001@yahoo.com>
> 
> Hi,
> 
> This v2 patch addresses high wakeup latency in the menu cpuidle governor
> on modern platforms with high C-state exit latencies.
> 
> Changes in v2:
> ==============
> 
> Based on Christian Loehle's feedback, I've simplified the approach to use
> min(predicted_ns, data->next_timer_ns) instead of the 25% safety margin
> from v1.
> 
> The simpler approach is cleaner and achieves the same goal: preventing the
> governor from selecting excessively deep C-states when the prediction
> suggests a short idle period but next_timer_ns is large (e.g., 10ms).
> 
> I will test both approaches (simple min vs 25% margin) and provide
> detailed comparison data including:
> - C-state residency tables
> - Usage statistics
> - Idle miss counts (above/below)
> - Actual latency measurements
> 
> Thank you Christian for the valuable feedback and for pointing out that
> the simpler approach may be sufficient.
> 
It was more of a question than a suggestion outright... And I still have
more of them, quoting v1:
+	 * Add a 25% safety margin to the prediction to reduce the risk of
+	 * selecting too shallow state, but clamp to next_timer to avoid
+	 * selecting unnecessarily deep states.

but the safety margin was ontop of the prediction, i.e. it skewed towards
deeper states (not shallower ones).

You also measured 150us wakeup latency, does this match the reported exit
latency for your platform (roughly)?
What do the platform states look like for you?

A trace or cpuidle sysfs dump pre and post workload would really help to
understand the situation.
Also regarding NOHZ_FULL, does that make a difference for your workload?
That would sort of imply very few idle wakeups (otherwise that bit of tick
overhead probably wouldn't matter. Is the NOHZ_FULL gain only in latency?
Frankly, if there's relatively strict latency requirements on the system
you need to let cpuidle know via pm qos or dma_latency....

Re: [PATCH v2 0/1] cpuidle: menu: Fix high wakeup latency on modern platforms

Posted by Ionut Nechita (Sunlight Linux) 1 week, 4 days ago

From: Ionut Nechita <sunlightlinux@gmail.com>

On Thu, Jan 22 2026 at 08:49, Christian Loehle wrote:

> It was more of a question than a suggestion outright... And I still have
> more of them, quoting v1:

Thank you for the detailed feedback. Let me provide more context about
the workload and the platforms where I observed this issue.

> You also measured 150us wakeup latency, does this match the reported exit
> latency for your platform (roughly)?
> What do the platform states look like for you?

Yes, the measured latency matches the reported exit latencies. Here are
the platforms I've tested:

1. Intel Xeon Gold 6443N (Sapphire Rapids):
   - C6 state: 190us latency, 600us residency target
   - C1E state: 2us latency, 4us residency target
   - Driver: intel_idle

2. AMD Ryzen 9 5900HS (laptop):
   - C3 state: 350us latency, 700us residency target
   - C2 state: 18us latency, 36us residency target
   - Driver: acpi_idle

The problem manifests primarily on the Sapphire Rapids platform where
C6 has 190us exit latency.

> Also regarding NOHZ_FULL, does that make a difference for your workload?

Yes, absolutely. The workload context is:

- PREEMPT_RT kernel (realtime)
- Isolated cores (isolcpus=)
- NOHZ_FULL enabled on isolated cores
- Inter-core communication latency testing with qperf
- kthreads and IRQ affinity set to non-isolated cores

The scenario: Core A (isolated, NOHZ_FULL) sends a message to Core B
(also isolated, NOHZ_FULL, currently idle). Core B enters C6 during
idle, then when the message arrives, the 190us exit latency dominates
the response time. This is unacceptable for realtime workloads.

> Frankly, if there's relatively strict latency requirements on the system
> you need to let cpuidle know via pm qos or dma_latency....

I considered PM QoS and /dev/cpu_dma_latency, but they have limitations
for this use case:

1. Global PM QoS affects all cores, not just the isolated ones
2. Per-task PM QoS requires application modifications
3. /dev/cpu_dma_latency is system-wide, not per-core

For isolated cores with NOHZ_FULL in a realtime environment, we want
the governor to make smarter decisions based on actual predicted idle
time rather than relying on next_timer_ns which can be arbitrarily large
on tickless cores.

> A trace or cpuidle sysfs dump pre and post workload would really help to
> understand the situation.

I will collect and provide:
- ftrace cpuidle event traces
- Complete sysfs cpuidle dumps pre/post workload
- C-state residency and usage statistics
- Detailed qperf latency measurements

Regarding the safety margin question from v1: you're right that I need
to clarify the logic. The goal is to clamp the upper bound to avoid
unnecessarily deep states when prediction suggests short idle, while
still respecting the prediction for target residency selection.

I'll send a follow-up with the detailed trace data and measurements.

Thanks for your patience and valuable feedback,
Ionut