drivers/cpuidle/governors/menu.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-)
From: Ionut Nechita <ionut_n2001@yahoo.com> Hi, This v2 patch addresses high wakeup latency in the menu cpuidle governor on modern platforms with high C-state exit latencies. Changes in v2: ============== Based on Christian Loehle's feedback, I've simplified the approach to use min(predicted_ns, data->next_timer_ns) instead of the 25% safety margin from v1. The simpler approach is cleaner and achieves the same goal: preventing the governor from selecting excessively deep C-states when the prediction suggests a short idle period but next_timer_ns is large (e.g., 10ms). I will test both approaches (simple min vs 25% margin) and provide detailed comparison data including: - C-state residency tables - Usage statistics - Idle miss counts (above/below) - Actual latency measurements Thank you Christian for the valuable feedback and for pointing out that the simpler approach may be sufficient. Background: =========== On Intel server platforms from 2022 onwards (Sapphire Rapids, Granite Rapids), we observe excessive wakeup latencies (~150us) in network- sensitive workloads when using the menu governor with NOHZ_FULL enabled. The issue stems from the governor using next_timer_ns directly when the tick is already stopped and predicted_ns < TICK_NSEC. This causes selection of very deep package C-states (PC6) even when the prediction suggests a much shorter idle duration. On platforms with high C-state exit latencies (Intel SPR: 190us for C6, or systems with large C-state gaps like C2 36us → C3 700us with 350us exit latency), this results in significant wakeup penalties. Testing: ======== Initial testing on Sapphire Rapids shows 5x latency reduction (151us → ~30us). I will provide comprehensive test results comparing baseline, simple min(), and the 25% margin approach. Ionut Nechita (1): cpuidle: menu: Use min() to prevent deep C-states when tick is stopped drivers/cpuidle/governors/menu.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) -- 2.52.0
On 1/22/26 08:09, Ionut Nechita (Sunlight Linux) wrote: > From: Ionut Nechita <ionut_n2001@yahoo.com> > > Hi, > > This v2 patch addresses high wakeup latency in the menu cpuidle governor > on modern platforms with high C-state exit latencies. > > Changes in v2: > ============== > > Based on Christian Loehle's feedback, I've simplified the approach to use > min(predicted_ns, data->next_timer_ns) instead of the 25% safety margin > from v1. > > The simpler approach is cleaner and achieves the same goal: preventing the > governor from selecting excessively deep C-states when the prediction > suggests a short idle period but next_timer_ns is large (e.g., 10ms). > > I will test both approaches (simple min vs 25% margin) and provide > detailed comparison data including: > - C-state residency tables > - Usage statistics > - Idle miss counts (above/below) > - Actual latency measurements > > Thank you Christian for the valuable feedback and for pointing out that > the simpler approach may be sufficient. > It was more of a question than a suggestion outright... And I still have more of them, quoting v1: + * Add a 25% safety margin to the prediction to reduce the risk of + * selecting too shallow state, but clamp to next_timer to avoid + * selecting unnecessarily deep states. but the safety margin was ontop of the prediction, i.e. it skewed towards deeper states (not shallower ones). You also measured 150us wakeup latency, does this match the reported exit latency for your platform (roughly)? What do the platform states look like for you? A trace or cpuidle sysfs dump pre and post workload would really help to understand the situation. Also regarding NOHZ_FULL, does that make a difference for your workload? That would sort of imply very few idle wakeups (otherwise that bit of tick overhead probably wouldn't matter. Is the NOHZ_FULL gain only in latency? Frankly, if there's relatively strict latency requirements on the system you need to let cpuidle know via pm qos or dma_latency....
From: Ionut Nechita <sunlightlinux@gmail.com> On Thu, Jan 22 2026 at 08:49, Christian Loehle wrote: > It was more of a question than a suggestion outright... And I still have > more of them, quoting v1: Thank you for the detailed feedback. Let me provide more context about the workload and the platforms where I observed this issue. > You also measured 150us wakeup latency, does this match the reported exit > latency for your platform (roughly)? > What do the platform states look like for you? Yes, the measured latency matches the reported exit latencies. Here are the platforms I've tested: 1. Intel Xeon Gold 6443N (Sapphire Rapids): - C6 state: 190us latency, 600us residency target - C1E state: 2us latency, 4us residency target - Driver: intel_idle 2. AMD Ryzen 9 5900HS (laptop): - C3 state: 350us latency, 700us residency target - C2 state: 18us latency, 36us residency target - Driver: acpi_idle The problem manifests primarily on the Sapphire Rapids platform where C6 has 190us exit latency. > Also regarding NOHZ_FULL, does that make a difference for your workload? Yes, absolutely. The workload context is: - PREEMPT_RT kernel (realtime) - Isolated cores (isolcpus=) - NOHZ_FULL enabled on isolated cores - Inter-core communication latency testing with qperf - kthreads and IRQ affinity set to non-isolated cores The scenario: Core A (isolated, NOHZ_FULL) sends a message to Core B (also isolated, NOHZ_FULL, currently idle). Core B enters C6 during idle, then when the message arrives, the 190us exit latency dominates the response time. This is unacceptable for realtime workloads. > Frankly, if there's relatively strict latency requirements on the system > you need to let cpuidle know via pm qos or dma_latency.... I considered PM QoS and /dev/cpu_dma_latency, but they have limitations for this use case: 1. Global PM QoS affects all cores, not just the isolated ones 2. Per-task PM QoS requires application modifications 3. /dev/cpu_dma_latency is system-wide, not per-core For isolated cores with NOHZ_FULL in a realtime environment, we want the governor to make smarter decisions based on actual predicted idle time rather than relying on next_timer_ns which can be arbitrarily large on tickless cores. > A trace or cpuidle sysfs dump pre and post workload would really help to > understand the situation. I will collect and provide: - ftrace cpuidle event traces - Complete sysfs cpuidle dumps pre/post workload - C-state residency and usage statistics - Detailed qperf latency measurements Regarding the safety margin question from v1: you're right that I need to clarify the logic. The goal is to clamp the upper bound to avoid unnecessarily deep states when prediction suggests short idle, while still respecting the prediction for target residency selection. I'll send a follow-up with the detailed trace data and measurements. Thanks for your patience and valuable feedback, Ionut
© 2016 - 2026 Red Hat, Inc.