The throttling logic in perf_sample_event_took() assumes the NMI is
running at the maximum allowed sample rate. While this makes sense most
of the time, it wildly overestimates the runtime of the NMI for the perf
hardware watchdog:
# bpftrace -e 'kprobe:perf_sample_event_took { \
printf("%s: cpu=%02d time_taken=%dns\n", \
strftime("%H:%M:%S.%f", nsecs), cpu(), arg0); }'
03:12:13.087003: cpu=00 time_taken=3190ns
03:12:13.486789: cpu=01 time_taken=2918ns
03:12:18.075288: cpu=03 time_taken=3308ns
03:12:19.797207: cpu=02 time_taken=2581ns
03:12:23.110317: cpu=00 time_taken=2823ns
03:12:23.510308: cpu=01 time_taken=2943ns
03:12:29.229348: cpu=03 time_taken=3669ns
03:12:31.656306: cpu=02 time_taken=3262ns
The NMI for the watchdog runs for 2-4us every ten seconds, but the
math done in perf_sample_event_took() concludes it is running for
200-400ms every second!
When it is the only PMU event running, it can take minutes to hours of
samples from the watchdog for the moving average to accumulate to
something near the real mean, which causes the same little "litany" of
sample rate throttles to happen every time Linux boots with the perf
hardware watchdog enabled:
perf: interrupt took too long (2526 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
perf: interrupt took too long (3177 > 3157), lowering kernel.perf_event_max_sample_rate to 62000
perf: interrupt took too long (3979 > 3971), lowering kernel.perf_event_max_sample_rate to 50000
perf: interrupt took too long (4983 > 4973), lowering kernel.perf_event_max_sample_rate to 40000
This serves no purpose: it doesn't actually affect the runtime of the
watchdog NMI at all. It confuses users, because it suggests their
machine is spinning its wheels in interrupts when it isn't.
Because the watchdog NMI is so infrequent, we can avoid throttling it by
making the throttling a two-step process: load and update a timestamp
whenever we think we need to throttle, and only actually proceed to
throttle if the last time that happened was less than one second ago.
This is inelegant, but it avoids touching the hot path and preserves
current throttling behavior for real PMU use, at the cost of delaying
the throttling by a single NMI.
Signed-off-by: Calvin Owens <calvin@wbinvd.org>
---
kernel/events/core.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 89b40e439717..0f7a7e912f55 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -623,6 +623,7 @@ core_initcall(init_events_core_sysctls);
*/
#define NR_ACCUMULATED_SAMPLES 128
static DEFINE_PER_CPU(u64, running_sample_length);
+static DEFINE_PER_CPU(u64, last_throttle_clock);
static u64 __report_avg;
static u64 __report_allowed;
@@ -643,6 +644,8 @@ void perf_sample_event_took(u64 sample_len_ns)
u64 max_len = READ_ONCE(perf_sample_allowed_ns);
u64 running_len;
u64 avg_len;
+ u64 delta;
+ u64 now;
u32 max;
if (max_len == 0)
@@ -663,6 +666,17 @@ void perf_sample_event_took(u64 sample_len_ns)
if (avg_len <= max_len)
return;
+ /*
+ * Very infrequent events like the perf counter hard watchdog
+ * can trigger spurious throttling: skip throttling if the prior
+ * NMI got here more than one second before this NMI began.
+ */
+ now = local_clock();
+ delta = now - __this_cpu_read(last_throttle_clock);
+ __this_cpu_write(last_throttle_clock, now);
+ if (delta - sample_len_ns > NSEC_PER_SEC)
+ return;
+
__report_avg = avg_len;
__report_allowed = max_len;
--
2.47.3
On Tuesday 03/31 at 08:25 -0700, Calvin Owens wrote: > @@ -663,6 +666,17 @@ void perf_sample_event_took(u64 sample_len_ns) > if (avg_len <= max_len) > return; > > + /* > + * Very infrequent events like the perf counter hard watchdog > + * can trigger spurious throttling: skip throttling if the prior > + * NMI got here more than one second before this NMI began. > + */ > + now = local_clock(); > + delta = now - __this_cpu_read(last_throttle_clock); > + __this_cpu_write(last_throttle_clock, now); > + if (delta - sample_len_ns > NSEC_PER_SEC) > + return; Bah, Sashiko caught something obvious I missed: https://sashiko.dev/#/patchset/cover.1774969692.git.calvin%40wbinvd.org >> When the outer handler completes, its sample_len_ns (total execution >> time) will be strictly greater than delta (time since the inner >> handler finished). This guarantees delta < sample_len_ns, causing the >> subtraction to underflow to a massive positive value. >> >> The condition > NSEC_PER_SEC will then evaluate to true, and the outer >> handler will erroneously skip the perf throttling logic. Should this >> check be rewritten to avoid subtraction, perhaps by using if (delta > >> sample_len_ns + NSEC_PER_SEC)? The solution it proposed makes sense to me. > __report_avg = avg_len; > __report_allowed = max_len; > > -- > 2.47.3 >
On Tuesday 03/31 at 10:22 -0700, Calvin Owens wrote: > On Tuesday 03/31 at 08:25 -0700, Calvin Owens wrote: > > @@ -663,6 +666,17 @@ void perf_sample_event_took(u64 sample_len_ns) > > if (avg_len <= max_len) > > return; > > > > + /* > > + * Very infrequent events like the perf counter hard watchdog > > + * can trigger spurious throttling: skip throttling if the prior > > + * NMI got here more than one second before this NMI began. > > + */ > > + now = local_clock(); > > + delta = now - __this_cpu_read(last_throttle_clock); > > + __this_cpu_write(last_throttle_clock, now); > > + if (delta - sample_len_ns > NSEC_PER_SEC) > > + return; > > Bah, Sashiko caught something obvious I missed: > > https://sashiko.dev/#/patchset/cover.1774969692.git.calvin%40wbinvd.org > > >> When the outer handler completes, its sample_len_ns (total execution > >> time) will be strictly greater than delta (time since the inner > >> handler finished). This guarantees delta < sample_len_ns, causing the > >> subtraction to underflow to a massive positive value. > >> > >> The condition > NSEC_PER_SEC will then evaluate to true, and the outer > >> handler will erroneously skip the perf throttling logic. Should this > >> check be rewritten to avoid subtraction, perhaps by using if (delta > > >> sample_len_ns + NSEC_PER_SEC)? > > The solution it proposed makes sense to me. I replied too quickly: I think Sashiko is actually wrong. It is assuming that sample_len_ns includes the latency of perf_sample_event_took(), but it does not. Nesting in the middle of the RMW of the percpu value strictly makes last_throttle_clock appear to have happened *sooner* to the outer NMI, so I think that case works. Thanks, apologies again for all the noise here, Calvin > > __report_avg = avg_len; > > __report_allowed = max_len; > > > > -- > > 2.47.3 > >
On Tuesday 03/31 at 11:10 -0700, Calvin Owens wrote: > On Tuesday 03/31 at 10:22 -0700, Calvin Owens wrote: > > On Tuesday 03/31 at 08:25 -0700, Calvin Owens wrote: > > > @@ -663,6 +666,17 @@ void perf_sample_event_took(u64 sample_len_ns) > > > if (avg_len <= max_len) > > > return; > > > > > > + /* > > > + * Very infrequent events like the perf counter hard watchdog > > > + * can trigger spurious throttling: skip throttling if the prior > > > + * NMI got here more than one second before this NMI began. > > > + */ > > > + now = local_clock(); > > > + delta = now - __this_cpu_read(last_throttle_clock); > > > + __this_cpu_write(last_throttle_clock, now); > > > + if (delta - sample_len_ns > NSEC_PER_SEC) > > > + return; > > > > Bah, Sashiko caught something obvious I missed: > > > > https://sashiko.dev/#/patchset/cover.1774969692.git.calvin%40wbinvd.org > > > > >> When the outer handler completes, its sample_len_ns (total execution > > >> time) will be strictly greater than delta (time since the inner > > >> handler finished). This guarantees delta < sample_len_ns, causing the > > >> subtraction to underflow to a massive positive value. > > >> > > >> The condition > NSEC_PER_SEC will then evaluate to true, and the outer > > >> handler will erroneously skip the perf throttling logic. Should this > > >> check be rewritten to avoid subtraction, perhaps by using if (delta > > > >> sample_len_ns + NSEC_PER_SEC)? > > > > The solution it proposed makes sense to me. > > I replied too quickly: I think Sashiko is actually wrong. Last time, I swear to god. I worked this out, it is indeed correct. The relevant RMW is: now = local_clock() delta = now = last_throttle_clock; last_throttle_clock = now Assume last_throttle_clock starts at zero. Normal case: NMI >>> sample_len_ns=1000ns now = 1010 delta = 1010 last_throttle_clock = 1010 (1010 - 0 >_NSEC_PER_SEC) == false Nesting case 1: NMI >>> sample_len_ns=1000ns now = 1010 NMI >>> sample_len_ns=1000ns now = 2020 delta = 2020; last_throttle_clock = 2020 (2020 - 0 > NSEC_PER_SEC) == false // does not skip throttle delta = *underflow* last_throttle_clock = 1010 (1010 - *underflow* > NSEC_PER_SEC) == true // skips throttle Nesting case 2: NMI >>> sample_len_ns=1000ns now = 1010 delta = 1010 NMI >>> sample_len_ns=1000ns now = 2020 delta = 2020 last_throttle_clock = 2020 (2020 - 0 > NSEC_PER_SEC) == false // does not skip throttle last_throttle_clock = 1010 (1010 - 1000 > NSEC_PER_SEC) == true // skips throttle I think the below deals with it. But I will wait to hear back before sending a V2. Thanks, Calvin --- kernel/events/core.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/kernel/events/core.c b/kernel/events/core.c index 89b40e439717..c51d61fbb03b 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -623,6 +623,7 @@ core_initcall(init_events_core_sysctls); */ #define NR_ACCUMULATED_SAMPLES 128 static DEFINE_PER_CPU(u64, running_sample_length); +static DEFINE_PER_CPU(u64, last_throttle_clock); static u64 __report_avg; static u64 __report_allowed; @@ -643,6 +644,8 @@ void perf_sample_event_took(u64 sample_len_ns) u64 max_len = READ_ONCE(perf_sample_allowed_ns); u64 running_len; u64 avg_len; + u64 last; + u64 now; u32 max; if (max_len == 0) @@ -663,6 +666,18 @@ void perf_sample_event_took(u64 sample_len_ns) if (avg_len <= max_len) return; + /* + * Very infrequent events like the perf counter hard watchdog + * can trigger spurious throttling: skip throttling if the prior + * NMI got here more than one second before this NMI began. But + * if NMIs are nesting, never skip throttling. + */ + now = local_clock(); + last = __this_cpu_read(last_throttle_clock); + if (this_cpu_try_cmpxchg(last_throttle_clock, last, now) && + now - last > NSEC_PER_SEC) + return; + __report_avg = avg_len; __report_allowed = max_len; -- 2.47.3
On Tuesday 03/31 at 10:22 -0700, Calvin Owens wrote: > On Tuesday 03/31 at 08:25 -0700, Calvin Owens wrote: > > @@ -663,6 +666,17 @@ void perf_sample_event_took(u64 sample_len_ns) > > if (avg_len <= max_len) > > return; > > > > + /* > > + * Very infrequent events like the perf counter hard watchdog > > + * can trigger spurious throttling: skip throttling if the prior > > + * NMI got here more than one second before this NMI began. > > + */ > > + now = local_clock(); > > + delta = now - __this_cpu_read(last_throttle_clock); > > + __this_cpu_write(last_throttle_clock, now); > > + if (delta - sample_len_ns > NSEC_PER_SEC) > > + return; Apologies for replying twice in a row... Sashiko made a second useful observation: >> There appears to be no upper bound on sample_len_ns itself. If an >> event takes 5 seconds to run but is configured to fire only once every >> 7 seconds, the idle time will be 2 seconds. >> >> Because 2 seconds is > NSEC_PER_SEC, the throttling logic is skipped >> entirely. This defeats the sysctl_perf_cpu_time_max_percent safeguard >> and allows an event to monopolize the CPU in NMI/IRQ context for >> seconds at a time without ever being throttled. I'm skeptical that would ever actually happen, but I think I can address that by adding: && sample_len_ns < NSEC_PER_SEC ...to the skip throttle condition? In fairness to the LLM skeptics, the feedback Sashiko gave on patch 1/2 is absolute nonsense. Thanks, Calvin
© 2016 - 2026 Red Hat, Inc.