Two semi-related perf throttling fixes

[PATCH 2/2] perf: Don't throttle based on NMI watchdog events

Posted by Calvin Owens 18 hours ago

The throttling logic in perf_sample_event_took() assumes the NMI is
running at the maximum allowed sample rate. While this makes sense most
of the time, it wildly overestimates the runtime of the NMI for the perf
hardware watchdog:

    # bpftrace -e 'kprobe:perf_sample_event_took { \
	    printf("%s: cpu=%02d time_taken=%dns\n", \
	    strftime("%H:%M:%S.%f", nsecs), cpu(), arg0); }'
    03:12:13.087003: cpu=00 time_taken=3190ns
    03:12:13.486789: cpu=01 time_taken=2918ns
    03:12:18.075288: cpu=03 time_taken=3308ns
    03:12:19.797207: cpu=02 time_taken=2581ns
    03:12:23.110317: cpu=00 time_taken=2823ns
    03:12:23.510308: cpu=01 time_taken=2943ns
    03:12:29.229348: cpu=03 time_taken=3669ns
    03:12:31.656306: cpu=02 time_taken=3262ns

The NMI for the watchdog runs for 2-4us every ten seconds, but the
math done in perf_sample_event_took() concludes it is running for
200-400ms every second!

When it is the only PMU event running, it can take minutes to hours of
samples from the watchdog for the moving average to accumulate to
something near the real mean, which causes the same little "litany" of
sample rate throttles to happen every time Linux boots with the perf
hardware watchdog enabled:

    perf: interrupt took too long (2526 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
    perf: interrupt took too long (3177 > 3157), lowering kernel.perf_event_max_sample_rate to 62000
    perf: interrupt took too long (3979 > 3971), lowering kernel.perf_event_max_sample_rate to 50000
    perf: interrupt took too long (4983 > 4973), lowering kernel.perf_event_max_sample_rate to 40000

This serves no purpose: it doesn't actually affect the runtime of the
watchdog NMI at all. It confuses users, because it suggests their
machine is spinning its wheels in interrupts when it isn't.

Because the watchdog NMI is so infrequent, we can avoid throttling it by
making the throttling a two-step process: load and update a timestamp
whenever we think we need to throttle, and only actually proceed to
throttle if the last time that happened was less than one second ago.

This is inelegant, but it avoids touching the hot path and preserves
current throttling behavior for real PMU use, at the cost of delaying
the throttling by a single NMI.

Signed-off-by: Calvin Owens <calvin@wbinvd.org>
---
 kernel/events/core.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 89b40e439717..0f7a7e912f55 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -623,6 +623,7 @@ core_initcall(init_events_core_sysctls);
  */
 #define NR_ACCUMULATED_SAMPLES 128
 static DEFINE_PER_CPU(u64, running_sample_length);
+static DEFINE_PER_CPU(u64, last_throttle_clock);
 
 static u64 __report_avg;
 static u64 __report_allowed;
@@ -643,6 +644,8 @@ void perf_sample_event_took(u64 sample_len_ns)
 	u64 max_len = READ_ONCE(perf_sample_allowed_ns);
 	u64 running_len;
 	u64 avg_len;
+	u64 delta;
+	u64 now;
 	u32 max;
 
 	if (max_len == 0)
@@ -663,6 +666,17 @@ void perf_sample_event_took(u64 sample_len_ns)
 	if (avg_len <= max_len)
 		return;
 
+	/*
+	 * Very infrequent events like the perf counter hard watchdog
+	 * can trigger spurious throttling: skip throttling if the prior
+	 * NMI got here more than one second before this NMI began.
+	 */
+	now = local_clock();
+	delta = now - __this_cpu_read(last_throttle_clock);
+	__this_cpu_write(last_throttle_clock, now);
+	if (delta - sample_len_ns > NSEC_PER_SEC)
+		return;
+
 	__report_avg = avg_len;
 	__report_allowed = max_len;
 
-- 
2.47.3

Re: [PATCH 2/2] perf: Don't throttle based on NMI watchdog events

Posted by Calvin Owens 16 hours ago

On Tuesday 03/31 at 08:25 -0700, Calvin Owens wrote:
> @@ -663,6 +666,17 @@ void perf_sample_event_took(u64 sample_len_ns)
>  	if (avg_len <= max_len)
>  		return;
>  
> +	/*
> +	 * Very infrequent events like the perf counter hard watchdog
> +	 * can trigger spurious throttling: skip throttling if the prior
> +	 * NMI got here more than one second before this NMI began.
> +	 */
> +	now = local_clock();
> +	delta = now - __this_cpu_read(last_throttle_clock);
> +	__this_cpu_write(last_throttle_clock, now);
> +	if (delta - sample_len_ns > NSEC_PER_SEC)
> +		return;

Bah, Sashiko caught something obvious I missed:

https://sashiko.dev/#/patchset/cover.1774969692.git.calvin%40wbinvd.org

>> When the outer handler completes, its sample_len_ns (total execution
>> time) will be strictly greater than delta (time since the inner
>> handler finished).  This guarantees delta < sample_len_ns, causing the
>> subtraction to underflow to a massive positive value.
>>
>> The condition > NSEC_PER_SEC will then evaluate to true, and the outer
>> handler will erroneously skip the perf throttling logic. Should this
>> check be rewritten to avoid subtraction, perhaps by using if (delta >
>> sample_len_ns + NSEC_PER_SEC)?

The solution it proposed makes sense to me.

>  	__report_avg = avg_len;
>  	__report_allowed = max_len;
>  
> -- 
> 2.47.3
>

Re: [PATCH 2/2] perf: Don't throttle based on NMI watchdog events

Posted by Calvin Owens 15 hours ago

On Tuesday 03/31 at 10:22 -0700, Calvin Owens wrote:
> On Tuesday 03/31 at 08:25 -0700, Calvin Owens wrote:
> > @@ -663,6 +666,17 @@ void perf_sample_event_took(u64 sample_len_ns)
> >  	if (avg_len <= max_len)
> >  		return;
> >  
> > +	/*
> > +	 * Very infrequent events like the perf counter hard watchdog
> > +	 * can trigger spurious throttling: skip throttling if the prior
> > +	 * NMI got here more than one second before this NMI began.
> > +	 */
> > +	now = local_clock();
> > +	delta = now - __this_cpu_read(last_throttle_clock);
> > +	__this_cpu_write(last_throttle_clock, now);
> > +	if (delta - sample_len_ns > NSEC_PER_SEC)
> > +		return;
> 
> Bah, Sashiko caught something obvious I missed:
> 
> https://sashiko.dev/#/patchset/cover.1774969692.git.calvin%40wbinvd.org
> 
> >> When the outer handler completes, its sample_len_ns (total execution
> >> time) will be strictly greater than delta (time since the inner
> >> handler finished).  This guarantees delta < sample_len_ns, causing the
> >> subtraction to underflow to a massive positive value.
> >>
> >> The condition > NSEC_PER_SEC will then evaluate to true, and the outer
> >> handler will erroneously skip the perf throttling logic. Should this
> >> check be rewritten to avoid subtraction, perhaps by using if (delta >
> >> sample_len_ns + NSEC_PER_SEC)?
> 
> The solution it proposed makes sense to me.

I replied too quickly: I think Sashiko is actually wrong.

It is assuming that sample_len_ns includes the latency of
perf_sample_event_took(), but it does not.

Nesting in the middle of the RMW of the percpu value strictly makes
last_throttle_clock appear to have happened *sooner* to the outer NMI,
so I think that case works.

Thanks, apologies again for all the noise here,
Calvin

> >  	__report_avg = avg_len;
> >  	__report_allowed = max_len;
> >  
> > -- 
> > 2.47.3
> >

Re: [PATCH 2/2] perf: Don't throttle based on NMI watchdog events

Posted by Calvin Owens 12 hours ago

On Tuesday 03/31 at 11:10 -0700, Calvin Owens wrote:
> On Tuesday 03/31 at 10:22 -0700, Calvin Owens wrote:
> > On Tuesday 03/31 at 08:25 -0700, Calvin Owens wrote:
> > > @@ -663,6 +666,17 @@ void perf_sample_event_took(u64 sample_len_ns)
> > >  	if (avg_len <= max_len)
> > >  		return;
> > >  
> > > +	/*
> > > +	 * Very infrequent events like the perf counter hard watchdog
> > > +	 * can trigger spurious throttling: skip throttling if the prior
> > > +	 * NMI got here more than one second before this NMI began.
> > > +	 */
> > > +	now = local_clock();
> > > +	delta = now - __this_cpu_read(last_throttle_clock);
> > > +	__this_cpu_write(last_throttle_clock, now);
> > > +	if (delta - sample_len_ns > NSEC_PER_SEC)
> > > +		return;
> > 
> > Bah, Sashiko caught something obvious I missed:
> > 
> > https://sashiko.dev/#/patchset/cover.1774969692.git.calvin%40wbinvd.org
> > 
> > >> When the outer handler completes, its sample_len_ns (total execution
> > >> time) will be strictly greater than delta (time since the inner
> > >> handler finished).  This guarantees delta < sample_len_ns, causing the
> > >> subtraction to underflow to a massive positive value.
> > >>
> > >> The condition > NSEC_PER_SEC will then evaluate to true, and the outer
> > >> handler will erroneously skip the perf throttling logic. Should this
> > >> check be rewritten to avoid subtraction, perhaps by using if (delta >
> > >> sample_len_ns + NSEC_PER_SEC)?
> > 
> > The solution it proposed makes sense to me.
> 
> I replied too quickly: I think Sashiko is actually wrong.

Last time, I swear to god. I worked this out, it is indeed correct.

The relevant RMW is:

	now = local_clock()
	delta = now = last_throttle_clock;
	last_throttle_clock = now

Assume last_throttle_clock starts at zero.

Normal case:

	NMI >>> sample_len_ns=1000ns
		now = 1010
		delta = 1010
		last_throttle_clock = 1010
		(1010 - 0 >_NSEC_PER_SEC) == false

Nesting case 1:

	NMI >>> sample_len_ns=1000ns
		now = 1010
		NMI >>> sample_len_ns=1000ns
			now = 2020
			delta = 2020;
			last_throttle_clock = 2020
			(2020 - 0 > NSEC_PER_SEC) == false
			// does not skip throttle
		delta = *underflow*
		last_throttle_clock = 1010
		(1010 - *underflow* > NSEC_PER_SEC) == true
		// skips throttle

Nesting case 2:

	NMI >>> sample_len_ns=1000ns
		now = 1010
		delta = 1010
		NMI >>> sample_len_ns=1000ns
			now = 2020
			delta = 2020
			last_throttle_clock = 2020
			(2020 - 0 > NSEC_PER_SEC) == false
			// does not skip throttle
		last_throttle_clock = 1010
		(1010 - 1000 > NSEC_PER_SEC) == true
		// skips throttle

I think the below deals with it. But I will wait to hear back before
sending a V2.

Thanks,
Calvin

---
 kernel/events/core.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 89b40e439717..c51d61fbb03b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -623,6 +623,7 @@ core_initcall(init_events_core_sysctls);
  */
 #define NR_ACCUMULATED_SAMPLES 128
 static DEFINE_PER_CPU(u64, running_sample_length);
+static DEFINE_PER_CPU(u64, last_throttle_clock);
 
 static u64 __report_avg;
 static u64 __report_allowed;
@@ -643,6 +644,8 @@ void perf_sample_event_took(u64 sample_len_ns)
 	u64 max_len = READ_ONCE(perf_sample_allowed_ns);
 	u64 running_len;
 	u64 avg_len;
+	u64 last;
+	u64 now;
 	u32 max;
 
 	if (max_len == 0)
@@ -663,6 +666,18 @@ void perf_sample_event_took(u64 sample_len_ns)
 	if (avg_len <= max_len)
 		return;
 
+	/*
+	 * Very infrequent events like the perf counter hard watchdog
+	 * can trigger spurious throttling: skip throttling if the prior
+	 * NMI got here more than one second before this NMI began. But
+	 * if NMIs are nesting, never skip throttling.
+	 */
+	now = local_clock();
+	last = __this_cpu_read(last_throttle_clock);
+	if (this_cpu_try_cmpxchg(last_throttle_clock, last, now) &&
+	    now - last > NSEC_PER_SEC)
+		return;
+
 	__report_avg = avg_len;
 	__report_allowed = max_len;
 
-- 
2.47.3

Re: [PATCH 2/2] perf: Don't throttle based on NMI watchdog events

Posted by Calvin Owens 16 hours ago

On Tuesday 03/31 at 10:22 -0700, Calvin Owens wrote:
> On Tuesday 03/31 at 08:25 -0700, Calvin Owens wrote:
> > @@ -663,6 +666,17 @@ void perf_sample_event_took(u64 sample_len_ns)
> >  	if (avg_len <= max_len)
> >  		return;
> >  
> > +	/*
> > +	 * Very infrequent events like the perf counter hard watchdog
> > +	 * can trigger spurious throttling: skip throttling if the prior
> > +	 * NMI got here more than one second before this NMI began.
> > +	 */
> > +	now = local_clock();
> > +	delta = now - __this_cpu_read(last_throttle_clock);
> > +	__this_cpu_write(last_throttle_clock, now);
> > +	if (delta - sample_len_ns > NSEC_PER_SEC)
> > +		return;

Apologies for replying twice in a row...

Sashiko made a second useful observation:

>> There appears to be no upper bound on sample_len_ns itself. If an
>> event takes 5 seconds to run but is configured to fire only once every
>> 7 seconds, the idle time will be 2 seconds.
>>
>> Because 2 seconds is > NSEC_PER_SEC, the throttling logic is skipped
>> entirely. This defeats the sysctl_perf_cpu_time_max_percent safeguard
>> and allows an event to monopolize the CPU in NMI/IRQ context for
>> seconds at a time without ever being throttled.

I'm skeptical that would ever actually happen, but I think I can address
that by adding:

	&& sample_len_ns < NSEC_PER_SEC

...to the skip throttle condition?

In fairness to the LLM skeptics, the feedback Sashiko gave on patch 1/2
is absolute nonsense.

Thanks,
Calvin