Hi all,
In the course of investigating [1], I set out to understand why this
sequence of messages is printed every boot, even when nobody is using
perf at all:
perf: interrupt took too long (2516 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
perf: interrupt took too long (3156 > 3145), lowering kernel.perf_event_max_sample_rate to 63000
perf: interrupt took too long (4014 > 3945), lowering kernel.perf_event_max_sample_rate to 49000
perf: interrupt took too long (5035 > 5017), lowering kernel.perf_event_max_sample_rate to 39000
perf: interrupt took too long (6302 > 6293), lowering kernel.perf_event_max_sample_rate to 31000
perf: interrupt took too long (7879 > 7877), lowering kernel.perf_event_max_sample_rate to 25000
perf: interrupt took too long (9852 > 9848), lowering kernel.perf_event_max_sample_rate to 20000
It turns out this happens because of how the dynamic sample rate
throttling interacts with the perf hardware watchdog. Patch [2/2] is my
attempt to prevent the dynamic throttling logic from acting solely based
on the latency of the watchdog NMI.
Intel CPUs were happy with that. But AMD CPUs still printed the messages!
That happens because AMD CPUs have a second PMU facility with its own
NMI handler, and both NMI handlers average in their latency, even when
they don't actually handle the NMI.
Patch [1/2] fixes that, which is a correctness issue entirely
independent of patch [2/2]. But it also happens to be required for patch
[2/2] to achieve its goal on AMD CPUs, so I sent them together.
Thanks,
Calvin
[1] https://lore.kernel.org/all/acMe-QZUel-bBYUh@mozart.vkv.me/
Calvin Owens (2):
perf/x86: Avoid double accounting of PMU NMI latencies
perf: Don't throttle based on NMI watchdog events
arch/x86/events/amd/ibs.c | 6 +++---
arch/x86/events/core.c | 3 ++-
kernel/events/core.c | 14 ++++++++++++++
3 files changed, 19 insertions(+), 4 deletions(-)
--
2.47.3