When intel_pmu_drain_pebs_icl() is called to drain PEBS records, the
perf_event_overflow() could be called to process the last PEBS record.
While perf_event_overflow() could trigger the interrupt throttle and
stop all events of the group, like what the below call-chain shows.
perf_event_overflow()
-> __perf_event_overflow()
->__perf_event_account_interrupt()
-> perf_event_throttle_group()
-> perf_event_throttle()
-> event->pmu->stop()
-> x86_pmu_stop()
The side effect of stopping the events is that all corresponding event
pointers in cpuc->events[] array are cleared to NULL.
Assume there are two PEBS events (event a and event b) in a group. When
intel_pmu_drain_pebs_icl() calls perf_event_overflow() to process the
last PEBS record of PEBS event a, interrupt throttle is triggered and
all pointers of event a and event b are cleared to NULL. Then
intel_pmu_drain_pebs_icl() tries to process the last PEBS record of
event b and encounters NULL pointer access.
Since the left PEBS records have been processed when stopping the event,
check and skip to process the last PEBS record if cpuc->events[*] is
NULL.
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202507042103.a15d2923-lkp@intel.com
Fixes: 9734e25fbf5a ("perf: Fix the throttle logic for a group")
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: kernel test robot <oliver.sang@intel.com>
---
arch/x86/events/intel/ds.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index c0b7ac1c7594..dcf29c099ad2 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2663,6 +2663,16 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
continue;
event = cpuc->events[bit];
+ /*
+ * perf_event_overflow() called by below __intel_pmu_pebs_last_event()
+ * could trigger interrupt throttle and clear all event pointers of the
+ * group in cpuc->events[] to NULL. So need to re-check if cpuc->events[*]
+ * is NULL, if so it indicates the event has been throttled (stopped) and
+ * the corresponding last PEBS records have been processed in stopping
+ * event, don't need to process it again.
+ */
+ if (!event)
+ continue;
__intel_pmu_pebs_last_event(event, iregs, regs, data, last[bit],
counts[bit], setup_pebs_adaptive_sample_data);
--
2.34.1
On Wed, Aug 20, 2025 at 10:30:28AM +0800, Dapeng Mi wrote: > When intel_pmu_drain_pebs_icl() is called to drain PEBS records, the > perf_event_overflow() could be called to process the last PEBS record. > > While perf_event_overflow() could trigger the interrupt throttle and > stop all events of the group, like what the below call-chain shows. > > perf_event_overflow() > -> __perf_event_overflow() > ->__perf_event_account_interrupt() > -> perf_event_throttle_group() > -> perf_event_throttle() > -> event->pmu->stop() > -> x86_pmu_stop() > > The side effect of stopping the events is that all corresponding event > pointers in cpuc->events[] array are cleared to NULL. > > Assume there are two PEBS events (event a and event b) in a group. When > intel_pmu_drain_pebs_icl() calls perf_event_overflow() to process the > last PEBS record of PEBS event a, interrupt throttle is triggered and > all pointers of event a and event b are cleared to NULL. Then > intel_pmu_drain_pebs_icl() tries to process the last PEBS record of > event b and encounters NULL pointer access. > > Since the left PEBS records have been processed when stopping the event, > check and skip to process the last PEBS record if cpuc->events[*] is > NULL. > > Reported-by: kernel test robot <oliver.sang@intel.com> > Closes: https://lore.kernel.org/oe-lkp/202507042103.a15d2923-lkp@intel.com > Fixes: 9734e25fbf5a ("perf: Fix the throttle logic for a group") > Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> > Tested-by: kernel test robot <oliver.sang@intel.com> > --- > arch/x86/events/intel/ds.c | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c > index c0b7ac1c7594..dcf29c099ad2 100644 > --- a/arch/x86/events/intel/ds.c > +++ b/arch/x86/events/intel/ds.c > @@ -2663,6 +2663,16 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d > continue; > > event = cpuc->events[bit]; > + /* > + * perf_event_overflow() called by below __intel_pmu_pebs_last_event() > + * could trigger interrupt throttle and clear all event pointers of the > + * group in cpuc->events[] to NULL. So need to re-check if cpuc->events[*] > + * is NULL, if so it indicates the event has been throttled (stopped) and > + * the corresponding last PEBS records have been processed in stopping > + * event, don't need to process it again. > + */ > + if (!event) > + continue; > > __intel_pmu_pebs_last_event(event, iregs, regs, data, last[bit], > counts[bit], setup_pebs_adaptive_sample_data); So if this is due to __intel_pmu_pebs_last_event() calling into perf_event_overflow(); then isn't intel_pmu_drain_pebs_nhm() similarly affected? And worse, the _nhm() version would loose all events for that counter, not just the last. I'm really thinking this isn't the right thing to do. How about we audit the entirety of arch/x86/events/ for cpuc->events[] usage and see if we can get away with changing x86_pmu_stop() to simply not clearing that field. Or perhaps move the setting and clearing into x86_pmu_{add,del}() rather than x86_pmu_{start,stop}(). After all, the latter don't affect the counter placement, they just stop/start the event.
On 8/21/2025 9:35 PM, Peter Zijlstra wrote: > On Wed, Aug 20, 2025 at 10:30:28AM +0800, Dapeng Mi wrote: >> When intel_pmu_drain_pebs_icl() is called to drain PEBS records, the >> perf_event_overflow() could be called to process the last PEBS record. >> >> While perf_event_overflow() could trigger the interrupt throttle and >> stop all events of the group, like what the below call-chain shows. >> >> perf_event_overflow() >> -> __perf_event_overflow() >> ->__perf_event_account_interrupt() >> -> perf_event_throttle_group() >> -> perf_event_throttle() >> -> event->pmu->stop() >> -> x86_pmu_stop() >> >> The side effect of stopping the events is that all corresponding event >> pointers in cpuc->events[] array are cleared to NULL. >> >> Assume there are two PEBS events (event a and event b) in a group. When >> intel_pmu_drain_pebs_icl() calls perf_event_overflow() to process the >> last PEBS record of PEBS event a, interrupt throttle is triggered and >> all pointers of event a and event b are cleared to NULL. Then >> intel_pmu_drain_pebs_icl() tries to process the last PEBS record of >> event b and encounters NULL pointer access. >> >> Since the left PEBS records have been processed when stopping the event, >> check and skip to process the last PEBS record if cpuc->events[*] is >> NULL. >> >> Reported-by: kernel test robot <oliver.sang@intel.com> >> Closes: https://lore.kernel.org/oe-lkp/202507042103.a15d2923-lkp@intel.com >> Fixes: 9734e25fbf5a ("perf: Fix the throttle logic for a group") >> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> >> Tested-by: kernel test robot <oliver.sang@intel.com> >> --- >> arch/x86/events/intel/ds.c | 10 ++++++++++ >> 1 file changed, 10 insertions(+) >> >> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c >> index c0b7ac1c7594..dcf29c099ad2 100644 >> --- a/arch/x86/events/intel/ds.c >> +++ b/arch/x86/events/intel/ds.c >> @@ -2663,6 +2663,16 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d >> continue; >> >> event = cpuc->events[bit]; >> + /* >> + * perf_event_overflow() called by below __intel_pmu_pebs_last_event() >> + * could trigger interrupt throttle and clear all event pointers of the >> + * group in cpuc->events[] to NULL. So need to re-check if cpuc->events[*] >> + * is NULL, if so it indicates the event has been throttled (stopped) and >> + * the corresponding last PEBS records have been processed in stopping >> + * event, don't need to process it again. >> + */ >> + if (!event) >> + continue; >> >> __intel_pmu_pebs_last_event(event, iregs, regs, data, last[bit], >> counts[bit], setup_pebs_adaptive_sample_data); > > So if this is due to __intel_pmu_pebs_last_event() calling into > perf_event_overflow(); then isn't intel_pmu_drain_pebs_nhm() similarly > affected? > > And worse, the _nhm() version would loose all events for that counter, > not just the last. hmm, Yes. After double check, I suppose I made a mistake for the answer to Andi. It indeed has data loss since the "ds->pebs_index" is reset at the head of _nhm()/_icl() these drain_pebs helper instead of the end of the drain_pebs helper. :( > I'm really thinking this isn't the right thing to do. > > > How about we audit the entirety of arch/x86/events/ for cpuc->events[] > usage and see if we can get away with changing x86_pmu_stop() to simply > not clearing that field. Checking current code, I suppose it's fine that we don't clear cpuc->events[] in x86_pmu_stop() since we already have another variable "cpuc->active_mask" which is used to indicate if the corresponding cpuc->events[*] is active. But in current code, the cpuc->active_mask is not always checked. So if we select not to clear cpuc->events[] in x86_pmu_stop(), then it's a must to check cpuc->active_mask before really accessing cpuc->events[] represented event. Maybe we can add an inline function got check this. bool inline x86_pmu_cntr_event_active(int idx) { struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); return cpuc->events[idx] && test_bit(idx, cpuc->active_mask); } > > Or perhaps move the setting and clearing into x86_pmu_{add,del}() rather > than x86_pmu_{start,stop}(). After all, the latter don't affect the > counter placement, they just stop/start the event. IIUC, we could not move the setting into x86_pmu_add() from x86_pmu_stop() since the counter index is not finalized at x86_pmu_add() is called. The counter index could change for each adding a new event. > >
On 8/22/2025 1:26 PM, Mi, Dapeng wrote: > On 8/21/2025 9:35 PM, Peter Zijlstra wrote: >> On Wed, Aug 20, 2025 at 10:30:28AM +0800, Dapeng Mi wrote: >>> When intel_pmu_drain_pebs_icl() is called to drain PEBS records, the >>> perf_event_overflow() could be called to process the last PEBS record. >>> >>> While perf_event_overflow() could trigger the interrupt throttle and >>> stop all events of the group, like what the below call-chain shows. >>> >>> perf_event_overflow() >>> -> __perf_event_overflow() >>> ->__perf_event_account_interrupt() >>> -> perf_event_throttle_group() >>> -> perf_event_throttle() >>> -> event->pmu->stop() >>> -> x86_pmu_stop() >>> >>> The side effect of stopping the events is that all corresponding event >>> pointers in cpuc->events[] array are cleared to NULL. >>> >>> Assume there are two PEBS events (event a and event b) in a group. When >>> intel_pmu_drain_pebs_icl() calls perf_event_overflow() to process the >>> last PEBS record of PEBS event a, interrupt throttle is triggered and >>> all pointers of event a and event b are cleared to NULL. Then >>> intel_pmu_drain_pebs_icl() tries to process the last PEBS record of >>> event b and encounters NULL pointer access. >>> >>> Since the left PEBS records have been processed when stopping the event, >>> check and skip to process the last PEBS record if cpuc->events[*] is >>> NULL. >>> >>> Reported-by: kernel test robot <oliver.sang@intel.com> >>> Closes: https://lore.kernel.org/oe-lkp/202507042103.a15d2923-lkp@intel.com >>> Fixes: 9734e25fbf5a ("perf: Fix the throttle logic for a group") >>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> >>> Tested-by: kernel test robot <oliver.sang@intel.com> >>> --- >>> arch/x86/events/intel/ds.c | 10 ++++++++++ >>> 1 file changed, 10 insertions(+) >>> >>> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c >>> index c0b7ac1c7594..dcf29c099ad2 100644 >>> --- a/arch/x86/events/intel/ds.c >>> +++ b/arch/x86/events/intel/ds.c >>> @@ -2663,6 +2663,16 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d >>> continue; >>> >>> event = cpuc->events[bit]; >>> + /* >>> + * perf_event_overflow() called by below __intel_pmu_pebs_last_event() >>> + * could trigger interrupt throttle and clear all event pointers of the >>> + * group in cpuc->events[] to NULL. So need to re-check if cpuc->events[*] >>> + * is NULL, if so it indicates the event has been throttled (stopped) and >>> + * the corresponding last PEBS records have been processed in stopping >>> + * event, don't need to process it again. >>> + */ >>> + if (!event) >>> + continue; >>> >>> __intel_pmu_pebs_last_event(event, iregs, regs, data, last[bit], >>> counts[bit], setup_pebs_adaptive_sample_data); >> So if this is due to __intel_pmu_pebs_last_event() calling into >> perf_event_overflow(); then isn't intel_pmu_drain_pebs_nhm() similarly >> affected? >> >> And worse, the _nhm() version would loose all events for that counter, >> not just the last. > hmm, Yes. After double check, I suppose I made a mistake for the answer to > Andi. It indeed has data loss since the "ds->pebs_index" is reset at the > head of _nhm()/_icl() these drain_pebs helper instead of the end of the > drain_pebs helper. :( > >> I'm really thinking this isn't the right thing to do. >> >> >> How about we audit the entirety of arch/x86/events/ for cpuc->events[] >> usage and see if we can get away with changing x86_pmu_stop() to simply >> not clearing that field. > Checking current code, I suppose it's fine that we don't clear > cpuc->events[] in x86_pmu_stop() since we already have another variable > "cpuc->active_mask" which is used to indicate if the corresponding > cpuc->events[*] is active. But in current code, the cpuc->active_mask is > not always checked. > > So if we select not to clear cpuc->events[] in x86_pmu_stop(), then it's a > must to check cpuc->active_mask before really accessing cpuc->events[] > represented event. Maybe we can add an inline function got check this. > > bool inline x86_pmu_cntr_event_active(int idx) > { > struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); > > return cpuc->events[idx] && test_bit(idx, cpuc->active_mask); > } Just think twice about this method, it breaks the original logic about cpuc->events[] (setting at x86_pmu_start() and clearing at x86_pmu_stop()) and this leaves the cpuc->events[] is never cleared, this is ambiguous. Besides checking cpuc->events[idx] and cpuc->active_mask is not atomic, this may bring potential risks especially considering cpuc->events[] are broadly used in x86/perf code. Talked with Kan offline, he suggests to get a events[] snapshot from cpuc->events[] before calling perf_event_overflow() and then use the events[] snapshot to process all left PEBS records. That seems a better and safer method to fix this issue for me. Peter, if you don't object this method, I would follow Kan's suggestion. Thanks. > >> Or perhaps move the setting and clearing into x86_pmu_{add,del}() rather >> than x86_pmu_{start,stop}(). After all, the latter don't affect the >> counter placement, they just stop/start the event. > IIUC, we could not move the setting into x86_pmu_add() from x86_pmu_stop() > since the counter index is not finalized at x86_pmu_add() is called. The > counter index could change for each adding a new event. > > >>
> event = cpuc->events[bit]; > + /* > + * perf_event_overflow() called by below __intel_pmu_pebs_last_event() > + * could trigger interrupt throttle and clear all event pointers of the > + * group in cpuc->events[] to NULL. So need to re-check if cpuc->events[*] > + * is NULL, if so it indicates the event has been throttled (stopped) and > + * the corresponding last PEBS records have been processed in stopping > + * event, don't need to process it again. > + */ > + if (!event) > + continue; Then we silently ignore the overflow. Would be better to log at least an overflow packet or something like that. -Andi
On 8/20/2025 11:41 AM, Andi Kleen wrote: >> event = cpuc->events[bit]; >> + /* >> + * perf_event_overflow() called by below __intel_pmu_pebs_last_event() >> + * could trigger interrupt throttle and clear all event pointers of the >> + * group in cpuc->events[] to NULL. So need to re-check if cpuc->events[*] >> + * is NULL, if so it indicates the event has been throttled (stopped) and >> + * the corresponding last PEBS records have been processed in stopping >> + * event, don't need to process it again. >> + */ >> + if (!event) >> + continue; > Then we silently ignore the overflow. Would be better to log at least an overflow > packet or something like that. Andi, I didn't fully get the exact meaning about the "log" here. When throttle is triggered, perf_event_throttle() has already called perf_log_throttle() to log the throttle event although only for the group leader. Is it enough? > > -Andi >
> Andi, I didn't fully get the exact meaning about the "log" here. When > throttle is triggered, perf_event_throttle() has already called > perf_log_throttle() to log the throttle event although only for the group > leader. Is it enough? Throttle normally doesn't involve data loss, just less samples. But this is data loss, so it's an overflow. -Andi
On 8/20/2025 1:44 PM, Andi Kleen wrote: >> Andi, I didn't fully get the exact meaning about the "log" here. When >> throttle is triggered, perf_event_throttle() has already called >> perf_log_throttle() to log the throttle event although only for the group >> leader. Is it enough? > Throttle normally doesn't involve data loss, just less samples. But this > is data loss, so it's an overflow. IIUC, there should be no data loss, the unprocessed PEBS records of these throttled events would be still processed eventually by calling intel_pmu_drain_pebs_buffer() when stopping the event. > > -Andi >
On Wed, Aug 20, 2025 at 01:54:17PM +0800, Mi, Dapeng wrote: > > On 8/20/2025 1:44 PM, Andi Kleen wrote: > >> Andi, I didn't fully get the exact meaning about the "log" here. When > >> throttle is triggered, perf_event_throttle() has already called > >> perf_log_throttle() to log the throttle event although only for the group > >> leader. Is it enough? > > Throttle normally doesn't involve data loss, just less samples. But this > > is data loss, so it's an overflow. > > IIUC, there should be no data loss, the unprocessed PEBS records of these > throttled events would be still processed eventually by calling > intel_pmu_drain_pebs_buffer() when stopping the event. Makes sense. Thanks, -Andi
© 2016 - 2025 Red Hat, Inc.