[PATCH V3 02/16] perf: Fix the throttle logic for a group

kan.liang@linux.intel.com posted 16 patches 7 months, 1 week ago
There is a newer version of this series
[PATCH V3 02/16] perf: Fix the throttle logic for a group
Posted by kan.liang@linux.intel.com 7 months, 1 week ago
From: Kan Liang <kan.liang@linux.intel.com>

The current throttle logic doesn't work well with a group, e.g., the
following sampling-read case.

$ perf record -e "{cycles,cycles}:S" ...

$ perf report -D | grep THROTTLE | tail -2
            THROTTLE events:        426  ( 9.0%)
          UNTHROTTLE events:        425  ( 9.0%)

$ perf report -D | grep PERF_RECORD_SAMPLE -a4 | tail -n 5
0 1020120874009167 0x74970 [0x68]: PERF_RECORD_SAMPLE(IP, 0x1):
... sample_read:
.... group nr 2
..... id 0000000000000327, value 000000000cbb993a, lost 0
..... id 0000000000000328, value 00000002211c26df, lost 0

The second cycles event has a much larger value than the first cycles
event in the same group.

The current throttle logic in the generic code only logs the THROTTLE
event. It relies on the specific driver implementation to disable
events. For all ARCHs, the implementation is similar. Only the event is
disabled, rather than the group.

The logic to disable the group should be generic for all ARCHs. Add the
logic in the generic code. The following patch will remove the buggy
driver-specific implementation.

The throttle only happens when an event is overflowed. Stop the entire
group when any event in the group triggers the throttle.
The MAX_INTERRUPTS is set to all throttle events.

The unthrottled could happen in 3 places.
- event/group sched. All events in the group are scheduled one by one.
  All of them will be unthrottled eventually. Nothing needs to be
  changed.
- The perf_adjust_freq_unthr_events for each tick. Needs to restart the
  group altogether.
- The __perf_event_period(). The whole group needs to be restarted
  altogether as well.

With the fix,
$ sudo perf report -D | grep PERF_RECORD_SAMPLE -a4 | tail -n 5
0 3573470770332 0x12f5f8 [0x70]: PERF_RECORD_SAMPLE(IP, 0x2):
... sample_read:
.... group nr 2
..... id 0000000000000a28, value 00000004fd3dfd8f, lost 0
..... id 0000000000000a29, value 00000004fd3dfd8f, lost 0

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 kernel/events/core.c | 60 ++++++++++++++++++++++++++++++++------------
 1 file changed, 44 insertions(+), 16 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index af78ec118e8f..52490c2ce45b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2739,6 +2739,39 @@ void perf_event_disable_inatomic(struct perf_event *event)
 static void perf_log_throttle(struct perf_event *event, int enable);
 static void perf_log_itrace_start(struct perf_event *event);
 
+static void perf_event_unthrottle(struct perf_event *event, bool start)
+{
+	event->hw.interrupts = 0;
+	if (start)
+		event->pmu->start(event, 0);
+	perf_log_throttle(event, 1);
+}
+
+static void perf_event_throttle(struct perf_event *event)
+{
+	event->pmu->stop(event, 0);
+	event->hw.interrupts = MAX_INTERRUPTS;
+	perf_log_throttle(event, 0);
+}
+
+static void perf_event_unthrottle_group(struct perf_event *event, bool skip_start_event)
+{
+	struct perf_event *sibling, *leader = event->group_leader;
+
+	perf_event_unthrottle(leader, skip_start_event ? leader != event : true);
+	for_each_sibling_event(sibling, leader)
+		perf_event_unthrottle(sibling, skip_start_event ? sibling != event : true);
+}
+
+static void perf_event_throttle_group(struct perf_event *event)
+{
+	struct perf_event *sibling, *leader = event->group_leader;
+
+	perf_event_throttle(leader);
+	for_each_sibling_event(sibling, leader)
+		perf_event_throttle(sibling);
+}
+
 static int
 event_sched_in(struct perf_event *event, struct perf_event_context *ctx)
 {
@@ -4393,12 +4426,8 @@ static void perf_adjust_freq_unthr_events(struct list_head *event_list)
 
 		hwc = &event->hw;
 
-		if (hwc->interrupts == MAX_INTERRUPTS) {
-			hwc->interrupts = 0;
-			perf_log_throttle(event, 1);
-			if (!is_event_in_freq_mode(event))
-				event->pmu->start(event, 0);
-		}
+		if (hwc->interrupts == MAX_INTERRUPTS)
+			perf_event_unthrottle_group(event, is_event_in_freq_mode(event));
 
 		if (!is_event_in_freq_mode(event))
 			continue;
@@ -6426,14 +6455,6 @@ static void __perf_event_period(struct perf_event *event,
 	active = (event->state == PERF_EVENT_STATE_ACTIVE);
 	if (active) {
 		perf_pmu_disable(event->pmu);
-		/*
-		 * We could be throttled; unthrottle now to avoid the tick
-		 * trying to unthrottle while we already re-started the event.
-		 */
-		if (event->hw.interrupts == MAX_INTERRUPTS) {
-			event->hw.interrupts = 0;
-			perf_log_throttle(event, 1);
-		}
 		event->pmu->stop(event, PERF_EF_UPDATE);
 	}
 
@@ -6441,6 +6462,14 @@ static void __perf_event_period(struct perf_event *event,
 
 	if (active) {
 		event->pmu->start(event, PERF_EF_RELOAD);
+		/*
+		 * Once the period is force-reset, the event starts immediately.
+		 * But the event/group could be throttled. Unthrottle the
+		 * event/group now to avoid the next tick trying to unthrottle
+		 * while we already re-started the event/group.
+		 */
+		if (event->hw.interrupts == MAX_INTERRUPTS)
+			perf_event_unthrottle_group(event, true);
 		perf_pmu_enable(event->pmu);
 	}
 }
@@ -10331,8 +10360,7 @@ __perf_event_account_interrupt(struct perf_event *event, int throttle)
 	if (unlikely(throttle && hwc->interrupts >= max_samples_per_tick)) {
 		__this_cpu_inc(perf_throttled_count);
 		tick_dep_set_cpu(smp_processor_id(), TICK_DEP_BIT_PERF_EVENTS);
-		hwc->interrupts = MAX_INTERRUPTS;
-		perf_log_throttle(event, 0);
+		perf_event_throttle_group(event);
 		ret = 1;
 	}
 
-- 
2.38.1
Re: [PATCH V3 02/16] perf: Fix the throttle logic for a group
Posted by Namhyung Kim 7 months ago
Hi Kan,

On Fri, May 16, 2025 at 11:28:39AM -0700, kan.liang@linux.intel.com wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> The current throttle logic doesn't work well with a group, e.g., the
> following sampling-read case.
> 
> $ perf record -e "{cycles,cycles}:S" ...
> 
> $ perf report -D | grep THROTTLE | tail -2
>             THROTTLE events:        426  ( 9.0%)
>           UNTHROTTLE events:        425  ( 9.0%)
> 
> $ perf report -D | grep PERF_RECORD_SAMPLE -a4 | tail -n 5
> 0 1020120874009167 0x74970 [0x68]: PERF_RECORD_SAMPLE(IP, 0x1):
> ... sample_read:
> .... group nr 2
> ..... id 0000000000000327, value 000000000cbb993a, lost 0
> ..... id 0000000000000328, value 00000002211c26df, lost 0
> 
> The second cycles event has a much larger value than the first cycles
> event in the same group.
> 
> The current throttle logic in the generic code only logs the THROTTLE
> event. It relies on the specific driver implementation to disable
> events. For all ARCHs, the implementation is similar. Only the event is
> disabled, rather than the group.
> 
> The logic to disable the group should be generic for all ARCHs. Add the
> logic in the generic code. The following patch will remove the buggy
> driver-specific implementation.
> 
> The throttle only happens when an event is overflowed. Stop the entire
> group when any event in the group triggers the throttle.
> The MAX_INTERRUPTS is set to all throttle events.
> 
> The unthrottled could happen in 3 places.
> - event/group sched. All events in the group are scheduled one by one.
>   All of them will be unthrottled eventually. Nothing needs to be
>   changed.
> - The perf_adjust_freq_unthr_events for each tick. Needs to restart the
>   group altogether.
> - The __perf_event_period(). The whole group needs to be restarted
>   altogether as well.
> 
> With the fix,
> $ sudo perf report -D | grep PERF_RECORD_SAMPLE -a4 | tail -n 5
> 0 3573470770332 0x12f5f8 [0x70]: PERF_RECORD_SAMPLE(IP, 0x2):
> ... sample_read:
> .... group nr 2
> ..... id 0000000000000a28, value 00000004fd3dfd8f, lost 0
> ..... id 0000000000000a29, value 00000004fd3dfd8f, lost 0

Thanks for working on this!

> 
> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> ---
>  kernel/events/core.c | 60 ++++++++++++++++++++++++++++++++------------
>  1 file changed, 44 insertions(+), 16 deletions(-)
> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index af78ec118e8f..52490c2ce45b 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2739,6 +2739,39 @@ void perf_event_disable_inatomic(struct perf_event *event)
>  static void perf_log_throttle(struct perf_event *event, int enable);
>  static void perf_log_itrace_start(struct perf_event *event);
>  
> +static void perf_event_unthrottle(struct perf_event *event, bool start)
> +{
> +	event->hw.interrupts = 0;
> +	if (start)
> +		event->pmu->start(event, 0);
> +	perf_log_throttle(event, 1);
> +}
> +
> +static void perf_event_throttle(struct perf_event *event)
> +{
> +	event->pmu->stop(event, 0);
> +	event->hw.interrupts = MAX_INTERRUPTS;
> +	perf_log_throttle(event, 0);
> +}
> +
> +static void perf_event_unthrottle_group(struct perf_event *event, bool skip_start_event)
> +{
> +	struct perf_event *sibling, *leader = event->group_leader;
> +
> +	perf_event_unthrottle(leader, skip_start_event ? leader != event : true);
> +	for_each_sibling_event(sibling, leader)
> +		perf_event_unthrottle(sibling, skip_start_event ? sibling != event : true);

This will add more PERF_RECORD_THROTTLE records for sibling events.
Maybe we can generate it for the actual target event only?

Also the condition for skip_start_event is if it's a freq event.
I think we can skip pmu->start() if the sibling is also a freq event.
I remember KVM folks concern about the number of PMU accesses as it
can cause VM exits.

Thanks,
Namhyung

> +}
> +
> +static void perf_event_throttle_group(struct perf_event *event)
> +{
> +	struct perf_event *sibling, *leader = event->group_leader;
> +
> +	perf_event_throttle(leader);
> +	for_each_sibling_event(sibling, leader)
> +		perf_event_throttle(sibling);
> +}
> +
>  static int
>  event_sched_in(struct perf_event *event, struct perf_event_context *ctx)
>  {
> @@ -4393,12 +4426,8 @@ static void perf_adjust_freq_unthr_events(struct list_head *event_list)
>  
>  		hwc = &event->hw;
>  
> -		if (hwc->interrupts == MAX_INTERRUPTS) {
> -			hwc->interrupts = 0;
> -			perf_log_throttle(event, 1);
> -			if (!is_event_in_freq_mode(event))
> -				event->pmu->start(event, 0);
> -		}
> +		if (hwc->interrupts == MAX_INTERRUPTS)
> +			perf_event_unthrottle_group(event, is_event_in_freq_mode(event));
>  
>  		if (!is_event_in_freq_mode(event))
>  			continue;
> @@ -6426,14 +6455,6 @@ static void __perf_event_period(struct perf_event *event,
>  	active = (event->state == PERF_EVENT_STATE_ACTIVE);
>  	if (active) {
>  		perf_pmu_disable(event->pmu);
> -		/*
> -		 * We could be throttled; unthrottle now to avoid the tick
> -		 * trying to unthrottle while we already re-started the event.
> -		 */
> -		if (event->hw.interrupts == MAX_INTERRUPTS) {
> -			event->hw.interrupts = 0;
> -			perf_log_throttle(event, 1);
> -		}
>  		event->pmu->stop(event, PERF_EF_UPDATE);
>  	}
>  
> @@ -6441,6 +6462,14 @@ static void __perf_event_period(struct perf_event *event,
>  
>  	if (active) {
>  		event->pmu->start(event, PERF_EF_RELOAD);
> +		/*
> +		 * Once the period is force-reset, the event starts immediately.
> +		 * But the event/group could be throttled. Unthrottle the
> +		 * event/group now to avoid the next tick trying to unthrottle
> +		 * while we already re-started the event/group.
> +		 */
> +		if (event->hw.interrupts == MAX_INTERRUPTS)
> +			perf_event_unthrottle_group(event, true);
>  		perf_pmu_enable(event->pmu);
>  	}
>  }
> @@ -10331,8 +10360,7 @@ __perf_event_account_interrupt(struct perf_event *event, int throttle)
>  	if (unlikely(throttle && hwc->interrupts >= max_samples_per_tick)) {
>  		__this_cpu_inc(perf_throttled_count);
>  		tick_dep_set_cpu(smp_processor_id(), TICK_DEP_BIT_PERF_EVENTS);
> -		hwc->interrupts = MAX_INTERRUPTS;
> -		perf_log_throttle(event, 0);
> +		perf_event_throttle_group(event);
>  		ret = 1;
>  	}
>  
> -- 
> 2.38.1
>
Re: [PATCH V3 02/16] perf: Fix the throttle logic for a group
Posted by Liang, Kan 7 months ago

On 2025-05-18 3:18 p.m., Namhyung Kim wrote:
> Hi Kan,
> 
> On Fri, May 16, 2025 at 11:28:39AM -0700, kan.liang@linux.intel.com wrote:
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> The current throttle logic doesn't work well with a group, e.g., the
>> following sampling-read case.
>>
>> $ perf record -e "{cycles,cycles}:S" ...
>>
>> $ perf report -D | grep THROTTLE | tail -2
>>             THROTTLE events:        426  ( 9.0%)
>>           UNTHROTTLE events:        425  ( 9.0%)
>>
>> $ perf report -D | grep PERF_RECORD_SAMPLE -a4 | tail -n 5
>> 0 1020120874009167 0x74970 [0x68]: PERF_RECORD_SAMPLE(IP, 0x1):
>> ... sample_read:
>> .... group nr 2
>> ..... id 0000000000000327, value 000000000cbb993a, lost 0
>> ..... id 0000000000000328, value 00000002211c26df, lost 0
>>
>> The second cycles event has a much larger value than the first cycles
>> event in the same group.
>>
>> The current throttle logic in the generic code only logs the THROTTLE
>> event. It relies on the specific driver implementation to disable
>> events. For all ARCHs, the implementation is similar. Only the event is
>> disabled, rather than the group.
>>
>> The logic to disable the group should be generic for all ARCHs. Add the
>> logic in the generic code. The following patch will remove the buggy
>> driver-specific implementation.
>>
>> The throttle only happens when an event is overflowed. Stop the entire
>> group when any event in the group triggers the throttle.
>> The MAX_INTERRUPTS is set to all throttle events.
>>
>> The unthrottled could happen in 3 places.
>> - event/group sched. All events in the group are scheduled one by one.
>>   All of them will be unthrottled eventually. Nothing needs to be
>>   changed.
>> - The perf_adjust_freq_unthr_events for each tick. Needs to restart the
>>   group altogether.
>> - The __perf_event_period(). The whole group needs to be restarted
>>   altogether as well.
>>
>> With the fix,
>> $ sudo perf report -D | grep PERF_RECORD_SAMPLE -a4 | tail -n 5
>> 0 3573470770332 0x12f5f8 [0x70]: PERF_RECORD_SAMPLE(IP, 0x2):
>> ... sample_read:
>> .... group nr 2
>> ..... id 0000000000000a28, value 00000004fd3dfd8f, lost 0
>> ..... id 0000000000000a29, value 00000004fd3dfd8f, lost 0
> 
> Thanks for working on this!
> 
>>
>> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>> ---
>>  kernel/events/core.c | 60 ++++++++++++++++++++++++++++++++------------
>>  1 file changed, 44 insertions(+), 16 deletions(-)
>>
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index af78ec118e8f..52490c2ce45b 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -2739,6 +2739,39 @@ void perf_event_disable_inatomic(struct perf_event *event)
>>  static void perf_log_throttle(struct perf_event *event, int enable);
>>  static void perf_log_itrace_start(struct perf_event *event);
>>  
>> +static void perf_event_unthrottle(struct perf_event *event, bool start)
>> +{
>> +	event->hw.interrupts = 0;
>> +	if (start)
>> +		event->pmu->start(event, 0);
>> +	perf_log_throttle(event, 1);
>> +}
>> +
>> +static void perf_event_throttle(struct perf_event *event)
>> +{
>> +	event->pmu->stop(event, 0);
>> +	event->hw.interrupts = MAX_INTERRUPTS;
>> +	perf_log_throttle(event, 0);
>> +}
>> +
>> +static void perf_event_unthrottle_group(struct perf_event *event, bool skip_start_event)
>> +{
>> +	struct perf_event *sibling, *leader = event->group_leader;
>> +
>> +	perf_event_unthrottle(leader, skip_start_event ? leader != event : true);
>> +	for_each_sibling_event(sibling, leader)
>> +		perf_event_unthrottle(sibling, skip_start_event ? sibling != event : true);
> 
> This will add more PERF_RECORD_THROTTLE records for sibling events.

Yes

> Maybe we can generate it for the actual target event only?

The current code cannot track the actual target event for unthrottle.
Because the MAX_INTERRUPTS are set for all events when event_throttle.

But I think we can only add a PERF_RECORD_THROTTLE record for the leader
event, which can reduce the number of THROTTLE records.

The sample right after the THROTTLE record must be generated by the
actual target event. I think it should be good enough for the perf tool
to locate the event.

I will add the below patch as a separate improvement in V4.

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 52490c2ce45b..cd559501cfbd 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2744,14 +2744,16 @@ static void perf_event_unthrottle(struct
perf_event *event, bool start)
  	event->hw.interrupts = 0;
  	if (start)
  	event->pmu->start(event, 0);
-	perf_log_throttle(event, 1);
+	if (event == event->group_leader)
+		perf_log_throttle(event, 1);
  }

  static void perf_event_throttle(struct perf_event *event)
  {
  	event->pmu->stop(event, 0);
  	event->hw.interrupts = MAX_INTERRUPTS;
-	perf_log_throttle(event, 0);
+	if (event == event->group_leader)
+		perf_log_throttle(event, 0);
  }


> 
> Also the condition for skip_start_event is if it's a freq event.
> I think we can skip pmu->start() if the sibling is also a freq event.

The skip_start_event is if it will be start later separately. It intends
to avoid the double start.

In the perf_adjust_freq_unthr_events(), it will only adjust and start
the leader event, not group. If we skip pmu->start() for a freq sibling
event, it will not start until the next context switch.

Thanks,
Kan

> I remember KVM folks concern about the number of PMU accesses as it
> can cause VM exits.
> 
> Thanks,
> Namhyung
> 
>> +}
>> +
>> +static void perf_event_throttle_group(struct perf_event *event)
>> +{
>> +	struct perf_event *sibling, *leader = event->group_leader;
>> +
>> +	perf_event_throttle(leader);
>> +	for_each_sibling_event(sibling, leader)
>> +		perf_event_throttle(sibling);
>> +}
>> +
>>  static int
>>  event_sched_in(struct perf_event *event, struct perf_event_context *ctx)
>>  {
>> @@ -4393,12 +4426,8 @@ static void perf_adjust_freq_unthr_events(struct list_head *event_list)
>>  
>>  		hwc = &event->hw;
>>  
>> -		if (hwc->interrupts == MAX_INTERRUPTS) {
>> -			hwc->interrupts = 0;
>> -			perf_log_throttle(event, 1);
>> -			if (!is_event_in_freq_mode(event))
>> -				event->pmu->start(event, 0);
>> -		}
>> +		if (hwc->interrupts == MAX_INTERRUPTS)
>> +			perf_event_unthrottle_group(event, is_event_in_freq_mode(event));
>>  
>>  		if (!is_event_in_freq_mode(event))
>>  			continue;
>> @@ -6426,14 +6455,6 @@ static void __perf_event_period(struct perf_event *event,
>>  	active = (event->state == PERF_EVENT_STATE_ACTIVE);
>>  	if (active) {
>>  		perf_pmu_disable(event->pmu);
>> -		/*
>> -		 * We could be throttled; unthrottle now to avoid the tick
>> -		 * trying to unthrottle while we already re-started the event.
>> -		 */
>> -		if (event->hw.interrupts == MAX_INTERRUPTS) {
>> -			event->hw.interrupts = 0;
>> -			perf_log_throttle(event, 1);
>> -		}
>>  		event->pmu->stop(event, PERF_EF_UPDATE);
>>  	}
>>  
>> @@ -6441,6 +6462,14 @@ static void __perf_event_period(struct perf_event *event,
>>  
>>  	if (active) {
>>  		event->pmu->start(event, PERF_EF_RELOAD);
>> +		/*
>> +		 * Once the period is force-reset, the event starts immediately.
>> +		 * But the event/group could be throttled. Unthrottle the
>> +		 * event/group now to avoid the next tick trying to unthrottle
>> +		 * while we already re-started the event/group.
>> +		 */
>> +		if (event->hw.interrupts == MAX_INTERRUPTS)
>> +			perf_event_unthrottle_group(event, true);
>>  		perf_pmu_enable(event->pmu);
>>  	}
>>  }
>> @@ -10331,8 +10360,7 @@ __perf_event_account_interrupt(struct perf_event *event, int throttle)
>>  	if (unlikely(throttle && hwc->interrupts >= max_samples_per_tick)) {
>>  		__this_cpu_inc(perf_throttled_count);
>>  		tick_dep_set_cpu(smp_processor_id(), TICK_DEP_BIT_PERF_EVENTS);
>> -		hwc->interrupts = MAX_INTERRUPTS;
>> -		perf_log_throttle(event, 0);
>> +		perf_event_throttle_group(event);
>>  		ret = 1;
>>  	}
>>  
>> -- 
>> 2.38.1
>>
>
Re: [PATCH V3 02/16] perf: Fix the throttle logic for a group
Posted by Namhyung Kim 7 months ago
On Tue, May 20, 2025 at 10:47:21AM -0400, Liang, Kan wrote:
> 
> 
> On 2025-05-18 3:18 p.m., Namhyung Kim wrote:
> > Hi Kan,
> > 
> > On Fri, May 16, 2025 at 11:28:39AM -0700, kan.liang@linux.intel.com wrote:
> >> From: Kan Liang <kan.liang@linux.intel.com>
> >>
> >> The current throttle logic doesn't work well with a group, e.g., the
> >> following sampling-read case.
> >>
> >> $ perf record -e "{cycles,cycles}:S" ...
> >>
> >> $ perf report -D | grep THROTTLE | tail -2
> >>             THROTTLE events:        426  ( 9.0%)
> >>           UNTHROTTLE events:        425  ( 9.0%)
> >>
> >> $ perf report -D | grep PERF_RECORD_SAMPLE -a4 | tail -n 5
> >> 0 1020120874009167 0x74970 [0x68]: PERF_RECORD_SAMPLE(IP, 0x1):
> >> ... sample_read:
> >> .... group nr 2
> >> ..... id 0000000000000327, value 000000000cbb993a, lost 0
> >> ..... id 0000000000000328, value 00000002211c26df, lost 0
> >>
> >> The second cycles event has a much larger value than the first cycles
> >> event in the same group.
> >>
> >> The current throttle logic in the generic code only logs the THROTTLE
> >> event. It relies on the specific driver implementation to disable
> >> events. For all ARCHs, the implementation is similar. Only the event is
> >> disabled, rather than the group.
> >>
> >> The logic to disable the group should be generic for all ARCHs. Add the
> >> logic in the generic code. The following patch will remove the buggy
> >> driver-specific implementation.
> >>
> >> The throttle only happens when an event is overflowed. Stop the entire
> >> group when any event in the group triggers the throttle.
> >> The MAX_INTERRUPTS is set to all throttle events.
> >>
> >> The unthrottled could happen in 3 places.
> >> - event/group sched. All events in the group are scheduled one by one.
> >>   All of them will be unthrottled eventually. Nothing needs to be
> >>   changed.
> >> - The perf_adjust_freq_unthr_events for each tick. Needs to restart the
> >>   group altogether.
> >> - The __perf_event_period(). The whole group needs to be restarted
> >>   altogether as well.
> >>
> >> With the fix,
> >> $ sudo perf report -D | grep PERF_RECORD_SAMPLE -a4 | tail -n 5
> >> 0 3573470770332 0x12f5f8 [0x70]: PERF_RECORD_SAMPLE(IP, 0x2):
> >> ... sample_read:
> >> .... group nr 2
> >> ..... id 0000000000000a28, value 00000004fd3dfd8f, lost 0
> >> ..... id 0000000000000a29, value 00000004fd3dfd8f, lost 0
> > 
> > Thanks for working on this!
> > 
> >>
> >> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> >> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> >> ---
> >>  kernel/events/core.c | 60 ++++++++++++++++++++++++++++++++------------
> >>  1 file changed, 44 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/kernel/events/core.c b/kernel/events/core.c
> >> index af78ec118e8f..52490c2ce45b 100644
> >> --- a/kernel/events/core.c
> >> +++ b/kernel/events/core.c
> >> @@ -2739,6 +2739,39 @@ void perf_event_disable_inatomic(struct perf_event *event)
> >>  static void perf_log_throttle(struct perf_event *event, int enable);
> >>  static void perf_log_itrace_start(struct perf_event *event);
> >>  
> >> +static void perf_event_unthrottle(struct perf_event *event, bool start)
> >> +{
> >> +	event->hw.interrupts = 0;
> >> +	if (start)
> >> +		event->pmu->start(event, 0);
> >> +	perf_log_throttle(event, 1);
> >> +}
> >> +
> >> +static void perf_event_throttle(struct perf_event *event)
> >> +{
> >> +	event->pmu->stop(event, 0);
> >> +	event->hw.interrupts = MAX_INTERRUPTS;
> >> +	perf_log_throttle(event, 0);
> >> +}
> >> +
> >> +static void perf_event_unthrottle_group(struct perf_event *event, bool skip_start_event)
> >> +{
> >> +	struct perf_event *sibling, *leader = event->group_leader;
> >> +
> >> +	perf_event_unthrottle(leader, skip_start_event ? leader != event : true);
> >> +	for_each_sibling_event(sibling, leader)
> >> +		perf_event_unthrottle(sibling, skip_start_event ? sibling != event : true);
> > 
> > This will add more PERF_RECORD_THROTTLE records for sibling events.
> 
> Yes
> 
> > Maybe we can generate it for the actual target event only?
> 
> The current code cannot track the actual target event for unthrottle.
> Because the MAX_INTERRUPTS are set for all events when event_throttle.

Right.

> 
> But I think we can only add a PERF_RECORD_THROTTLE record for the leader
> event, which can reduce the number of THROTTLE records.

Sounds good.

> 
> The sample right after the THROTTLE record must be generated by the
> actual target event. I think it should be good enough for the perf tool
> to locate the event.

IIRC perf tool doesn't track which event is throttled, but yeah, it'd be
possible to use the next sample to locate it.

> 
> I will add the below patch as a separate improvement in V4.
> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 52490c2ce45b..cd559501cfbd 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2744,14 +2744,16 @@ static void perf_event_unthrottle(struct
> perf_event *event, bool start)
>   	event->hw.interrupts = 0;
>   	if (start)
>   	event->pmu->start(event, 0);
> -	perf_log_throttle(event, 1);
> +	if (event == event->group_leader)
> +		perf_log_throttle(event, 1);
>   }
> 
>   static void perf_event_throttle(struct perf_event *event)
>   {
>   	event->pmu->stop(event, 0);
>   	event->hw.interrupts = MAX_INTERRUPTS;
> -	perf_log_throttle(event, 0);
> +	if (event == event->group_leader)
> +		perf_log_throttle(event, 0);
>   }

Looks good.

> 
> 
> > 
> > Also the condition for skip_start_event is if it's a freq event.
> > I think we can skip pmu->start() if the sibling is also a freq event.
> 
> The skip_start_event is if it will be start later separately. It intends
> to avoid the double start.
> 
> In the perf_adjust_freq_unthr_events(), it will only adjust and start
> the leader event, not group. If we skip pmu->start() for a freq sibling
> event, it will not start until the next context switch.

Oh, I missed that it only has leaders in the active list.

Thanks,
Namhyung
Re: [PATCH V3 02/16] perf: Fix the throttle logic for a group
Posted by Ingo Molnar 7 months ago
* kan.liang@linux.intel.com <kan.liang@linux.intel.com> wrote:

> The throttle only happens when an event is overflowed. Stop the entire
> group when any event in the group triggers the throttle.
> The MAX_INTERRUPTS is set to all throttle events.

Since this is a relatively long series with a healthy dose of 
breakage-risk, I'm wondering about bisectability:

 - patch #2 auto-throttles groups, ie. stops the PMU

 - patches #3-#16 removes explicit PMU-stop calls.

In the interim commits, will the double PMU-stop in drivers not updated 
yet do anything noticeable, such as generate warnings, etc?

Thanks,

	Ingo
Re: [PATCH V3 02/16] perf: Fix the throttle logic for a group
Posted by Liang, Kan 7 months ago

On 2025-05-17 4:22 a.m., Ingo Molnar wrote:
> 
> * kan.liang@linux.intel.com <kan.liang@linux.intel.com> wrote:
> 
>> The throttle only happens when an event is overflowed. Stop the entire
>> group when any event in the group triggers the throttle.
>> The MAX_INTERRUPTS is set to all throttle events.
> 
> Since this is a relatively long series with a healthy dose of 
> breakage-risk, I'm wondering about bisectability:
> 
>  - patch #2 auto-throttles groups, ie. stops the PMU
> 
>  - patches #3-#16 removes explicit PMU-stop calls.
> 
> In the interim commits, will the double PMU-stop in drivers not updated 
> yet do anything noticeable, such as generate warnings, etc?
> 

The short answer is no.

Here are the details for different ARCHs.

There is a active_mask to track the active counter/event in X86. The
current implementation checks the corresponding bit first. If it is
already cleared, do nothing. It avoids the double PMU-stop. I've tested
on my machine.
AMD and Zhaoxin shares the same x86_pmu_stop() as Intel. They are OK as
well.

powerpc, S390, ARC, sparc and xtensa utilize the PERF_HES_STOPPED flag
instead. If the flag has been set, do nothing. It can also avoids the
double PMU-stop.

ARM, apple m1, csky, loongarch and mips invoke the disable_event, rather
than PMU stop. The disable_event unconditionally disables the counter
register. It doesn't check if the register is already disabled. But I
don't think double writing a register can trigger any issue.

Alpha utilizes the PERF_HES_STOPPED flag. But it seems still writes the
counter register even it's already disabled. Because the cpuc->enabled
is used to check whether to write to the register. It's not updated in
the alpha_pmu_stop(). But again, I don't think double writing a register
can trigger any issue.

Thanks,
Kan