arch/x86/events/intel/core.c | 262 ++++++++++++++++++++++++++++- arch/x86/events/perf_event.h | 21 +++ arch/x86/events/perf_event_flags.h | 2 +- arch/x86/include/asm/msr-index.h | 4 + arch/x86/include/asm/perf_event.h | 4 +- include/linux/perf_event.h | 2 + 6 files changed, 288 insertions(+), 7 deletions(-)
From: Kan Liang <kan.liang@linux.intel.com>
Changes since V1:
- Add a check to the reload value which cannot exceeds the max period
- Avoid invoking intel_pmu_enable_acr() for the perf metrics event.
- Update comments explain to case which the event->attr.config2 exceeds
the group size
The relative rates among two or more events are useful for performance
analysis, e.g., a high branch miss rate may indicate a performance
issue. Usually, the samples with a relative rate that exceeds some
threshold are more useful. However, the traditional sampling takes
samples of events separately. To get the relative rates among two or
more events, a high sample rate is required, which can bring high
overhead. Many samples taken in the non-hotspot area are also dropped
(useless) in the post-process.
Auto Counter Reload (ACR) provides a means for software to specify that,
for each supported counter, the hardware should automatically reload the
counter to a specified initial value upon overflow of chosen counters.
This mechanism enables software to sample based on the relative rate of
two (or more) events, such that a sample (PMI or PEBS) is taken only if
the rate of one event exceeds some threshold relative to the rate of
another event. Taking a PMI or PEBS only when the relative rate of
perfmon events crosses a threshold can have significantly less
performance overhead than other techniques.
The details can be found at Intel Architecture Instruction Set
Extensions and Future Features (053) 8.7 AUTO COUNTER RELOAD.
Examples:
Here is the snippet of the mispredict.c. Since the array has random
numbers, jumps are random and often mispredicted.
The mispredicted rate depends on the compared value.
For the Loop1, ~11% of all branches are mispredicted.
For the Loop2, ~21% of all branches are mispredicted.
main()
{
...
for (i = 0; i < N; i++)
data[i] = rand() % 256;
...
/* Loop 1 */
for (k = 0; k < 50; k++)
for (i = 0; i < N; i++)
if (data[i] >= 64)
sum += data[i];
...
...
/* Loop 2 */
for (k = 0; k < 50; k++)
for (i = 0; i < N; i++)
if (data[i] >= 128)
sum += data[i];
...
}
Usually, a code with a high branch miss rate means a bad performance.
To understand the branch miss rate of the codes, the traditional method
usually sample both branches and branch-misses events. E.g.,
perf record -e "{cpu_atom/branch-misses/ppu, cpu_atom/branch-instructions/u}"
-c 1000000 -- ./mispredict
[ perf record: Woken up 4 times to write data ]
[ perf record: Captured and wrote 0.925 MB perf.data (5106 samples) ]
The 5106 samples are from both events and spread in both Loops.
In the post process stage, a user can know that the Loop 2 has a 21%
branch miss rate. Then they can focus on the samples of branch-misses
events for the Loop 2.
With this patch, the user can generate the samples only when the branch
miss rate > 20%.
perf record -e "{cpu_atom/branch-misses,period=200000,acr_mask=0x2/ppu,
cpu_atom/branch-instructions,period=1000000,acr_mask=0x3/u}"
-- ./mispredict
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.098 MB perf.data (2498 samples) ]
$perf report
Percent │154: movl $0x0,-0x14(%rbp)
│ ↓ jmp 1af
│ for (i = j; i < N; i++)
│15d: mov -0x10(%rbp),%eax
│ mov %eax,-0x18(%rbp)
│ ↓ jmp 1a2
│ if (data[i] >= 128)
│165: mov -0x18(%rbp),%eax
│ cltq
│ lea 0x0(,%rax,4),%rdx
│ mov -0x8(%rbp),%rax
│ add %rdx,%rax
│ mov (%rax),%eax
│ ┌──cmp $0x7f,%eax
100.00 0.00 │ ├──jle 19e
│ │sum += data[i];
The 2498 samples are all from the branch-misses events for the Loop 2.
The number of samples and overhead is significantly reduced without
losing any information.
Kan Liang (3):
perf/x86/intel: Fix ARCH_PERFMON_NUM_COUNTER_LEAF
perf/x86/intel: Add the enumeration and flag for the auto counter
reload
perf/x86/intel: Support auto counter reload
arch/x86/events/intel/core.c | 262 ++++++++++++++++++++++++++++-
arch/x86/events/perf_event.h | 21 +++
arch/x86/events/perf_event_flags.h | 2 +-
arch/x86/include/asm/msr-index.h | 4 +
arch/x86/include/asm/perf_event.h | 4 +-
include/linux/perf_event.h | 2 +
6 files changed, 288 insertions(+), 7 deletions(-)
--
2.38.1
* kan.liang@linux.intel.com <kan.liang@linux.intel.com> wrote: > From: Kan Liang <kan.liang@linux.intel.com> > > Changes since V1: > - Add a check to the reload value which cannot exceeds the max period > - Avoid invoking intel_pmu_enable_acr() for the perf metrics event. > - Update comments explain to case which the event->attr.config2 exceeds > the group size > The 2498 samples are all from the branch-misses events for the Loop 2. > > The number of samples and overhead is significantly reduced without > losing any information. Ok, that looks like a pretty sweet PMU feature. What is the hardware support range of this auto count reload feature, how recent CPU does one have to have? The series has aged a bit though, while a variant of patch #1 has been merged already under: 47a973fd7563 perf/x86/intel: Fix ARCH_PERFMON_NUM_COUNTER_LEAF ... but #2 and #3 don't apply cleanly anymore. Mind sending a refreshed series perhaps? Thanks, Ingo
On 2025-03-14 5:51 a.m., Ingo Molnar wrote: > > * kan.liang@linux.intel.com <kan.liang@linux.intel.com> wrote: > >> From: Kan Liang <kan.liang@linux.intel.com> >> >> Changes since V1: >> - Add a check to the reload value which cannot exceeds the max period >> - Avoid invoking intel_pmu_enable_acr() for the perf metrics event. >> - Update comments explain to case which the event->attr.config2 exceeds >> the group size > >> The 2498 samples are all from the branch-misses events for the Loop 2. >> >> The number of samples and overhead is significantly reduced without >> losing any information. > > Ok, that looks like a pretty sweet PMU feature. > Thanks for the review. > What is the hardware support range of this auto count reload feature, > how recent CPU does one have to have? The feature was first introduced into the Sierra Forest server, which was launched last year. https://en.wikipedia.org/wiki/Sierra_Forest All the future platforms should have it support as well. > > The series has aged a bit though, while a variant of patch #1 has been > merged already under: > > 47a973fd7563 perf/x86/intel: Fix ARCH_PERFMON_NUM_COUNTER_LEAF > > ... but #2 and #3 don't apply cleanly anymore. > > Mind sending a refreshed series perhaps? No problem. I saw Peter just gives several feedback. I will also address the concerns in the new series. Thanks, Kan
Hi Peter,
Ping. Could you please let me know if you have any comments.
Thanks,
Kan
On 2024-10-10 3:28 p.m., kan.liang@linux.intel.com wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
>
> Changes since V1:
> - Add a check to the reload value which cannot exceeds the max period
> - Avoid invoking intel_pmu_enable_acr() for the perf metrics event.
> - Update comments explain to case which the event->attr.config2 exceeds
> the group size
>
> The relative rates among two or more events are useful for performance
> analysis, e.g., a high branch miss rate may indicate a performance
> issue. Usually, the samples with a relative rate that exceeds some
> threshold are more useful. However, the traditional sampling takes
> samples of events separately. To get the relative rates among two or
> more events, a high sample rate is required, which can bring high
> overhead. Many samples taken in the non-hotspot area are also dropped
> (useless) in the post-process.
>
> Auto Counter Reload (ACR) provides a means for software to specify that,
> for each supported counter, the hardware should automatically reload the
> counter to a specified initial value upon overflow of chosen counters.
> This mechanism enables software to sample based on the relative rate of
> two (or more) events, such that a sample (PMI or PEBS) is taken only if
> the rate of one event exceeds some threshold relative to the rate of
> another event. Taking a PMI or PEBS only when the relative rate of
> perfmon events crosses a threshold can have significantly less
> performance overhead than other techniques.
>
> The details can be found at Intel Architecture Instruction Set
> Extensions and Future Features (053) 8.7 AUTO COUNTER RELOAD.
>
> Examples:
>
> Here is the snippet of the mispredict.c. Since the array has random
> numbers, jumps are random and often mispredicted.
> The mispredicted rate depends on the compared value.
>
> For the Loop1, ~11% of all branches are mispredicted.
> For the Loop2, ~21% of all branches are mispredicted.
>
> main()
> {
> ...
> for (i = 0; i < N; i++)
> data[i] = rand() % 256;
> ...
> /* Loop 1 */
> for (k = 0; k < 50; k++)
> for (i = 0; i < N; i++)
> if (data[i] >= 64)
> sum += data[i];
> ...
>
> ...
> /* Loop 2 */
> for (k = 0; k < 50; k++)
> for (i = 0; i < N; i++)
> if (data[i] >= 128)
> sum += data[i];
> ...
> }
>
> Usually, a code with a high branch miss rate means a bad performance.
> To understand the branch miss rate of the codes, the traditional method
> usually sample both branches and branch-misses events. E.g.,
> perf record -e "{cpu_atom/branch-misses/ppu, cpu_atom/branch-instructions/u}"
> -c 1000000 -- ./mispredict
>
> [ perf record: Woken up 4 times to write data ]
> [ perf record: Captured and wrote 0.925 MB perf.data (5106 samples) ]
> The 5106 samples are from both events and spread in both Loops.
> In the post process stage, a user can know that the Loop 2 has a 21%
> branch miss rate. Then they can focus on the samples of branch-misses
> events for the Loop 2.
>
> With this patch, the user can generate the samples only when the branch
> miss rate > 20%.
> perf record -e "{cpu_atom/branch-misses,period=200000,acr_mask=0x2/ppu,
> cpu_atom/branch-instructions,period=1000000,acr_mask=0x3/u}"
> -- ./mispredict
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.098 MB perf.data (2498 samples) ]
>
> $perf report
>
> Percent │154: movl $0x0,-0x14(%rbp)
> │ ↓ jmp 1af
> │ for (i = j; i < N; i++)
> │15d: mov -0x10(%rbp),%eax
> │ mov %eax,-0x18(%rbp)
> │ ↓ jmp 1a2
> │ if (data[i] >= 128)
> │165: mov -0x18(%rbp),%eax
> │ cltq
> │ lea 0x0(,%rax,4),%rdx
> │ mov -0x8(%rbp),%rax
> │ add %rdx,%rax
> │ mov (%rax),%eax
> │ ┌──cmp $0x7f,%eax
> 100.00 0.00 │ ├──jle 19e
> │ │sum += data[i];
>
> The 2498 samples are all from the branch-misses events for the Loop 2.
>
> The number of samples and overhead is significantly reduced without
> losing any information.
>
> Kan Liang (3):
> perf/x86/intel: Fix ARCH_PERFMON_NUM_COUNTER_LEAF
> perf/x86/intel: Add the enumeration and flag for the auto counter
> reload
> perf/x86/intel: Support auto counter reload
>
> arch/x86/events/intel/core.c | 262 ++++++++++++++++++++++++++++-
> arch/x86/events/perf_event.h | 21 +++
> arch/x86/events/perf_event_flags.h | 2 +-
> arch/x86/include/asm/msr-index.h | 4 +
> arch/x86/include/asm/perf_event.h | 4 +-
> include/linux/perf_event.h | 2 +
> 6 files changed, 288 insertions(+), 7 deletions(-)
>
© 2016 - 2026 Red Hat, Inc.