arch/x86/events/intel/core.c | 262 ++++++++++++++++++++++++++++- arch/x86/events/perf_event.h | 21 +++ arch/x86/events/perf_event_flags.h | 2 +- arch/x86/include/asm/msr-index.h | 4 + arch/x86/include/asm/perf_event.h | 4 +- include/linux/perf_event.h | 2 + 6 files changed, 288 insertions(+), 7 deletions(-)
From: Kan Liang <kan.liang@linux.intel.com> Changes since V1: - Add a check to the reload value which cannot exceeds the max period - Avoid invoking intel_pmu_enable_acr() for the perf metrics event. - Update comments explain to case which the event->attr.config2 exceeds the group size The relative rates among two or more events are useful for performance analysis, e.g., a high branch miss rate may indicate a performance issue. Usually, the samples with a relative rate that exceeds some threshold are more useful. However, the traditional sampling takes samples of events separately. To get the relative rates among two or more events, a high sample rate is required, which can bring high overhead. Many samples taken in the non-hotspot area are also dropped (useless) in the post-process. Auto Counter Reload (ACR) provides a means for software to specify that, for each supported counter, the hardware should automatically reload the counter to a specified initial value upon overflow of chosen counters. This mechanism enables software to sample based on the relative rate of two (or more) events, such that a sample (PMI or PEBS) is taken only if the rate of one event exceeds some threshold relative to the rate of another event. Taking a PMI or PEBS only when the relative rate of perfmon events crosses a threshold can have significantly less performance overhead than other techniques. The details can be found at Intel Architecture Instruction Set Extensions and Future Features (053) 8.7 AUTO COUNTER RELOAD. Examples: Here is the snippet of the mispredict.c. Since the array has random numbers, jumps are random and often mispredicted. The mispredicted rate depends on the compared value. For the Loop1, ~11% of all branches are mispredicted. For the Loop2, ~21% of all branches are mispredicted. main() { ... for (i = 0; i < N; i++) data[i] = rand() % 256; ... /* Loop 1 */ for (k = 0; k < 50; k++) for (i = 0; i < N; i++) if (data[i] >= 64) sum += data[i]; ... ... /* Loop 2 */ for (k = 0; k < 50; k++) for (i = 0; i < N; i++) if (data[i] >= 128) sum += data[i]; ... } Usually, a code with a high branch miss rate means a bad performance. To understand the branch miss rate of the codes, the traditional method usually sample both branches and branch-misses events. E.g., perf record -e "{cpu_atom/branch-misses/ppu, cpu_atom/branch-instructions/u}" -c 1000000 -- ./mispredict [ perf record: Woken up 4 times to write data ] [ perf record: Captured and wrote 0.925 MB perf.data (5106 samples) ] The 5106 samples are from both events and spread in both Loops. In the post process stage, a user can know that the Loop 2 has a 21% branch miss rate. Then they can focus on the samples of branch-misses events for the Loop 2. With this patch, the user can generate the samples only when the branch miss rate > 20%. perf record -e "{cpu_atom/branch-misses,period=200000,acr_mask=0x2/ppu, cpu_atom/branch-instructions,period=1000000,acr_mask=0x3/u}" -- ./mispredict [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.098 MB perf.data (2498 samples) ] $perf report Percent │154: movl $0x0,-0x14(%rbp) │ ↓ jmp 1af │ for (i = j; i < N; i++) │15d: mov -0x10(%rbp),%eax │ mov %eax,-0x18(%rbp) │ ↓ jmp 1a2 │ if (data[i] >= 128) │165: mov -0x18(%rbp),%eax │ cltq │ lea 0x0(,%rax,4),%rdx │ mov -0x8(%rbp),%rax │ add %rdx,%rax │ mov (%rax),%eax │ ┌──cmp $0x7f,%eax 100.00 0.00 │ ├──jle 19e │ │sum += data[i]; The 2498 samples are all from the branch-misses events for the Loop 2. The number of samples and overhead is significantly reduced without losing any information. Kan Liang (3): perf/x86/intel: Fix ARCH_PERFMON_NUM_COUNTER_LEAF perf/x86/intel: Add the enumeration and flag for the auto counter reload perf/x86/intel: Support auto counter reload arch/x86/events/intel/core.c | 262 ++++++++++++++++++++++++++++- arch/x86/events/perf_event.h | 21 +++ arch/x86/events/perf_event_flags.h | 2 +- arch/x86/include/asm/msr-index.h | 4 + arch/x86/include/asm/perf_event.h | 4 +- include/linux/perf_event.h | 2 + 6 files changed, 288 insertions(+), 7 deletions(-) -- 2.38.1
Hi Peter, Ping. Could you please let me know if you have any comments. Thanks, Kan On 2024-10-10 3:28 p.m., kan.liang@linux.intel.com wrote: > From: Kan Liang <kan.liang@linux.intel.com> > > Changes since V1: > - Add a check to the reload value which cannot exceeds the max period > - Avoid invoking intel_pmu_enable_acr() for the perf metrics event. > - Update comments explain to case which the event->attr.config2 exceeds > the group size > > The relative rates among two or more events are useful for performance > analysis, e.g., a high branch miss rate may indicate a performance > issue. Usually, the samples with a relative rate that exceeds some > threshold are more useful. However, the traditional sampling takes > samples of events separately. To get the relative rates among two or > more events, a high sample rate is required, which can bring high > overhead. Many samples taken in the non-hotspot area are also dropped > (useless) in the post-process. > > Auto Counter Reload (ACR) provides a means for software to specify that, > for each supported counter, the hardware should automatically reload the > counter to a specified initial value upon overflow of chosen counters. > This mechanism enables software to sample based on the relative rate of > two (or more) events, such that a sample (PMI or PEBS) is taken only if > the rate of one event exceeds some threshold relative to the rate of > another event. Taking a PMI or PEBS only when the relative rate of > perfmon events crosses a threshold can have significantly less > performance overhead than other techniques. > > The details can be found at Intel Architecture Instruction Set > Extensions and Future Features (053) 8.7 AUTO COUNTER RELOAD. > > Examples: > > Here is the snippet of the mispredict.c. Since the array has random > numbers, jumps are random and often mispredicted. > The mispredicted rate depends on the compared value. > > For the Loop1, ~11% of all branches are mispredicted. > For the Loop2, ~21% of all branches are mispredicted. > > main() > { > ... > for (i = 0; i < N; i++) > data[i] = rand() % 256; > ... > /* Loop 1 */ > for (k = 0; k < 50; k++) > for (i = 0; i < N; i++) > if (data[i] >= 64) > sum += data[i]; > ... > > ... > /* Loop 2 */ > for (k = 0; k < 50; k++) > for (i = 0; i < N; i++) > if (data[i] >= 128) > sum += data[i]; > ... > } > > Usually, a code with a high branch miss rate means a bad performance. > To understand the branch miss rate of the codes, the traditional method > usually sample both branches and branch-misses events. E.g., > perf record -e "{cpu_atom/branch-misses/ppu, cpu_atom/branch-instructions/u}" > -c 1000000 -- ./mispredict > > [ perf record: Woken up 4 times to write data ] > [ perf record: Captured and wrote 0.925 MB perf.data (5106 samples) ] > The 5106 samples are from both events and spread in both Loops. > In the post process stage, a user can know that the Loop 2 has a 21% > branch miss rate. Then they can focus on the samples of branch-misses > events for the Loop 2. > > With this patch, the user can generate the samples only when the branch > miss rate > 20%. > perf record -e "{cpu_atom/branch-misses,period=200000,acr_mask=0x2/ppu, > cpu_atom/branch-instructions,period=1000000,acr_mask=0x3/u}" > -- ./mispredict > [ perf record: Woken up 1 times to write data ] > [ perf record: Captured and wrote 0.098 MB perf.data (2498 samples) ] > > $perf report > > Percent │154: movl $0x0,-0x14(%rbp) > │ ↓ jmp 1af > │ for (i = j; i < N; i++) > │15d: mov -0x10(%rbp),%eax > │ mov %eax,-0x18(%rbp) > │ ↓ jmp 1a2 > │ if (data[i] >= 128) > │165: mov -0x18(%rbp),%eax > │ cltq > │ lea 0x0(,%rax,4),%rdx > │ mov -0x8(%rbp),%rax > │ add %rdx,%rax > │ mov (%rax),%eax > │ ┌──cmp $0x7f,%eax > 100.00 0.00 │ ├──jle 19e > │ │sum += data[i]; > > The 2498 samples are all from the branch-misses events for the Loop 2. > > The number of samples and overhead is significantly reduced without > losing any information. > > Kan Liang (3): > perf/x86/intel: Fix ARCH_PERFMON_NUM_COUNTER_LEAF > perf/x86/intel: Add the enumeration and flag for the auto counter > reload > perf/x86/intel: Support auto counter reload > > arch/x86/events/intel/core.c | 262 ++++++++++++++++++++++++++++- > arch/x86/events/perf_event.h | 21 +++ > arch/x86/events/perf_event_flags.h | 2 +- > arch/x86/include/asm/msr-index.h | 4 + > arch/x86/include/asm/perf_event.h | 4 +- > include/linux/perf_event.h | 2 + > 6 files changed, 288 insertions(+), 7 deletions(-) >
© 2016 - 2024 Red Hat, Inc.