[PATCH 0/3] perf vendor events amd: Address event errata

Sandipan Das posted 3 patches 7 months, 1 week ago
.../arch/x86/amdzen5/inst-cache.json          | 24 -------------------
.../arch/x86/amdzen5/load-store.json          |  6 -----
.../arch/x86/amdzen5/recommended.json         | 13 ----------
3 files changed, 43 deletions(-)
[PATCH 0/3] perf vendor events amd: Address event errata
Posted by Sandipan Das 7 months, 1 week ago
Remove unreliable Zen 5 events and metrics. The following errata from
the Revision Guide for AMD Family 1Ah Models 00h-0Fh Processors have
been addressed.
#1569 PMCx078 Counts Incorrectly in Unpredictable Ways
#1583 PMCx18E May Overcount Instruction Cache Accesses
#1587 PMCx188 May Undercount IBS (Instruction Based Sampling) Fetch Events

The document can be downloaded from
https://bugzilla.kernel.org/attachment.cgi?id=308095

Sandipan Das (3):
  perf vendor events amd: Remove Zen 5 instruction cache events
  perf vendor events amd: Remove Zen 5 TLB flush event
  perf vendor events amd: Remove Zen 5 IBS fetch event

 .../arch/x86/amdzen5/inst-cache.json          | 24 -------------------
 .../arch/x86/amdzen5/load-store.json          |  6 -----
 .../arch/x86/amdzen5/recommended.json         | 13 ----------
 3 files changed, 43 deletions(-)

-- 
2.43.0
Re: [PATCH 0/3] perf vendor events amd: Address event errata
Posted by Ian Rogers 7 months, 1 week ago
On Wed, May 7, 2025 at 7:28 AM Sandipan Das <sandipan.das@amd.com> wrote:
>
> Remove unreliable Zen 5 events and metrics. The following errata from
> the Revision Guide for AMD Family 1Ah Models 00h-0Fh Processors have
> been addressed.
> #1569 PMCx078 Counts Incorrectly in Unpredictable Ways
> #1583 PMCx18E May Overcount Instruction Cache Accesses
> #1587 PMCx188 May Undercount IBS (Instruction Based Sampling) Fetch Events
>
> The document can be downloaded from
> https://bugzilla.kernel.org/attachment.cgi?id=308095

Hi Sandipan,

the document is somewhat brief, for example:
```
1583 PMCx18E May Overcount Instruction Cache Accesses

Description
If PMCx18E[IcAccessTypes] is programmed to 18x (Instruction Cache
Miss) or 1Fx (All Instruction Cache Accesses) then the performance
counter may overcount.

Potential Effect on System
Inaccuracies in performance monitoring software may be experienced.

Suggested Workaround
None

Fix Planned
No fix planned
```
Given being able to count instruction cache accesses (for example) is
a useful feature, would it be possible to change:
```
-  {
-    "EventName": "ic_tag_hit_miss.instruction_cache_hit",
-    "EventCode": "0x18e",
-    "BriefDescription": "Instruction cache hits.",
-    "UMask": "0x07"
-  },
...
```
to be say:
```
  {
    "EventName": "ic_tag_hit_miss.instruction_cache_hit",
    "EventCode": "0x18e",
    "BriefDescription": "Instruction cache hits. Note, this counter is
affected by errata 1583.",
    "UMask": "0x07",
    "Experimental": "1"
  },
```
That is rather than remove the event, the event is tagged as
experimental (taken to mean accuracy isn't guaranteed) and the errata
is explicitly noted in the description. Currently the Experimental tag
has no impact on what happens in the perf tool, for example, the
"Deprecated" tag hides events in the `perf list` command and is
commonly used when an event is renamed.

Thanks,
Ian
> Sandipan Das (3):
>   perf vendor events amd: Remove Zen 5 instruction cache events
>   perf vendor events amd: Remove Zen 5 TLB flush event
>   perf vendor events amd: Remove Zen 5 IBS fetch event
>
>  .../arch/x86/amdzen5/inst-cache.json          | 24 -------------------
>  .../arch/x86/amdzen5/load-store.json          |  6 -----
>  .../arch/x86/amdzen5/recommended.json         | 13 ----------
>  3 files changed, 43 deletions(-)
>
> --
> 2.43.0
>
Re: [PATCH 0/3] perf vendor events amd: Address event errata
Posted by Sandipan Das 7 months, 1 week ago
On 5/7/2025 9:26 PM, Ian Rogers wrote:
> On Wed, May 7, 2025 at 7:28 AM Sandipan Das <sandipan.das@amd.com> wrote:
>>
>> Remove unreliable Zen 5 events and metrics. The following errata from
>> the Revision Guide for AMD Family 1Ah Models 00h-0Fh Processors have
>> been addressed.
>> #1569 PMCx078 Counts Incorrectly in Unpredictable Ways
>> #1583 PMCx18E May Overcount Instruction Cache Accesses
>> #1587 PMCx188 May Undercount IBS (Instruction Based Sampling) Fetch Events
>>
>> The document can be downloaded from
>> https://bugzilla.kernel.org/attachment.cgi?id=308095
> 
> Hi Sandipan,
> 
> the document is somewhat brief, for example:
> ```
> 1583 PMCx18E May Overcount Instruction Cache Accesses
> 
> Description
> If PMCx18E[IcAccessTypes] is programmed to 18x (Instruction Cache
> Miss) or 1Fx (All Instruction Cache Accesses) then the performance
> counter may overcount.
> 
> Potential Effect on System
> Inaccuracies in performance monitoring software may be experienced.
> 
> Suggested Workaround
> None
> 
> Fix Planned
> No fix planned
> ```
> Given being able to count instruction cache accesses (for example) is
> a useful feature, would it be possible to change:
> ```
> -  {
> -    "EventName": "ic_tag_hit_miss.instruction_cache_hit",
> -    "EventCode": "0x18e",
> -    "BriefDescription": "Instruction cache hits.",
> -    "UMask": "0x07"
> -  },
> ...
> ```
> to be say:
> ```
>   {
>     "EventName": "ic_tag_hit_miss.instruction_cache_hit",
>     "EventCode": "0x18e",
>     "BriefDescription": "Instruction cache hits. Note, this counter is
> affected by errata 1583.",
>     "UMask": "0x07",
>     "Experimental": "1"
>   },
> ```
> That is rather than remove the event, the event is tagged as
> experimental (taken to mean accuracy isn't guaranteed) and the errata
> is explicitly noted in the description. Currently the Experimental tag
> has no impact on what happens in the perf tool, for example, the
> "Deprecated" tag hides events in the `perf list` command and is
> commonly used when an event is renamed.
> 

I agree that events like IC hits and misses are generally useful and am
fine with the idea of keeping them but my concern is that unless users
read the event description, there is no way for them to know if the
perf output that they are seeing may be unreliable. There is also no
guarantee that such events will be fixed in a future uarch. From a
quick glance, I couldn't find a mechanism that makes perf stat/report
show a warning for named events with known issues.
Re: [PATCH 0/3] perf vendor events amd: Address event errata
Posted by Ian Rogers 7 months, 1 week ago
On Thu, May 8, 2025 at 3:56 AM Sandipan Das <sandipan.das@amd.com> wrote:
>
> On 5/7/2025 9:26 PM, Ian Rogers wrote:
> > On Wed, May 7, 2025 at 7:28 AM Sandipan Das <sandipan.das@amd.com> wrote:
> >>
> >> Remove unreliable Zen 5 events and metrics. The following errata from
> >> the Revision Guide for AMD Family 1Ah Models 00h-0Fh Processors have
> >> been addressed.
> >> #1569 PMCx078 Counts Incorrectly in Unpredictable Ways
> >> #1583 PMCx18E May Overcount Instruction Cache Accesses
> >> #1587 PMCx188 May Undercount IBS (Instruction Based Sampling) Fetch Events
> >>
> >> The document can be downloaded from
> >> https://bugzilla.kernel.org/attachment.cgi?id=308095
> >
> > Hi Sandipan,
> >
> > the document is somewhat brief, for example:
> > ```
> > 1583 PMCx18E May Overcount Instruction Cache Accesses
> >
> > Description
> > If PMCx18E[IcAccessTypes] is programmed to 18x (Instruction Cache
> > Miss) or 1Fx (All Instruction Cache Accesses) then the performance
> > counter may overcount.
> >
> > Potential Effect on System
> > Inaccuracies in performance monitoring software may be experienced.
> >
> > Suggested Workaround
> > None
> >
> > Fix Planned
> > No fix planned
> > ```
> > Given being able to count instruction cache accesses (for example) is
> > a useful feature, would it be possible to change:
> > ```
> > -  {
> > -    "EventName": "ic_tag_hit_miss.instruction_cache_hit",
> > -    "EventCode": "0x18e",
> > -    "BriefDescription": "Instruction cache hits.",
> > -    "UMask": "0x07"
> > -  },
> > ...
> > ```
> > to be say:
> > ```
> >   {
> >     "EventName": "ic_tag_hit_miss.instruction_cache_hit",
> >     "EventCode": "0x18e",
> >     "BriefDescription": "Instruction cache hits. Note, this counter is
> > affected by errata 1583.",
> >     "UMask": "0x07",
> >     "Experimental": "1"
> >   },
> > ```
> > That is rather than remove the event, the event is tagged as
> > experimental (taken to mean accuracy isn't guaranteed) and the errata
> > is explicitly noted in the description. Currently the Experimental tag
> > has no impact on what happens in the perf tool, for example, the
> > "Deprecated" tag hides events in the `perf list` command and is
> > commonly used when an event is renamed.
> >
>
> I agree that events like IC hits and misses are generally useful and am
> fine with the idea of keeping them but my concern is that unless users
> read the event description, there is no way for them to know if the
> perf output that they are seeing may be unreliable. There is also no
> guarantee that such events will be fixed in a future uarch. From a
> quick glance, I couldn't find a mechanism that makes perf stat/report
> show a warning for named events with known issues.

So I'm forgetting the flow, but rediscovering it. We do have an Errata
json value as shown in:
https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/pmu-events/arch/arm64/ampere/ampereone/memory.json?h=perf-tools-next#n2
```
    {
       "ArchStdEvent": "LD_RETIRED",
       "Errata": "Errata AC03_CPU_52",
       "BriefDescription": "Instruction architecturally executed,
condition code check pass, load.
Impacted by errata -"
   },
```
It doesn't impact perf stat/record but it does get added to the event
description for perf list:
https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/pmu-events/jevents.py?h=perf-tools-next#n340
```
    if 'Errata' in jd:
      extra_desc += '  Spec update: ' + jd['Errata']
```
which means the perf list description ends up as "Instruction
architecturally executed, condition code check pass, load. Impacted by
errata -  Spec update: Errata AC03_CPU_52". We could change this so
that the Errata is distinct in the encoded in perf json and then we
could display the errata when perf stat/record parses the event. I'd
be a little worried about this breaking things that parse perf's text
output, but the impact would be limited to Zen5, Ampere and older
Intel CPUs. We could also make the errata output conditional on
passing a verbose flag to perf. Would just `perf list` support work
for you or would the perf stat/record changes be a requirement for
keeping these events?

Thanks,
Ian
Re: [PATCH 0/3] perf vendor events amd: Address event errata
Posted by Sandipan Das 7 months, 1 week ago
On 5/8/2025 9:04 PM, Ian Rogers wrote:
> On Thu, May 8, 2025 at 3:56 AM Sandipan Das <sandipan.das@amd.com> wrote:
>>
>> On 5/7/2025 9:26 PM, Ian Rogers wrote:
>>> On Wed, May 7, 2025 at 7:28 AM Sandipan Das <sandipan.das@amd.com> wrote:
>>>>
>>>> Remove unreliable Zen 5 events and metrics. The following errata from
>>>> the Revision Guide for AMD Family 1Ah Models 00h-0Fh Processors have
>>>> been addressed.
>>>> #1569 PMCx078 Counts Incorrectly in Unpredictable Ways
>>>> #1583 PMCx18E May Overcount Instruction Cache Accesses
>>>> #1587 PMCx188 May Undercount IBS (Instruction Based Sampling) Fetch Events
>>>>
>>>> The document can be downloaded from
>>>> https://bugzilla.kernel.org/attachment.cgi?id=308095
>>>
>>> Hi Sandipan,
>>>
>>> the document is somewhat brief, for example:
>>> ```
>>> 1583 PMCx18E May Overcount Instruction Cache Accesses
>>>
>>> Description
>>> If PMCx18E[IcAccessTypes] is programmed to 18x (Instruction Cache
>>> Miss) or 1Fx (All Instruction Cache Accesses) then the performance
>>> counter may overcount.
>>>
>>> Potential Effect on System
>>> Inaccuracies in performance monitoring software may be experienced.
>>>
>>> Suggested Workaround
>>> None
>>>
>>> Fix Planned
>>> No fix planned
>>> ```
>>> Given being able to count instruction cache accesses (for example) is
>>> a useful feature, would it be possible to change:
>>> ```
>>> -  {
>>> -    "EventName": "ic_tag_hit_miss.instruction_cache_hit",
>>> -    "EventCode": "0x18e",
>>> -    "BriefDescription": "Instruction cache hits.",
>>> -    "UMask": "0x07"
>>> -  },
>>> ...
>>> ```
>>> to be say:
>>> ```
>>>   {
>>>     "EventName": "ic_tag_hit_miss.instruction_cache_hit",
>>>     "EventCode": "0x18e",
>>>     "BriefDescription": "Instruction cache hits. Note, this counter is
>>> affected by errata 1583.",
>>>     "UMask": "0x07",
>>>     "Experimental": "1"
>>>   },
>>> ```
>>> That is rather than remove the event, the event is tagged as
>>> experimental (taken to mean accuracy isn't guaranteed) and the errata
>>> is explicitly noted in the description. Currently the Experimental tag
>>> has no impact on what happens in the perf tool, for example, the
>>> "Deprecated" tag hides events in the `perf list` command and is
>>> commonly used when an event is renamed.
>>>
>>
>> I agree that events like IC hits and misses are generally useful and am
>> fine with the idea of keeping them but my concern is that unless users
>> read the event description, there is no way for them to know if the
>> perf output that they are seeing may be unreliable. There is also no
>> guarantee that such events will be fixed in a future uarch. From a
>> quick glance, I couldn't find a mechanism that makes perf stat/report
>> show a warning for named events with known issues.
> 
> So I'm forgetting the flow, but rediscovering it. We do have an Errata
> json value as shown in:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/pmu-events/arch/arm64/ampere/ampereone/memory.json?h=perf-tools-next#n2
> ```
>     {
>        "ArchStdEvent": "LD_RETIRED",
>        "Errata": "Errata AC03_CPU_52",
>        "BriefDescription": "Instruction architecturally executed,
> condition code check pass, load.
> Impacted by errata -"
>    },
> ```
> It doesn't impact perf stat/record but it does get added to the event
> description for perf list:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/pmu-events/jevents.py?h=perf-tools-next#n340
> ```
>     if 'Errata' in jd:
>       extra_desc += '  Spec update: ' + jd['Errata']
> ```
> which means the perf list description ends up as "Instruction
> architecturally executed, condition code check pass, load. Impacted by
> errata -  Spec update: Errata AC03_CPU_52". We could change this so
> that the Errata is distinct in the encoded in perf json and then we
> could display the errata when perf stat/record parses the event. I'd
> be a little worried about this breaking things that parse perf's text
> output, but the impact would be limited to Zen5, Ampere and older
> Intel CPUs. We could also make the errata output conditional on
> passing a verbose flag to perf. Would just `perf list` support work
> for you or would the perf stat/record changes be a requirement for
> keeping these events?
> 

Adding "Errata" tags to the affected events is a good start.
We can sort out the perf stat/record changes eventually.