[PATCH v4 09/22] perf jevents: Add ports metric group giving utilization on Intel

Ian Rogers posted 22 patches 2 months ago
[PATCH v4 09/22] perf jevents: Add ports metric group giving utilization on Intel
Posted by Ian Rogers 2 months ago
The ports metric group contains a metric for each port giving its
utilization as a ratio of cycles. The metrics are created by looking
for UOPS_DISPATCHED.PORT events.

Signed-off-by: Ian Rogers <irogers@google.com>
---
 tools/perf/pmu-events/intel_metrics.py | 33 ++++++++++++++++++++++++--
 1 file changed, 31 insertions(+), 2 deletions(-)

diff --git a/tools/perf/pmu-events/intel_metrics.py b/tools/perf/pmu-events/intel_metrics.py
index f4707e964f75..3ef4eb868580 100755
--- a/tools/perf/pmu-events/intel_metrics.py
+++ b/tools/perf/pmu-events/intel_metrics.py
@@ -1,12 +1,13 @@
 #!/usr/bin/env python3
 # SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
 from metric import (d_ratio, has_event, max, CheckPmu, Event, JsonEncodeMetric,
-                    JsonEncodeMetricGroupDescriptions, LoadEvents, Metric,
-                    MetricGroup, MetricRef, Select)
+                    JsonEncodeMetricGroupDescriptions, Literal, LoadEvents,
+                    Metric, MetricGroup, MetricRef, Select)
 import argparse
 import json
 import math
 import os
+import re
 from typing import Optional
 
 # Global command line arguments.
@@ -260,6 +261,33 @@ def IntelBr():
                      description="breakdown of retired branch instructions")
 
 
+def IntelPorts() -> Optional[MetricGroup]:
+  pipeline_events = json.load(open(f"{_args.events_path}/x86/{_args.model}/pipeline.json"))
+
+  core_cycles = Event("CPU_CLK_UNHALTED.THREAD_P_ANY",
+                      "CPU_CLK_UNHALTED.DISTRIBUTED",
+                      "cycles")
+  # Number of CPU cycles scaled for SMT.
+  smt_cycles = Select(core_cycles / 2, Literal("#smt_on"), core_cycles)
+
+  metrics = []
+  for x in pipeline_events:
+    if "EventName" in x and re.search("^UOPS_DISPATCHED.PORT", x["EventName"]):
+      name = x["EventName"]
+      port = re.search(r"(PORT_[0-9].*)", name).group(0).lower()
+      if name.endswith("_CORE"):
+        cyc = core_cycles
+      else:
+        cyc = smt_cycles
+      metrics.append(Metric(port, f"{port} utilization (higher is better)",
+                            d_ratio(Event(name), cyc), "100%"))
+  if len(metrics) == 0:
+    return None
+
+  return MetricGroup("ports", metrics, "functional unit (port) utilization -- "
+                     "fraction of cycles each port is utilized (higher is better)")
+
+
 def IntelSwpf() -> Optional[MetricGroup]:
   ins = Event("instructions")
   try:
@@ -352,6 +380,7 @@ def main() -> None:
       Smi(),
       Tsx(),
       IntelBr(),
+      IntelPorts(),
       IntelSwpf(),
   ])
 
-- 
2.46.1.824.gd892dcdcdd-goog
Re: [PATCH v4 09/22] perf jevents: Add ports metric group giving utilization on Intel
Posted by Liang, Kan 3 weeks ago

On 2024-09-26 1:50 p.m., Ian Rogers wrote:
> The ports metric group contains a metric for each port giving its
> utilization as a ratio of cycles. The metrics are created by looking
> for UOPS_DISPATCHED.PORT events.
> 
> Signed-off-by: Ian Rogers <irogers@google.com>
> ---
>  tools/perf/pmu-events/intel_metrics.py | 33 ++++++++++++++++++++++++--
>  1 file changed, 31 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/perf/pmu-events/intel_metrics.py b/tools/perf/pmu-events/intel_metrics.py
> index f4707e964f75..3ef4eb868580 100755
> --- a/tools/perf/pmu-events/intel_metrics.py
> +++ b/tools/perf/pmu-events/intel_metrics.py
> @@ -1,12 +1,13 @@
>  #!/usr/bin/env python3
>  # SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
>  from metric import (d_ratio, has_event, max, CheckPmu, Event, JsonEncodeMetric,
> -                    JsonEncodeMetricGroupDescriptions, LoadEvents, Metric,
> -                    MetricGroup, MetricRef, Select)
> +                    JsonEncodeMetricGroupDescriptions, Literal, LoadEvents,
> +                    Metric, MetricGroup, MetricRef, Select)
>  import argparse
>  import json
>  import math
>  import os
> +import re
>  from typing import Optional
>  
>  # Global command line arguments.
> @@ -260,6 +261,33 @@ def IntelBr():
>                       description="breakdown of retired branch instructions")
>  
>  
> +def IntelPorts() -> Optional[MetricGroup]:
> +  pipeline_events = json.load(open(f"{_args.events_path}/x86/{_args.model}/pipeline.json"))
> +
> +  core_cycles = Event("CPU_CLK_UNHALTED.THREAD_P_ANY",
> +                      "CPU_CLK_UNHALTED.DISTRIBUTED",
> +                      "cycles")
> +  # Number of CPU cycles scaled for SMT.
> +  smt_cycles = Select(core_cycles / 2, Literal("#smt_on"), core_cycles)
> +
> +  metrics = []
> +  for x in pipeline_events:
> +    if "EventName" in x and re.search("^UOPS_DISPATCHED.PORT", x["EventName"]):
> +      name = x["EventName"]
> +      port = re.search(r"(PORT_[0-9].*)", name).group(0).lower()
> +      if name.endswith("_CORE"):
> +        cyc = core_cycles
> +      else:
> +        cyc = smt_cycles
> +      metrics.append(Metric(port, f"{port} utilization (higher is better)",
> +                            d_ratio(Event(name), cyc), "100%"))


The generated metric highly depends on the event name, which is very
fragile. We will probably have the same event in a new generation, but
with a different name. Long-term maintenance could be a problem.
Is there an idea regarding how to sync the event names for new generations?

Maybe we should improve the event generation script and do an automatic
check to tell which metrics are missed. Then we may decide if updating
the new event name, dropping the metric or adding a different metric.

Thanks,
Kan

> +  if len(metrics) == 0:
> +    return None
> +
> +  return MetricGroup("ports", metrics, "functional unit (port) utilization -- "
> +                     "fraction of cycles each port is utilized (higher is better)")
> +
> +
>  def IntelSwpf() -> Optional[MetricGroup]:
>    ins = Event("instructions")
>    try:
> @@ -352,6 +380,7 @@ def main() -> None:
>        Smi(),
>        Tsx(),
>        IntelBr(),
> +      IntelPorts(),
>        IntelSwpf(),
>    ])
>
Re: [PATCH v4 09/22] perf jevents: Add ports metric group giving utilization on Intel
Posted by Ian Rogers 3 weeks ago
On Thu, Nov 7, 2024 at 7:00 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>
>
>
> On 2024-09-26 1:50 p.m., Ian Rogers wrote:
> > The ports metric group contains a metric for each port giving its
> > utilization as a ratio of cycles. The metrics are created by looking
> > for UOPS_DISPATCHED.PORT events.
> >
> > Signed-off-by: Ian Rogers <irogers@google.com>
> > ---
> >  tools/perf/pmu-events/intel_metrics.py | 33 ++++++++++++++++++++++++--
> >  1 file changed, 31 insertions(+), 2 deletions(-)
> >
> > diff --git a/tools/perf/pmu-events/intel_metrics.py b/tools/perf/pmu-events/intel_metrics.py
> > index f4707e964f75..3ef4eb868580 100755
> > --- a/tools/perf/pmu-events/intel_metrics.py
> > +++ b/tools/perf/pmu-events/intel_metrics.py
> > @@ -1,12 +1,13 @@
> >  #!/usr/bin/env python3
> >  # SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
> >  from metric import (d_ratio, has_event, max, CheckPmu, Event, JsonEncodeMetric,
> > -                    JsonEncodeMetricGroupDescriptions, LoadEvents, Metric,
> > -                    MetricGroup, MetricRef, Select)
> > +                    JsonEncodeMetricGroupDescriptions, Literal, LoadEvents,
> > +                    Metric, MetricGroup, MetricRef, Select)
> >  import argparse
> >  import json
> >  import math
> >  import os
> > +import re
> >  from typing import Optional
> >
> >  # Global command line arguments.
> > @@ -260,6 +261,33 @@ def IntelBr():
> >                       description="breakdown of retired branch instructions")
> >
> >
> > +def IntelPorts() -> Optional[MetricGroup]:
> > +  pipeline_events = json.load(open(f"{_args.events_path}/x86/{_args.model}/pipeline.json"))
> > +
> > +  core_cycles = Event("CPU_CLK_UNHALTED.THREAD_P_ANY",
> > +                      "CPU_CLK_UNHALTED.DISTRIBUTED",
> > +                      "cycles")
> > +  # Number of CPU cycles scaled for SMT.
> > +  smt_cycles = Select(core_cycles / 2, Literal("#smt_on"), core_cycles)
> > +
> > +  metrics = []
> > +  for x in pipeline_events:
> > +    if "EventName" in x and re.search("^UOPS_DISPATCHED.PORT", x["EventName"]):
> > +      name = x["EventName"]
> > +      port = re.search(r"(PORT_[0-9].*)", name).group(0).lower()
> > +      if name.endswith("_CORE"):
> > +        cyc = core_cycles
> > +      else:
> > +        cyc = smt_cycles
> > +      metrics.append(Metric(port, f"{port} utilization (higher is better)",
> > +                            d_ratio(Event(name), cyc), "100%"))
>
>
> The generated metric highly depends on the event name, which is very
> fragile. We will probably have the same event in a new generation, but
> with a different name. Long-term maintenance could be a problem.
> Is there an idea regarding how to sync the event names for new generations?

I agree with the idea that it is fragile, it is also strangely robust
as you say, new generations will gain support if they follow the same
naming convention. We have tests that load bearing metrics exists on
our platforms so maybe the appropriate place to test for existence is
in Weilin's metrics test.


> Maybe we should improve the event generation script and do an automatic
> check to tell which metrics are missed. Then we may decide if updating
> the new event name, dropping the metric or adding a different metric.

So I'm not sure it is a bug to not have the metric, if it were we
could just throw rather than return None. We're going to run the
script for every model including old models like nehalem, so I've
generally kept it as None. I think doing future work on testing is
probably best. It would also indicate use of the metric if people
notice it missing (not that the script aims for that :-) ).

Thanks,
Ian

> Thanks,
> Kan
>
> > +  if len(metrics) == 0:
> > +    return None
> > +
> > +  return MetricGroup("ports", metrics, "functional unit (port) utilization -- "
> > +                     "fraction of cycles each port is utilized (higher is better)")
> > +
> > +
> >  def IntelSwpf() -> Optional[MetricGroup]:
> >    ins = Event("instructions")
> >    try:
> > @@ -352,6 +380,7 @@ def main() -> None:
> >        Smi(),
> >        Tsx(),
> >        IntelBr(),
> > +      IntelPorts(),
> >        IntelSwpf(),
> >    ])
> >
>
Re: [PATCH v4 09/22] perf jevents: Add ports metric group giving utilization on Intel
Posted by Liang, Kan 3 weeks ago

On 2024-11-07 12:12 p.m., Ian Rogers wrote:
> On Thu, Nov 7, 2024 at 7:00 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>
>>
>> On 2024-09-26 1:50 p.m., Ian Rogers wrote:
>>> The ports metric group contains a metric for each port giving its
>>> utilization as a ratio of cycles. The metrics are created by looking
>>> for UOPS_DISPATCHED.PORT events.
>>>
>>> Signed-off-by: Ian Rogers <irogers@google.com>
>>> ---
>>>  tools/perf/pmu-events/intel_metrics.py | 33 ++++++++++++++++++++++++--
>>>  1 file changed, 31 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/tools/perf/pmu-events/intel_metrics.py b/tools/perf/pmu-events/intel_metrics.py
>>> index f4707e964f75..3ef4eb868580 100755
>>> --- a/tools/perf/pmu-events/intel_metrics.py
>>> +++ b/tools/perf/pmu-events/intel_metrics.py
>>> @@ -1,12 +1,13 @@
>>>  #!/usr/bin/env python3
>>>  # SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
>>>  from metric import (d_ratio, has_event, max, CheckPmu, Event, JsonEncodeMetric,
>>> -                    JsonEncodeMetricGroupDescriptions, LoadEvents, Metric,
>>> -                    MetricGroup, MetricRef, Select)
>>> +                    JsonEncodeMetricGroupDescriptions, Literal, LoadEvents,
>>> +                    Metric, MetricGroup, MetricRef, Select)
>>>  import argparse
>>>  import json
>>>  import math
>>>  import os
>>> +import re
>>>  from typing import Optional
>>>
>>>  # Global command line arguments.
>>> @@ -260,6 +261,33 @@ def IntelBr():
>>>                       description="breakdown of retired branch instructions")
>>>
>>>
>>> +def IntelPorts() -> Optional[MetricGroup]:
>>> +  pipeline_events = json.load(open(f"{_args.events_path}/x86/{_args.model}/pipeline.json"))
>>> +
>>> +  core_cycles = Event("CPU_CLK_UNHALTED.THREAD_P_ANY",
>>> +                      "CPU_CLK_UNHALTED.DISTRIBUTED",
>>> +                      "cycles")
>>> +  # Number of CPU cycles scaled for SMT.
>>> +  smt_cycles = Select(core_cycles / 2, Literal("#smt_on"), core_cycles)
>>> +
>>> +  metrics = []
>>> +  for x in pipeline_events:
>>> +    if "EventName" in x and re.search("^UOPS_DISPATCHED.PORT", x["EventName"]):
>>> +      name = x["EventName"]
>>> +      port = re.search(r"(PORT_[0-9].*)", name).group(0).lower()
>>> +      if name.endswith("_CORE"):
>>> +        cyc = core_cycles
>>> +      else:
>>> +        cyc = smt_cycles
>>> +      metrics.append(Metric(port, f"{port} utilization (higher is better)",
>>> +                            d_ratio(Event(name), cyc), "100%"))
>>
>> The generated metric highly depends on the event name, which is very
>> fragile. We will probably have the same event in a new generation, but
>> with a different name. Long-term maintenance could be a problem.
>> Is there an idea regarding how to sync the event names for new generations?
> I agree with the idea that it is fragile, it is also strangely robust
> as you say, new generations will gain support if they follow the same
> naming convention. We have tests that load bearing metrics exists on
> our platforms so maybe the appropriate place to test for existence is
> in Weilin's metrics test.
> 
> 
>> Maybe we should improve the event generation script and do an automatic
>> check to tell which metrics are missed. Then we may decide if updating
>> the new event name, dropping the metric or adding a different metric.
> So I'm not sure it is a bug to not have the metric, if it were we
> could just throw rather than return None. We're going to run the
> script for every model including old models like nehalem, so I've
> generally kept it as None. I think doing future work on testing is
> probably best. It would also indicate use of the metric if people
> notice it missing (not that the script aims for that 🙂 ).

The maintenance is still a concern, even if we have a way to test it
out. There is already an "official" metric published in GitHub, which is
maintained by Intel. To be honest, I don't think there is more energy to
maintain these "non-official" metrics.

I don't think it should be a bug without these metrics. So it's very
likely that the issue will not be addressed right away. If we cannot
keep these metrics updated for the future platforms, I couldn't find a
reason to have them.

Thanks,
Kan
Re: [PATCH v4 09/22] perf jevents: Add ports metric group giving utilization on Intel
Posted by Ian Rogers 3 weeks ago
On Thu, Nov 7, 2024 at 11:36 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>
> On 2024-11-07 12:12 p.m., Ian Rogers wrote:
> > On Thu, Nov 7, 2024 at 7:00 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
> >>
> >>
> >> On 2024-09-26 1:50 p.m., Ian Rogers wrote:
> >>> The ports metric group contains a metric for each port giving its
> >>> utilization as a ratio of cycles. The metrics are created by looking
> >>> for UOPS_DISPATCHED.PORT events.
> >>>
> >>> Signed-off-by: Ian Rogers <irogers@google.com>
> >>> ---
> >>>  tools/perf/pmu-events/intel_metrics.py | 33 ++++++++++++++++++++++++--
> >>>  1 file changed, 31 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/tools/perf/pmu-events/intel_metrics.py b/tools/perf/pmu-events/intel_metrics.py
> >>> index f4707e964f75..3ef4eb868580 100755
> >>> --- a/tools/perf/pmu-events/intel_metrics.py
> >>> +++ b/tools/perf/pmu-events/intel_metrics.py
> >>> @@ -1,12 +1,13 @@
> >>>  #!/usr/bin/env python3
> >>>  # SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
> >>>  from metric import (d_ratio, has_event, max, CheckPmu, Event, JsonEncodeMetric,
> >>> -                    JsonEncodeMetricGroupDescriptions, LoadEvents, Metric,
> >>> -                    MetricGroup, MetricRef, Select)
> >>> +                    JsonEncodeMetricGroupDescriptions, Literal, LoadEvents,
> >>> +                    Metric, MetricGroup, MetricRef, Select)
> >>>  import argparse
> >>>  import json
> >>>  import math
> >>>  import os
> >>> +import re
> >>>  from typing import Optional
> >>>
> >>>  # Global command line arguments.
> >>> @@ -260,6 +261,33 @@ def IntelBr():
> >>>                       description="breakdown of retired branch instructions")
> >>>
> >>>
> >>> +def IntelPorts() -> Optional[MetricGroup]:
> >>> +  pipeline_events = json.load(open(f"{_args.events_path}/x86/{_args.model}/pipeline.json"))
> >>> +
> >>> +  core_cycles = Event("CPU_CLK_UNHALTED.THREAD_P_ANY",
> >>> +                      "CPU_CLK_UNHALTED.DISTRIBUTED",
> >>> +                      "cycles")
> >>> +  # Number of CPU cycles scaled for SMT.
> >>> +  smt_cycles = Select(core_cycles / 2, Literal("#smt_on"), core_cycles)
> >>> +
> >>> +  metrics = []
> >>> +  for x in pipeline_events:
> >>> +    if "EventName" in x and re.search("^UOPS_DISPATCHED.PORT", x["EventName"]):
> >>> +      name = x["EventName"]
> >>> +      port = re.search(r"(PORT_[0-9].*)", name).group(0).lower()
> >>> +      if name.endswith("_CORE"):
> >>> +        cyc = core_cycles
> >>> +      else:
> >>> +        cyc = smt_cycles
> >>> +      metrics.append(Metric(port, f"{port} utilization (higher is better)",
> >>> +                            d_ratio(Event(name), cyc), "100%"))
> >>
> >> The generated metric highly depends on the event name, which is very
> >> fragile. We will probably have the same event in a new generation, but
> >> with a different name. Long-term maintenance could be a problem.
> >> Is there an idea regarding how to sync the event names for new generations?
> > I agree with the idea that it is fragile, it is also strangely robust
> > as you say, new generations will gain support if they follow the same
> > naming convention. We have tests that load bearing metrics exists on
> > our platforms so maybe the appropriate place to test for existence is
> > in Weilin's metrics test.
> >
> >
> >> Maybe we should improve the event generation script and do an automatic
> >> check to tell which metrics are missed. Then we may decide if updating
> >> the new event name, dropping the metric or adding a different metric.
> > So I'm not sure it is a bug to not have the metric, if it were we
> > could just throw rather than return None. We're going to run the
> > script for every model including old models like nehalem, so I've
> > generally kept it as None. I think doing future work on testing is
> > probably best. It would also indicate use of the metric if people
> > notice it missing (not that the script aims for that 🙂 ).
>
> The maintenance is still a concern, even if we have a way to test it
> out. There is already an "official" metric published in GitHub, which is
> maintained by Intel. To be honest, I don't think there is more energy to
> maintain these "non-official" metrics.
>
> I don't think it should be a bug without these metrics. So it's very
> likely that the issue will not be addressed right away. If we cannot
> keep these metrics updated for the future platforms, I couldn't find a
> reason to have them.

So I think there are a few things:
1) I'd like there to be a non-json infrastructure for events that can
handle multiple models. Some failings of json are its inability to
validate events, long lines, lack of comments, metric expression
strings that aren't inherently sound, etc.  I'd like to make it so we
can have json metrics for everything, ie remove the hardcoded metrics
that play badly with event sharing, etc. Doing this by updating every
json would be tedious and excessively noisy.
2) There are "official" metrics from Intel and I've worked for the
establishment of that. That doesn't mean every Intel metric is in the
official metrics. Servers are better served than client machines. Core
TMA metrics are well served but uncore less so.
3) Are perf metrics perfect with some kind of warranty? Well no, as
your reviews in this thread have shown SMI cost on hybrid is likely
broken. We don't intentionally try to have broken metrics and fix them
asap when they come up. GPLv2 has an explicit "no warranty" section.
Now Intel have experimental and non-experimental events, we update the
comments of metrics using those to highlight that the underlying
events are experimental:
https://github.com/googleprodkernel/linux-perf/blob/google_tools_master/tools/perf/pmu-events/metric.py#L598
If there are bugs in the metrics then open source, sharing and fixing
benefits everyone.
4) Am I looking for energy from Intel to maintain these metrics? No.
I'm trying to stop carrying the patches just inside of Google's tree.

Thanks,
Ian
Re: [PATCH v4 09/22] perf jevents: Add ports metric group giving utilization on Intel
Posted by Liang, Kan 2 weeks, 6 days ago

On 2024-11-07 4:00 p.m., Ian Rogers wrote:
> On Thu, Nov 7, 2024 at 11:36 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>> On 2024-11-07 12:12 p.m., Ian Rogers wrote:
>>> On Thu, Nov 7, 2024 at 7:00 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>>>
>>>> On 2024-09-26 1:50 p.m., Ian Rogers wrote:
>>>>> The ports metric group contains a metric for each port giving its
>>>>> utilization as a ratio of cycles. The metrics are created by looking
>>>>> for UOPS_DISPATCHED.PORT events.
>>>>>
>>>>> Signed-off-by: Ian Rogers <irogers@google.com>
>>>>> ---
>>>>>  tools/perf/pmu-events/intel_metrics.py | 33 ++++++++++++++++++++++++--
>>>>>  1 file changed, 31 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/tools/perf/pmu-events/intel_metrics.py b/tools/perf/pmu-events/intel_metrics.py
>>>>> index f4707e964f75..3ef4eb868580 100755
>>>>> --- a/tools/perf/pmu-events/intel_metrics.py
>>>>> +++ b/tools/perf/pmu-events/intel_metrics.py
>>>>> @@ -1,12 +1,13 @@
>>>>>  #!/usr/bin/env python3
>>>>>  # SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
>>>>>  from metric import (d_ratio, has_event, max, CheckPmu, Event, JsonEncodeMetric,
>>>>> -                    JsonEncodeMetricGroupDescriptions, LoadEvents, Metric,
>>>>> -                    MetricGroup, MetricRef, Select)
>>>>> +                    JsonEncodeMetricGroupDescriptions, Literal, LoadEvents,
>>>>> +                    Metric, MetricGroup, MetricRef, Select)
>>>>>  import argparse
>>>>>  import json
>>>>>  import math
>>>>>  import os
>>>>> +import re
>>>>>  from typing import Optional
>>>>>
>>>>>  # Global command line arguments.
>>>>> @@ -260,6 +261,33 @@ def IntelBr():
>>>>>                       description="breakdown of retired branch instructions")
>>>>>
>>>>>
>>>>> +def IntelPorts() -> Optional[MetricGroup]:
>>>>> +  pipeline_events = json.load(open(f"{_args.events_path}/x86/{_args.model}/pipeline.json"))
>>>>> +
>>>>> +  core_cycles = Event("CPU_CLK_UNHALTED.THREAD_P_ANY",
>>>>> +                      "CPU_CLK_UNHALTED.DISTRIBUTED",
>>>>> +                      "cycles")
>>>>> +  # Number of CPU cycles scaled for SMT.
>>>>> +  smt_cycles = Select(core_cycles / 2, Literal("#smt_on"), core_cycles)
>>>>> +
>>>>> +  metrics = []
>>>>> +  for x in pipeline_events:
>>>>> +    if "EventName" in x and re.search("^UOPS_DISPATCHED.PORT", x["EventName"]):
>>>>> +      name = x["EventName"]
>>>>> +      port = re.search(r"(PORT_[0-9].*)", name).group(0).lower()
>>>>> +      if name.endswith("_CORE"):
>>>>> +        cyc = core_cycles
>>>>> +      else:
>>>>> +        cyc = smt_cycles
>>>>> +      metrics.append(Metric(port, f"{port} utilization (higher is better)",
>>>>> +                            d_ratio(Event(name), cyc), "100%"))
>>>> The generated metric highly depends on the event name, which is very
>>>> fragile. We will probably have the same event in a new generation, but
>>>> with a different name. Long-term maintenance could be a problem.
>>>> Is there an idea regarding how to sync the event names for new generations?
>>> I agree with the idea that it is fragile, it is also strangely robust
>>> as you say, new generations will gain support if they follow the same
>>> naming convention. We have tests that load bearing metrics exists on
>>> our platforms so maybe the appropriate place to test for existence is
>>> in Weilin's metrics test.
>>>
>>>
>>>> Maybe we should improve the event generation script and do an automatic
>>>> check to tell which metrics are missed. Then we may decide if updating
>>>> the new event name, dropping the metric or adding a different metric.
>>> So I'm not sure it is a bug to not have the metric, if it were we
>>> could just throw rather than return None. We're going to run the
>>> script for every model including old models like nehalem, so I've
>>> generally kept it as None. I think doing future work on testing is
>>> probably best. It would also indicate use of the metric if people
>>> notice it missing (not that the script aims for that 🙂 ).
>> The maintenance is still a concern, even if we have a way to test it
>> out. There is already an "official" metric published in GitHub, which is
>> maintained by Intel. To be honest, I don't think there is more energy to
>> maintain these "non-official" metrics.
>>
>> I don't think it should be a bug without these metrics. So it's very
>> likely that the issue will not be addressed right away. If we cannot
>> keep these metrics updated for the future platforms, I couldn't find a
>> reason to have them.
> So I think there are a few things:
> 1) I'd like there to be a non-json infrastructure for events that can
> handle multiple models. Some failings of json are its inability to
> validate events, long lines, lack of comments, metric expression
> strings that aren't inherently sound, etc.  I'd like to make it so we
> can have json metrics for everything, ie remove the hardcoded metrics
> that play badly with event sharing, etc. Doing this by updating every
> json would be tedious and excessively noisy.
> 2) There are "official" metrics from Intel and I've worked for the
> establishment of that. That doesn't mean every Intel metric is in the
> official metrics. Servers are better served than client machines. Core
> TMA metrics are well served but uncore less so.
> 3) Are perf metrics perfect with some kind of warranty? Well no, as
> your reviews in this thread have shown SMI cost on hybrid is likely
> broken. We don't intentionally try to have broken metrics and fix them
> asap when they come up. GPLv2 has an explicit "no warranty" section.
> Now Intel have experimental and non-experimental events, we update the
> comments of metrics using those to highlight that the underlying
> events are experimental:
> https://github.com/googleprodkernel/linux-perf/blob/google_tools_master/
> tools/perf/pmu-events/metric.py#L598
> If there are bugs in the metrics then open source, sharing and fixing
> benefits everyone.
> 4) Am I looking for energy from Intel to maintain these metrics? No.
> I'm trying to stop carrying the patches just inside of Google's tree.

Got it. Thanks for the clarification.

IIUC, the generated metrics are based on the best knowledge of
contributors. The initial source is from Google's tree. It should be the
contributor's responsibility to maintain or update the metrics.
If so, I agree that it should be a good supplement.

I would have some comments in general.
- I think we need a way to distinguish these metrics, e.g., add a
dedicated prefix. I will not be surprised if some customers bring the
metrics to Intel or other vendor's customer service, and ask why it
doesn't work on some platforms. I don't think they will get any useful
information there. The best way is to report any issues here. So we can
fix and update the metric.
- All events, no matter whether they are from the JSON file or exposed
by the kernel, have to be checked before showing the metrics. Because we
cannot guarantee the availability of an event, even for the
architectural events. We may introduce a wrapper for the Metric() to
check all the involved events. So we don't need to add the try: except
thing in each patch?
- In the perf test, it's better to ignore the errors from these metrics.
So it doesn't block things. But we can show a warning from them with
-vvv. The issue can still be found and fixed.

Thanks,
Kan