[RFC PATCH v10 0/8] TPEBS counting mode support

weilin.wang@intel.com posted 8 patches 1 year, 6 months ago
There is a newer version of this series
tools/perf/Documentation/perf-list.txt        |    1 +
tools/perf/Documentation/topdown.txt          |   30 +
tools/perf/arch/x86/util/evlist.c             |    6 +
tools/perf/builtin-stat.c                     |    8 +
.../arch/x86/meteorlake/metricgroups.json     |  140 +
.../arch/x86/meteorlake/mtl-metrics.json      | 2595 +++++++++++++++++
.../perf/tests/shell/test_stat_intel_tpebs.sh |   19 +
tools/perf/util/Build                         |    1 +
tools/perf/util/data.c                        |    7 +-
tools/perf/util/evsel.c                       |   26 +
tools/perf/util/evsel.h                       |    6 +
tools/perf/util/intel-tpebs.c                 |  397 +++
tools/perf/util/intel-tpebs.h                 |   48 +
tools/perf/util/parse-events.c                |    2 +
tools/perf/util/parse-events.h                |    1 +
tools/perf/util/parse-events.l                |    3 +-
16 files changed, 3288 insertions(+), 2 deletions(-)
create mode 100644 tools/perf/pmu-events/arch/x86/meteorlake/metricgroups.json
create mode 100644 tools/perf/pmu-events/arch/x86/meteorlake/mtl-metrics.json
create mode 100755 tools/perf/tests/shell/test_stat_intel_tpebs.sh
create mode 100644 tools/perf/util/intel-tpebs.c
create mode 100644 tools/perf/util/intel-tpebs.h
[RFC PATCH v10 0/8] TPEBS counting mode support
Posted by weilin.wang@intel.com 1 year, 6 months ago
From: Weilin Wang <weilin.wang@intel.com>

I have tried not to count retire_latency events but did not succeed.
In particular, I tried the following methods:
 - Convert retire_latency event to dummy event in event parser.
 - Early bail out in evsel__open_cpu() and store_evsel_ids().

The first method fails and causes non-retire_latency events with the same event
name return 0 count.

The second method fails and causes all the events in the same group returning
"<not counted>" results.

Because of above, the retire_latency event will still run in counting mode.

Other changes in v10:
 - Change perf record fork from perf stat to evsel. All the major operations
 like tpebs start, stop, read_evsel should directly work through evsel.
 - Make intel-tpebs x86_64 only. This change is cross-compiled to arm64.
 - Put tpebs code to intel-tepbs and simplify intel-tpebs APIs to minimum number
of functions and variables. Update funtion name and variable names to use
consistent prefix. Also improve error handling.
 - Integrate code patch from Ian for the :R parser.
 - Update MTL metrics to TMA 4.8.

V9: https://lore.kernel.org/all/20240521173952.3397644-1-weilin.wang@intel.com/

Changes in v9:
 - Update the retire_latency result print and metric calculation method. Plugin
the value to evsel so that no special code is required.
 - Update --control:fifo to use pipe instead of named pipe.
 - Add test for TPEBS counting mode.
 - Update Document with more details.

Changes in v8:
- In this revision, the code is updated to base on Ian's patch on R modifier
parser https://lore.kernel.org/lkml/20240428053616.1125891-3-irogers@google.com/
After this change, there is no special code required for R modifier in
metricgroup.c and metricgroup.h files.

Caveat of this change:
  Ideally, we will need to add special handling to skip counting events with R
modifier in evsel. Currently, this is not implemented so the event with :R will
be both counted and sampled. Usually, in a metric formula that uses retire_latency,
it would already require to count the event. As a result, we will endup count the
same event twice. This should be able to be handled properly when we finalize our
design on evsel R modifier support.

- Move TPEBS specific code out from main perf stat code to separate files in
util/intel-tpebs.c and util/intel-tpebs.h. [Namhyung]
- Use --control:fifo to ack perf stat from forked perf record instead of sleep(2) [Namhyung]
- Add introductions about TPEBS and R modifier in Documents. [Namhyung]


Changes in v7:
- Update code and comments for better code quality [Namhyung]
- Add a separate commit for perf data [Namhyung]
- Update retire latency print function to improve alignment [Namhyung]

Changes in v6:
- Update code and add comments for better code quality [Namhyung]
- Remove the added fd var and directly pass the opened fd to data.file.fd [Namhyung]
- Add kill() to stop perf record when perf stat exists early [Namhyung]
- Add command opt check to ensure only start perf record when -a/-C given [Namhyung]
- Squash commits [Namhyung]

Changes in v5:
- Update code and add comments for better code quality [Ian]

Changes in v4:
- Remove uncessary debug print and update code and comments for better
readability and quality [Namhyung]
- Update mtl metric json file with consistent TmaL1 and TopdownL1 metricgroup

Changes in v3:
- Remove ':' when event name has '@' [Ian]
- Use 'R' as the modifier instead of "retire_latency" [Ian]

Changes in v2:
- Add MTL metric file
- Add more descriptions and example to the patch [Arnaldo]

Here is an example of running perf stat to collect a metric that uses
retire_latency value of event MEM_INST_RETIRED.STLB_HIT_STORES on a MTL system.

In this simple example, there is no MEM_INST_RETIRED.STLB_HIT_STORES sample.
Therefore, the MEM_INST_RETIRED.STLB_HIT_STORES:p count and retire_latency value
are all 0.

./perf stat -M tma_dtlb_store -a -- sleep 1

[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.000 MB - ]

 Performance counter stats for 'system wide':

       181,047,168      cpu_core/TOPDOWN.SLOTS/          #      0.6 %  tma_dtlb_store
         3,195,608      cpu_core/topdown-retiring/
        40,156,649      cpu_core/topdown-mem-bound/
         3,550,925      cpu_core/topdown-bad-spec/
       117,571,818      cpu_core/topdown-fe-bound/
        57,118,087      cpu_core/topdown-be-bound/
            69,179      cpu_core/EXE_ACTIVITY.BOUND_ON_STORES/
             4,582      cpu_core/MEM_INST_RETIRED.STLB_HIT_STORES/
        30,183,104      cpu_core/CPU_CLK_UNHALTED.DISTRIBUTED/
        30,556,790      cpu_core/CPU_CLK_UNHALTED.THREAD/
           168,486      cpu_core/DTLB_STORE_MISSES.WALK_ACTIVE/
              0.00 MEM_INST_RETIRED.STLB_HIT_STORES:p       0        0

       1.003105924 seconds time elapsed

v1:
TPEBS is one of the features provided by the next generation of Intel PMU.
Please refer to Section 8.4.1 of "Intel® Architecture Instruction Set Extensions
Programming Reference" [1] for more details about this feature.

This set of patches supports TPEBS in counting mode. The code works in the
following way: it forks a perf record process from perf stat when retire_latency
of one or more events are used in a metric formula. Perf stat would send a
SIGTERM signal to perf record before it needs the retire latency value for
metric calculation. Perf stat will then process sample data to extract the
retire latency data for metric calculations. Currently, the code uses the
arithmetic average of retire latency values.

[1] https://www.intel.com/content/www/us/en/content-details/812218/intel-architecture-instruction-set-extensions-programming-reference.html?wapkw=future%20features


Ian Rogers (1):
  perf parse-events: Add a retirement latency modifier

Weilin Wang (7):
  perf data: Allow to use given fd in data->file.fd
  perf stat: Fork and launch perf record when perf stat needs to get
    retire latency value for a metric.
  perf stat: Plugin retire_lat value from sampled data to evsel
  perf vendor events intel: Add MTL metric json files
  perf stat: Add command line option for enabling tpebs recording
  perf Document: Add TPEBS to Documents
  perf test: Add test for Intel TPEBS counting mode

 tools/perf/Documentation/perf-list.txt        |    1 +
 tools/perf/Documentation/topdown.txt          |   30 +
 tools/perf/arch/x86/util/evlist.c             |    6 +
 tools/perf/builtin-stat.c                     |    8 +
 .../arch/x86/meteorlake/metricgroups.json     |  140 +
 .../arch/x86/meteorlake/mtl-metrics.json      | 2595 +++++++++++++++++
 .../perf/tests/shell/test_stat_intel_tpebs.sh |   19 +
 tools/perf/util/Build                         |    1 +
 tools/perf/util/data.c                        |    7 +-
 tools/perf/util/evsel.c                       |   26 +
 tools/perf/util/evsel.h                       |    6 +
 tools/perf/util/intel-tpebs.c                 |  397 +++
 tools/perf/util/intel-tpebs.h                 |   48 +
 tools/perf/util/parse-events.c                |    2 +
 tools/perf/util/parse-events.h                |    1 +
 tools/perf/util/parse-events.l                |    3 +-
 16 files changed, 3288 insertions(+), 2 deletions(-)
 create mode 100644 tools/perf/pmu-events/arch/x86/meteorlake/metricgroups.json
 create mode 100644 tools/perf/pmu-events/arch/x86/meteorlake/mtl-metrics.json
 create mode 100755 tools/perf/tests/shell/test_stat_intel_tpebs.sh
 create mode 100644 tools/perf/util/intel-tpebs.c
 create mode 100644 tools/perf/util/intel-tpebs.h

--
2.43.0

Re: [RFC PATCH v10 0/8] TPEBS counting mode support
Posted by Namhyung Kim 1 year, 6 months ago
Hello,

On Tue, May 28, 2024 at 11:43 PM <weilin.wang@intel.com> wrote:
>
> From: Weilin Wang <weilin.wang@intel.com>
>
> I have tried not to count retire_latency events but did not succeed.
> In particular, I tried the following methods:
>  - Convert retire_latency event to dummy event in event parser.
>  - Early bail out in evsel__open_cpu() and store_evsel_ids().
>
> The first method fails and causes non-retire_latency events with the same event
> name return 0 count.
>
> The second method fails and causes all the events in the same group returning
> "<not counted>" results.

Can you please describe where it fails?  Is it failing on other events
because the tpebs event is a leader of the group?  I think you wanted
to avoid having it in the leader position.  If we can skip any actual
operations (open/close/enable/disable/read) for the tpebs events, then
it could be fine..

Thanks,
Namhyung

>
> Because of above, the retire_latency event will still run in counting mode.
>
> Other changes in v10:
>  - Change perf record fork from perf stat to evsel. All the major operations
>  like tpebs start, stop, read_evsel should directly work through evsel.
>  - Make intel-tpebs x86_64 only. This change is cross-compiled to arm64.
>  - Put tpebs code to intel-tepbs and simplify intel-tpebs APIs to minimum number
> of functions and variables. Update funtion name and variable names to use
> consistent prefix. Also improve error handling.
>  - Integrate code patch from Ian for the :R parser.
>  - Update MTL metrics to TMA 4.8.
>
> V9: https://lore.kernel.org/all/20240521173952.3397644-1-weilin.wang@intel.com/
>
> Changes in v9:
>  - Update the retire_latency result print and metric calculation method. Plugin
> the value to evsel so that no special code is required.
>  - Update --control:fifo to use pipe instead of named pipe.
>  - Add test for TPEBS counting mode.
>  - Update Document with more details.
>
> Changes in v8:
> - In this revision, the code is updated to base on Ian's patch on R modifier
> parser https://lore.kernel.org/lkml/20240428053616.1125891-3-irogers@google.com/
> After this change, there is no special code required for R modifier in
> metricgroup.c and metricgroup.h files.
>
> Caveat of this change:
>   Ideally, we will need to add special handling to skip counting events with R
> modifier in evsel. Currently, this is not implemented so the event with :R will
> be both counted and sampled. Usually, in a metric formula that uses retire_latency,
> it would already require to count the event. As a result, we will endup count the
> same event twice. This should be able to be handled properly when we finalize our
> design on evsel R modifier support.
>
> - Move TPEBS specific code out from main perf stat code to separate files in
> util/intel-tpebs.c and util/intel-tpebs.h. [Namhyung]
> - Use --control:fifo to ack perf stat from forked perf record instead of sleep(2) [Namhyung]
> - Add introductions about TPEBS and R modifier in Documents. [Namhyung]
>
>
> Changes in v7:
> - Update code and comments for better code quality [Namhyung]
> - Add a separate commit for perf data [Namhyung]
> - Update retire latency print function to improve alignment [Namhyung]
>
> Changes in v6:
> - Update code and add comments for better code quality [Namhyung]
> - Remove the added fd var and directly pass the opened fd to data.file.fd [Namhyung]
> - Add kill() to stop perf record when perf stat exists early [Namhyung]
> - Add command opt check to ensure only start perf record when -a/-C given [Namhyung]
> - Squash commits [Namhyung]
>
> Changes in v5:
> - Update code and add comments for better code quality [Ian]
>
> Changes in v4:
> - Remove uncessary debug print and update code and comments for better
> readability and quality [Namhyung]
> - Update mtl metric json file with consistent TmaL1 and TopdownL1 metricgroup
>
> Changes in v3:
> - Remove ':' when event name has '@' [Ian]
> - Use 'R' as the modifier instead of "retire_latency" [Ian]
>
> Changes in v2:
> - Add MTL metric file
> - Add more descriptions and example to the patch [Arnaldo]
>
> Here is an example of running perf stat to collect a metric that uses
> retire_latency value of event MEM_INST_RETIRED.STLB_HIT_STORES on a MTL system.
>
> In this simple example, there is no MEM_INST_RETIRED.STLB_HIT_STORES sample.
> Therefore, the MEM_INST_RETIRED.STLB_HIT_STORES:p count and retire_latency value
> are all 0.
>
> ./perf stat -M tma_dtlb_store -a -- sleep 1
>
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.000 MB - ]
>
>  Performance counter stats for 'system wide':
>
>        181,047,168      cpu_core/TOPDOWN.SLOTS/          #      0.6 %  tma_dtlb_store
>          3,195,608      cpu_core/topdown-retiring/
>         40,156,649      cpu_core/topdown-mem-bound/
>          3,550,925      cpu_core/topdown-bad-spec/
>        117,571,818      cpu_core/topdown-fe-bound/
>         57,118,087      cpu_core/topdown-be-bound/
>             69,179      cpu_core/EXE_ACTIVITY.BOUND_ON_STORES/
>              4,582      cpu_core/MEM_INST_RETIRED.STLB_HIT_STORES/
>         30,183,104      cpu_core/CPU_CLK_UNHALTED.DISTRIBUTED/
>         30,556,790      cpu_core/CPU_CLK_UNHALTED.THREAD/
>            168,486      cpu_core/DTLB_STORE_MISSES.WALK_ACTIVE/
>               0.00 MEM_INST_RETIRED.STLB_HIT_STORES:p       0        0
>
>        1.003105924 seconds time elapsed
>
> v1:
> TPEBS is one of the features provided by the next generation of Intel PMU.
> Please refer to Section 8.4.1 of "Intel® Architecture Instruction Set Extensions
> Programming Reference" [1] for more details about this feature.
>
> This set of patches supports TPEBS in counting mode. The code works in the
> following way: it forks a perf record process from perf stat when retire_latency
> of one or more events are used in a metric formula. Perf stat would send a
> SIGTERM signal to perf record before it needs the retire latency value for
> metric calculation. Perf stat will then process sample data to extract the
> retire latency data for metric calculations. Currently, the code uses the
> arithmetic average of retire latency values.
>
> [1] https://www.intel.com/content/www/us/en/content-details/812218/intel-architecture-instruction-set-extensions-programming-reference.html?wapkw=future%20features
>
>
> Ian Rogers (1):
>   perf parse-events: Add a retirement latency modifier
>
> Weilin Wang (7):
>   perf data: Allow to use given fd in data->file.fd
>   perf stat: Fork and launch perf record when perf stat needs to get
>     retire latency value for a metric.
>   perf stat: Plugin retire_lat value from sampled data to evsel
>   perf vendor events intel: Add MTL metric json files
>   perf stat: Add command line option for enabling tpebs recording
>   perf Document: Add TPEBS to Documents
>   perf test: Add test for Intel TPEBS counting mode
>
>  tools/perf/Documentation/perf-list.txt        |    1 +
>  tools/perf/Documentation/topdown.txt          |   30 +
>  tools/perf/arch/x86/util/evlist.c             |    6 +
>  tools/perf/builtin-stat.c                     |    8 +
>  .../arch/x86/meteorlake/metricgroups.json     |  140 +
>  .../arch/x86/meteorlake/mtl-metrics.json      | 2595 +++++++++++++++++
>  .../perf/tests/shell/test_stat_intel_tpebs.sh |   19 +
>  tools/perf/util/Build                         |    1 +
>  tools/perf/util/data.c                        |    7 +-
>  tools/perf/util/evsel.c                       |   26 +
>  tools/perf/util/evsel.h                       |    6 +
>  tools/perf/util/intel-tpebs.c                 |  397 +++
>  tools/perf/util/intel-tpebs.h                 |   48 +
>  tools/perf/util/parse-events.c                |    2 +
>  tools/perf/util/parse-events.h                |    1 +
>  tools/perf/util/parse-events.l                |    3 +-
>  16 files changed, 3288 insertions(+), 2 deletions(-)
>  create mode 100644 tools/perf/pmu-events/arch/x86/meteorlake/metricgroups.json
>  create mode 100644 tools/perf/pmu-events/arch/x86/meteorlake/mtl-metrics.json
>  create mode 100755 tools/perf/tests/shell/test_stat_intel_tpebs.sh
>  create mode 100644 tools/perf/util/intel-tpebs.c
>  create mode 100644 tools/perf/util/intel-tpebs.h
>
> --
> 2.43.0
>
RE: [RFC PATCH v10 0/8] TPEBS counting mode support
Posted by Wang, Weilin 1 year, 6 months ago

> -----Original Message-----
> From: Namhyung Kim <namhyung@kernel.org>
> Sent: Thursday, May 30, 2024 11:37 PM
> To: Wang, Weilin <weilin.wang@intel.com>
> Cc: Ian Rogers <irogers@google.com>; Arnaldo Carvalho de Melo
> <acme@kernel.org>; Peter Zijlstra <peterz@infradead.org>; Ingo Molnar
> <mingo@redhat.com>; Alexander Shishkin
> <alexander.shishkin@linux.intel.com>; Jiri Olsa <jolsa@kernel.org>; Hunter,
> Adrian <adrian.hunter@intel.com>; Kan Liang <kan.liang@linux.intel.com>;
> linux-perf-users@vger.kernel.org; linux-kernel@vger.kernel.org; Taylor, Perry
> <perry.taylor@intel.com>; Alt, Samantha <samantha.alt@intel.com>; Biggers,
> Caleb <caleb.biggers@intel.com>
> Subject: Re: [RFC PATCH v10 0/8] TPEBS counting mode support
> 
> Hello,
> 
> On Tue, May 28, 2024 at 11:43 PM <weilin.wang@intel.com> wrote:
> >
> > From: Weilin Wang <weilin.wang@intel.com>
> >
> > I have tried not to count retire_latency events but did not succeed.
> > In particular, I tried the following methods:
> >  - Convert retire_latency event to dummy event in event parser.
> >  - Early bail out in evsel__open_cpu() and store_evsel_ids().
> >
> > The first method fails and causes non-retire_latency events with the same
> event
> > name return 0 count.
> >
> > The second method fails and causes all the events in the same group
> returning
> > "<not counted>" results.
> 
> Can you please describe where it fails?  Is it failing on other events
> because the tpebs event is a leader of the group?  I think you wanted
> to avoid having it in the leader position.  If we can skip any actual
> operations (open/close/enable/disable/read) for the tpebs events, then
> it could be fine..

It does not fail with the code in this patch set. But if I make it return directly 
from tpebs_start() in evsel__open_cpu(), it will cause segfault. The segfault is
caused by store_evsel_id(). I could add another early return from store_evsel_id()
if the evsel->retire_lat is true. 

After this change, it will eventually run and give me <not counted> results 
like below:

	<not counted> event1
	<not counted> event2
	xx event1:R

In a different case, it may seem to work (xxxxxx stands for some valid value):

	xxxxxxx event1
	xxxxxxx event2
	xxxxxxx event3
	xx event1:R

In the first case, the event1, event2 and event1:R are scheduled in the same 
group. On the other hand, in the second case, event1, event2 and event3 are
in one group, while event1:R is in a different group. 

Based on these two different type of results, I believe the failure happens in
the group that include a :R event. I've added the change to arch_evlist__cmp()
so that a :R event would not be a leader of the group. 

I think I've made evsel__open_cpu() return before it create fd and make
store_evsel_id() not to read and store fd. I'm not sure where I'm missing. Please 
let me know if you have any suggestions.

Thanks,
Weilin

> 
> Thanks,
> Namhyung
> 
> >
> > Because of above, the retire_latency event will still run in counting mode.
> >
> > Other changes in v10:
> >  - Change perf record fork from perf stat to evsel. All the major operations
> >  like tpebs start, stop, read_evsel should directly work through evsel.
> >  - Make intel-tpebs x86_64 only. This change is cross-compiled to arm64.
> >  - Put tpebs code to intel-tepbs and simplify intel-tpebs APIs to minimum
> number
> > of functions and variables. Update funtion name and variable names to use
> > consistent prefix. Also improve error handling.
> >  - Integrate code patch from Ian for the :R parser.
> >  - Update MTL metrics to TMA 4.8.
> >
> > V9: https://lore.kernel.org/all/20240521173952.3397644-1-
> weilin.wang@intel.com/
> >
> > Changes in v9:
> >  - Update the retire_latency result print and metric calculation method.
> Plugin
> > the value to evsel so that no special code is required.
> >  - Update --control:fifo to use pipe instead of named pipe.
> >  - Add test for TPEBS counting mode.
> >  - Update Document with more details.
> >
> > Changes in v8:
> > - In this revision, the code is updated to base on Ian's patch on R modifier
> > parser https://lore.kernel.org/lkml/20240428053616.1125891-3-
> irogers@google.com/
> > After this change, there is no special code required for R modifier in
> > metricgroup.c and metricgroup.h files.
> >
> > Caveat of this change:
> >   Ideally, we will need to add special handling to skip counting events with R
> > modifier in evsel. Currently, this is not implemented so the event with :R will
> > be both counted and sampled. Usually, in a metric formula that uses
> retire_latency,
> > it would already require to count the event. As a result, we will endup count
> the
> > same event twice. This should be able to be handled properly when we
> finalize our
> > design on evsel R modifier support.
> >
> > - Move TPEBS specific code out from main perf stat code to separate files in
> > util/intel-tpebs.c and util/intel-tpebs.h. [Namhyung]
> > - Use --control:fifo to ack perf stat from forked perf record instead of
> sleep(2) [Namhyung]
> > - Add introductions about TPEBS and R modifier in Documents. [Namhyung]
> >
> >
> > Changes in v7:
> > - Update code and comments for better code quality [Namhyung]
> > - Add a separate commit for perf data [Namhyung]
> > - Update retire latency print function to improve alignment [Namhyung]
> >
> > Changes in v6:
> > - Update code and add comments for better code quality [Namhyung]
> > - Remove the added fd var and directly pass the opened fd to data.file.fd
> [Namhyung]
> > - Add kill() to stop perf record when perf stat exists early [Namhyung]
> > - Add command opt check to ensure only start perf record when -a/-C given
> [Namhyung]
> > - Squash commits [Namhyung]
> >
> > Changes in v5:
> > - Update code and add comments for better code quality [Ian]
> >
> > Changes in v4:
> > - Remove uncessary debug print and update code and comments for better
> > readability and quality [Namhyung]
> > - Update mtl metric json file with consistent TmaL1 and TopdownL1
> metricgroup
> >
> > Changes in v3:
> > - Remove ':' when event name has '@' [Ian]
> > - Use 'R' as the modifier instead of "retire_latency" [Ian]
> >
> > Changes in v2:
> > - Add MTL metric file
> > - Add more descriptions and example to the patch [Arnaldo]
> >
> > Here is an example of running perf stat to collect a metric that uses
> > retire_latency value of event MEM_INST_RETIRED.STLB_HIT_STORES on a
> MTL system.
> >
> > In this simple example, there is no MEM_INST_RETIRED.STLB_HIT_STORES
> sample.
> > Therefore, the MEM_INST_RETIRED.STLB_HIT_STORES:p count and
> retire_latency value
> > are all 0.
> >
> > ./perf stat -M tma_dtlb_store -a -- sleep 1
> >
> > [ perf record: Woken up 1 times to write data ]
> > [ perf record: Captured and wrote 0.000 MB - ]
> >
> >  Performance counter stats for 'system wide':
> >
> >        181,047,168      cpu_core/TOPDOWN.SLOTS/          #      0.6 %
> tma_dtlb_store
> >          3,195,608      cpu_core/topdown-retiring/
> >         40,156,649      cpu_core/topdown-mem-bound/
> >          3,550,925      cpu_core/topdown-bad-spec/
> >        117,571,818      cpu_core/topdown-fe-bound/
> >         57,118,087      cpu_core/topdown-be-bound/
> >             69,179      cpu_core/EXE_ACTIVITY.BOUND_ON_STORES/
> >              4,582      cpu_core/MEM_INST_RETIRED.STLB_HIT_STORES/
> >         30,183,104      cpu_core/CPU_CLK_UNHALTED.DISTRIBUTED/
> >         30,556,790      cpu_core/CPU_CLK_UNHALTED.THREAD/
> >            168,486      cpu_core/DTLB_STORE_MISSES.WALK_ACTIVE/
> >               0.00 MEM_INST_RETIRED.STLB_HIT_STORES:p       0        0
> >
> >        1.003105924 seconds time elapsed
> >
> > v1:
> > TPEBS is one of the features provided by the next generation of Intel PMU.
> > Please refer to Section 8.4.1 of "Intel® Architecture Instruction Set
> Extensions
> > Programming Reference" [1] for more details about this feature.
> >
> > This set of patches supports TPEBS in counting mode. The code works in the
> > following way: it forks a perf record process from perf stat when
> retire_latency
> > of one or more events are used in a metric formula. Perf stat would send a
> > SIGTERM signal to perf record before it needs the retire latency value for
> > metric calculation. Perf stat will then process sample data to extract the
> > retire latency data for metric calculations. Currently, the code uses the
> > arithmetic average of retire latency values.
> >
> > [1] https://www.intel.com/content/www/us/en/content-
> details/812218/intel-architecture-instruction-set-extensions-programming-
> reference.html?wapkw=future%20features
> >
> >
> > Ian Rogers (1):
> >   perf parse-events: Add a retirement latency modifier
> >
> > Weilin Wang (7):
> >   perf data: Allow to use given fd in data->file.fd
> >   perf stat: Fork and launch perf record when perf stat needs to get
> >     retire latency value for a metric.
> >   perf stat: Plugin retire_lat value from sampled data to evsel
> >   perf vendor events intel: Add MTL metric json files
> >   perf stat: Add command line option for enabling tpebs recording
> >   perf Document: Add TPEBS to Documents
> >   perf test: Add test for Intel TPEBS counting mode
> >
> >  tools/perf/Documentation/perf-list.txt        |    1 +
> >  tools/perf/Documentation/topdown.txt          |   30 +
> >  tools/perf/arch/x86/util/evlist.c             |    6 +
> >  tools/perf/builtin-stat.c                     |    8 +
> >  .../arch/x86/meteorlake/metricgroups.json     |  140 +
> >  .../arch/x86/meteorlake/mtl-metrics.json      | 2595 +++++++++++++++++
> >  .../perf/tests/shell/test_stat_intel_tpebs.sh |   19 +
> >  tools/perf/util/Build                         |    1 +
> >  tools/perf/util/data.c                        |    7 +-
> >  tools/perf/util/evsel.c                       |   26 +
> >  tools/perf/util/evsel.h                       |    6 +
> >  tools/perf/util/intel-tpebs.c                 |  397 +++
> >  tools/perf/util/intel-tpebs.h                 |   48 +
> >  tools/perf/util/parse-events.c                |    2 +
> >  tools/perf/util/parse-events.h                |    1 +
> >  tools/perf/util/parse-events.l                |    3 +-
> >  16 files changed, 3288 insertions(+), 2 deletions(-)
> >  create mode 100644 tools/perf/pmu-
> events/arch/x86/meteorlake/metricgroups.json
> >  create mode 100644 tools/perf/pmu-events/arch/x86/meteorlake/mtl-
> metrics.json
> >  create mode 100755 tools/perf/tests/shell/test_stat_intel_tpebs.sh
> >  create mode 100644 tools/perf/util/intel-tpebs.c
> >  create mode 100644 tools/perf/util/intel-tpebs.h
> >
> > --
> > 2.43.0
> >
Re: [RFC PATCH v10 0/8] TPEBS counting mode support
Posted by Namhyung Kim 1 year, 6 months ago
On Fri, May 31, 2024 at 12:00 AM Wang, Weilin <weilin.wang@intel.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Namhyung Kim <namhyung@kernel.org>
> > Sent: Thursday, May 30, 2024 11:37 PM
> > To: Wang, Weilin <weilin.wang@intel.com>
> > Cc: Ian Rogers <irogers@google.com>; Arnaldo Carvalho de Melo
> > <acme@kernel.org>; Peter Zijlstra <peterz@infradead.org>; Ingo Molnar
> > <mingo@redhat.com>; Alexander Shishkin
> > <alexander.shishkin@linux.intel.com>; Jiri Olsa <jolsa@kernel.org>; Hunter,
> > Adrian <adrian.hunter@intel.com>; Kan Liang <kan.liang@linux.intel.com>;
> > linux-perf-users@vger.kernel.org; linux-kernel@vger.kernel.org; Taylor, Perry
> > <perry.taylor@intel.com>; Alt, Samantha <samantha.alt@intel.com>; Biggers,
> > Caleb <caleb.biggers@intel.com>
> > Subject: Re: [RFC PATCH v10 0/8] TPEBS counting mode support
> >
> > Hello,
> >
> > On Tue, May 28, 2024 at 11:43 PM <weilin.wang@intel.com> wrote:
> > >
> > > From: Weilin Wang <weilin.wang@intel.com>
> > >
> > > I have tried not to count retire_latency events but did not succeed.
> > > In particular, I tried the following methods:
> > >  - Convert retire_latency event to dummy event in event parser.
> > >  - Early bail out in evsel__open_cpu() and store_evsel_ids().
> > >
> > > The first method fails and causes non-retire_latency events with the same
> > event
> > > name return 0 count.
> > >
> > > The second method fails and causes all the events in the same group
> > returning
> > > "<not counted>" results.
> >
> > Can you please describe where it fails?  Is it failing on other events
> > because the tpebs event is a leader of the group?  I think you wanted
> > to avoid having it in the leader position.  If we can skip any actual
> > operations (open/close/enable/disable/read) for the tpebs events, then
> > it could be fine..
>
> It does not fail with the code in this patch set. But if I make it return directly
> from tpebs_start() in evsel__open_cpu(), it will cause segfault. The segfault is
> caused by store_evsel_id(). I could add another early return from store_evsel_id()
> if the evsel->retire_lat is true.

Yeah, I think event:R should not go to kernel from perf stat and you need to
handle that in the tools.

>
> After this change, it will eventually run and give me <not counted> results
> like below:
>
>         <not counted> event1
>         <not counted> event2
>         xx event1:R
>
> In a different case, it may seem to work (xxxxxx stands for some valid value):
>
>         xxxxxxx event1
>         xxxxxxx event2
>         xxxxxxx event3
>         xx event1:R
>
> In the first case, the event1, event2 and event1:R are scheduled in the same
> group. On the other hand, in the second case, event1, event2 and event3 are
> in one group, while event1:R is in a different group.

If you don't open event1:R then the kernel only sees event1 and event2.

>
> Based on these two different type of results, I believe the failure happens in
> the group that include a :R event. I've added the change to arch_evlist__cmp()
> so that a :R event would not be a leader of the group.
>
> I think I've made evsel__open_cpu() return before it create fd and make
> store_evsel_id() not to read and store fd. I'm not sure where I'm missing. Please
> let me know if you have any suggestions.

As I said, please don't open event1:R (for perf stat) and let tpebs_stop() set
the value using the data from perf record in background.

Thanks,
Namhyung
RE: [RFC PATCH v10 0/8] TPEBS counting mode support
Posted by Wang, Weilin 1 year, 6 months ago

> -----Original Message-----
> From: Namhyung Kim <namhyung@kernel.org>
> Sent: Friday, May 31, 2024 2:30 PM
> To: Wang, Weilin <weilin.wang@intel.com>
> Cc: Ian Rogers <irogers@google.com>; Arnaldo Carvalho de Melo
> <acme@kernel.org>; Peter Zijlstra <peterz@infradead.org>; Ingo Molnar
> <mingo@redhat.com>; Alexander Shishkin
> <alexander.shishkin@linux.intel.com>; Jiri Olsa <jolsa@kernel.org>; Hunter,
> Adrian <adrian.hunter@intel.com>; Kan Liang <kan.liang@linux.intel.com>;
> linux-perf-users@vger.kernel.org; linux-kernel@vger.kernel.org; Taylor, Perry
> <perry.taylor@intel.com>; Alt, Samantha <samantha.alt@intel.com>; Biggers,
> Caleb <caleb.biggers@intel.com>
> Subject: Re: [RFC PATCH v10 0/8] TPEBS counting mode support
> 
> On Fri, May 31, 2024 at 12:00 AM Wang, Weilin <weilin.wang@intel.com>
> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Namhyung Kim <namhyung@kernel.org>
> > > Sent: Thursday, May 30, 2024 11:37 PM
> > > To: Wang, Weilin <weilin.wang@intel.com>
> > > Cc: Ian Rogers <irogers@google.com>; Arnaldo Carvalho de Melo
> > > <acme@kernel.org>; Peter Zijlstra <peterz@infradead.org>; Ingo Molnar
> > > <mingo@redhat.com>; Alexander Shishkin
> > > <alexander.shishkin@linux.intel.com>; Jiri Olsa <jolsa@kernel.org>; Hunter,
> > > Adrian <adrian.hunter@intel.com>; Kan Liang <kan.liang@linux.intel.com>;
> > > linux-perf-users@vger.kernel.org; linux-kernel@vger.kernel.org; Taylor,
> Perry
> > > <perry.taylor@intel.com>; Alt, Samantha <samantha.alt@intel.com>;
> Biggers,
> > > Caleb <caleb.biggers@intel.com>
> > > Subject: Re: [RFC PATCH v10 0/8] TPEBS counting mode support
> > >
> > > Hello,
> > >
> > > On Tue, May 28, 2024 at 11:43 PM <weilin.wang@intel.com> wrote:
> > > >
> > > > From: Weilin Wang <weilin.wang@intel.com>
> > > >
> > > > I have tried not to count retire_latency events but did not succeed.
> > > > In particular, I tried the following methods:
> > > >  - Convert retire_latency event to dummy event in event parser.
> > > >  - Early bail out in evsel__open_cpu() and store_evsel_ids().
> > > >
> > > > The first method fails and causes non-retire_latency events with the same
> > > event
> > > > name return 0 count.
> > > >
> > > > The second method fails and causes all the events in the same group
> > > returning
> > > > "<not counted>" results.
> > >
> > > Can you please describe where it fails?  Is it failing on other events
> > > because the tpebs event is a leader of the group?  I think you wanted
> > > to avoid having it in the leader position.  If we can skip any actual
> > > operations (open/close/enable/disable/read) for the tpebs events, then
> > > it could be fine..
> >
> > It does not fail with the code in this patch set. But if I make it return directly
> > from tpebs_start() in evsel__open_cpu(), it will cause segfault. The segfault is
> > caused by store_evsel_id(). I could add another early return from
> store_evsel_id()
> > if the evsel->retire_lat is true.
> 
> Yeah, I think event:R should not go to kernel from perf stat and you need to
> handle that in the tools.
> 
> >
> > After this change, it will eventually run and give me <not counted> results
> > like below:
> >
> >         <not counted> event1
> >         <not counted> event2
> >         xx event1:R
> >
> > In a different case, it may seem to work (xxxxxx stands for some valid value):
> >
> >         xxxxxxx event1
> >         xxxxxxx event2
> >         xxxxxxx event3
> >         xx event1:R
> >
> > In the first case, the event1, event2 and event1:R are scheduled in the same
> > group. On the other hand, in the second case, event1, event2 and event3
> are
> > in one group, while event1:R is in a different group.
> 
> If you don't open event1:R then the kernel only sees event1 and event2.
> 
> >
> > Based on these two different type of results, I believe the failure happens in
> > the group that include a :R event. I've added the change to
> arch_evlist__cmp()
> > so that a :R event would not be a leader of the group.
> >
> > I think I've made evsel__open_cpu() return before it create fd and make
> > store_evsel_id() not to read and store fd. I'm not sure where I'm missing.
> Please
> > let me know if you have any suggestions.
> 
> As I said, please don't open event1:R (for perf stat) and let tpebs_stop() set
> the value using the data from perf record in background.

I think this is exactly what I'm trying to achieve, not open event1:R for perf stat. 
The problem is that I'm not sure how to do it properly in the code. Please give 
me some suggestion here. 

Thanks,
Weilin

> 
> Thanks,
> Namhyung
Re: [RFC PATCH v10 0/8] TPEBS counting mode support
Posted by Namhyung Kim 1 year, 6 months ago
On Fri, May 31, 2024 at 11:03:25PM +0000, Wang, Weilin wrote:
> 
> > -----Original Message-----
> > From: Namhyung Kim <namhyung@kernel.org>
> > As I said, please don't open event1:R (for perf stat) and let tpebs_stop() set
> > the value using the data from perf record in background.
> 
> I think this is exactly what I'm trying to achieve, not open event1:R for perf stat. 
> The problem is that I'm not sure how to do it properly in the code. Please give 
> me some suggestion here. 

Ok, I think the problem is in the read code.  It requires the number of
entries and the data size to match with what it calculates from the
member events.  It should not count TPEBS events as we don't want to
open them.

Something like below might work (on top of your series).  Probably
libperf should be aware of this..

Thanks,
Namhyung

---8<---

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index ac7a98ff6b19..7913db4a99e0 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1559,6 +1559,60 @@ static void evsel__set_count(struct evsel *counter, int cpu_map_idx, int thread,
 	perf_counts__set_loaded(counter->counts, cpu_map_idx, thread, true);
 }
 
+static bool evsel__group_has_tpebs(struct evsel *leader)
+{
+	struct evsel *evsel;
+
+	for_each_group_evsel(evsel, leader) {
+		if (evsel__is_retire_lat(evsel))
+			return true;
+	}
+	return false;
+}
+
+static u64 evsel__group_read_nr_members(struct evsel *leader)
+{
+	u64 nr = leader->core.nr_members;
+	struct evsel *evsel;
+
+	for_each_group_evsel(evsel, leader) {
+		if (evsel__is_retire_lat(evsel))
+			nr--;
+	}
+	return nr;
+}
+
+static u64 evsel__group_read_size(struct evsel *leader)
+{
+	u64 read_format = leader->core.attr.read_format;
+	int entry = sizeof(u64); /* value */
+	int size = 0;
+	int nr = 1;
+
+	if (!evsel__group_has_tpebs(leader))
+		return perf_evsel__read_size(&leader->core);
+
+	if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
+		size += sizeof(u64);
+
+	if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING)
+		size += sizeof(u64);
+
+	if (read_format & PERF_FORMAT_ID)
+		entry += sizeof(u64);
+
+	if (read_format & PERF_FORMAT_LOST)
+		entry += sizeof(u64);
+
+	if (read_format & PERF_FORMAT_GROUP) {
+		nr = evsel__group_read_nr_members(leader);
+		size += sizeof(u64);
+	}
+
+	size += entry * nr;
+	return size;
+}
+
 static int evsel__process_group_data(struct evsel *leader, int cpu_map_idx, int thread, u64 *data)
 {
 	u64 read_format = leader->core.attr.read_format;
@@ -1567,7 +1621,7 @@ static int evsel__process_group_data(struct evsel *leader, int cpu_map_idx, int
 
 	nr = *data++;
 
-	if (nr != (u64) leader->core.nr_members)
+	if (nr != evsel__group_read_nr_members(leader))
 		return -EINVAL;
 
 	if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
@@ -1597,7 +1651,7 @@ static int evsel__read_group(struct evsel *leader, int cpu_map_idx, int thread)
 {
 	struct perf_stat_evsel *ps = leader->stats;
 	u64 read_format = leader->core.attr.read_format;
-	int size = perf_evsel__read_size(&leader->core);
+	int size = evsel__group_read_size(leader);
 	u64 *data = ps->group_data;
 
 	if (!(read_format & PERF_FORMAT_ID))
@@ -2210,8 +2264,7 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
 
 	if (evsel__is_retire_lat(evsel)) {
 		err = tpebs_start(evsel->evlist, cpus);
-		if (err)
-			return err;
+		return err;
 	}
 
 	err = __evsel__prepare_open(evsel, cpus, threads);
diff --git a/tools/perf/util/intel-tpebs.c b/tools/perf/util/intel-tpebs.c
index d099fc8080e1..a3857e88af96 100644
--- a/tools/perf/util/intel-tpebs.c
+++ b/tools/perf/util/intel-tpebs.c
@@ -354,10 +354,11 @@ int tpebs_set_evsel(struct evsel *evsel, int cpu_map_idx, int thread)
 	 */
 	if (cpu_map_idx == 0 && thread == 0) {
 	/* Lost precision when casting from double to __u64. Any improvement? */
-		val = t->val;
+		val = t->val * 1000;
 	} else
 		val = 0;
 
+	evsel->scale = 1e-3;
 	count->val = val;
 	return 0;
 }
RE: [RFC PATCH v10 0/8] TPEBS counting mode support
Posted by Wang, Weilin 1 year, 6 months ago

> -----Original Message-----
> From: Namhyung Kim <namhyung@kernel.org>
> Sent: Sunday, June 2, 2024 2:18 PM
> To: Wang, Weilin <weilin.wang@intel.com>
> Cc: Ian Rogers <irogers@google.com>; Arnaldo Carvalho de Melo
> <acme@kernel.org>; Peter Zijlstra <peterz@infradead.org>; Ingo Molnar
> <mingo@redhat.com>; Alexander Shishkin
> <alexander.shishkin@linux.intel.com>; Jiri Olsa <jolsa@kernel.org>; Hunter,
> Adrian <adrian.hunter@intel.com>; Kan Liang <kan.liang@linux.intel.com>;
> linux-perf-users@vger.kernel.org; linux-kernel@vger.kernel.org; Taylor, Perry
> <perry.taylor@intel.com>; Alt, Samantha <samantha.alt@intel.com>; Biggers,
> Caleb <caleb.biggers@intel.com>
> Subject: Re: [RFC PATCH v10 0/8] TPEBS counting mode support
> 
> On Fri, May 31, 2024 at 11:03:25PM +0000, Wang, Weilin wrote:
> >
> > > -----Original Message-----
> > > From: Namhyung Kim <namhyung@kernel.org>
> > > As I said, please don't open event1:R (for perf stat) and let tpebs_stop() set
> > > the value using the data from perf record in background.
> >
> > I think this is exactly what I'm trying to achieve, not open event1:R for perf
> stat.
> > The problem is that I'm not sure how to do it properly in the code. Please
> give
> > me some suggestion here.
> 
> Ok, I think the problem is in the read code.  It requires the number of
> entries and the data size to match with what it calculates from the
> member events.  It should not count TPEBS events as we don't want to
> open them.
> 
> Something like below might work (on top of your series).  Probably
> libperf should be aware of this..
> 
Thanks Namhyung!  I will integrate this patch and test it out. 

Thanks,
Weilin

> Thanks,
> Namhyung
> 
> ---8<---
> 
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index ac7a98ff6b19..7913db4a99e0 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -1559,6 +1559,60 @@ static void evsel__set_count(struct evsel
> *counter, int cpu_map_idx, int thread,
>  	perf_counts__set_loaded(counter->counts, cpu_map_idx, thread,
> true);
>  }
> 
> +static bool evsel__group_has_tpebs(struct evsel *leader)
> +{
> +	struct evsel *evsel;
> +
> +	for_each_group_evsel(evsel, leader) {
> +		if (evsel__is_retire_lat(evsel))
> +			return true;
> +	}
> +	return false;
> +}
> +
> +static u64 evsel__group_read_nr_members(struct evsel *leader)
> +{
> +	u64 nr = leader->core.nr_members;
> +	struct evsel *evsel;
> +
> +	for_each_group_evsel(evsel, leader) {
> +		if (evsel__is_retire_lat(evsel))
> +			nr--;
> +	}
> +	return nr;
> +}
> +
> +static u64 evsel__group_read_size(struct evsel *leader)
> +{
> +	u64 read_format = leader->core.attr.read_format;
> +	int entry = sizeof(u64); /* value */
> +	int size = 0;
> +	int nr = 1;
> +
> +	if (!evsel__group_has_tpebs(leader))
> +		return perf_evsel__read_size(&leader->core);
> +
> +	if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
> +		size += sizeof(u64);
> +
> +	if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING)
> +		size += sizeof(u64);
> +
> +	if (read_format & PERF_FORMAT_ID)
> +		entry += sizeof(u64);
> +
> +	if (read_format & PERF_FORMAT_LOST)
> +		entry += sizeof(u64);
> +
> +	if (read_format & PERF_FORMAT_GROUP) {
> +		nr = evsel__group_read_nr_members(leader);
> +		size += sizeof(u64);
> +	}
> +
> +	size += entry * nr;
> +	return size;
> +}
> +
>  static int evsel__process_group_data(struct evsel *leader, int cpu_map_idx,
> int thread, u64 *data)
>  {
>  	u64 read_format = leader->core.attr.read_format;
> @@ -1567,7 +1621,7 @@ static int evsel__process_group_data(struct evsel
> *leader, int cpu_map_idx, int
> 
>  	nr = *data++;
> 
> -	if (nr != (u64) leader->core.nr_members)
> +	if (nr != evsel__group_read_nr_members(leader))
>  		return -EINVAL;
> 
>  	if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
> @@ -1597,7 +1651,7 @@ static int evsel__read_group(struct evsel *leader,
> int cpu_map_idx, int thread)
>  {
>  	struct perf_stat_evsel *ps = leader->stats;
>  	u64 read_format = leader->core.attr.read_format;
> -	int size = perf_evsel__read_size(&leader->core);
> +	int size = evsel__group_read_size(leader);
>  	u64 *data = ps->group_data;
> 
>  	if (!(read_format & PERF_FORMAT_ID))
> @@ -2210,8 +2264,7 @@ static int evsel__open_cpu(struct evsel *evsel,
> struct perf_cpu_map *cpus,
> 
>  	if (evsel__is_retire_lat(evsel)) {
>  		err = tpebs_start(evsel->evlist, cpus);
> -		if (err)
> -			return err;
> +		return err;
>  	}
> 
>  	err = __evsel__prepare_open(evsel, cpus, threads);
> diff --git a/tools/perf/util/intel-tpebs.c b/tools/perf/util/intel-tpebs.c
> index d099fc8080e1..a3857e88af96 100644
> --- a/tools/perf/util/intel-tpebs.c
> +++ b/tools/perf/util/intel-tpebs.c
> @@ -354,10 +354,11 @@ int tpebs_set_evsel(struct evsel *evsel, int
> cpu_map_idx, int thread)
>  	 */
>  	if (cpu_map_idx == 0 && thread == 0) {
>  	/* Lost precision when casting from double to __u64. Any
> improvement? */
> -		val = t->val;
> +		val = t->val * 1000;
>  	} else
>  		val = 0;
> 
> +	evsel->scale = 1e-3;
>  	count->val = val;
>  	return 0;
>  }