[RFC PATCH v18 0/8] TPEBS counting mode support

weilin.wang@intel.com posted 8 patches 1 year, 4 months ago
tools/perf/Documentation/perf-list.txt        |    1 +
tools/perf/Documentation/perf-stat.txt        |    8 +
tools/perf/Documentation/topdown.txt          |   30 +
tools/perf/arch/x86/util/evlist.c             |    6 +
tools/perf/builtin-stat.c                     |    8 +
.../arch/x86/meteorlake/metricgroups.json     |  142 +
.../arch/x86/meteorlake/mtl-metrics.json      | 2535 +++++++++++++++++
.../perf/tests/shell/test_stat_intel_tpebs.sh |   19 +
tools/perf/util/Build                         |    1 +
tools/perf/util/data.c                        |    7 +-
tools/perf/util/evlist.c                      |    2 +
tools/perf/util/evsel.c                       |   81 +-
tools/perf/util/evsel.h                       |    6 +
tools/perf/util/intel-tpebs.c                 |  431 +++
tools/perf/util/intel-tpebs.h                 |   35 +
tools/perf/util/parse-events.c                |    2 +
tools/perf/util/parse-events.h                |    1 +
tools/perf/util/parse-events.l                |    3 +-
18 files changed, 3314 insertions(+), 4 deletions(-)
create mode 100644 tools/perf/pmu-events/arch/x86/meteorlake/metricgroups.json
create mode 100644 tools/perf/pmu-events/arch/x86/meteorlake/mtl-metrics.json
create mode 100755 tools/perf/tests/shell/test_stat_intel_tpebs.sh
create mode 100644 tools/perf/util/intel-tpebs.c
create mode 100644 tools/perf/util/intel-tpebs.h
[RFC PATCH v18 0/8] TPEBS counting mode support
Posted by weilin.wang@intel.com 1 year, 4 months ago
From: Weilin Wang <weilin.wang@intel.com>

Change in v18:
 - Update to exit 2 in TPEBS shell test when not on Intel platform.
 - Several updates to use EVLIST_CTL_CMD_ENABLE_TAG, EVLIST_CTL_CMD_ACK_TAG, and
 etc.

Changes in v17:
 - Add a poll on control fifo ack_fd to ensure program returns successfully when
 perf record failed for any reason.
 - Add a check in the tpebs test to only run on Intel platforms.

Changes in v16:
 - Update tpebs bash test code and variable name.
 - Add a check to ensure only use "-C" option when cpumap is not "-1" when
 forking "perf record".

Changes in v15:
 - Revert changes added for python import test failure in v14 because the code
 works correctly with the new python build code.
 - Update the command line option to --record-tpebs.

Changes in v14:
 - Fix the python import test failure. We cannot support PYTHON_PERF because it
 will trigger a chain of dependency issues if we add intel-tpebs.c to it. So,
 only enable tpebs functions in evsel and evlist when PYTHON_PERF is not
 defined.
 - Fix shellcheck warning for the tpebs test.

Changes in v13:
 - Add document for the command line option and fix build error in non-x86_64.
 - Update example with non-zero retire_latency value when tpebs recording is
 enabled.
 - Add tpebs_stop() back to tpebs_set_evsel() because in hybrid platform, when
 the forked perf record is not killed, the reader thread does not get any
 sampled data from the PIPE. As a result, tpebs_set_evesel() will always return
 zero on retire_latency values. This does not happen on my test GNR machine.
 Since -I is not supported yet, I think we should add tpebs_stop() to ensure
 correctness for now. More investigation is required here when we work on
 supporting -I mode.
 - Rebase code on top of perf-tools-next.

Changes in v12:
 - Update MTL metric JSON file to include E-Core TMA3.6 changes.
 - Update comments and code for better code quality. Keep tpebs_start() and
 tpebs_delete() at evsel level for now and add comments on these functions with
 more details about potential future changes. Remove tpebs_stop() call from
 tpebs_set_evsel(). Simplify the tpebs_start() and tpebs_stop()/tpebs_delete()
 interfaces. Also fixed the bugs on not freed "new" pointer and the incorrect
 variable value assignment to val instead of counter->val.

Changes in v11:
 - Make retire_latency evsels not opened for counting. This works correctly now
 with the code Namhyung suggested that adjusting group read size when
 retire_latency evsels included in the group.
 - Update retire_latency value assignment using rint() to reduce precision loss
 while keeping code generic.
 - Fix the build error caused by not used variable in the test.

Other changes in v10:
 - Change perf record fork from perf stat to evsel. All the major operations
 like tpebs start, stop, read_evsel should directly work through evsel.
 - Make intel-tpebs x86_64 only. This change is cross-compiled to arm64.
 - Put tpebs code to intel-tepbs and simplify intel-tpebs APIs to minimum number
of functions and variables. Update funtion name and variable names to use
consistent prefix. Also improve error handling.
 - Integrate code patch from Ian for the :R parser.
 - Update MTL metrics to TMA 4.8.

V9: https://lore.kernel.org/all/20240521173952.3397644-1-weilin.wang@intel.com/

Changes in v9:
 - Update the retire_latency result print and metric calculation method. Plugin
the value to evsel so that no special code is required.
 - Update --control:fifo to use pipe instead of named pipe.
 - Add test for TPEBS counting mode.
 - Update Document with more details.

Changes in v8:
 - In this revision, the code is updated to base on Ian's patch on R modifier
parser https://lore.kernel.org/lkml/20240428053616.1125891-3-irogers@google.com/
After this change, there is no special code required for R modifier in
metricgroup.c and metricgroup.h files.

Caveat of this change:
  Ideally, we will need to add special handling to skip counting events with R
modifier in evsel. Currently, this is not implemented so the event with :R will
be both counted and sampled. Usually, in a metric formula that uses retire_latency,
it would already require to count the event. As a result, we will endup count the
same event twice. This should be able to be handled properly when we finalize our
design on evsel R modifier support.

 - Move TPEBS specific code out from main perf stat code to separate files in
util/intel-tpebs.c and util/intel-tpebs.h. [Namhyung]
 - Use --control:fifo to ack perf stat from forked perf record instead of sleep(2) [Namhyung]
 - Add introductions about TPEBS and R modifier in Documents. [Namhyung]


Changes in v7:
 - Update code and comments for better code quality [Namhyung]
 - Add a separate commit for perf data [Namhyung]
 - Update retire latency print function to improve alignment [Namhyung]

Changes in v6:
 - Update code and add comments for better code quality [Namhyung]
 - Remove the added fd var and directly pass the opened fd to data.file.fd [Namhyung]
 - Add kill() to stop perf record when perf stat exists early [Namhyung]
 - Add command opt check to ensure only start perf record when -a/-C given [Namhyung]
 - Squash commits [Namhyung]

Changes in v5:
 - Update code and add comments for better code quality [Ian]

Changes in v4:
 - Remove uncessary debug print and update code and comments for better
readability and quality [Namhyung]
 - Update mtl metric json file with consistent TmaL1 and TopdownL1 metricgroup

Changes in v3:
 - Remove ':' when event name has '@' [Ian]
 - Use 'R' as the modifier instead of "retire_latency" [Ian]

Changes in v2:
 - Add MTL metric file
 - Add more descriptions and example to the patch [Arnaldo]

Here is an example of running perf stat to collect a metric that uses
retire_latency value of event MEM_INST_RETIRED.STLB_HIT_STORES on a MTL system.

In this simple example, there is no MEM_INST_RETIRED.STLB_HIT_STORES sample.
Therefore, the MEM_INST_RETIRED.STLB_HIT_STORES:p count and retire_latency value
are all 0.

./perf stat -M tma_dtlb_store -a -- sleep 1

[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.000 MB - ]

 Performance counter stats for 'system wide':

       181,047,168      cpu_core/TOPDOWN.SLOTS/          #      0.6 %  tma_dtlb_store
         3,195,608      cpu_core/topdown-retiring/
        40,156,649      cpu_core/topdown-mem-bound/
         3,550,925      cpu_core/topdown-bad-spec/
       117,571,818      cpu_core/topdown-fe-bound/
        57,118,087      cpu_core/topdown-be-bound/
            69,179      cpu_core/EXE_ACTIVITY.BOUND_ON_STORES/
             4,582      cpu_core/MEM_INST_RETIRED.STLB_HIT_STORES/
        30,183,104      cpu_core/CPU_CLK_UNHALTED.DISTRIBUTED/
        30,556,790      cpu_core/CPU_CLK_UNHALTED.THREAD/
           168,486      cpu_core/DTLB_STORE_MISSES.WALK_ACTIVE/
              0.00 MEM_INST_RETIRED.STLB_HIT_STORES:p       0        0

       1.003105924 seconds time elapsed

v1:
TPEBS is one of the features provided by the next generation of Intel PMU.
Please refer to Section 8.4.1 of "Intel® Architecture Instruction Set Extensions
Programming Reference" [1] for more details about this feature.

This set of patches supports TPEBS in counting mode. The code works in the
following way: it forks a perf record process from perf stat when retire_latency
of one or more events are used in a metric formula. Perf stat would send a
SIGTERM signal to perf record before it needs the retire latency value for
metric calculation. Perf stat will then process sample data to extract the
retire latency data for metric calculations. Currently, the code uses the
arithmetic average of retire latency values.

[1] https://www.intel.com/content/www/us/en/content-details/812218/intel-architecture-instruction-set-extensions-programming-reference.html?wapkw=future%20features




Ian Rogers (1):
  perf parse-events: Add a retirement latency modifier

Weilin Wang (7):
  perf data: Allow to use given fd in data->file.fd
  perf stat: Fork and launch perf record when perf stat needs to get
    retire latency value for a metric.
  perf stat: Plugin retire_lat value from sampled data to evsel
  perf vendor events intel: Add MTL metric json files
  perf stat: Add command line option for enabling tpebs recording
  perf Document: Add TPEBS to Documents
  perf test: Add test for Intel TPEBS counting mode

 tools/perf/Documentation/perf-list.txt        |    1 +
 tools/perf/Documentation/perf-stat.txt        |    8 +
 tools/perf/Documentation/topdown.txt          |   30 +
 tools/perf/arch/x86/util/evlist.c             |    6 +
 tools/perf/builtin-stat.c                     |    8 +
 .../arch/x86/meteorlake/metricgroups.json     |  142 +
 .../arch/x86/meteorlake/mtl-metrics.json      | 2535 +++++++++++++++++
 .../perf/tests/shell/test_stat_intel_tpebs.sh |   19 +
 tools/perf/util/Build                         |    1 +
 tools/perf/util/data.c                        |    7 +-
 tools/perf/util/evlist.c                      |    2 +
 tools/perf/util/evsel.c                       |   81 +-
 tools/perf/util/evsel.h                       |    6 +
 tools/perf/util/intel-tpebs.c                 |  431 +++
 tools/perf/util/intel-tpebs.h                 |   35 +
 tools/perf/util/parse-events.c                |    2 +
 tools/perf/util/parse-events.h                |    1 +
 tools/perf/util/parse-events.l                |    3 +-
 18 files changed, 3314 insertions(+), 4 deletions(-)
 create mode 100644 tools/perf/pmu-events/arch/x86/meteorlake/metricgroups.json
 create mode 100644 tools/perf/pmu-events/arch/x86/meteorlake/mtl-metrics.json
 create mode 100755 tools/perf/tests/shell/test_stat_intel_tpebs.sh
 create mode 100644 tools/perf/util/intel-tpebs.c
 create mode 100644 tools/perf/util/intel-tpebs.h

--
2.43.0

Re: [RFC PATCH v18 0/8] TPEBS counting mode support
Posted by Namhyung Kim 1 year, 4 months ago
Hello Weilin,

On Fri, Jul 19, 2024 at 11:21 PM <weilin.wang@intel.com> wrote:
>
> From: Weilin Wang <weilin.wang@intel.com>
>
> Change in v18:
>  - Update to exit 2 in TPEBS shell test when not on Intel platform.
>  - Several updates to use EVLIST_CTL_CMD_ENABLE_TAG, EVLIST_CTL_CMD_ACK_TAG, and
>  etc.
>
> Changes in v17:
>  - Add a poll on control fifo ack_fd to ensure program returns successfully when
>  perf record failed for any reason.
>  - Add a check in the tpebs test to only run on Intel platforms.
>
> Changes in v16:
>  - Update tpebs bash test code and variable name.
>  - Add a check to ensure only use "-C" option when cpumap is not "-1" when
>  forking "perf record".
>
> Changes in v15:
>  - Revert changes added for python import test failure in v14 because the code
>  works correctly with the new python build code.
>  - Update the command line option to --record-tpebs.
>
> Changes in v14:
>  - Fix the python import test failure. We cannot support PYTHON_PERF because it
>  will trigger a chain of dependency issues if we add intel-tpebs.c to it. So,
>  only enable tpebs functions in evsel and evlist when PYTHON_PERF is not
>  defined.
>  - Fix shellcheck warning for the tpebs test.
>
> Changes in v13:
>  - Add document for the command line option and fix build error in non-x86_64.
>  - Update example with non-zero retire_latency value when tpebs recording is
>  enabled.
>  - Add tpebs_stop() back to tpebs_set_evsel() because in hybrid platform, when
>  the forked perf record is not killed, the reader thread does not get any
>  sampled data from the PIPE. As a result, tpebs_set_evesel() will always return
>  zero on retire_latency values. This does not happen on my test GNR machine.
>  Since -I is not supported yet, I think we should add tpebs_stop() to ensure
>  correctness for now. More investigation is required here when we work on
>  supporting -I mode.
>  - Rebase code on top of perf-tools-next.
>
> Changes in v12:
>  - Update MTL metric JSON file to include E-Core TMA3.6 changes.
>  - Update comments and code for better code quality. Keep tpebs_start() and
>  tpebs_delete() at evsel level for now and add comments on these functions with
>  more details about potential future changes. Remove tpebs_stop() call from
>  tpebs_set_evsel(). Simplify the tpebs_start() and tpebs_stop()/tpebs_delete()
>  interfaces. Also fixed the bugs on not freed "new" pointer and the incorrect
>  variable value assignment to val instead of counter->val.
>
> Changes in v11:
>  - Make retire_latency evsels not opened for counting. This works correctly now
>  with the code Namhyung suggested that adjusting group read size when
>  retire_latency evsels included in the group.
>  - Update retire_latency value assignment using rint() to reduce precision loss
>  while keeping code generic.
>  - Fix the build error caused by not used variable in the test.
>
> Other changes in v10:
>  - Change perf record fork from perf stat to evsel. All the major operations
>  like tpebs start, stop, read_evsel should directly work through evsel.
>  - Make intel-tpebs x86_64 only. This change is cross-compiled to arm64.
>  - Put tpebs code to intel-tepbs and simplify intel-tpebs APIs to minimum number
> of functions and variables. Update funtion name and variable names to use
> consistent prefix. Also improve error handling.
>  - Integrate code patch from Ian for the :R parser.
>  - Update MTL metrics to TMA 4.8.
>
> V9: https://lore.kernel.org/all/20240521173952.3397644-1-weilin.wang@intel.com/
>
> Changes in v9:
>  - Update the retire_latency result print and metric calculation method. Plugin
> the value to evsel so that no special code is required.
>  - Update --control:fifo to use pipe instead of named pipe.
>  - Add test for TPEBS counting mode.
>  - Update Document with more details.
>
> Changes in v8:
>  - In this revision, the code is updated to base on Ian's patch on R modifier
> parser https://lore.kernel.org/lkml/20240428053616.1125891-3-irogers@google.com/
> After this change, there is no special code required for R modifier in
> metricgroup.c and metricgroup.h files.
>
> Caveat of this change:
>   Ideally, we will need to add special handling to skip counting events with R
> modifier in evsel. Currently, this is not implemented so the event with :R will
> be both counted and sampled. Usually, in a metric formula that uses retire_latency,
> it would already require to count the event. As a result, we will endup count the
> same event twice. This should be able to be handled properly when we finalize our
> design on evsel R modifier support.
>
>  - Move TPEBS specific code out from main perf stat code to separate files in
> util/intel-tpebs.c and util/intel-tpebs.h. [Namhyung]
>  - Use --control:fifo to ack perf stat from forked perf record instead of sleep(2) [Namhyung]
>  - Add introductions about TPEBS and R modifier in Documents. [Namhyung]
>
>
> Changes in v7:
>  - Update code and comments for better code quality [Namhyung]
>  - Add a separate commit for perf data [Namhyung]
>  - Update retire latency print function to improve alignment [Namhyung]
>
> Changes in v6:
>  - Update code and add comments for better code quality [Namhyung]
>  - Remove the added fd var and directly pass the opened fd to data.file.fd [Namhyung]
>  - Add kill() to stop perf record when perf stat exists early [Namhyung]
>  - Add command opt check to ensure only start perf record when -a/-C given [Namhyung]
>  - Squash commits [Namhyung]
>
> Changes in v5:
>  - Update code and add comments for better code quality [Ian]
>
> Changes in v4:
>  - Remove uncessary debug print and update code and comments for better
> readability and quality [Namhyung]
>  - Update mtl metric json file with consistent TmaL1 and TopdownL1 metricgroup
>
> Changes in v3:
>  - Remove ':' when event name has '@' [Ian]
>  - Use 'R' as the modifier instead of "retire_latency" [Ian]
>
> Changes in v2:
>  - Add MTL metric file
>  - Add more descriptions and example to the patch [Arnaldo]
>
> Here is an example of running perf stat to collect a metric that uses
> retire_latency value of event MEM_INST_RETIRED.STLB_HIT_STORES on a MTL system.
>
> In this simple example, there is no MEM_INST_RETIRED.STLB_HIT_STORES sample.
> Therefore, the MEM_INST_RETIRED.STLB_HIT_STORES:p count and retire_latency value
> are all 0.
>
> ./perf stat -M tma_dtlb_store -a -- sleep 1
>
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.000 MB - ]
>
>  Performance counter stats for 'system wide':
>
>        181,047,168      cpu_core/TOPDOWN.SLOTS/          #      0.6 %  tma_dtlb_store
>          3,195,608      cpu_core/topdown-retiring/
>         40,156,649      cpu_core/topdown-mem-bound/
>          3,550,925      cpu_core/topdown-bad-spec/
>        117,571,818      cpu_core/topdown-fe-bound/
>         57,118,087      cpu_core/topdown-be-bound/
>             69,179      cpu_core/EXE_ACTIVITY.BOUND_ON_STORES/
>              4,582      cpu_core/MEM_INST_RETIRED.STLB_HIT_STORES/
>         30,183,104      cpu_core/CPU_CLK_UNHALTED.DISTRIBUTED/
>         30,556,790      cpu_core/CPU_CLK_UNHALTED.THREAD/
>            168,486      cpu_core/DTLB_STORE_MISSES.WALK_ACTIVE/
>               0.00 MEM_INST_RETIRED.STLB_HIT_STORES:p       0        0
>
>        1.003105924 seconds time elapsed
>
> v1:
> TPEBS is one of the features provided by the next generation of Intel PMU.
> Please refer to Section 8.4.1 of "Intel® Architecture Instruction Set Extensions
> Programming Reference" [1] for more details about this feature.
>
> This set of patches supports TPEBS in counting mode. The code works in the
> following way: it forks a perf record process from perf stat when retire_latency
> of one or more events are used in a metric formula. Perf stat would send a
> SIGTERM signal to perf record before it needs the retire latency value for
> metric calculation. Perf stat will then process sample data to extract the
> retire latency data for metric calculations. Currently, the code uses the
> arithmetic average of retire latency values.
>
> [1] https://www.intel.com/content/www/us/en/content-details/812218/intel-architecture-instruction-set-extensions-programming-reference.html?wapkw=future%20features
>
>
>
>
> Ian Rogers (1):
>   perf parse-events: Add a retirement latency modifier
>
> Weilin Wang (7):
>   perf data: Allow to use given fd in data->file.fd
>   perf stat: Fork and launch perf record when perf stat needs to get
>     retire latency value for a metric.
>   perf stat: Plugin retire_lat value from sampled data to evsel
>   perf vendor events intel: Add MTL metric json files
>   perf stat: Add command line option for enabling tpebs recording
>   perf Document: Add TPEBS to Documents
>   perf test: Add test for Intel TPEBS counting mode

Thanks for your persistence!

Reviewed-by: Namhyung Kim <namhyung@kernel.org>

Thanks,
Namhyung

>
>  tools/perf/Documentation/perf-list.txt        |    1 +
>  tools/perf/Documentation/perf-stat.txt        |    8 +
>  tools/perf/Documentation/topdown.txt          |   30 +
>  tools/perf/arch/x86/util/evlist.c             |    6 +
>  tools/perf/builtin-stat.c                     |    8 +
>  .../arch/x86/meteorlake/metricgroups.json     |  142 +
>  .../arch/x86/meteorlake/mtl-metrics.json      | 2535 +++++++++++++++++
>  .../perf/tests/shell/test_stat_intel_tpebs.sh |   19 +
>  tools/perf/util/Build                         |    1 +
>  tools/perf/util/data.c                        |    7 +-
>  tools/perf/util/evlist.c                      |    2 +
>  tools/perf/util/evsel.c                       |   81 +-
>  tools/perf/util/evsel.h                       |    6 +
>  tools/perf/util/intel-tpebs.c                 |  431 +++
>  tools/perf/util/intel-tpebs.h                 |   35 +
>  tools/perf/util/parse-events.c                |    2 +
>  tools/perf/util/parse-events.h                |    1 +
>  tools/perf/util/parse-events.l                |    3 +-
>  18 files changed, 3314 insertions(+), 4 deletions(-)
>  create mode 100644 tools/perf/pmu-events/arch/x86/meteorlake/metricgroups.json
>  create mode 100644 tools/perf/pmu-events/arch/x86/meteorlake/mtl-metrics.json
>  create mode 100755 tools/perf/tests/shell/test_stat_intel_tpebs.sh
>  create mode 100644 tools/perf/util/intel-tpebs.c
>  create mode 100644 tools/perf/util/intel-tpebs.h
>
> --
> 2.43.0
>
Re: [RFC PATCH v18 0/8] TPEBS counting mode support
Posted by Ian Rogers 1 year, 4 months ago
On Mon, Jul 22, 2024 at 10:38 AM Namhyung Kim <namhyung@kernel.org> wrote:
>
> Hello Weilin,
>
> On Fri, Jul 19, 2024 at 11:21 PM <weilin.wang@intel.com> wrote:
> >
> > From: Weilin Wang <weilin.wang@intel.com>
> >
> > Change in v18:
> >  - Update to exit 2 in TPEBS shell test when not on Intel platform.
> >  - Several updates to use EVLIST_CTL_CMD_ENABLE_TAG, EVLIST_CTL_CMD_ACK_TAG, and
> >  etc.
> >
> > Changes in v17:
> >  - Add a poll on control fifo ack_fd to ensure program returns successfully when
> >  perf record failed for any reason.
> >  - Add a check in the tpebs test to only run on Intel platforms.
> >
> > Changes in v16:
> >  - Update tpebs bash test code and variable name.
> >  - Add a check to ensure only use "-C" option when cpumap is not "-1" when
> >  forking "perf record".
> >
> > Changes in v15:
> >  - Revert changes added for python import test failure in v14 because the code
> >  works correctly with the new python build code.
> >  - Update the command line option to --record-tpebs.
> >
> > Changes in v14:
> >  - Fix the python import test failure. We cannot support PYTHON_PERF because it
> >  will trigger a chain of dependency issues if we add intel-tpebs.c to it. So,
> >  only enable tpebs functions in evsel and evlist when PYTHON_PERF is not
> >  defined.
> >  - Fix shellcheck warning for the tpebs test.
> >
> > Changes in v13:
> >  - Add document for the command line option and fix build error in non-x86_64.
> >  - Update example with non-zero retire_latency value when tpebs recording is
> >  enabled.
> >  - Add tpebs_stop() back to tpebs_set_evsel() because in hybrid platform, when
> >  the forked perf record is not killed, the reader thread does not get any
> >  sampled data from the PIPE. As a result, tpebs_set_evesel() will always return
> >  zero on retire_latency values. This does not happen on my test GNR machine.
> >  Since -I is not supported yet, I think we should add tpebs_stop() to ensure
> >  correctness for now. More investigation is required here when we work on
> >  supporting -I mode.
> >  - Rebase code on top of perf-tools-next.
> >
> > Changes in v12:
> >  - Update MTL metric JSON file to include E-Core TMA3.6 changes.
> >  - Update comments and code for better code quality. Keep tpebs_start() and
> >  tpebs_delete() at evsel level for now and add comments on these functions with
> >  more details about potential future changes. Remove tpebs_stop() call from
> >  tpebs_set_evsel(). Simplify the tpebs_start() and tpebs_stop()/tpebs_delete()
> >  interfaces. Also fixed the bugs on not freed "new" pointer and the incorrect
> >  variable value assignment to val instead of counter->val.
> >
> > Changes in v11:
> >  - Make retire_latency evsels not opened for counting. This works correctly now
> >  with the code Namhyung suggested that adjusting group read size when
> >  retire_latency evsels included in the group.
> >  - Update retire_latency value assignment using rint() to reduce precision loss
> >  while keeping code generic.
> >  - Fix the build error caused by not used variable in the test.
> >
> > Other changes in v10:
> >  - Change perf record fork from perf stat to evsel. All the major operations
> >  like tpebs start, stop, read_evsel should directly work through evsel.
> >  - Make intel-tpebs x86_64 only. This change is cross-compiled to arm64.
> >  - Put tpebs code to intel-tepbs and simplify intel-tpebs APIs to minimum number
> > of functions and variables. Update funtion name and variable names to use
> > consistent prefix. Also improve error handling.
> >  - Integrate code patch from Ian for the :R parser.
> >  - Update MTL metrics to TMA 4.8.
> >
> > V9: https://lore.kernel.org/all/20240521173952.3397644-1-weilin.wang@intel.com/
> >
> > Changes in v9:
> >  - Update the retire_latency result print and metric calculation method. Plugin
> > the value to evsel so that no special code is required.
> >  - Update --control:fifo to use pipe instead of named pipe.
> >  - Add test for TPEBS counting mode.
> >  - Update Document with more details.
> >
> > Changes in v8:
> >  - In this revision, the code is updated to base on Ian's patch on R modifier
> > parser https://lore.kernel.org/lkml/20240428053616.1125891-3-irogers@google.com/
> > After this change, there is no special code required for R modifier in
> > metricgroup.c and metricgroup.h files.
> >
> > Caveat of this change:
> >   Ideally, we will need to add special handling to skip counting events with R
> > modifier in evsel. Currently, this is not implemented so the event with :R will
> > be both counted and sampled. Usually, in a metric formula that uses retire_latency,
> > it would already require to count the event. As a result, we will endup count the
> > same event twice. This should be able to be handled properly when we finalize our
> > design on evsel R modifier support.
> >
> >  - Move TPEBS specific code out from main perf stat code to separate files in
> > util/intel-tpebs.c and util/intel-tpebs.h. [Namhyung]
> >  - Use --control:fifo to ack perf stat from forked perf record instead of sleep(2) [Namhyung]
> >  - Add introductions about TPEBS and R modifier in Documents. [Namhyung]
> >
> >
> > Changes in v7:
> >  - Update code and comments for better code quality [Namhyung]
> >  - Add a separate commit for perf data [Namhyung]
> >  - Update retire latency print function to improve alignment [Namhyung]
> >
> > Changes in v6:
> >  - Update code and add comments for better code quality [Namhyung]
> >  - Remove the added fd var and directly pass the opened fd to data.file.fd [Namhyung]
> >  - Add kill() to stop perf record when perf stat exists early [Namhyung]
> >  - Add command opt check to ensure only start perf record when -a/-C given [Namhyung]
> >  - Squash commits [Namhyung]
> >
> > Changes in v5:
> >  - Update code and add comments for better code quality [Ian]
> >
> > Changes in v4:
> >  - Remove uncessary debug print and update code and comments for better
> > readability and quality [Namhyung]
> >  - Update mtl metric json file with consistent TmaL1 and TopdownL1 metricgroup
> >
> > Changes in v3:
> >  - Remove ':' when event name has '@' [Ian]
> >  - Use 'R' as the modifier instead of "retire_latency" [Ian]
> >
> > Changes in v2:
> >  - Add MTL metric file
> >  - Add more descriptions and example to the patch [Arnaldo]
> >
> > Here is an example of running perf stat to collect a metric that uses
> > retire_latency value of event MEM_INST_RETIRED.STLB_HIT_STORES on a MTL system.
> >
> > In this simple example, there is no MEM_INST_RETIRED.STLB_HIT_STORES sample.
> > Therefore, the MEM_INST_RETIRED.STLB_HIT_STORES:p count and retire_latency value
> > are all 0.
> >
> > ./perf stat -M tma_dtlb_store -a -- sleep 1
> >
> > [ perf record: Woken up 1 times to write data ]
> > [ perf record: Captured and wrote 0.000 MB - ]
> >
> >  Performance counter stats for 'system wide':
> >
> >        181,047,168      cpu_core/TOPDOWN.SLOTS/          #      0.6 %  tma_dtlb_store
> >          3,195,608      cpu_core/topdown-retiring/
> >         40,156,649      cpu_core/topdown-mem-bound/
> >          3,550,925      cpu_core/topdown-bad-spec/
> >        117,571,818      cpu_core/topdown-fe-bound/
> >         57,118,087      cpu_core/topdown-be-bound/
> >             69,179      cpu_core/EXE_ACTIVITY.BOUND_ON_STORES/
> >              4,582      cpu_core/MEM_INST_RETIRED.STLB_HIT_STORES/
> >         30,183,104      cpu_core/CPU_CLK_UNHALTED.DISTRIBUTED/
> >         30,556,790      cpu_core/CPU_CLK_UNHALTED.THREAD/
> >            168,486      cpu_core/DTLB_STORE_MISSES.WALK_ACTIVE/
> >               0.00 MEM_INST_RETIRED.STLB_HIT_STORES:p       0        0
> >
> >        1.003105924 seconds time elapsed
> >
> > v1:
> > TPEBS is one of the features provided by the next generation of Intel PMU.
> > Please refer to Section 8.4.1 of "Intel® Architecture Instruction Set Extensions
> > Programming Reference" [1] for more details about this feature.
> >
> > This set of patches supports TPEBS in counting mode. The code works in the
> > following way: it forks a perf record process from perf stat when retire_latency
> > of one or more events are used in a metric formula. Perf stat would send a
> > SIGTERM signal to perf record before it needs the retire latency value for
> > metric calculation. Perf stat will then process sample data to extract the
> > retire latency data for metric calculations. Currently, the code uses the
> > arithmetic average of retire latency values.
> >
> > [1] https://www.intel.com/content/www/us/en/content-details/812218/intel-architecture-instruction-set-extensions-programming-reference.html?wapkw=future%20features
> >
> >
> >
> >
> > Ian Rogers (1):
> >   perf parse-events: Add a retirement latency modifier
> >
> > Weilin Wang (7):
> >   perf data: Allow to use given fd in data->file.fd
> >   perf stat: Fork and launch perf record when perf stat needs to get
> >     retire latency value for a metric.
> >   perf stat: Plugin retire_lat value from sampled data to evsel
> >   perf vendor events intel: Add MTL metric json files
> >   perf stat: Add command line option for enabling tpebs recording
> >   perf Document: Add TPEBS to Documents
> >   perf test: Add test for Intel TPEBS counting mode
>
> Thanks for your persistence!
>
> Reviewed-by: Namhyung Kim <namhyung@kernel.org>

Ping.

Thanks,
Ian
Re: [RFC PATCH v18 0/8] TPEBS counting mode support
Posted by Arnaldo Carvalho de Melo 1 year, 4 months ago
On Mon, Aug 05, 2024 at 08:10:12AM -0700, Ian Rogers wrote:
> On Mon, Jul 22, 2024 at 10:38 AM Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > Hello Weilin,
> >
> > On Fri, Jul 19, 2024 at 11:21 PM <weilin.wang@intel.com> wrote:
> > >
> > > From: Weilin Wang <weilin.wang@intel.com>
> > >
> > > Change in v18:
> > >  - Update to exit 2 in TPEBS shell test when not on Intel platform.
> > >  - Several updates to use EVLIST_CTL_CMD_ENABLE_TAG, EVLIST_CTL_CMD_ACK_TAG, and
> > >  etc.
> > >
> > > Changes in v17:
> > >  - Add a poll on control fifo ack_fd to ensure program returns successfully when
> > >  perf record failed for any reason.
> > >  - Add a check in the tpebs test to only run on Intel platforms.
> > >
> > > Changes in v16:
> > >  - Update tpebs bash test code and variable name.
> > >  - Add a check to ensure only use "-C" option when cpumap is not "-1" when
> > >  forking "perf record".
> > >
> > > Changes in v15:
> > >  - Revert changes added for python import test failure in v14 because the code
> > >  works correctly with the new python build code.
> > >  - Update the command line option to --record-tpebs.
> > >
> > > Changes in v14:
> > >  - Fix the python import test failure. We cannot support PYTHON_PERF because it
> > >  will trigger a chain of dependency issues if we add intel-tpebs.c to it. So,
> > >  only enable tpebs functions in evsel and evlist when PYTHON_PERF is not
> > >  defined.
> > >  - Fix shellcheck warning for the tpebs test.
> > >
> > > Changes in v13:
> > >  - Add document for the command line option and fix build error in non-x86_64.
> > >  - Update example with non-zero retire_latency value when tpebs recording is
> > >  enabled.
> > >  - Add tpebs_stop() back to tpebs_set_evsel() because in hybrid platform, when
> > >  the forked perf record is not killed, the reader thread does not get any
> > >  sampled data from the PIPE. As a result, tpebs_set_evesel() will always return
> > >  zero on retire_latency values. This does not happen on my test GNR machine.
> > >  Since -I is not supported yet, I think we should add tpebs_stop() to ensure
> > >  correctness for now. More investigation is required here when we work on
> > >  supporting -I mode.
> > >  - Rebase code on top of perf-tools-next.
> > >
> > > Changes in v12:
> > >  - Update MTL metric JSON file to include E-Core TMA3.6 changes.
> > >  - Update comments and code for better code quality. Keep tpebs_start() and
> > >  tpebs_delete() at evsel level for now and add comments on these functions with
> > >  more details about potential future changes. Remove tpebs_stop() call from
> > >  tpebs_set_evsel(). Simplify the tpebs_start() and tpebs_stop()/tpebs_delete()
> > >  interfaces. Also fixed the bugs on not freed "new" pointer and the incorrect
> > >  variable value assignment to val instead of counter->val.
> > >
> > > Changes in v11:
> > >  - Make retire_latency evsels not opened for counting. This works correctly now
> > >  with the code Namhyung suggested that adjusting group read size when
> > >  retire_latency evsels included in the group.
> > >  - Update retire_latency value assignment using rint() to reduce precision loss
> > >  while keeping code generic.
> > >  - Fix the build error caused by not used variable in the test.
> > >
> > > Other changes in v10:
> > >  - Change perf record fork from perf stat to evsel. All the major operations
> > >  like tpebs start, stop, read_evsel should directly work through evsel.
> > >  - Make intel-tpebs x86_64 only. This change is cross-compiled to arm64.
> > >  - Put tpebs code to intel-tepbs and simplify intel-tpebs APIs to minimum number
> > > of functions and variables. Update funtion name and variable names to use
> > > consistent prefix. Also improve error handling.
> > >  - Integrate code patch from Ian for the :R parser.
> > >  - Update MTL metrics to TMA 4.8.
> > >
> > > V9: https://lore.kernel.org/all/20240521173952.3397644-1-weilin.wang@intel.com/
> > >
> > > Changes in v9:
> > >  - Update the retire_latency result print and metric calculation method. Plugin
> > > the value to evsel so that no special code is required.
> > >  - Update --control:fifo to use pipe instead of named pipe.
> > >  - Add test for TPEBS counting mode.
> > >  - Update Document with more details.
> > >
> > > Changes in v8:
> > >  - In this revision, the code is updated to base on Ian's patch on R modifier
> > > parser https://lore.kernel.org/lkml/20240428053616.1125891-3-irogers@google.com/
> > > After this change, there is no special code required for R modifier in
> > > metricgroup.c and metricgroup.h files.
> > >
> > > Caveat of this change:
> > >   Ideally, we will need to add special handling to skip counting events with R
> > > modifier in evsel. Currently, this is not implemented so the event with :R will
> > > be both counted and sampled. Usually, in a metric formula that uses retire_latency,
> > > it would already require to count the event. As a result, we will endup count the
> > > same event twice. This should be able to be handled properly when we finalize our
> > > design on evsel R modifier support.
> > >
> > >  - Move TPEBS specific code out from main perf stat code to separate files in
> > > util/intel-tpebs.c and util/intel-tpebs.h. [Namhyung]
> > >  - Use --control:fifo to ack perf stat from forked perf record instead of sleep(2) [Namhyung]
> > >  - Add introductions about TPEBS and R modifier in Documents. [Namhyung]
> > >
> > >
> > > Changes in v7:
> > >  - Update code and comments for better code quality [Namhyung]
> > >  - Add a separate commit for perf data [Namhyung]
> > >  - Update retire latency print function to improve alignment [Namhyung]
> > >
> > > Changes in v6:
> > >  - Update code and add comments for better code quality [Namhyung]
> > >  - Remove the added fd var and directly pass the opened fd to data.file.fd [Namhyung]
> > >  - Add kill() to stop perf record when perf stat exists early [Namhyung]
> > >  - Add command opt check to ensure only start perf record when -a/-C given [Namhyung]
> > >  - Squash commits [Namhyung]
> > >
> > > Changes in v5:
> > >  - Update code and add comments for better code quality [Ian]
> > >
> > > Changes in v4:
> > >  - Remove uncessary debug print and update code and comments for better
> > > readability and quality [Namhyung]
> > >  - Update mtl metric json file with consistent TmaL1 and TopdownL1 metricgroup
> > >
> > > Changes in v3:
> > >  - Remove ':' when event name has '@' [Ian]
> > >  - Use 'R' as the modifier instead of "retire_latency" [Ian]
> > >
> > > Changes in v2:
> > >  - Add MTL metric file
> > >  - Add more descriptions and example to the patch [Arnaldo]
> > >
> > > Here is an example of running perf stat to collect a metric that uses
> > > retire_latency value of event MEM_INST_RETIRED.STLB_HIT_STORES on a MTL system.
> > >
> > > In this simple example, there is no MEM_INST_RETIRED.STLB_HIT_STORES sample.
> > > Therefore, the MEM_INST_RETIRED.STLB_HIT_STORES:p count and retire_latency value
> > > are all 0.
> > >
> > > ./perf stat -M tma_dtlb_store -a -- sleep 1
> > >
> > > [ perf record: Woken up 1 times to write data ]
> > > [ perf record: Captured and wrote 0.000 MB - ]
> > >
> > >  Performance counter stats for 'system wide':
> > >
> > >        181,047,168      cpu_core/TOPDOWN.SLOTS/          #      0.6 %  tma_dtlb_store
> > >          3,195,608      cpu_core/topdown-retiring/
> > >         40,156,649      cpu_core/topdown-mem-bound/
> > >          3,550,925      cpu_core/topdown-bad-spec/
> > >        117,571,818      cpu_core/topdown-fe-bound/
> > >         57,118,087      cpu_core/topdown-be-bound/
> > >             69,179      cpu_core/EXE_ACTIVITY.BOUND_ON_STORES/
> > >              4,582      cpu_core/MEM_INST_RETIRED.STLB_HIT_STORES/
> > >         30,183,104      cpu_core/CPU_CLK_UNHALTED.DISTRIBUTED/
> > >         30,556,790      cpu_core/CPU_CLK_UNHALTED.THREAD/
> > >            168,486      cpu_core/DTLB_STORE_MISSES.WALK_ACTIVE/
> > >               0.00 MEM_INST_RETIRED.STLB_HIT_STORES:p       0        0
> > >
> > >        1.003105924 seconds time elapsed
> > >
> > > v1:
> > > TPEBS is one of the features provided by the next generation of Intel PMU.
> > > Please refer to Section 8.4.1 of "Intel® Architecture Instruction Set Extensions
> > > Programming Reference" [1] for more details about this feature.
> > >
> > > This set of patches supports TPEBS in counting mode. The code works in the
> > > following way: it forks a perf record process from perf stat when retire_latency
> > > of one or more events are used in a metric formula. Perf stat would send a
> > > SIGTERM signal to perf record before it needs the retire latency value for
> > > metric calculation. Perf stat will then process sample data to extract the
> > > retire latency data for metric calculations. Currently, the code uses the
> > > arithmetic average of retire latency values.
> > >
> > > [1] https://www.intel.com/content/www/us/en/content-details/812218/intel-architecture-instruction-set-extensions-programming-reference.html?wapkw=future%20features
> > >
> > >
> > >
> > >
> > > Ian Rogers (1):
> > >   perf parse-events: Add a retirement latency modifier
> > >
> > > Weilin Wang (7):
> > >   perf data: Allow to use given fd in data->file.fd
> > >   perf stat: Fork and launch perf record when perf stat needs to get
> > >     retire latency value for a metric.
> > >   perf stat: Plugin retire_lat value from sampled data to evsel
> > >   perf vendor events intel: Add MTL metric json files
> > >   perf stat: Add command line option for enabling tpebs recording
> > >   perf Document: Add TPEBS to Documents
> > >   perf test: Add test for Intel TPEBS counting mode
> >
> > Thanks for your persistence!
> >
> > Reviewed-by: Namhyung Kim <namhyung@kernel.org>
> 
> Ping.

I guess Namhyung's reviewed-by should suffice, but since you're pinging
and I saw previous comments about this serie, would it be possible to
get your Reviewed-by as well?

Thanks,

- Arnaldo
Re: [RFC PATCH v18 0/8] TPEBS counting mode support
Posted by Ian Rogers 1 year, 4 months ago
On Mon, Aug 5, 2024 at 12:34 PM Arnaldo Carvalho de Melo
<acme@kernel.org> wrote:
>
> On Mon, Aug 05, 2024 at 08:10:12AM -0700, Ian Rogers wrote:
> > On Mon, Jul 22, 2024 at 10:38 AM Namhyung Kim <namhyung@kernel.org> wrote:
> > >
> > > Hello Weilin,
> > >
> > > On Fri, Jul 19, 2024 at 11:21 PM <weilin.wang@intel.com> wrote:
> > > >
> > > > From: Weilin Wang <weilin.wang@intel.com>
> > > >
> > > > Change in v18:
> > > >  - Update to exit 2 in TPEBS shell test when not on Intel platform.
> > > >  - Several updates to use EVLIST_CTL_CMD_ENABLE_TAG, EVLIST_CTL_CMD_ACK_TAG, and
> > > >  etc.
> > > >
> > > > Changes in v17:
> > > >  - Add a poll on control fifo ack_fd to ensure program returns successfully when
> > > >  perf record failed for any reason.
> > > >  - Add a check in the tpebs test to only run on Intel platforms.
> > > >
> > > > Changes in v16:
> > > >  - Update tpebs bash test code and variable name.
> > > >  - Add a check to ensure only use "-C" option when cpumap is not "-1" when
> > > >  forking "perf record".
> > > >
> > > > Changes in v15:
> > > >  - Revert changes added for python import test failure in v14 because the code
> > > >  works correctly with the new python build code.
> > > >  - Update the command line option to --record-tpebs.
> > > >
> > > > Changes in v14:
> > > >  - Fix the python import test failure. We cannot support PYTHON_PERF because it
> > > >  will trigger a chain of dependency issues if we add intel-tpebs.c to it. So,
> > > >  only enable tpebs functions in evsel and evlist when PYTHON_PERF is not
> > > >  defined.
> > > >  - Fix shellcheck warning for the tpebs test.
> > > >
> > > > Changes in v13:
> > > >  - Add document for the command line option and fix build error in non-x86_64.
> > > >  - Update example with non-zero retire_latency value when tpebs recording is
> > > >  enabled.
> > > >  - Add tpebs_stop() back to tpebs_set_evsel() because in hybrid platform, when
> > > >  the forked perf record is not killed, the reader thread does not get any
> > > >  sampled data from the PIPE. As a result, tpebs_set_evesel() will always return
> > > >  zero on retire_latency values. This does not happen on my test GNR machine.
> > > >  Since -I is not supported yet, I think we should add tpebs_stop() to ensure
> > > >  correctness for now. More investigation is required here when we work on
> > > >  supporting -I mode.
> > > >  - Rebase code on top of perf-tools-next.
> > > >
> > > > Changes in v12:
> > > >  - Update MTL metric JSON file to include E-Core TMA3.6 changes.
> > > >  - Update comments and code for better code quality. Keep tpebs_start() and
> > > >  tpebs_delete() at evsel level for now and add comments on these functions with
> > > >  more details about potential future changes. Remove tpebs_stop() call from
> > > >  tpebs_set_evsel(). Simplify the tpebs_start() and tpebs_stop()/tpebs_delete()
> > > >  interfaces. Also fixed the bugs on not freed "new" pointer and the incorrect
> > > >  variable value assignment to val instead of counter->val.
> > > >
> > > > Changes in v11:
> > > >  - Make retire_latency evsels not opened for counting. This works correctly now
> > > >  with the code Namhyung suggested that adjusting group read size when
> > > >  retire_latency evsels included in the group.
> > > >  - Update retire_latency value assignment using rint() to reduce precision loss
> > > >  while keeping code generic.
> > > >  - Fix the build error caused by not used variable in the test.
> > > >
> > > > Other changes in v10:
> > > >  - Change perf record fork from perf stat to evsel. All the major operations
> > > >  like tpebs start, stop, read_evsel should directly work through evsel.
> > > >  - Make intel-tpebs x86_64 only. This change is cross-compiled to arm64.
> > > >  - Put tpebs code to intel-tepbs and simplify intel-tpebs APIs to minimum number
> > > > of functions and variables. Update funtion name and variable names to use
> > > > consistent prefix. Also improve error handling.
> > > >  - Integrate code patch from Ian for the :R parser.
> > > >  - Update MTL metrics to TMA 4.8.
> > > >
> > > > V9: https://lore.kernel.org/all/20240521173952.3397644-1-weilin.wang@intel.com/
> > > >
> > > > Changes in v9:
> > > >  - Update the retire_latency result print and metric calculation method. Plugin
> > > > the value to evsel so that no special code is required.
> > > >  - Update --control:fifo to use pipe instead of named pipe.
> > > >  - Add test for TPEBS counting mode.
> > > >  - Update Document with more details.
> > > >
> > > > Changes in v8:
> > > >  - In this revision, the code is updated to base on Ian's patch on R modifier
> > > > parser https://lore.kernel.org/lkml/20240428053616.1125891-3-irogers@google.com/
> > > > After this change, there is no special code required for R modifier in
> > > > metricgroup.c and metricgroup.h files.
> > > >
> > > > Caveat of this change:
> > > >   Ideally, we will need to add special handling to skip counting events with R
> > > > modifier in evsel. Currently, this is not implemented so the event with :R will
> > > > be both counted and sampled. Usually, in a metric formula that uses retire_latency,
> > > > it would already require to count the event. As a result, we will endup count the
> > > > same event twice. This should be able to be handled properly when we finalize our
> > > > design on evsel R modifier support.
> > > >
> > > >  - Move TPEBS specific code out from main perf stat code to separate files in
> > > > util/intel-tpebs.c and util/intel-tpebs.h. [Namhyung]
> > > >  - Use --control:fifo to ack perf stat from forked perf record instead of sleep(2) [Namhyung]
> > > >  - Add introductions about TPEBS and R modifier in Documents. [Namhyung]
> > > >
> > > >
> > > > Changes in v7:
> > > >  - Update code and comments for better code quality [Namhyung]
> > > >  - Add a separate commit for perf data [Namhyung]
> > > >  - Update retire latency print function to improve alignment [Namhyung]
> > > >
> > > > Changes in v6:
> > > >  - Update code and add comments for better code quality [Namhyung]
> > > >  - Remove the added fd var and directly pass the opened fd to data.file.fd [Namhyung]
> > > >  - Add kill() to stop perf record when perf stat exists early [Namhyung]
> > > >  - Add command opt check to ensure only start perf record when -a/-C given [Namhyung]
> > > >  - Squash commits [Namhyung]
> > > >
> > > > Changes in v5:
> > > >  - Update code and add comments for better code quality [Ian]
> > > >
> > > > Changes in v4:
> > > >  - Remove uncessary debug print and update code and comments for better
> > > > readability and quality [Namhyung]
> > > >  - Update mtl metric json file with consistent TmaL1 and TopdownL1 metricgroup
> > > >
> > > > Changes in v3:
> > > >  - Remove ':' when event name has '@' [Ian]
> > > >  - Use 'R' as the modifier instead of "retire_latency" [Ian]
> > > >
> > > > Changes in v2:
> > > >  - Add MTL metric file
> > > >  - Add more descriptions and example to the patch [Arnaldo]
> > > >
> > > > Here is an example of running perf stat to collect a metric that uses
> > > > retire_latency value of event MEM_INST_RETIRED.STLB_HIT_STORES on a MTL system.
> > > >
> > > > In this simple example, there is no MEM_INST_RETIRED.STLB_HIT_STORES sample.
> > > > Therefore, the MEM_INST_RETIRED.STLB_HIT_STORES:p count and retire_latency value
> > > > are all 0.
> > > >
> > > > ./perf stat -M tma_dtlb_store -a -- sleep 1
> > > >
> > > > [ perf record: Woken up 1 times to write data ]
> > > > [ perf record: Captured and wrote 0.000 MB - ]
> > > >
> > > >  Performance counter stats for 'system wide':
> > > >
> > > >        181,047,168      cpu_core/TOPDOWN.SLOTS/          #      0.6 %  tma_dtlb_store
> > > >          3,195,608      cpu_core/topdown-retiring/
> > > >         40,156,649      cpu_core/topdown-mem-bound/
> > > >          3,550,925      cpu_core/topdown-bad-spec/
> > > >        117,571,818      cpu_core/topdown-fe-bound/
> > > >         57,118,087      cpu_core/topdown-be-bound/
> > > >             69,179      cpu_core/EXE_ACTIVITY.BOUND_ON_STORES/
> > > >              4,582      cpu_core/MEM_INST_RETIRED.STLB_HIT_STORES/
> > > >         30,183,104      cpu_core/CPU_CLK_UNHALTED.DISTRIBUTED/
> > > >         30,556,790      cpu_core/CPU_CLK_UNHALTED.THREAD/
> > > >            168,486      cpu_core/DTLB_STORE_MISSES.WALK_ACTIVE/
> > > >               0.00 MEM_INST_RETIRED.STLB_HIT_STORES:p       0        0
> > > >
> > > >        1.003105924 seconds time elapsed
> > > >
> > > > v1:
> > > > TPEBS is one of the features provided by the next generation of Intel PMU.
> > > > Please refer to Section 8.4.1 of "Intel® Architecture Instruction Set Extensions
> > > > Programming Reference" [1] for more details about this feature.
> > > >
> > > > This set of patches supports TPEBS in counting mode. The code works in the
> > > > following way: it forks a perf record process from perf stat when retire_latency
> > > > of one or more events are used in a metric formula. Perf stat would send a
> > > > SIGTERM signal to perf record before it needs the retire latency value for
> > > > metric calculation. Perf stat will then process sample data to extract the
> > > > retire latency data for metric calculations. Currently, the code uses the
> > > > arithmetic average of retire latency values.
> > > >
> > > > [1] https://www.intel.com/content/www/us/en/content-details/812218/intel-architecture-instruction-set-extensions-programming-reference.html?wapkw=future%20features
> > > >
> > > >
> > > >
> > > >
> > > > Ian Rogers (1):
> > > >   perf parse-events: Add a retirement latency modifier
> > > >
> > > > Weilin Wang (7):
> > > >   perf data: Allow to use given fd in data->file.fd
> > > >   perf stat: Fork and launch perf record when perf stat needs to get
> > > >     retire latency value for a metric.
> > > >   perf stat: Plugin retire_lat value from sampled data to evsel
> > > >   perf vendor events intel: Add MTL metric json files
> > > >   perf stat: Add command line option for enabling tpebs recording
> > > >   perf Document: Add TPEBS to Documents
> > > >   perf test: Add test for Intel TPEBS counting mode
> > >
> > > Thanks for your persistence!
> > >
> > > Reviewed-by: Namhyung Kim <namhyung@kernel.org>
> >
> > Ping.
>
> I guess Namhyung's reviewed-by should suffice, but since you're pinging
> and I saw previous comments about this serie, would it be possible to
> get your Reviewed-by as well?

I'm happy with most of the changes and they can have my reviewed-by.
I'm not happy with:
perf stat: Fork and launch 'perf record' when 'perf stat' needs to get
retire latency value for a metric.
so please leave it off there.

Context: originally the patches aimed to abstract everything inside of
the evsel, now builtin-stat is manually starting and ending the
sub-process and has to be aware of TPEBS. There are ugly architectural
dependencies. My reviewed-by was correct on the earlier version, as
you could make a tpebs evsel just like any... other, open, close, read
it, etc. This would be useful say in a python script. This version to
my eyes is worse as all TPEBS dependencies are made explicit and the
caller needs to be aware of TPEBS. I understand Weilin was asked to
make some of the changes, and there was a desire for efficiency, but I
think things were better in earlier versions.

Thanks,
Ian
Re: [RFC PATCH v18 0/8] TPEBS counting mode support
Posted by Arnaldo Carvalho de Melo 1 year, 4 months ago
On Mon, Aug 05, 2024 at 04:33:44PM -0700, Ian Rogers wrote:
> On Mon, Aug 5, 2024 at 12:34 PM Arnaldo Carvalho de Melo
> <acme@kernel.org> wrote:
> >
> > On Mon, Aug 05, 2024 at 08:10:12AM -0700, Ian Rogers wrote:
> > > On Mon, Jul 22, 2024 at 10:38 AM Namhyung Kim <namhyung@kernel.org> wrote:
> > > >
> > > > Hello Weilin,
> > > >
> > > > On Fri, Jul 19, 2024 at 11:21 PM <weilin.wang@intel.com> wrote:
> > > > >
> > > > > From: Weilin Wang <weilin.wang@intel.com>
> > > > >
> > > > > Change in v18:
> > > > >  - Update to exit 2 in TPEBS shell test when not on Intel platform.
> > > > >  - Several updates to use EVLIST_CTL_CMD_ENABLE_TAG, EVLIST_CTL_CMD_ACK_TAG, and
> > > > >  etc.
> > > > >
> > > > > Changes in v17:
> > > > >  - Add a poll on control fifo ack_fd to ensure program returns successfully when
> > > > >  perf record failed for any reason.
> > > > >  - Add a check in the tpebs test to only run on Intel platforms.
> > > > >
> > > > > Changes in v16:
> > > > >  - Update tpebs bash test code and variable name.
> > > > >  - Add a check to ensure only use "-C" option when cpumap is not "-1" when
> > > > >  forking "perf record".
> > > > >
> > > > > Changes in v15:
> > > > >  - Revert changes added for python import test failure in v14 because the code
> > > > >  works correctly with the new python build code.
> > > > >  - Update the command line option to --record-tpebs.
> > > > >
> > > > > Changes in v14:
> > > > >  - Fix the python import test failure. We cannot support PYTHON_PERF because it
> > > > >  will trigger a chain of dependency issues if we add intel-tpebs.c to it. So,
> > > > >  only enable tpebs functions in evsel and evlist when PYTHON_PERF is not
> > > > >  defined.
> > > > >  - Fix shellcheck warning for the tpebs test.
> > > > >
> > > > > Changes in v13:
> > > > >  - Add document for the command line option and fix build error in non-x86_64.
> > > > >  - Update example with non-zero retire_latency value when tpebs recording is
> > > > >  enabled.
> > > > >  - Add tpebs_stop() back to tpebs_set_evsel() because in hybrid platform, when
> > > > >  the forked perf record is not killed, the reader thread does not get any
> > > > >  sampled data from the PIPE. As a result, tpebs_set_evesel() will always return
> > > > >  zero on retire_latency values. This does not happen on my test GNR machine.
> > > > >  Since -I is not supported yet, I think we should add tpebs_stop() to ensure
> > > > >  correctness for now. More investigation is required here when we work on
> > > > >  supporting -I mode.
> > > > >  - Rebase code on top of perf-tools-next.
> > > > >
> > > > > Changes in v12:
> > > > >  - Update MTL metric JSON file to include E-Core TMA3.6 changes.
> > > > >  - Update comments and code for better code quality. Keep tpebs_start() and
> > > > >  tpebs_delete() at evsel level for now and add comments on these functions with
> > > > >  more details about potential future changes. Remove tpebs_stop() call from
> > > > >  tpebs_set_evsel(). Simplify the tpebs_start() and tpebs_stop()/tpebs_delete()
> > > > >  interfaces. Also fixed the bugs on not freed "new" pointer and the incorrect
> > > > >  variable value assignment to val instead of counter->val.
> > > > >
> > > > > Changes in v11:
> > > > >  - Make retire_latency evsels not opened for counting. This works correctly now
> > > > >  with the code Namhyung suggested that adjusting group read size when
> > > > >  retire_latency evsels included in the group.
> > > > >  - Update retire_latency value assignment using rint() to reduce precision loss
> > > > >  while keeping code generic.
> > > > >  - Fix the build error caused by not used variable in the test.
> > > > >
> > > > > Other changes in v10:
> > > > >  - Change perf record fork from perf stat to evsel. All the major operations
> > > > >  like tpebs start, stop, read_evsel should directly work through evsel.
> > > > >  - Make intel-tpebs x86_64 only. This change is cross-compiled to arm64.
> > > > >  - Put tpebs code to intel-tepbs and simplify intel-tpebs APIs to minimum number
> > > > > of functions and variables. Update funtion name and variable names to use
> > > > > consistent prefix. Also improve error handling.
> > > > >  - Integrate code patch from Ian for the :R parser.
> > > > >  - Update MTL metrics to TMA 4.8.
> > > > >
> > > > > V9: https://lore.kernel.org/all/20240521173952.3397644-1-weilin.wang@intel.com/
> > > > >
> > > > > Changes in v9:
> > > > >  - Update the retire_latency result print and metric calculation method. Plugin
> > > > > the value to evsel so that no special code is required.
> > > > >  - Update --control:fifo to use pipe instead of named pipe.
> > > > >  - Add test for TPEBS counting mode.
> > > > >  - Update Document with more details.
> > > > >
> > > > > Changes in v8:
> > > > >  - In this revision, the code is updated to base on Ian's patch on R modifier
> > > > > parser https://lore.kernel.org/lkml/20240428053616.1125891-3-irogers@google.com/
> > > > > After this change, there is no special code required for R modifier in
> > > > > metricgroup.c and metricgroup.h files.
> > > > >
> > > > > Caveat of this change:
> > > > >   Ideally, we will need to add special handling to skip counting events with R
> > > > > modifier in evsel. Currently, this is not implemented so the event with :R will
> > > > > be both counted and sampled. Usually, in a metric formula that uses retire_latency,
> > > > > it would already require to count the event. As a result, we will endup count the
> > > > > same event twice. This should be able to be handled properly when we finalize our
> > > > > design on evsel R modifier support.
> > > > >
> > > > >  - Move TPEBS specific code out from main perf stat code to separate files in
> > > > > util/intel-tpebs.c and util/intel-tpebs.h. [Namhyung]
> > > > >  - Use --control:fifo to ack perf stat from forked perf record instead of sleep(2) [Namhyung]
> > > > >  - Add introductions about TPEBS and R modifier in Documents. [Namhyung]
> > > > >
> > > > >
> > > > > Changes in v7:
> > > > >  - Update code and comments for better code quality [Namhyung]
> > > > >  - Add a separate commit for perf data [Namhyung]
> > > > >  - Update retire latency print function to improve alignment [Namhyung]
> > > > >
> > > > > Changes in v6:
> > > > >  - Update code and add comments for better code quality [Namhyung]
> > > > >  - Remove the added fd var and directly pass the opened fd to data.file.fd [Namhyung]
> > > > >  - Add kill() to stop perf record when perf stat exists early [Namhyung]
> > > > >  - Add command opt check to ensure only start perf record when -a/-C given [Namhyung]
> > > > >  - Squash commits [Namhyung]
> > > > >
> > > > > Changes in v5:
> > > > >  - Update code and add comments for better code quality [Ian]
> > > > >
> > > > > Changes in v4:
> > > > >  - Remove uncessary debug print and update code and comments for better
> > > > > readability and quality [Namhyung]
> > > > >  - Update mtl metric json file with consistent TmaL1 and TopdownL1 metricgroup
> > > > >
> > > > > Changes in v3:
> > > > >  - Remove ':' when event name has '@' [Ian]
> > > > >  - Use 'R' as the modifier instead of "retire_latency" [Ian]
> > > > >
> > > > > Changes in v2:
> > > > >  - Add MTL metric file
> > > > >  - Add more descriptions and example to the patch [Arnaldo]
> > > > >
> > > > > Here is an example of running perf stat to collect a metric that uses
> > > > > retire_latency value of event MEM_INST_RETIRED.STLB_HIT_STORES on a MTL system.
> > > > >
> > > > > In this simple example, there is no MEM_INST_RETIRED.STLB_HIT_STORES sample.
> > > > > Therefore, the MEM_INST_RETIRED.STLB_HIT_STORES:p count and retire_latency value
> > > > > are all 0.
> > > > >
> > > > > ./perf stat -M tma_dtlb_store -a -- sleep 1
> > > > >
> > > > > [ perf record: Woken up 1 times to write data ]
> > > > > [ perf record: Captured and wrote 0.000 MB - ]
> > > > >
> > > > >  Performance counter stats for 'system wide':
> > > > >
> > > > >        181,047,168      cpu_core/TOPDOWN.SLOTS/          #      0.6 %  tma_dtlb_store
> > > > >          3,195,608      cpu_core/topdown-retiring/
> > > > >         40,156,649      cpu_core/topdown-mem-bound/
> > > > >          3,550,925      cpu_core/topdown-bad-spec/
> > > > >        117,571,818      cpu_core/topdown-fe-bound/
> > > > >         57,118,087      cpu_core/topdown-be-bound/
> > > > >             69,179      cpu_core/EXE_ACTIVITY.BOUND_ON_STORES/
> > > > >              4,582      cpu_core/MEM_INST_RETIRED.STLB_HIT_STORES/
> > > > >         30,183,104      cpu_core/CPU_CLK_UNHALTED.DISTRIBUTED/
> > > > >         30,556,790      cpu_core/CPU_CLK_UNHALTED.THREAD/
> > > > >            168,486      cpu_core/DTLB_STORE_MISSES.WALK_ACTIVE/
> > > > >               0.00 MEM_INST_RETIRED.STLB_HIT_STORES:p       0        0
> > > > >
> > > > >        1.003105924 seconds time elapsed
> > > > >
> > > > > v1:
> > > > > TPEBS is one of the features provided by the next generation of Intel PMU.
> > > > > Please refer to Section 8.4.1 of "Intel® Architecture Instruction Set Extensions
> > > > > Programming Reference" [1] for more details about this feature.
> > > > >
> > > > > This set of patches supports TPEBS in counting mode. The code works in the
> > > > > following way: it forks a perf record process from perf stat when retire_latency
> > > > > of one or more events are used in a metric formula. Perf stat would send a
> > > > > SIGTERM signal to perf record before it needs the retire latency value for
> > > > > metric calculation. Perf stat will then process sample data to extract the
> > > > > retire latency data for metric calculations. Currently, the code uses the
> > > > > arithmetic average of retire latency values.
> > > > >
> > > > > [1] https://www.intel.com/content/www/us/en/content-details/812218/intel-architecture-instruction-set-extensions-programming-reference.html?wapkw=future%20features
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Ian Rogers (1):
> > > > >   perf parse-events: Add a retirement latency modifier
> > > > >
> > > > > Weilin Wang (7):
> > > > >   perf data: Allow to use given fd in data->file.fd
> > > > >   perf stat: Fork and launch perf record when perf stat needs to get
> > > > >     retire latency value for a metric.
> > > > >   perf stat: Plugin retire_lat value from sampled data to evsel
> > > > >   perf vendor events intel: Add MTL metric json files
> > > > >   perf stat: Add command line option for enabling tpebs recording
> > > > >   perf Document: Add TPEBS to Documents
> > > > >   perf test: Add test for Intel TPEBS counting mode
> > > >
> > > > Thanks for your persistence!
> > > >
> > > > Reviewed-by: Namhyung Kim <namhyung@kernel.org>
> > >
> > > Ping.
> >
> > I guess Namhyung's reviewed-by should suffice, but since you're pinging
> > and I saw previous comments about this serie, would it be possible to
> > get your Reviewed-by as well?
> 
> I'm happy with most of the changes and they can have my reviewed-by.

Thanks.

> I'm not happy with:
> perf stat: Fork and launch 'perf record' when 'perf stat' needs to get
> retire latency value for a metric.

I don't like it either, but I couldn't find the time to engage in this
discussion to try to offer an alternative patch and since Namhyung had
provided his reviewed-by and you pinged me, I felt compelled to move
forward on this.

> so please leave it off there.

So, what are you proposing, that we process up to some patch?

For now I'm removing the whole series from perf-tools-next.

- Arnaldo
 
> Context: originally the patches aimed to abstract everything inside of
> the evsel, now builtin-stat is manually starting and ending the
> sub-process and has to be aware of TPEBS. There are ugly architectural
> dependencies. My reviewed-by was correct on the earlier version, as
> you could make a tpebs evsel just like any... other, open, close, read
> it, etc. This would be useful say in a python script. This version to
> my eyes is worse as all TPEBS dependencies are made explicit and the
> caller needs to be aware of TPEBS. I understand Weilin was asked to
> make some of the changes, and there was a desire for efficiency, but I
> think things were better in earlier versions.
> 
> Thanks,
> Ian
Re: [RFC PATCH v18 0/8] TPEBS counting mode support
Posted by Ian Rogers 1 year, 4 months ago
On Tue, Aug 6, 2024 at 6:32 AM Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
>
> On Mon, Aug 05, 2024 at 04:33:44PM -0700, Ian Rogers wrote:
> > On Mon, Aug 5, 2024 at 12:34 PM Arnaldo Carvalho de Melo
> > <acme@kernel.org> wrote:
> > >
> > > On Mon, Aug 05, 2024 at 08:10:12AM -0700, Ian Rogers wrote:
> > > > On Mon, Jul 22, 2024 at 10:38 AM Namhyung Kim <namhyung@kernel.org> wrote:
> > > > >
> > > > > Hello Weilin,
> > > > >
> > > > > On Fri, Jul 19, 2024 at 11:21 PM <weilin.wang@intel.com> wrote:
> > > > > >
> > > > > > From: Weilin Wang <weilin.wang@intel.com>
> > > > > >
> > > > > > Change in v18:
> > > > > >  - Update to exit 2 in TPEBS shell test when not on Intel platform.
> > > > > >  - Several updates to use EVLIST_CTL_CMD_ENABLE_TAG, EVLIST_CTL_CMD_ACK_TAG, and
> > > > > >  etc.
> > > > > >
> > > > > > Changes in v17:
> > > > > >  - Add a poll on control fifo ack_fd to ensure program returns successfully when
> > > > > >  perf record failed for any reason.
> > > > > >  - Add a check in the tpebs test to only run on Intel platforms.
> > > > > >
> > > > > > Changes in v16:
> > > > > >  - Update tpebs bash test code and variable name.
> > > > > >  - Add a check to ensure only use "-C" option when cpumap is not "-1" when
> > > > > >  forking "perf record".
> > > > > >
> > > > > > Changes in v15:
> > > > > >  - Revert changes added for python import test failure in v14 because the code
> > > > > >  works correctly with the new python build code.
> > > > > >  - Update the command line option to --record-tpebs.
> > > > > >
> > > > > > Changes in v14:
> > > > > >  - Fix the python import test failure. We cannot support PYTHON_PERF because it
> > > > > >  will trigger a chain of dependency issues if we add intel-tpebs.c to it. So,
> > > > > >  only enable tpebs functions in evsel and evlist when PYTHON_PERF is not
> > > > > >  defined.
> > > > > >  - Fix shellcheck warning for the tpebs test.
> > > > > >
> > > > > > Changes in v13:
> > > > > >  - Add document for the command line option and fix build error in non-x86_64.
> > > > > >  - Update example with non-zero retire_latency value when tpebs recording is
> > > > > >  enabled.
> > > > > >  - Add tpebs_stop() back to tpebs_set_evsel() because in hybrid platform, when
> > > > > >  the forked perf record is not killed, the reader thread does not get any
> > > > > >  sampled data from the PIPE. As a result, tpebs_set_evesel() will always return
> > > > > >  zero on retire_latency values. This does not happen on my test GNR machine.
> > > > > >  Since -I is not supported yet, I think we should add tpebs_stop() to ensure
> > > > > >  correctness for now. More investigation is required here when we work on
> > > > > >  supporting -I mode.
> > > > > >  - Rebase code on top of perf-tools-next.
> > > > > >
> > > > > > Changes in v12:
> > > > > >  - Update MTL metric JSON file to include E-Core TMA3.6 changes.
> > > > > >  - Update comments and code for better code quality. Keep tpebs_start() and
> > > > > >  tpebs_delete() at evsel level for now and add comments on these functions with
> > > > > >  more details about potential future changes. Remove tpebs_stop() call from
> > > > > >  tpebs_set_evsel(). Simplify the tpebs_start() and tpebs_stop()/tpebs_delete()
> > > > > >  interfaces. Also fixed the bugs on not freed "new" pointer and the incorrect
> > > > > >  variable value assignment to val instead of counter->val.
> > > > > >
> > > > > > Changes in v11:
> > > > > >  - Make retire_latency evsels not opened for counting. This works correctly now
> > > > > >  with the code Namhyung suggested that adjusting group read size when
> > > > > >  retire_latency evsels included in the group.
> > > > > >  - Update retire_latency value assignment using rint() to reduce precision loss
> > > > > >  while keeping code generic.
> > > > > >  - Fix the build error caused by not used variable in the test.
> > > > > >
> > > > > > Other changes in v10:
> > > > > >  - Change perf record fork from perf stat to evsel. All the major operations
> > > > > >  like tpebs start, stop, read_evsel should directly work through evsel.
> > > > > >  - Make intel-tpebs x86_64 only. This change is cross-compiled to arm64.
> > > > > >  - Put tpebs code to intel-tepbs and simplify intel-tpebs APIs to minimum number
> > > > > > of functions and variables. Update funtion name and variable names to use
> > > > > > consistent prefix. Also improve error handling.
> > > > > >  - Integrate code patch from Ian for the :R parser.
> > > > > >  - Update MTL metrics to TMA 4.8.
> > > > > >
> > > > > > V9: https://lore.kernel.org/all/20240521173952.3397644-1-weilin.wang@intel.com/
> > > > > >
> > > > > > Changes in v9:
> > > > > >  - Update the retire_latency result print and metric calculation method. Plugin
> > > > > > the value to evsel so that no special code is required.
> > > > > >  - Update --control:fifo to use pipe instead of named pipe.
> > > > > >  - Add test for TPEBS counting mode.
> > > > > >  - Update Document with more details.
> > > > > >
> > > > > > Changes in v8:
> > > > > >  - In this revision, the code is updated to base on Ian's patch on R modifier
> > > > > > parser https://lore.kernel.org/lkml/20240428053616.1125891-3-irogers@google.com/
> > > > > > After this change, there is no special code required for R modifier in
> > > > > > metricgroup.c and metricgroup.h files.
> > > > > >
> > > > > > Caveat of this change:
> > > > > >   Ideally, we will need to add special handling to skip counting events with R
> > > > > > modifier in evsel. Currently, this is not implemented so the event with :R will
> > > > > > be both counted and sampled. Usually, in a metric formula that uses retire_latency,
> > > > > > it would already require to count the event. As a result, we will endup count the
> > > > > > same event twice. This should be able to be handled properly when we finalize our
> > > > > > design on evsel R modifier support.
> > > > > >
> > > > > >  - Move TPEBS specific code out from main perf stat code to separate files in
> > > > > > util/intel-tpebs.c and util/intel-tpebs.h. [Namhyung]
> > > > > >  - Use --control:fifo to ack perf stat from forked perf record instead of sleep(2) [Namhyung]
> > > > > >  - Add introductions about TPEBS and R modifier in Documents. [Namhyung]
> > > > > >
> > > > > >
> > > > > > Changes in v7:
> > > > > >  - Update code and comments for better code quality [Namhyung]
> > > > > >  - Add a separate commit for perf data [Namhyung]
> > > > > >  - Update retire latency print function to improve alignment [Namhyung]
> > > > > >
> > > > > > Changes in v6:
> > > > > >  - Update code and add comments for better code quality [Namhyung]
> > > > > >  - Remove the added fd var and directly pass the opened fd to data.file.fd [Namhyung]
> > > > > >  - Add kill() to stop perf record when perf stat exists early [Namhyung]
> > > > > >  - Add command opt check to ensure only start perf record when -a/-C given [Namhyung]
> > > > > >  - Squash commits [Namhyung]
> > > > > >
> > > > > > Changes in v5:
> > > > > >  - Update code and add comments for better code quality [Ian]
> > > > > >
> > > > > > Changes in v4:
> > > > > >  - Remove uncessary debug print and update code and comments for better
> > > > > > readability and quality [Namhyung]
> > > > > >  - Update mtl metric json file with consistent TmaL1 and TopdownL1 metricgroup
> > > > > >
> > > > > > Changes in v3:
> > > > > >  - Remove ':' when event name has '@' [Ian]
> > > > > >  - Use 'R' as the modifier instead of "retire_latency" [Ian]
> > > > > >
> > > > > > Changes in v2:
> > > > > >  - Add MTL metric file
> > > > > >  - Add more descriptions and example to the patch [Arnaldo]
> > > > > >
> > > > > > Here is an example of running perf stat to collect a metric that uses
> > > > > > retire_latency value of event MEM_INST_RETIRED.STLB_HIT_STORES on a MTL system.
> > > > > >
> > > > > > In this simple example, there is no MEM_INST_RETIRED.STLB_HIT_STORES sample.
> > > > > > Therefore, the MEM_INST_RETIRED.STLB_HIT_STORES:p count and retire_latency value
> > > > > > are all 0.
> > > > > >
> > > > > > ./perf stat -M tma_dtlb_store -a -- sleep 1
> > > > > >
> > > > > > [ perf record: Woken up 1 times to write data ]
> > > > > > [ perf record: Captured and wrote 0.000 MB - ]
> > > > > >
> > > > > >  Performance counter stats for 'system wide':
> > > > > >
> > > > > >        181,047,168      cpu_core/TOPDOWN.SLOTS/          #      0.6 %  tma_dtlb_store
> > > > > >          3,195,608      cpu_core/topdown-retiring/
> > > > > >         40,156,649      cpu_core/topdown-mem-bound/
> > > > > >          3,550,925      cpu_core/topdown-bad-spec/
> > > > > >        117,571,818      cpu_core/topdown-fe-bound/
> > > > > >         57,118,087      cpu_core/topdown-be-bound/
> > > > > >             69,179      cpu_core/EXE_ACTIVITY.BOUND_ON_STORES/
> > > > > >              4,582      cpu_core/MEM_INST_RETIRED.STLB_HIT_STORES/
> > > > > >         30,183,104      cpu_core/CPU_CLK_UNHALTED.DISTRIBUTED/
> > > > > >         30,556,790      cpu_core/CPU_CLK_UNHALTED.THREAD/
> > > > > >            168,486      cpu_core/DTLB_STORE_MISSES.WALK_ACTIVE/
> > > > > >               0.00 MEM_INST_RETIRED.STLB_HIT_STORES:p       0        0
> > > > > >
> > > > > >        1.003105924 seconds time elapsed
> > > > > >
> > > > > > v1:
> > > > > > TPEBS is one of the features provided by the next generation of Intel PMU.
> > > > > > Please refer to Section 8.4.1 of "Intel® Architecture Instruction Set Extensions
> > > > > > Programming Reference" [1] for more details about this feature.
> > > > > >
> > > > > > This set of patches supports TPEBS in counting mode. The code works in the
> > > > > > following way: it forks a perf record process from perf stat when retire_latency
> > > > > > of one or more events are used in a metric formula. Perf stat would send a
> > > > > > SIGTERM signal to perf record before it needs the retire latency value for
> > > > > > metric calculation. Perf stat will then process sample data to extract the
> > > > > > retire latency data for metric calculations. Currently, the code uses the
> > > > > > arithmetic average of retire latency values.
> > > > > >
> > > > > > [1] https://www.intel.com/content/www/us/en/content-details/812218/intel-architecture-instruction-set-extensions-programming-reference.html?wapkw=future%20features
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > Ian Rogers (1):
> > > > > >   perf parse-events: Add a retirement latency modifier
> > > > > >
> > > > > > Weilin Wang (7):
> > > > > >   perf data: Allow to use given fd in data->file.fd
> > > > > >   perf stat: Fork and launch perf record when perf stat needs to get
> > > > > >     retire latency value for a metric.
> > > > > >   perf stat: Plugin retire_lat value from sampled data to evsel
> > > > > >   perf vendor events intel: Add MTL metric json files
> > > > > >   perf stat: Add command line option for enabling tpebs recording
> > > > > >   perf Document: Add TPEBS to Documents
> > > > > >   perf test: Add test for Intel TPEBS counting mode
> > > > >
> > > > > Thanks for your persistence!
> > > > >
> > > > > Reviewed-by: Namhyung Kim <namhyung@kernel.org>
> > > >
> > > > Ping.
> > >
> > > I guess Namhyung's reviewed-by should suffice, but since you're pinging
> > > and I saw previous comments about this serie, would it be possible to
> > > get your Reviewed-by as well?
> >
> > I'm happy with most of the changes and they can have my reviewed-by.
>
> Thanks.
>
> > I'm not happy with:
> > perf stat: Fork and launch 'perf record' when 'perf stat' needs to get
> > retire latency value for a metric.
>
> I don't like it either, but I couldn't find the time to engage in this
> discussion to try to offer an alternative patch and since Namhyung had
> provided his reviewed-by and you pinged me, I felt compelled to move
> forward on this.

I think we should be moving forward on this. I think the series had my
reviewed-by 9 months ago and this work is gating the bring up of Intel
models like Granite Rapids, it also adds metrics for Intel Meteorlake
which is Intel's laptop CPU of the day. After all fixes for the ARM
breakages, this is the 2nd biggest nuts thing we don't have landed.

> > so please leave it off there.
>
> So, what are you proposing, that we process up to some patch?

I was proposing just not having my tag on that patch. Perfect isn't
the enemy of good.

I don't want to be picky on other people's style, especially when it
brings big value as here. A low value series blocked on a style issue
due to 1 reviewer is:
https://lore.kernel.org/lkml/CAP-5=fW8YMivp1JW8vRKh=OZtGVLPQ4v88z_2ae+cAVnr7RXoQ@mail.gmail.com/
I'm happy to give style feedback but generally try to avoid it
blocking merging. In this case I even wrote the patches doing what I'd
suggested.

> For now I'm removing the whole series from perf-tools-next.

Ugh.

Thanks,
Ian

> - Arnaldo
>
> > Context: originally the patches aimed to abstract everything inside of
> > the evsel, now builtin-stat is manually starting and ending the
> > sub-process and has to be aware of TPEBS. There are ugly architectural
> > dependencies. My reviewed-by was correct on the earlier version, as
> > you could make a tpebs evsel just like any... other, open, close, read
> > it, etc. This would be useful say in a python script. This version to
> > my eyes is worse as all TPEBS dependencies are made explicit and the
> > caller needs to be aware of TPEBS. I understand Weilin was asked to
> > make some of the changes, and there was a desire for efficiency, but I
> > think things were better in earlier versions.
> >
> > Thanks,
> > Ian
Re: [RFC PATCH v18 0/8] TPEBS counting mode support
Posted by Ian Rogers 1 year, 4 months ago
On Tue, Aug 6, 2024 at 7:35 AM Ian Rogers <irogers@google.com> wrote:
> On Tue, Aug 6, 2024 at 6:32 AM Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
> > For now I'm removing the whole series from perf-tools-next.
>
> Ugh.

Hi, what's the plan here? The topdown (aka TMA) metrics on recent
Intel CPUs (client and server) are dependent on timed PEBS. I think we
should land this series and move forward from there.

Thanks,
Ian
Re: [RFC PATCH v18 0/8] TPEBS counting mode support
Posted by Arnaldo Carvalho de Melo 1 year, 4 months ago
On Mon, Aug 12, 2024 at 08:38:06AM -0700, Ian Rogers wrote:
> On Tue, Aug 6, 2024 at 7:35 AM Ian Rogers <irogers@google.com> wrote:
> > On Tue, Aug 6, 2024 at 6:32 AM Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
> > > For now I'm removing the whole series from perf-tools-next.
> >
> > Ugh.
> 
> Hi, what's the plan here? The topdown (aka TMA) metrics on recent
> Intel CPUs (client and server) are dependent on timed PEBS. I think we
> should land this series and move forward from there.

I'm taking this as an Acked-by, merged the 3rd and 4th patches to keep
bisectability and constified the process_sample_event() perf_tool
pointer, test building with containers now.

Will push to tmp.perf-tools-next while that runs.

- Arnaldo