[PATCH v4 0/8] perf report: Add latency and parallelism profiling

Dmitry Vyukov posted 8 patches 1 year ago
There is a newer version of this series
.../callchain-overhead-calculation.txt        |  5 +-
.../cpu-and-latency-overheads.txt             | 85 ++++++++++++++++++
tools/perf/Documentation/perf-report.txt      | 49 ++++++----
tools/perf/Documentation/tips.txt             |  3 +
tools/perf/builtin-record.c                   | 20 +++++
tools/perf/builtin-report.c                   | 39 ++++++++
tools/perf/ui/browsers/hists.c                | 27 +++---
tools/perf/ui/hist.c                          | 64 +++++++++----
tools/perf/util/addr_location.c               |  1 +
tools/perf/util/addr_location.h               |  7 +-
tools/perf/util/event.c                       | 11 +++
tools/perf/util/events_stats.h                |  2 +
tools/perf/util/hist.c                        | 90 +++++++++++++++----
tools/perf/util/hist.h                        | 32 +++++--
tools/perf/util/machine.c                     |  7 ++
tools/perf/util/machine.h                     |  6 ++
tools/perf/util/sample.h                      |  2 +-
tools/perf/util/session.c                     | 12 +++
tools/perf/util/session.h                     |  1 +
tools/perf/util/sort.c                        | 69 ++++++++++++--
tools/perf/util/sort.h                        |  3 +-
tools/perf/util/symbol.c                      | 34 +++++++
tools/perf/util/symbol_conf.h                 |  8 +-
23 files changed, 502 insertions(+), 75 deletions(-)
create mode 100644 tools/perf/Documentation/cpu-and-latency-overheads.txt
[PATCH v4 0/8] perf report: Add latency and parallelism profiling
Posted by Dmitry Vyukov 1 year ago
There are two notions of time: wall-clock time and CPU time.
For a single-threaded program, or a program running on a single-core
machine, these notions are the same. However, for a multi-threaded/
multi-process program running on a multi-core machine, these notions are
significantly different. Each second of wall-clock time we have
number-of-cores seconds of CPU time.

Currently perf only allows to profile CPU time. Perf (and all other
existing profilers to the be best of my knowledge) does not allow to
profile wall-clock time.

Optimizing CPU overhead is useful to improve 'throughput', while
optimizing wall-clock overhead is useful to improve 'latency'.
These profiles are complementary and are not interchangeable.
Examples of where latency profile is needed:
 - optimzing build latency
 - optimizing server request latency
 - optimizing ML training/inference latency
 - optimizing running time of any command line program

CPU profile is useless for these use cases at best (if a user understands
the difference), or misleading at worst (if a user tries to use a wrong
profile for a job).

This series add latency and parallelization profiling.
See the added documentation and flags descriptions for details.

Brief outline of the implementation:
 - add context switch collection during record
 - calculate number of threads running on CPUs (parallelism level)
   during report
 - divide each sample weight by the parallelism level
This effectively models that we were taking 1 sample per unit of
wall-clock time.

We still default to the CPU profile, so it's up to users to learn
about the second profiling mode and use it when appropriate.

Changes in v4:
 - added "Shrink struct hist_entry size" commit
 - rebased to perf-tools-next HEAD

Changes in v3:
 - rebase and split into patches
 - rename 'wallclock' to 'latency' everywhere
 - don't enable latency profiling by default,
   instead add record/report --latency flag

Dmitry Vyukov (8):
  perf report: Add machine parallelism
  perf report: Add parallelism sort key
  perf report: Switch filtered from u8 to u16
  perf report: Add parallelism filter
  perf report: Add latency output field
  perf report: Add --latency flag
  perf report: Add latency and parallelism profiling documentation
  perf hist: Shrink struct hist_entry size

 .../callchain-overhead-calculation.txt        |  5 +-
 .../cpu-and-latency-overheads.txt             | 85 ++++++++++++++++++
 tools/perf/Documentation/perf-report.txt      | 49 ++++++----
 tools/perf/Documentation/tips.txt             |  3 +
 tools/perf/builtin-record.c                   | 20 +++++
 tools/perf/builtin-report.c                   | 39 ++++++++
 tools/perf/ui/browsers/hists.c                | 27 +++---
 tools/perf/ui/hist.c                          | 64 +++++++++----
 tools/perf/util/addr_location.c               |  1 +
 tools/perf/util/addr_location.h               |  7 +-
 tools/perf/util/event.c                       | 11 +++
 tools/perf/util/events_stats.h                |  2 +
 tools/perf/util/hist.c                        | 90 +++++++++++++++----
 tools/perf/util/hist.h                        | 32 +++++--
 tools/perf/util/machine.c                     |  7 ++
 tools/perf/util/machine.h                     |  6 ++
 tools/perf/util/sample.h                      |  2 +-
 tools/perf/util/session.c                     | 12 +++
 tools/perf/util/session.h                     |  1 +
 tools/perf/util/sort.c                        | 69 ++++++++++++--
 tools/perf/util/sort.h                        |  3 +-
 tools/perf/util/symbol.c                      | 34 +++++++
 tools/perf/util/symbol_conf.h                 |  8 +-
 23 files changed, 502 insertions(+), 75 deletions(-)
 create mode 100644 tools/perf/Documentation/cpu-and-latency-overheads.txt


base-commit: 8ce0d2da14d3fb62844dd0e95982c194326b1a5f
-- 
2.48.1.362.g079036d154-goog
Re: [PATCH v4 0/8] perf report: Add latency and parallelism profiling
Posted by Namhyung Kim 1 year ago
Hello,

On Mon, Feb 03, 2025 at 03:30:35PM +0100, Dmitry Vyukov wrote:
> There are two notions of time: wall-clock time and CPU time.
> For a single-threaded program, or a program running on a single-core
> machine, these notions are the same. However, for a multi-threaded/
> multi-process program running on a multi-core machine, these notions are
> significantly different. Each second of wall-clock time we have
> number-of-cores seconds of CPU time.
> 
> Currently perf only allows to profile CPU time. Perf (and all other
> existing profilers to the be best of my knowledge) does not allow to
> profile wall-clock time.
> 
> Optimizing CPU overhead is useful to improve 'throughput', while
> optimizing wall-clock overhead is useful to improve 'latency'.
> These profiles are complementary and are not interchangeable.
> Examples of where latency profile is needed:
>  - optimzing build latency
>  - optimizing server request latency
>  - optimizing ML training/inference latency
>  - optimizing running time of any command line program
> 
> CPU profile is useless for these use cases at best (if a user understands
> the difference), or misleading at worst (if a user tries to use a wrong
> profile for a job).
> 
> This series add latency and parallelization profiling.
> See the added documentation and flags descriptions for details.
> 
> Brief outline of the implementation:
>  - add context switch collection during record
>  - calculate number of threads running on CPUs (parallelism level)
>    during report
>  - divide each sample weight by the parallelism level
> This effectively models that we were taking 1 sample per unit of
> wall-clock time.
> 
> We still default to the CPU profile, so it's up to users to learn
> about the second profiling mode and use it when appropriate.
> 
> Changes in v4:
>  - added "Shrink struct hist_entry size" commit
>  - rebased to perf-tools-next HEAD
> 
> Changes in v3:
>  - rebase and split into patches
>  - rename 'wallclock' to 'latency' everywhere
>  - don't enable latency profiling by default,
>    instead add record/report --latency flag
> 
> Dmitry Vyukov (8):
>   perf report: Add machine parallelism
>   perf report: Add parallelism sort key
>   perf report: Switch filtered from u8 to u16
>   perf report: Add parallelism filter
>   perf report: Add latency output field
>   perf report: Add --latency flag
>   perf report: Add latency and parallelism profiling documentation
>   perf hist: Shrink struct hist_entry size

I found that stdio output is broken that latency is too high.  TUI is fine.

  $ ./perf report --latency -s dso --stdio
  ...
  #
  #  Latency  Overhead  Shared Object       
  # ........  ........  ....................
  #
     57669.15%    57.67%  [kernel.kallsyms]   
     42330.85%    42.33%  ld-linux-x86-64.so.2

Thanks,
Namhyung

> 
>  .../callchain-overhead-calculation.txt        |  5 +-
>  .../cpu-and-latency-overheads.txt             | 85 ++++++++++++++++++
>  tools/perf/Documentation/perf-report.txt      | 49 ++++++----
>  tools/perf/Documentation/tips.txt             |  3 +
>  tools/perf/builtin-record.c                   | 20 +++++
>  tools/perf/builtin-report.c                   | 39 ++++++++
>  tools/perf/ui/browsers/hists.c                | 27 +++---
>  tools/perf/ui/hist.c                          | 64 +++++++++----
>  tools/perf/util/addr_location.c               |  1 +
>  tools/perf/util/addr_location.h               |  7 +-
>  tools/perf/util/event.c                       | 11 +++
>  tools/perf/util/events_stats.h                |  2 +
>  tools/perf/util/hist.c                        | 90 +++++++++++++++----
>  tools/perf/util/hist.h                        | 32 +++++--
>  tools/perf/util/machine.c                     |  7 ++
>  tools/perf/util/machine.h                     |  6 ++
>  tools/perf/util/sample.h                      |  2 +-
>  tools/perf/util/session.c                     | 12 +++
>  tools/perf/util/session.h                     |  1 +
>  tools/perf/util/sort.c                        | 69 ++++++++++++--
>  tools/perf/util/sort.h                        |  3 +-
>  tools/perf/util/symbol.c                      | 34 +++++++
>  tools/perf/util/symbol_conf.h                 |  8 +-
>  23 files changed, 502 insertions(+), 75 deletions(-)
>  create mode 100644 tools/perf/Documentation/cpu-and-latency-overheads.txt
> 
> 
> base-commit: 8ce0d2da14d3fb62844dd0e95982c194326b1a5f
> -- 
> 2.48.1.362.g079036d154-goog
>
Re: [PATCH v4 0/8] perf report: Add latency and parallelism profiling
Posted by Dmitry Vyukov 1 year ago
On Tue, 4 Feb 2025 at 05:21, Namhyung Kim <namhyung@kernel.org> wrote:
>
> Hello,
>
> On Mon, Feb 03, 2025 at 03:30:35PM +0100, Dmitry Vyukov wrote:
> > There are two notions of time: wall-clock time and CPU time.
> > For a single-threaded program, or a program running on a single-core
> > machine, these notions are the same. However, for a multi-threaded/
> > multi-process program running on a multi-core machine, these notions are
> > significantly different. Each second of wall-clock time we have
> > number-of-cores seconds of CPU time.
> >
> > Currently perf only allows to profile CPU time. Perf (and all other
> > existing profilers to the be best of my knowledge) does not allow to
> > profile wall-clock time.
> >
> > Optimizing CPU overhead is useful to improve 'throughput', while
> > optimizing wall-clock overhead is useful to improve 'latency'.
> > These profiles are complementary and are not interchangeable.
> > Examples of where latency profile is needed:
> >  - optimzing build latency
> >  - optimizing server request latency
> >  - optimizing ML training/inference latency
> >  - optimizing running time of any command line program
> >
> > CPU profile is useless for these use cases at best (if a user understands
> > the difference), or misleading at worst (if a user tries to use a wrong
> > profile for a job).
> >
> > This series add latency and parallelization profiling.
> > See the added documentation and flags descriptions for details.
> >
> > Brief outline of the implementation:
> >  - add context switch collection during record
> >  - calculate number of threads running on CPUs (parallelism level)
> >    during report
> >  - divide each sample weight by the parallelism level
> > This effectively models that we were taking 1 sample per unit of
> > wall-clock time.
> >
> > We still default to the CPU profile, so it's up to users to learn
> > about the second profiling mode and use it when appropriate.
> >
> > Changes in v4:
> >  - added "Shrink struct hist_entry size" commit
> >  - rebased to perf-tools-next HEAD
> >
> > Changes in v3:
> >  - rebase and split into patches
> >  - rename 'wallclock' to 'latency' everywhere
> >  - don't enable latency profiling by default,
> >    instead add record/report --latency flag
> >
> > Dmitry Vyukov (8):
> >   perf report: Add machine parallelism
> >   perf report: Add parallelism sort key
> >   perf report: Switch filtered from u8 to u16
> >   perf report: Add parallelism filter
> >   perf report: Add latency output field
> >   perf report: Add --latency flag
> >   perf report: Add latency and parallelism profiling documentation
> >   perf hist: Shrink struct hist_entry size
>
> I found that stdio output is broken that latency is too high.  TUI is fine.
>
>   $ ./perf report --latency -s dso --stdio
>   ...
>   #
>   #  Latency  Overhead  Shared Object
>   # ........  ........  ....................
>   #
>      57669.15%    57.67%  [kernel.kallsyms]
>      42330.85%    42.33%  ld-linux-x86-64.so.2

Fixed in v5.

> Thanks,
> Namhyung
>
> >
> >  .../callchain-overhead-calculation.txt        |  5 +-
> >  .../cpu-and-latency-overheads.txt             | 85 ++++++++++++++++++
> >  tools/perf/Documentation/perf-report.txt      | 49 ++++++----
> >  tools/perf/Documentation/tips.txt             |  3 +
> >  tools/perf/builtin-record.c                   | 20 +++++
> >  tools/perf/builtin-report.c                   | 39 ++++++++
> >  tools/perf/ui/browsers/hists.c                | 27 +++---
> >  tools/perf/ui/hist.c                          | 64 +++++++++----
> >  tools/perf/util/addr_location.c               |  1 +
> >  tools/perf/util/addr_location.h               |  7 +-
> >  tools/perf/util/event.c                       | 11 +++
> >  tools/perf/util/events_stats.h                |  2 +
> >  tools/perf/util/hist.c                        | 90 +++++++++++++++----
> >  tools/perf/util/hist.h                        | 32 +++++--
> >  tools/perf/util/machine.c                     |  7 ++
> >  tools/perf/util/machine.h                     |  6 ++
> >  tools/perf/util/sample.h                      |  2 +-
> >  tools/perf/util/session.c                     | 12 +++
> >  tools/perf/util/session.h                     |  1 +
> >  tools/perf/util/sort.c                        | 69 ++++++++++++--
> >  tools/perf/util/sort.h                        |  3 +-
> >  tools/perf/util/symbol.c                      | 34 +++++++
> >  tools/perf/util/symbol_conf.h                 |  8 +-
> >  23 files changed, 502 insertions(+), 75 deletions(-)
> >  create mode 100644 tools/perf/Documentation/cpu-and-latency-overheads.txt
> >
> >
> > base-commit: 8ce0d2da14d3fb62844dd0e95982c194326b1a5f
> > --
> > 2.48.1.362.g079036d154-goog
> >