[PATCH v5 0/8] perf report: Add latency and parallelism profiling

Dmitry Vyukov posted 8 patches 1 year ago
There is a newer version of this series
.../callchain-overhead-calculation.txt        |   5 +-
.../cpu-and-latency-overheads.txt             |  85 ++++++++++++++
tools/perf/Documentation/perf-record.txt      |   4 +
tools/perf/Documentation/perf-report.txt      |  54 ++++++---
tools/perf/Documentation/tips.txt             |   3 +
tools/perf/builtin-record.c                   |  20 ++++
tools/perf/builtin-report.c                   |  39 +++++++
tools/perf/ui/browsers/hists.c                |  27 +++--
tools/perf/ui/hist.c                          | 104 ++++++++++++------
tools/perf/util/addr_location.c               |   1 +
tools/perf/util/addr_location.h               |   7 +-
tools/perf/util/event.c                       |  11 ++
tools/perf/util/events_stats.h                |   2 +
tools/perf/util/hist.c                        |  90 ++++++++++++---
tools/perf/util/hist.h                        |  32 +++++-
tools/perf/util/machine.c                     |   7 ++
tools/perf/util/machine.h                     |   6 +
tools/perf/util/sample.h                      |   2 +-
tools/perf/util/session.c                     |  12 ++
tools/perf/util/session.h                     |   1 +
tools/perf/util/sort.c                        |  69 +++++++++++-
tools/perf/util/sort.h                        |   3 +-
tools/perf/util/symbol.c                      |  34 ++++++
tools/perf/util/symbol_conf.h                 |   8 +-
24 files changed, 531 insertions(+), 95 deletions(-)
create mode 100644 tools/perf/Documentation/cpu-and-latency-overheads.txt
[PATCH v5 0/8] perf report: Add latency and parallelism profiling
Posted by Dmitry Vyukov 1 year ago
There are two notions of time: wall-clock time and CPU time.
For a single-threaded program, or a program running on a single-core
machine, these notions are the same. However, for a multi-threaded/
multi-process program running on a multi-core machine, these notions are
significantly different. Each second of wall-clock time we have
number-of-cores seconds of CPU time.

Currently perf only allows to profile CPU time. Perf (and all other
existing profilers to the be best of my knowledge) does not allow to
profile wall-clock time.

Optimizing CPU overhead is useful to improve 'throughput', while
optimizing wall-clock overhead is useful to improve 'latency'.
These profiles are complementary and are not interchangeable.
Examples of where latency profile is needed:
 - optimzing build latency
 - optimizing server request latency
 - optimizing ML training/inference latency
 - optimizing running time of any command line program

CPU profile is useless for these use cases at best (if a user understands
the difference), or misleading at worst (if a user tries to use a wrong
profile for a job).

This series add latency and parallelization profiling.
See the added documentation and flags descriptions for details.

Brief outline of the implementation:
 - add context switch collection during record
 - calculate number of threads running on CPUs (parallelism level)
   during report
 - divide each sample weight by the parallelism level
This effectively models that we were taking 1 sample per unit of
wall-clock time.

We still default to the CPU profile, so it's up to users to learn
about the second profiling mode and use it when appropriate.

Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: linux-perf-users@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

Changes in v5:
 - fixed formatting of latency field in --stdout mode
 - added description of --latency flag in Documentation flags

Changes in v4:
 - added "Shrink struct hist_entry size" commit
 - rebased to perf-tools-next HEAD

Changes in v3:
 - rebase and split into patches
 - rename 'wallclock' to 'latency' everywhere
 - don't enable latency profiling by default,
   instead add record/report --latency flag

Dmitry Vyukov (8):
  perf report: Add machine parallelism
  perf report: Add parallelism sort key
  perf report: Switch filtered from u8 to u16
  perf report: Add parallelism filter
  perf report: Add latency output field
  perf report: Add --latency flag
  perf report: Add latency and parallelism profiling documentation
  perf hist: Shrink struct hist_entry size

 .../callchain-overhead-calculation.txt        |   5 +-
 .../cpu-and-latency-overheads.txt             |  85 ++++++++++++++
 tools/perf/Documentation/perf-record.txt      |   4 +
 tools/perf/Documentation/perf-report.txt      |  54 ++++++---
 tools/perf/Documentation/tips.txt             |   3 +
 tools/perf/builtin-record.c                   |  20 ++++
 tools/perf/builtin-report.c                   |  39 +++++++
 tools/perf/ui/browsers/hists.c                |  27 +++--
 tools/perf/ui/hist.c                          | 104 ++++++++++++------
 tools/perf/util/addr_location.c               |   1 +
 tools/perf/util/addr_location.h               |   7 +-
 tools/perf/util/event.c                       |  11 ++
 tools/perf/util/events_stats.h                |   2 +
 tools/perf/util/hist.c                        |  90 ++++++++++++---
 tools/perf/util/hist.h                        |  32 +++++-
 tools/perf/util/machine.c                     |   7 ++
 tools/perf/util/machine.h                     |   6 +
 tools/perf/util/sample.h                      |   2 +-
 tools/perf/util/session.c                     |  12 ++
 tools/perf/util/session.h                     |   1 +
 tools/perf/util/sort.c                        |  69 +++++++++++-
 tools/perf/util/sort.h                        |   3 +-
 tools/perf/util/symbol.c                      |  34 ++++++
 tools/perf/util/symbol_conf.h                 |   8 +-
 24 files changed, 531 insertions(+), 95 deletions(-)
 create mode 100644 tools/perf/Documentation/cpu-and-latency-overheads.txt


base-commit: 8ce0d2da14d3fb62844dd0e95982c194326b1a5f
-- 
2.48.1.362.g079036d154-goog
Re: [PATCH v5 0/8] perf report: Add latency and parallelism profiling
Posted by Andi Kleen 1 year ago
Dmitry Vyukov <dvyukov@google.com> writes:

> There are two notions of time: wall-clock time and CPU time.
> For a single-threaded program, or a program running on a single-core
> machine, these notions are the same. However, for a multi-threaded/
> multi-process program running on a multi-core machine, these notions are
> significantly different. Each second of wall-clock time we have
> number-of-cores seconds of CPU time.

I'm curious how does this interact with the time / --time-quantum sort key?

I assume it just works, but might be worth checking.

It was intended to address some of these issues too.

> Optimizing CPU overhead is useful to improve 'throughput', while
> optimizing wall-clock overhead is useful to improve 'latency'.
> These profiles are complementary and are not interchangeable.
> Examples of where latency profile is needed:
>  - optimzing build latency
>  - optimizing server request latency
>  - optimizing ML training/inference latency
>  - optimizing running time of any command line program
>
> CPU profile is useless for these use cases at best (if a user understands
> the difference), or misleading at worst (if a user tries to use a wrong
> profile for a job).

I would agree in the general case, but not if the time sort key
is chosen with a suitable quantum. You can see how the parallelism
changes over time then which is often a good enough proxy. 


> We still default to the CPU profile, so it's up to users to learn
> about the second profiling mode and use it when appropriate.

You should add it to tips.txt then

>  .../callchain-overhead-calculation.txt        |   5 +-
>  .../cpu-and-latency-overheads.txt             |  85 ++++++++++++++
>  tools/perf/Documentation/perf-record.txt      |   4 +
>  tools/perf/Documentation/perf-report.txt      |  54 ++++++---
>  tools/perf/Documentation/tips.txt             |   3 +
>  tools/perf/builtin-record.c                   |  20 ++++
>  tools/perf/builtin-report.c                   |  39 +++++++
>  tools/perf/ui/browsers/hists.c                |  27 +++--
>  tools/perf/ui/hist.c                          | 104 ++++++++++++------
>  tools/perf/util/addr_location.c               |   1 +
>  tools/perf/util/addr_location.h               |   7 +-
>  tools/perf/util/event.c                       |  11 ++
>  tools/perf/util/events_stats.h                |   2 +
>  tools/perf/util/hist.c                        |  90 ++++++++++++---
>  tools/perf/util/hist.h                        |  32 +++++-
>  tools/perf/util/machine.c                     |   7 ++
>  tools/perf/util/machine.h                     |   6 +
>  tools/perf/util/sample.h                      |   2 +-
>  tools/perf/util/session.c                     |  12 ++
>  tools/perf/util/session.h                     |   1 +
>  tools/perf/util/sort.c                        |  69 +++++++++++-
>  tools/perf/util/sort.h                        |   3 +-
>  tools/perf/util/symbol.c                      |  34 ++++++
>  tools/perf/util/symbol_conf.h                 |   8 +-

We traditionally didn't do it, but in general test coverage
of perf report is too low, so I would recommend to add some simple
test case in the perf test scripts.

-Andi
Re: [PATCH v5 0/8] perf report: Add latency and parallelism profiling
Posted by Dmitry Vyukov 1 year ago
On Thu, 6 Feb 2025 at 19:30, Andi Kleen <ak@linux.intel.com> wrote:
>
> Dmitry Vyukov <dvyukov@google.com> writes:
>
> > There are two notions of time: wall-clock time and CPU time.
> > For a single-threaded program, or a program running on a single-core
> > machine, these notions are the same. However, for a multi-threaded/
> > multi-process program running on a multi-core machine, these notions are
> > significantly different. Each second of wall-clock time we have
> > number-of-cores seconds of CPU time.
>
> I'm curious how does this interact with the time / --time-quantum sort key?
>
> I assume it just works, but might be worth checking.

Yes, it seems to just work as one would assume. Things just combine as intended.

> It was intended to address some of these issues too.

What issue? Latency profiling? I wonder what approach you had in mind?

> > Optimizing CPU overhead is useful to improve 'throughput', while
> > optimizing wall-clock overhead is useful to improve 'latency'.
> > These profiles are complementary and are not interchangeable.
> > Examples of where latency profile is needed:
> >  - optimzing build latency
> >  - optimizing server request latency
> >  - optimizing ML training/inference latency
> >  - optimizing running time of any command line program
> >
> > CPU profile is useless for these use cases at best (if a user understands
> > the difference), or misleading at worst (if a user tries to use a wrong
> > profile for a job).
>
> I would agree in the general case, but not if the time sort key
> is chosen with a suitable quantum. You can see how the parallelism
> changes over time then which is often a good enough proxy.

That's an interesting feature, but I don't see how it helps with latency.

How do you infer parallelism for slices? It looks like it just gives
the same wrong CPU profile, but multiple times (for each slice).

Also (1) user still needs to understand the default profile is wrong,
(2) be proficient with perf features, (3) manually aggregate lots of
data (time slicing increases amount of data in the profile X times),
(4) deal with inaccuracy caused by edge effects (e.g. slice is 1s, but
program phase changed mid-second).

But it does open some interesting capabilities in combination with a
latency profile, e.g. the following shows how parallelism was changing
over time.

for perf make profile:

perf report -F time,latency,parallelism --time-quantum=1s

# Time           Latency  Parallelism
# ............  ........  ...........
#
  1795957.0000     1.42%            1
  1795957.0000     0.07%            2
  1795957.0000     0.01%            3
  1795957.0000     0.00%            4

  1795958.0000     4.82%            1
  1795958.0000     0.11%            2
  1795958.0000     0.00%            3
...
  1795964.0000     1.76%            2
  1795964.0000     0.58%            4
  1795964.0000     0.45%            1
  1795964.0000     0.23%           10
  1795964.0000     0.21%            6

/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\

Here it finally started running on more than 1 CPU.






> > We still default to the CPU profile, so it's up to users to learn
> > about the second profiling mode and use it when appropriate.
>
> You should add it to tips.txt then
>
> >  .../callchain-overhead-calculation.txt        |   5 +-
> >  .../cpu-and-latency-overheads.txt             |  85 ++++++++++++++
> >  tools/perf/Documentation/perf-record.txt      |   4 +
> >  tools/perf/Documentation/perf-report.txt      |  54 ++++++---
> >  tools/perf/Documentation/tips.txt             |   3 +
> >  tools/perf/builtin-record.c                   |  20 ++++
> >  tools/perf/builtin-report.c                   |  39 +++++++
> >  tools/perf/ui/browsers/hists.c                |  27 +++--
> >  tools/perf/ui/hist.c                          | 104 ++++++++++++------
> >  tools/perf/util/addr_location.c               |   1 +
> >  tools/perf/util/addr_location.h               |   7 +-
> >  tools/perf/util/event.c                       |  11 ++
> >  tools/perf/util/events_stats.h                |   2 +
> >  tools/perf/util/hist.c                        |  90 ++++++++++++---
> >  tools/perf/util/hist.h                        |  32 +++++-
> >  tools/perf/util/machine.c                     |   7 ++
> >  tools/perf/util/machine.h                     |   6 +
> >  tools/perf/util/sample.h                      |   2 +-
> >  tools/perf/util/session.c                     |  12 ++
> >  tools/perf/util/session.h                     |   1 +
> >  tools/perf/util/sort.c                        |  69 +++++++++++-
> >  tools/perf/util/sort.h                        |   3 +-
> >  tools/perf/util/symbol.c                      |  34 ++++++
> >  tools/perf/util/symbol_conf.h                 |   8 +-
>
> We traditionally didn't do it, but in general test coverage
> of perf report is too low, so I would recommend to add some simple
> test case in the perf test scripts.
>
> -Andi
Re: [PATCH v5 0/8] perf report: Add latency and parallelism profiling
Posted by Andi Kleen 1 year ago
> > I assume it just works, but might be worth checking.
> 
> Yes, it seems to just work as one would assume. Things just combine as intended.

Great.

> 
> > It was intended to address some of these issues too.
> 
> What issue? Latency profiling? I wonder what approach you had in mind?

The problem with gaps in parallelism is usually how things change
over time. If you have e.g. idle periods they tend to look different
in the profile. with the full aggregation you don't see it, but with
a time series it tends to stand out.

But yes that approach usually only works for large gaps. I guess
yours is better for more fine grained issues.

And I guess it might not be the most intutive for less experienced
users.

This is BTW actually a case for using a perf data GUI like hotspot or
vtune which can visualize this better and you can zoom arbitrarily.
Standard perf has only timechart for it, but it's a bit clunky to use
and only shows the reschedules.

> Also (1) user still needs to understand the default profile is wrong,
> (2) be proficient with perf features, (3) manually aggregate lots of
> data (time slicing increases amount of data in the profile X times),
> (4) deal with inaccuracy caused by edge effects (e.g. slice is 1s, but
> program phase changed mid-second).

If you're lucky and the problem is not long tail you can use a high
percentage cut off (--percent-limit) to eliminate most of the data.

Then you just have "topN functions over time" which tends to be quite
readable. One drawback of that approach is that it doesn't show
the "other", but perhaps we'll fix that one day.

But yes that perf has too many options and is not intuitive and most
people miss most of the features is an inherent problem. I don't have
a good solution for that unfortunately, other than perhaps better
documentation.

> 
> But it does open some interesting capabilities in combination with a
> latency profile, e.g. the following shows how parallelism was changing
> over time.
> 
> for perf make profile:

Very nice! Looks useful.

Perhaps add that variant to tips.txt too.

> 
> perf report -F time,latency,parallelism --time-quantum=1s
> 
> # Time           Latency  Parallelism
> # ............  ........  ...........
> #
>   1795957.0000     1.42%            1
>   1795957.0000     0.07%            2
>   1795957.0000     0.01%            3
>   1795957.0000     0.00%            4
> 
>   1795958.0000     4.82%            1
>   1795958.0000     0.11%            2
>   1795958.0000     0.00%            3
> ...
>   1795964.0000     1.76%            2
>   1795964.0000     0.58%            4
>   1795964.0000     0.45%            1
>   1795964.0000     0.23%           10
>   1795964.0000     0.21%            6
> 
> /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
> 
> Here it finally started running on more than 1 CPU.


-Andi
Re: [PATCH v5 0/8] perf report: Add latency and parallelism profiling
Posted by Dmitry Vyukov 12 months ago
On Fri, 7 Feb 2025 at 19:30, Andi Kleen <ak@linux.intel.com> wrote:
>
> > > I assume it just works, but might be worth checking.
> >
> > Yes, it seems to just work as one would assume. Things just combine as intended.
>
> Great.
>
> >
> > > It was intended to address some of these issues too.
> >
> > What issue? Latency profiling? I wonder what approach you had in mind?
>
> The problem with gaps in parallelism is usually how things change
> over time. If you have e.g. idle periods they tend to look different
> in the profile. with the full aggregation you don't see it, but with
> a time series it tends to stand out.
>
> But yes that approach usually only works for large gaps. I guess
> yours is better for more fine grained issues.
>
> And I guess it might not be the most intutive for less experienced
> users.
>
> This is BTW actually a case for using a perf data GUI like hotspot or
> vtune which can visualize this better and you can zoom arbitrarily.
> Standard perf has only timechart for it, but it's a bit clunky to use
> and only shows the reschedules.
>
> > Also (1) user still needs to understand the default profile is wrong,
> > (2) be proficient with perf features, (3) manually aggregate lots of
> > data (time slicing increases amount of data in the profile X times),
> > (4) deal with inaccuracy caused by edge effects (e.g. slice is 1s, but
> > program phase changed mid-second).
>
> If you're lucky and the problem is not long tail you can use a high
> percentage cut off (--percent-limit) to eliminate most of the data.
>
> Then you just have "topN functions over time" which tends to be quite
> readable. One drawback of that approach is that it doesn't show
> the "other", but perhaps we'll fix that one day.
>
> But yes that perf has too many options and is not intuitive and most
> people miss most of the features is an inherent problem. I don't have
> a good solution for that unfortunately, other than perhaps better
> documentation.

I don't think this is a solution :(

I provided lots of rationale for making this latency profiling enabled
by default in this patch series for this reason. If we just capture
context switches, then we can show both overhead and latency, even if
we sort by overhead by default, people would still see the latency
column and may start thinking/asking questions.
But this is not happening, so mostly people on this thread will know about it :)


> > But it does open some interesting capabilities in combination with a
> > latency profile, e.g. the following shows how parallelism was changing
> > over time.
> >
> > for perf make profile:
>
> Very nice! Looks useful.
>
> Perhaps add that variant to tips.txt too.

That's a good idea.
I am waiting for other feedback to not resend the series just because of this.


> > perf report -F time,latency,parallelism --time-quantum=1s
> >
> > # Time           Latency  Parallelism
> > # ............  ........  ...........
> > #
> >   1795957.0000     1.42%            1
> >   1795957.0000     0.07%            2
> >   1795957.0000     0.01%            3
> >   1795957.0000     0.00%            4
> >
> >   1795958.0000     4.82%            1
> >   1795958.0000     0.11%            2
> >   1795958.0000     0.00%            3
> > ...
> >   1795964.0000     1.76%            2
> >   1795964.0000     0.58%            4
> >   1795964.0000     0.45%            1
> >   1795964.0000     0.23%           10
> >   1795964.0000     0.21%            6
> >
> > /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
> >
> > Here it finally started running on more than 1 CPU.
>
>
> -Andi
Re: [PATCH v5 0/8] perf report: Add latency and parallelism profiling
Posted by Dmitry Vyukov 11 months, 4 weeks ago
On Mon, 10 Feb 2025 at 08:17, Dmitry Vyukov <dvyukov@google.com> wrote:
>
> On Fri, 7 Feb 2025 at 19:30, Andi Kleen <ak@linux.intel.com> wrote:
> >
> > > > I assume it just works, but might be worth checking.
> > >
> > > Yes, it seems to just work as one would assume. Things just combine as intended.
> >
> > Great.
> >
> > >
> > > > It was intended to address some of these issues too.
> > >
> > > What issue? Latency profiling? I wonder what approach you had in mind?
> >
> > The problem with gaps in parallelism is usually how things change
> > over time. If you have e.g. idle periods they tend to look different
> > in the profile. with the full aggregation you don't see it, but with
> > a time series it tends to stand out.
> >
> > But yes that approach usually only works for large gaps. I guess
> > yours is better for more fine grained issues.
> >
> > And I guess it might not be the most intutive for less experienced
> > users.
> >
> > This is BTW actually a case for using a perf data GUI like hotspot or
> > vtune which can visualize this better and you can zoom arbitrarily.
> > Standard perf has only timechart for it, but it's a bit clunky to use
> > and only shows the reschedules.
> >
> > > Also (1) user still needs to understand the default profile is wrong,
> > > (2) be proficient with perf features, (3) manually aggregate lots of
> > > data (time slicing increases amount of data in the profile X times),
> > > (4) deal with inaccuracy caused by edge effects (e.g. slice is 1s, but
> > > program phase changed mid-second).
> >
> > If you're lucky and the problem is not long tail you can use a high
> > percentage cut off (--percent-limit) to eliminate most of the data.
> >
> > Then you just have "topN functions over time" which tends to be quite
> > readable. One drawback of that approach is that it doesn't show
> > the "other", but perhaps we'll fix that one day.
> >
> > But yes that perf has too many options and is not intuitive and most
> > people miss most of the features is an inherent problem. I don't have
> > a good solution for that unfortunately, other than perhaps better
> > documentation.
>
> I don't think this is a solution :(
>
> I provided lots of rationale for making this latency profiling enabled
> by default in this patch series for this reason. If we just capture
> context switches, then we can show both overhead and latency, even if
> we sort by overhead by default, people would still see the latency
> column and may start thinking/asking questions.
> But this is not happening, so mostly people on this thread will know about it :)
>
>
> > > But it does open some interesting capabilities in combination with a
> > > latency profile, e.g. the following shows how parallelism was changing
> > > over time.
> > >
> > > for perf make profile:
> >
> > Very nice! Looks useful.
> >
> > Perhaps add that variant to tips.txt too.

Now done in v7

> That's a good idea.
> I am waiting for other feedback to not resend the series just because of this.
>
>
> > > perf report -F time,latency,parallelism --time-quantum=1s
> > >
> > > # Time           Latency  Parallelism
> > > # ............  ........  ...........
> > > #
> > >   1795957.0000     1.42%            1
> > >   1795957.0000     0.07%            2
> > >   1795957.0000     0.01%            3
> > >   1795957.0000     0.00%            4
> > >
> > >   1795958.0000     4.82%            1
> > >   1795958.0000     0.11%            2
> > >   1795958.0000     0.00%            3
> > > ...
> > >   1795964.0000     1.76%            2
> > >   1795964.0000     0.58%            4
> > >   1795964.0000     0.45%            1
> > >   1795964.0000     0.23%           10
> > >   1795964.0000     0.21%            6
> > >
> > > /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
> > >
> > > Here it finally started running on more than 1 CPU.
> >
> >
> > -Andi
Re: [PATCH v5 0/8] perf report: Add latency and parallelism profiling
Posted by Andi Kleen 12 months ago
> > But yes that perf has too many options and is not intuitive and most
> > people miss most of the features is an inherent problem. I don't have
> > a good solution for that unfortunately, other than perhaps better
> > documentation.
> 
> I don't think this is a solution :(
> 
> I provided lots of rationale for making this latency profiling enabled
> by default in this patch series for this reason. If we just capture
> context switches, then we can show both overhead and latency, even if
> we sort by overhead by default, people would still see the latency
> column and may start thinking/asking questions.
> But this is not happening, so mostly people on this thread will know about it :)

Maybe something that could be done is to have some higher level configurations
for perf report that are easier to understand.

This kind of already exists e.g. in perf mem report which is just a
wrapper around perf report with some magic flags.

In this case you could have perf latency report (or maybe some other
syntax like perf report --mode latency) 

There are a few others that would make sense, like basic time series
or disabling children aggregation.

Another possibility would be to find a heuristic where perf report
can detect that a latency problem might be there (e.g. varying
usage of CPUs) and suggest it automatically.

-Andi
Re: [PATCH v5 0/8] perf report: Add latency and parallelism profiling
Posted by Dmitry Vyukov 1 year ago
On Thu, 6 Feb 2025 at 19:30, Andi Kleen <ak@linux.intel.com> wrote:
>
> Dmitry Vyukov <dvyukov@google.com> writes:
>
> > There are two notions of time: wall-clock time and CPU time.
> > For a single-threaded program, or a program running on a single-core
> > machine, these notions are the same. However, for a multi-threaded/
> > multi-process program running on a multi-core machine, these notions are
> > significantly different. Each second of wall-clock time we have
> > number-of-cores seconds of CPU time.
>
> I'm curious how does this interact with the time / --time-quantum sort key?
>
> I assume it just works, but might be worth checking.

I will check later. But if you have some concrete commands to try, it
will help. I never used --time-quantum before.


> It was intended to address some of these issues too.
>
> > Optimizing CPU overhead is useful to improve 'throughput', while
> > optimizing wall-clock overhead is useful to improve 'latency'.
> > These profiles are complementary and are not interchangeable.
> > Examples of where latency profile is needed:
> >  - optimzing build latency
> >  - optimizing server request latency
> >  - optimizing ML training/inference latency
> >  - optimizing running time of any command line program
> >
> > CPU profile is useless for these use cases at best (if a user understands
> > the difference), or misleading at worst (if a user tries to use a wrong
> > profile for a job).
>
> I would agree in the general case, but not if the time sort key
> is chosen with a suitable quantum. You can see how the parallelism
> changes over time then which is often a good enough proxy.

Never used it. I will look at what capabilities it provides.

> > We still default to the CPU profile, so it's up to users to learn
> > about the second profiling mode and use it when appropriate.
>
> You should add it to tips.txt then

It is done in the docs patch.

> >  .../callchain-overhead-calculation.txt        |   5 +-
> >  .../cpu-and-latency-overheads.txt             |  85 ++++++++++++++
> >  tools/perf/Documentation/perf-record.txt      |   4 +
> >  tools/perf/Documentation/perf-report.txt      |  54 ++++++---
> >  tools/perf/Documentation/tips.txt             |   3 +
> >  tools/perf/builtin-record.c                   |  20 ++++
> >  tools/perf/builtin-report.c                   |  39 +++++++
> >  tools/perf/ui/browsers/hists.c                |  27 +++--
> >  tools/perf/ui/hist.c                          | 104 ++++++++++++------
> >  tools/perf/util/addr_location.c               |   1 +
> >  tools/perf/util/addr_location.h               |   7 +-
> >  tools/perf/util/event.c                       |  11 ++
> >  tools/perf/util/events_stats.h                |   2 +
> >  tools/perf/util/hist.c                        |  90 ++++++++++++---
> >  tools/perf/util/hist.h                        |  32 +++++-
> >  tools/perf/util/machine.c                     |   7 ++
> >  tools/perf/util/machine.h                     |   6 +
> >  tools/perf/util/sample.h                      |   2 +-
> >  tools/perf/util/session.c                     |  12 ++
> >  tools/perf/util/session.h                     |   1 +
> >  tools/perf/util/sort.c                        |  69 +++++++++++-
> >  tools/perf/util/sort.h                        |   3 +-
> >  tools/perf/util/symbol.c                      |  34 ++++++
> >  tools/perf/util/symbol_conf.h                 |   8 +-
>
> We traditionally didn't do it, but in general test coverage
> of perf report is too low, so I would recommend to add some simple
> test case in the perf test scripts.

What of this is testable within the current testing framework?
Also how do I run tests? I failed to figure it out.
Re: [PATCH v5 0/8] perf report: Add latency and parallelism profiling
Posted by Andi Kleen 1 year ago
On Thu, Feb 06, 2025 at 07:41:00PM +0100, Dmitry Vyukov wrote:
> On Thu, 6 Feb 2025 at 19:30, Andi Kleen <ak@linux.intel.com> wrote:
> >
> > Dmitry Vyukov <dvyukov@google.com> writes:
> >
> > > There are two notions of time: wall-clock time and CPU time.
> > > For a single-threaded program, or a program running on a single-core
> > > machine, these notions are the same. However, for a multi-threaded/
> > > multi-process program running on a multi-core machine, these notions are
> > > significantly different. Each second of wall-clock time we have
> > > number-of-cores seconds of CPU time.
> >
> > I'm curious how does this interact with the time / --time-quantum sort key?
> >
> > I assume it just works, but might be worth checking.
> 
> I will check later. But if you have some concrete commands to try, it
> will help. I never used --time-quantum before.

perf report --sort time,overhead,sym 

It just slices perf.data into time slices so you get a time series
instead of full aggregation.

--time-quantum is optional, but sets the slice length,

> > >  tools/perf/util/symbol_conf.h                 |   8 +-
> >
> > We traditionally didn't do it, but in general test coverage
> > of perf report is too low, so I would recommend to add some simple
> > test case in the perf test scripts.
> 
> What of this is testable within the current testing framework?

You can write a shell script that runs it and does 
some basic sanity checking. tests/shell has a lot of examples.

If you don't do that someone will break it like it happened 
to some of my features :/

> Also how do I run tests? I failed to figure it out.

Just "perf test"


-Andi
Re: [PATCH v5 0/8] perf report: Add latency and parallelism profiling
Posted by Andi Kleen 1 year ago
> > Also how do I run tests? I failed to figure it out.
> 
> Just "perf test"

Or actually did you mean you can't run it in the build directory?

For that you may need my patch  
https://lore.kernel.org/linux-perf-users/20240813213651.1057362-2-ak@linux.intel.com/

(sadly it's still not applied)

-Andi
Re: [PATCH v5 0/8] perf report: Add latency and parallelism profiling
Posted by Ian Rogers 1 year ago
On Thu, Feb 6, 2025 at 10:41 AM Dmitry Vyukov <dvyukov@google.com> wrote:
> On Thu, 6 Feb 2025 at 19:30, Andi Kleen <ak@linux.intel.com> wrote:
[snip]
> > We traditionally didn't do it, but in general test coverage
> > of perf report is too low, so I would recommend to add some simple
> > test case in the perf test scripts.
>
> What of this is testable within the current testing framework?
> Also how do I run tests? I failed to figure it out.

Often just having a test that ensure a command doesn't segfault is
progress :-) The shell tests Andi mentions are here:
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/tests/shell?h=perf-tools-next
There's no explicit `perf report` test there but maybe the annotate,
diff or record tests could give you some ideas.

Thanks,
Ian
Re: [PATCH v5 0/8] perf report: Add latency and parallelism profiling
Posted by Namhyung Kim 1 year ago
On Thu, Feb 06, 2025 at 10:51:16AM -0800, Ian Rogers wrote:
> On Thu, Feb 6, 2025 at 10:41 AM Dmitry Vyukov <dvyukov@google.com> wrote:
> > On Thu, 6 Feb 2025 at 19:30, Andi Kleen <ak@linux.intel.com> wrote:
> [snip]
> > > We traditionally didn't do it, but in general test coverage
> > > of perf report is too low, so I would recommend to add some simple
> > > test case in the perf test scripts.
> >
> > What of this is testable within the current testing framework?
> > Also how do I run tests? I failed to figure it out.
> 
> Often just having a test that ensure a command doesn't segfault is
> progress :-) The shell tests Andi mentions are here:
> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/tests/shell?h=perf-tools-next
> There's no explicit `perf report` test there but maybe the annotate,
> diff or record tests could give you some ideas.

Well, we have. :)

  $ ./perf test -vv report
  116: perftool-testsuite_report:
  --- start ---
  test child forked, pid 109653
  -- [ PASS ] -- perf_report :: setup :: prepare the perf.data file
  ## [ PASS ] ## perf_report :: setup SUMMARY
  -- [ SKIP ] -- perf_report :: test_basic :: help message :: testcase skipped
  -- [ PASS ] -- perf_report :: test_basic :: basic execution
  -- [ PASS ] -- perf_report :: test_basic :: number of samples
  -- [ PASS ] -- perf_report :: test_basic :: header
  -- [ PASS ] -- perf_report :: test_basic :: header timestamp
  -- [ PASS ] -- perf_report :: test_basic :: show CPU utilization
  -- [ PASS ] -- perf_report :: test_basic :: pid
  -- [ PASS ] -- perf_report :: test_basic :: non-existing symbol
  -- [ PASS ] -- perf_report :: test_basic :: symbol filter
  ## [ PASS ] ## perf_report :: test_basic SUMMARY
  ---- end(0) ----
  116: perftool-testsuite_report                                       : Ok

Thanks,
Namhyung

Re: [PATCH v5 0/8] perf report: Add latency and parallelism profiling
Posted by Dmitry Vyukov 1 year ago
On Fri, 7 Feb 2025 at 04:57, Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Thu, Feb 06, 2025 at 10:51:16AM -0800, Ian Rogers wrote:
> > On Thu, Feb 6, 2025 at 10:41 AM Dmitry Vyukov <dvyukov@google.com> wrote:
> > > On Thu, 6 Feb 2025 at 19:30, Andi Kleen <ak@linux.intel.com> wrote:
> > [snip]
> > > > We traditionally didn't do it, but in general test coverage
> > > > of perf report is too low, so I would recommend to add some simple
> > > > test case in the perf test scripts.
> > >
> > > What of this is testable within the current testing framework?
> > > Also how do I run tests? I failed to figure it out.
> >
> > Often just having a test that ensure a command doesn't segfault is
> > progress :-) The shell tests Andi mentions are here:
> > https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/tests/shell?h=perf-tools-next
> > There's no explicit `perf report` test there but maybe the annotate,
> > diff or record tests could give you some ideas.
>
> Well, we have. :)
>
>   $ ./perf test -vv report
>   116: perftool-testsuite_report:
>   --- start ---
>   test child forked, pid 109653
>   -- [ PASS ] -- perf_report :: setup :: prepare the perf.data file
>   ## [ PASS ] ## perf_report :: setup SUMMARY
>   -- [ SKIP ] -- perf_report :: test_basic :: help message :: testcase skipped
>   -- [ PASS ] -- perf_report :: test_basic :: basic execution
>   -- [ PASS ] -- perf_report :: test_basic :: number of samples
>   -- [ PASS ] -- perf_report :: test_basic :: header
>   -- [ PASS ] -- perf_report :: test_basic :: header timestamp
>   -- [ PASS ] -- perf_report :: test_basic :: show CPU utilization
>   -- [ PASS ] -- perf_report :: test_basic :: pid
>   -- [ PASS ] -- perf_report :: test_basic :: non-existing symbol
>   -- [ PASS ] -- perf_report :: test_basic :: symbol filter
>   ## [ PASS ] ## perf_report :: test_basic SUMMARY
>   ---- end(0) ----
>   116: perftool-testsuite_report                                       : Ok
>
> Thanks,
> Namhyung


Thanks, Namhyung, Andi, Ian, this is useful.

Added tests in v6:
https://lore.kernel.org/linux-perf-users/eb8506dfa5998da3b891ba3d36f7ed4a7db4ca2b.1738928210.git.dvyukov@google.com/T/#u