[v5] perf report: Add latency and parallelism profiling

[PATCH v5 6/8] perf report: Add --latency flag

Posted by Dmitry Vyukov 1 year ago

Add record/report --latency flag that allows to capture and show
latency-centric profiles rather than the default CPU-consumption-centric
profiles. For latency profiles record captures context switch events,
and report shows Latency as the first column.

Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: linux-perf-users@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

---
Changes in v5:
 - added description of --latency flag in Documentation flags
---
 tools/perf/Documentation/perf-record.txt |  4 +++
 tools/perf/Documentation/perf-report.txt |  5 +++
 tools/perf/builtin-record.c              | 20 ++++++++++++
 tools/perf/builtin-report.c              | 32 +++++++++++++++---
 tools/perf/ui/hist.c                     | 41 +++++++++++++++++++-----
 tools/perf/util/hist.h                   |  1 +
 tools/perf/util/sort.c                   | 33 +++++++++++++++----
 tools/perf/util/sort.h                   |  2 +-
 tools/perf/util/symbol_conf.h            |  4 ++-
 9 files changed, 122 insertions(+), 20 deletions(-)

diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index 80686d590de24..c7fc1ba265e27 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -227,6 +227,10 @@ OPTIONS
 	'--filter' exists, the new filter expression will be combined with
 	them by '&&'.
 
+--latency::
+	Enable data collection for latency profiling.
+	Use perf report --latency for latency-centric profile.
+
 -a::
 --all-cpus::
         System-wide collection from all CPUs (default if no target is specified).
diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
index 87f8645194062..66794131aec48 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -68,6 +68,11 @@ OPTIONS
 --hide-unresolved::
         Only display entries resolved to a symbol.
 
+--latency::
+        Show latency-centric profile rather than the default
+        CPU-consumption-centric profile
+        (requires perf record --latency flag).
+
 -s::
 --sort=::
 	Sort histogram entries by given key(s) - multiple keys can be specified
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 5db1aedf48df9..e219639ac401b 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -161,6 +161,7 @@ struct record {
 	struct evlist		*sb_evlist;
 	pthread_t		thread_id;
 	int			realtime_prio;
+	bool			latency;
 	bool			switch_output_event_set;
 	bool			no_buildid;
 	bool			no_buildid_set;
@@ -3371,6 +3372,9 @@ static struct option __record_options[] = {
 		     parse_events_option),
 	OPT_CALLBACK(0, "filter", &record.evlist, "filter",
 		     "event filter", parse_filter),
+	OPT_BOOLEAN(0, "latency", &record.latency,
+		    "Enable data collection for latency profiling.\n"
+		    "\t\t\t  Use perf report --latency for latency-centric profile."),
 	OPT_CALLBACK_NOOPT(0, "exclude-perf", &record.evlist,
 			   NULL, "don't record events from perf itself",
 			   exclude_perf),
@@ -4017,6 +4021,22 @@ int cmd_record(int argc, const char **argv)
 
 	}
 
+	if (record.latency) {
+		/*
+		 * There is no fundamental reason why latency profiling
+		 * can't work for system-wide mode, but exact semantics
+		 * and details are to be defined.
+		 * See the following thread for details:
+		 * https://lore.kernel.org/all/Z4XDJyvjiie3howF@google.com/
+		 */
+		if (record.opts.target.system_wide) {
+			pr_err("Failed: latency profiling is not supported with system-wide collection.\n");
+			err = -EINVAL;
+			goto out_opts;
+		}
+		record.opts.record_switch_events = true;
+	}
+
 	if (rec->buildid_mmap) {
 		if (!perf_can_record_build_id()) {
 			pr_err("Failed: no support to record build id in mmap events, update your kernel.\n");
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 2a19abdc869a1..69de6dbefecfa 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -112,6 +112,8 @@ struct report {
 	u64			nr_entries;
 	u64			queue_size;
 	u64			total_cycles;
+	u64			total_samples;
+	u64			singlethreaded_samples;
 	int			socket_filter;
 	DECLARE_BITMAP(cpu_bitmap, MAX_NR_CPUS);
 	struct branch_type_stat	brtype_stat;
@@ -331,6 +333,10 @@ static int process_sample_event(const struct perf_tool *tool,
 				     &rep->total_cycles, evsel);
 	}
 
+	rep->total_samples++;
+	if (al.parallelism == 1)
+		rep->singlethreaded_samples++;
+
 	ret = hist_entry_iter__add(&iter, &al, rep->max_stack, rep);
 	if (ret < 0)
 		pr_debug("problem adding hist entry, skipping event\n");
@@ -1079,6 +1085,11 @@ static int __cmd_report(struct report *rep)
 		return ret;
 	}
 
+	/* Don't show Latency column for non-parallel profiles by default. */
+	if (rep->singlethreaded_samples * 100 / rep->total_samples >= 99 &&
+			!symbol_conf.prefer_latency)
+		perf_hpp__cancel_latency();
+
 	evlist__check_mem_load_aux(session->evlist);
 
 	if (rep->stats_mode)
@@ -1468,6 +1479,10 @@ int cmd_report(int argc, const char **argv)
 		    "Disable raw trace ordering"),
 	OPT_BOOLEAN(0, "skip-empty", &report.skip_empty,
 		    "Do not display empty (or dummy) events in the output"),
+	OPT_BOOLEAN(0, "latency", &symbol_conf.prefer_latency,
+		    "Show latency-centric profile rather than the default\n"
+		    "\t\t\t  CPU-consumption-centric profile\n"
+		    "\t\t\t  (requires perf record --latency flag)."),
 	OPT_END()
 	};
 	struct perf_data data = {
@@ -1722,16 +1737,25 @@ int cmd_report(int argc, const char **argv)
 		symbol_conf.annotate_data_sample = true;
 	}
 
+	symbol_conf.enable_latency = true;
 	if (report.disable_order || !perf_session__has_switch_events(session)) {
 		if (symbol_conf.parallelism_list_str ||
-				(sort_order && strstr(sort_order, "parallelism")) ||
-				(field_order && strstr(field_order, "parallelism"))) {
+			symbol_conf.prefer_latency ||
+			(sort_order && (strstr(sort_order, "latency") ||
+				strstr(sort_order, "parallelism"))) ||
+			(field_order && (strstr(field_order, "latency") ||
+				strstr(field_order, "parallelism")))) {
 			if (report.disable_order)
-				ui__error("Use of parallelism is incompatible with --disable-order.\n");
+				ui__error("Use of latency profile or parallelism is incompatible with --disable-order.\n");
 			else
-				ui__error("Use of parallelism requires --switch-events during record.\n");
+				ui__error("Use of latency profile or parallelism requires --latency flag during record.\n");
 			return -1;
 		}
+		/*
+		 * If user did not ask for anything related to
+		 * latency/parallelism explicitly, just don't show it.
+		 */
+		symbol_conf.enable_latency = false;
 	}
 
 	if (sort_order && strstr(sort_order, "ipc")) {
diff --git a/tools/perf/ui/hist.c b/tools/perf/ui/hist.c
index 6de6309595f9e..b453f8eb579cc 100644
--- a/tools/perf/ui/hist.c
+++ b/tools/perf/ui/hist.c
@@ -632,27 +632,36 @@ void perf_hpp__init(void)
 		return;
 
 	if (symbol_conf.cumulate_callchain) {
-		hpp_dimension__add_output(PERF_HPP__OVERHEAD_ACC);
+		/* Use idempotent addition to avoid more complex logic. */
+		if (symbol_conf.prefer_latency)
+			hpp_dimension__add_output(PERF_HPP__LATENCY_ACC, true);
+		hpp_dimension__add_output(PERF_HPP__OVERHEAD_ACC, true);
+		if (symbol_conf.enable_latency)
+			hpp_dimension__add_output(PERF_HPP__LATENCY_ACC, true);
 		perf_hpp__format[PERF_HPP__OVERHEAD].name = "Self";
 	}
 
-	hpp_dimension__add_output(PERF_HPP__OVERHEAD);
+	if (symbol_conf.prefer_latency)
+		hpp_dimension__add_output(PERF_HPP__LATENCY, true);
+	hpp_dimension__add_output(PERF_HPP__OVERHEAD, true);
+	if (symbol_conf.enable_latency)
+		hpp_dimension__add_output(PERF_HPP__LATENCY, true);
 
 	if (symbol_conf.show_cpu_utilization) {
-		hpp_dimension__add_output(PERF_HPP__OVERHEAD_SYS);
-		hpp_dimension__add_output(PERF_HPP__OVERHEAD_US);
+		hpp_dimension__add_output(PERF_HPP__OVERHEAD_SYS, false);
+		hpp_dimension__add_output(PERF_HPP__OVERHEAD_US, false);
 
 		if (perf_guest) {
-			hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_SYS);
-			hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_US);
+			hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_SYS, false);
+			hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_US, false);
 		}
 	}
 
 	if (symbol_conf.show_nr_samples)
-		hpp_dimension__add_output(PERF_HPP__SAMPLES);
+		hpp_dimension__add_output(PERF_HPP__SAMPLES, false);
 
 	if (symbol_conf.show_total_period)
-		hpp_dimension__add_output(PERF_HPP__PERIOD);
+		hpp_dimension__add_output(PERF_HPP__PERIOD, false);
 }
 
 void perf_hpp_list__column_register(struct perf_hpp_list *list,
@@ -701,6 +710,22 @@ void perf_hpp__cancel_cumulate(void)
 	}
 }
 
+void perf_hpp__cancel_latency(void)
+{
+	struct perf_hpp_fmt *fmt, *lat, *acc, *tmp;
+
+	if (is_strict_order(field_order) || is_strict_order(sort_order))
+		return;
+
+	lat = &perf_hpp__format[PERF_HPP__LATENCY];
+	acc = &perf_hpp__format[PERF_HPP__LATENCY_ACC];
+
+	perf_hpp_list__for_each_format_safe(&perf_hpp_list, fmt, tmp) {
+		if (fmt_equal(lat, fmt) || fmt_equal(acc, fmt))
+			perf_hpp__column_unregister(fmt);
+	}
+}
+
 void perf_hpp__setup_output_field(struct perf_hpp_list *list)
 {
 	struct perf_hpp_fmt *fmt;
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 91159f16c60b2..29d4c7a3d1747 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -582,6 +582,7 @@ enum {
 
 void perf_hpp__init(void);
 void perf_hpp__cancel_cumulate(void);
+void perf_hpp__cancel_latency(void);
 void perf_hpp__setup_output_field(struct perf_hpp_list *list);
 void perf_hpp__reset_output_field(struct perf_hpp_list *list);
 void perf_hpp__append_sort_keys(struct perf_hpp_list *list);
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index bc4c3acfe7552..2b6023de7a53a 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -2622,6 +2622,7 @@ struct hpp_dimension {
 	const char		*name;
 	struct perf_hpp_fmt	*fmt;
 	int			taken;
+	int			was_taken;
 };
 
 #define DIM(d, n) { .name = n, .fmt = &perf_hpp__format[d], }
@@ -3513,6 +3514,7 @@ static int __hpp_dimension__add(struct hpp_dimension *hd,
 		return -1;
 
 	hd->taken = 1;
+	hd->was_taken = 1;
 	perf_hpp_list__register_sort_field(list, fmt);
 	return 0;
 }
@@ -3547,10 +3549,15 @@ static int __hpp_dimension__add_output(struct perf_hpp_list *list,
 	return 0;
 }
 
-int hpp_dimension__add_output(unsigned col)
+int hpp_dimension__add_output(unsigned col, bool implicit)
 {
+	struct hpp_dimension *hd;
+
 	BUG_ON(col >= PERF_HPP__MAX_INDEX);
-	return __hpp_dimension__add_output(&perf_hpp_list, &hpp_sort_dimensions[col]);
+	hd = &hpp_sort_dimensions[col];
+	if (implicit && !hd->was_taken)
+		return 0;
+	return __hpp_dimension__add_output(&perf_hpp_list, hd);
 }
 
 int sort_dimension__add(struct perf_hpp_list *list, const char *tok,
@@ -3809,10 +3816,24 @@ static char *setup_overhead(char *keys)
 	if (sort__mode == SORT_MODE__DIFF)
 		return keys;
 
-	keys = prefix_if_not_in("overhead", keys);
-
-	if (symbol_conf.cumulate_callchain)
-		keys = prefix_if_not_in("overhead_children", keys);
+	if (symbol_conf.prefer_latency) {
+		keys = prefix_if_not_in("overhead", keys);
+		keys = prefix_if_not_in("latency", keys);
+		if (symbol_conf.cumulate_callchain) {
+			keys = prefix_if_not_in("overhead_children", keys);
+			keys = prefix_if_not_in("latency_children", keys);
+		}
+	} else if (!keys || (!strstr(keys, "overhead") &&
+			!strstr(keys, "latency"))) {
+		if (symbol_conf.enable_latency)
+			keys = prefix_if_not_in("latency", keys);
+		keys = prefix_if_not_in("overhead", keys);
+		if (symbol_conf.cumulate_callchain) {
+			if (symbol_conf.enable_latency)
+				keys = prefix_if_not_in("latency_children", keys);
+			keys = prefix_if_not_in("overhead_children", keys);
+		}
+	}
 
 	return keys;
 }
diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
index 11fb15f914093..180d36a2bea35 100644
--- a/tools/perf/util/sort.h
+++ b/tools/perf/util/sort.h
@@ -141,7 +141,7 @@ int report_parse_ignore_callees_opt(const struct option *opt, const char *arg, i
 
 bool is_strict_order(const char *order);
 
-int hpp_dimension__add_output(unsigned col);
+int hpp_dimension__add_output(unsigned col, bool implicit);
 void reset_dimensions(void);
 int sort_dimension__add(struct perf_hpp_list *list, const char *tok,
 			struct evlist *evlist,
diff --git a/tools/perf/util/symbol_conf.h b/tools/perf/util/symbol_conf.h
index c5b2e56127e22..cd9aa82c7d5ad 100644
--- a/tools/perf/util/symbol_conf.h
+++ b/tools/perf/util/symbol_conf.h
@@ -49,7 +49,9 @@ struct symbol_conf {
 			keep_exited_threads,
 			annotate_data_member,
 			annotate_data_sample,
-			skip_empty;
+			skip_empty,
+			enable_latency,
+			prefer_latency;
 	const char	*vmlinux_name,
 			*kallsyms_name,
 			*source_prefix,
-- 
2.48.1.362.g079036d154-goog

Re: [PATCH v5 6/8] perf report: Add --latency flag

Posted by Namhyung Kim 1 year ago

On Wed, Feb 05, 2025 at 05:27:45PM +0100, Dmitry Vyukov wrote:
> Add record/report --latency flag that allows to capture and show
> latency-centric profiles rather than the default CPU-consumption-centric
> profiles. For latency profiles record captures context switch events,
> and report shows Latency as the first column.
> 
> Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> Cc: Ian Rogers <irogers@google.com>
> Cc: linux-perf-users@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> 
> ---
[SNIP]  
> +void perf_hpp__cancel_latency(void)
> +{
> +	struct perf_hpp_fmt *fmt, *lat, *acc, *tmp;
> +
> +	if (is_strict_order(field_order) || is_strict_order(sort_order))
> +		return;

This also needs to be changed since you call setup_overhead even if you
have a strict sort_order.

For example, it's strange these two are different.

With default sort order:

  $ perf record --latency -- ls /

  $ perf report --stdio
  ...
  # Overhead  Command  Shared Object         Symbol                  
  # ........  .......  ....................  ........................
  #
      64.50%  ls       ld-linux-x86-64.so.2  [.] do_lookup_x
      33.41%  ls       [kernel.kallsyms]     [k] chacha_block_generic
       2.00%  perf-ex  [kernel.kallsyms]     [k] put_ctx
       0.09%  perf-ex  [kernel.kallsyms]     [k] native_write_msr
  
Same sort order, but explicitly:

  $ perf report --stdio -s comm,dso,sym
  ...
  # Overhead   Latency  Command  Shared Object         Symbol                  
  # ........  ........  .......  ....................  ........................
  #
      64.50%    64.50%  ls       ld-linux-x86-64.so.2  [.] do_lookup_x
      33.41%    33.41%  ls       [kernel.kallsyms]     [k] chacha_block_generic
       2.00%     2.00%  perf-ex  [kernel.kallsyms]     [k] put_ctx
       0.09%     0.09%  perf-ex  [kernel.kallsyms]     [k] native_write_msr

Maybe you want to cancel the latency field even if sort key is given
(unless it has 'latency').

Something like this?

---8<---
@@ -714,7 +715,9 @@ void perf_hpp__cancel_latency(void)
 {
        struct perf_hpp_fmt *fmt, *lat, *acc, *tmp;
 
-       if (is_strict_order(field_order) || is_strict_order(sort_order))
+       if (is_strict_order(field_order))
+               return;
+       if (is_strict_order(sort_order) && strstr(sort_order, "latency"))
                return;
 
        lat = &perf_hpp__format[PERF_HPP__LATENCY];
---8<---

Thanks,
Namhyung

> +
> +	lat = &perf_hpp__format[PERF_HPP__LATENCY];
> +	acc = &perf_hpp__format[PERF_HPP__LATENCY_ACC];
> +
> +	perf_hpp_list__for_each_format_safe(&perf_hpp_list, fmt, tmp) {
> +		if (fmt_equal(lat, fmt) || fmt_equal(acc, fmt))
> +			perf_hpp__column_unregister(fmt);
> +	}
> +}
> +
>  void perf_hpp__setup_output_field(struct perf_hpp_list *list)
>  {
>  	struct perf_hpp_fmt *fmt;

Re: [PATCH v5 6/8] perf report: Add --latency flag

Posted by Dmitry Vyukov 1 year ago

On Fri, 7 Feb 2025 at 04:53, Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Wed, Feb 05, 2025 at 05:27:45PM +0100, Dmitry Vyukov wrote:
> > Add record/report --latency flag that allows to capture and show
> > latency-centric profiles rather than the default CPU-consumption-centric
> > profiles. For latency profiles record captures context switch events,
> > and report shows Latency as the first column.
> >
> > Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
> > Cc: Namhyung Kim <namhyung@kernel.org>
> > Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> > Cc: Ian Rogers <irogers@google.com>
> > Cc: linux-perf-users@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> >
> > ---
> [SNIP]
> > +void perf_hpp__cancel_latency(void)
> > +{
> > +     struct perf_hpp_fmt *fmt, *lat, *acc, *tmp;
> > +
> > +     if (is_strict_order(field_order) || is_strict_order(sort_order))
> > +             return;
>
> This also needs to be changed since you call setup_overhead even if you
> have a strict sort_order.
>
> For example, it's strange these two are different.
>
> With default sort order:
>
>   $ perf record --latency -- ls /
>
>   $ perf report --stdio
>   ...
>   # Overhead  Command  Shared Object         Symbol
>   # ........  .......  ....................  ........................
>   #
>       64.50%  ls       ld-linux-x86-64.so.2  [.] do_lookup_x
>       33.41%  ls       [kernel.kallsyms]     [k] chacha_block_generic
>        2.00%  perf-ex  [kernel.kallsyms]     [k] put_ctx
>        0.09%  perf-ex  [kernel.kallsyms]     [k] native_write_msr
>
> Same sort order, but explicitly:
>
>   $ perf report --stdio -s comm,dso,sym
>   ...
>   # Overhead   Latency  Command  Shared Object         Symbol
>   # ........  ........  .......  ....................  ........................
>   #
>       64.50%    64.50%  ls       ld-linux-x86-64.so.2  [.] do_lookup_x
>       33.41%    33.41%  ls       [kernel.kallsyms]     [k] chacha_block_generic
>        2.00%     2.00%  perf-ex  [kernel.kallsyms]     [k] put_ctx
>        0.09%     0.09%  perf-ex  [kernel.kallsyms]     [k] native_write_msr
>
> Maybe you want to cancel the latency field even if sort key is given
> (unless it has 'latency').
>
> Something like this?

Makes sense! Done in v6.

> ---8<---
> @@ -714,7 +715,9 @@ void perf_hpp__cancel_latency(void)
>  {
>         struct perf_hpp_fmt *fmt, *lat, *acc, *tmp;
>
> -       if (is_strict_order(field_order) || is_strict_order(sort_order))
> +       if (is_strict_order(field_order))
> +               return;
> +       if (is_strict_order(sort_order) && strstr(sort_order, "latency"))
>                 return;
>
>         lat = &perf_hpp__format[PERF_HPP__LATENCY];
> ---8<---
>
> Thanks,
> Namhyung
>
> > +
> > +     lat = &perf_hpp__format[PERF_HPP__LATENCY];
> > +     acc = &perf_hpp__format[PERF_HPP__LATENCY_ACC];
> > +
> > +     perf_hpp_list__for_each_format_safe(&perf_hpp_list, fmt, tmp) {
> > +             if (fmt_equal(lat, fmt) || fmt_equal(acc, fmt))
> > +                     perf_hpp__column_unregister(fmt);
> > +     }
> > +}
> > +
> >  void perf_hpp__setup_output_field(struct perf_hpp_list *list)
> >  {
> >       struct perf_hpp_fmt *fmt;

Re: [PATCH v5 6/8] perf report: Add --latency flag

Posted by Namhyung Kim 1 year ago

On Wed, Feb 05, 2025 at 05:27:45PM +0100, Dmitry Vyukov wrote:
> Add record/report --latency flag that allows to capture and show
> latency-centric profiles rather than the default CPU-consumption-centric
> profiles. For latency profiles record captures context switch events,
> and report shows Latency as the first column.
> 
> Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> Cc: Ian Rogers <irogers@google.com>
> Cc: linux-perf-users@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> 
> ---
> Changes in v5:
>  - added description of --latency flag in Documentation flags
> ---
>  tools/perf/Documentation/perf-record.txt |  4 +++
>  tools/perf/Documentation/perf-report.txt |  5 +++
>  tools/perf/builtin-record.c              | 20 ++++++++++++
>  tools/perf/builtin-report.c              | 32 +++++++++++++++---
>  tools/perf/ui/hist.c                     | 41 +++++++++++++++++++-----
>  tools/perf/util/hist.h                   |  1 +
>  tools/perf/util/sort.c                   | 33 +++++++++++++++----
>  tools/perf/util/sort.h                   |  2 +-
>  tools/perf/util/symbol_conf.h            |  4 ++-
>  9 files changed, 122 insertions(+), 20 deletions(-)
> 
> diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
> index 80686d590de24..c7fc1ba265e27 100644
> --- a/tools/perf/Documentation/perf-record.txt
> +++ b/tools/perf/Documentation/perf-record.txt
> @@ -227,6 +227,10 @@ OPTIONS
>  	'--filter' exists, the new filter expression will be combined with
>  	them by '&&'.
>  
> +--latency::
> +	Enable data collection for latency profiling.
> +	Use perf report --latency for latency-centric profile.
> +
>  -a::
>  --all-cpus::
>          System-wide collection from all CPUs (default if no target is specified).
> diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
> index 87f8645194062..66794131aec48 100644
> --- a/tools/perf/Documentation/perf-report.txt
> +++ b/tools/perf/Documentation/perf-report.txt
> @@ -68,6 +68,11 @@ OPTIONS
>  --hide-unresolved::
>          Only display entries resolved to a symbol.
>  
> +--latency::
> +        Show latency-centric profile rather than the default
> +        CPU-consumption-centric profile
> +        (requires perf record --latency flag).
> +
>  -s::
>  --sort=::
>  	Sort histogram entries by given key(s) - multiple keys can be specified
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index 5db1aedf48df9..e219639ac401b 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -161,6 +161,7 @@ struct record {
>  	struct evlist		*sb_evlist;
>  	pthread_t		thread_id;
>  	int			realtime_prio;
> +	bool			latency;
>  	bool			switch_output_event_set;
>  	bool			no_buildid;
>  	bool			no_buildid_set;
> @@ -3371,6 +3372,9 @@ static struct option __record_options[] = {
>  		     parse_events_option),
>  	OPT_CALLBACK(0, "filter", &record.evlist, "filter",
>  		     "event filter", parse_filter),
> +	OPT_BOOLEAN(0, "latency", &record.latency,
> +		    "Enable data collection for latency profiling.\n"
> +		    "\t\t\t  Use perf report --latency for latency-centric profile."),
>  	OPT_CALLBACK_NOOPT(0, "exclude-perf", &record.evlist,
>  			   NULL, "don't record events from perf itself",
>  			   exclude_perf),
> @@ -4017,6 +4021,22 @@ int cmd_record(int argc, const char **argv)
>  
>  	}
>  
> +	if (record.latency) {
> +		/*
> +		 * There is no fundamental reason why latency profiling
> +		 * can't work for system-wide mode, but exact semantics
> +		 * and details are to be defined.
> +		 * See the following thread for details:
> +		 * https://lore.kernel.org/all/Z4XDJyvjiie3howF@google.com/
> +		 */
> +		if (record.opts.target.system_wide) {
> +			pr_err("Failed: latency profiling is not supported with system-wide collection.\n");
> +			err = -EINVAL;
> +			goto out_opts;
> +		}
> +		record.opts.record_switch_events = true;
> +	}
> +
>  	if (rec->buildid_mmap) {
>  		if (!perf_can_record_build_id()) {
>  			pr_err("Failed: no support to record build id in mmap events, update your kernel.\n");
> diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
> index 2a19abdc869a1..69de6dbefecfa 100644
> --- a/tools/perf/builtin-report.c
> +++ b/tools/perf/builtin-report.c
> @@ -112,6 +112,8 @@ struct report {
>  	u64			nr_entries;
>  	u64			queue_size;
>  	u64			total_cycles;
> +	u64			total_samples;
> +	u64			singlethreaded_samples;
>  	int			socket_filter;
>  	DECLARE_BITMAP(cpu_bitmap, MAX_NR_CPUS);
>  	struct branch_type_stat	brtype_stat;
> @@ -331,6 +333,10 @@ static int process_sample_event(const struct perf_tool *tool,
>  				     &rep->total_cycles, evsel);
>  	}
>  
> +	rep->total_samples++;
> +	if (al.parallelism == 1)
> +		rep->singlethreaded_samples++;
> +
>  	ret = hist_entry_iter__add(&iter, &al, rep->max_stack, rep);
>  	if (ret < 0)
>  		pr_debug("problem adding hist entry, skipping event\n");
> @@ -1079,6 +1085,11 @@ static int __cmd_report(struct report *rep)
>  		return ret;
>  	}
>  
> +	/* Don't show Latency column for non-parallel profiles by default. */
> +	if (rep->singlethreaded_samples * 100 / rep->total_samples >= 99 &&
> +			!symbol_conf.prefer_latency)
> +		perf_hpp__cancel_latency();
> +
>  	evlist__check_mem_load_aux(session->evlist);
>  
>  	if (rep->stats_mode)
> @@ -1468,6 +1479,10 @@ int cmd_report(int argc, const char **argv)
>  		    "Disable raw trace ordering"),
>  	OPT_BOOLEAN(0, "skip-empty", &report.skip_empty,
>  		    "Do not display empty (or dummy) events in the output"),
> +	OPT_BOOLEAN(0, "latency", &symbol_conf.prefer_latency,
> +		    "Show latency-centric profile rather than the default\n"
> +		    "\t\t\t  CPU-consumption-centric profile\n"
> +		    "\t\t\t  (requires perf record --latency flag)."),
>  	OPT_END()
>  	};
>  	struct perf_data data = {
> @@ -1722,16 +1737,25 @@ int cmd_report(int argc, const char **argv)
>  		symbol_conf.annotate_data_sample = true;
>  	}
>  
> +	symbol_conf.enable_latency = true;
>  	if (report.disable_order || !perf_session__has_switch_events(session)) {
>  		if (symbol_conf.parallelism_list_str ||
> -				(sort_order && strstr(sort_order, "parallelism")) ||
> -				(field_order && strstr(field_order, "parallelism"))) {
> +			symbol_conf.prefer_latency ||
> +			(sort_order && (strstr(sort_order, "latency") ||
> +				strstr(sort_order, "parallelism"))) ||
> +			(field_order && (strstr(field_order, "latency") ||
> +				strstr(field_order, "parallelism")))) {
>  			if (report.disable_order)
> -				ui__error("Use of parallelism is incompatible with --disable-order.\n");
> +				ui__error("Use of latency profile or parallelism is incompatible with --disable-order.\n");
>  			else
> -				ui__error("Use of parallelism requires --switch-events during record.\n");
> +				ui__error("Use of latency profile or parallelism requires --latency flag during record.\n");
>  			return -1;
>  		}
> +		/*
> +		 * If user did not ask for anything related to
> +		 * latency/parallelism explicitly, just don't show it.
> +		 */
> +		symbol_conf.enable_latency = false;
>  	}
>  
>  	if (sort_order && strstr(sort_order, "ipc")) {
> diff --git a/tools/perf/ui/hist.c b/tools/perf/ui/hist.c
> index 6de6309595f9e..b453f8eb579cc 100644
> --- a/tools/perf/ui/hist.c
> +++ b/tools/perf/ui/hist.c
> @@ -632,27 +632,36 @@ void perf_hpp__init(void)
>  		return;
>  
>  	if (symbol_conf.cumulate_callchain) {
> -		hpp_dimension__add_output(PERF_HPP__OVERHEAD_ACC);
> +		/* Use idempotent addition to avoid more complex logic. */
> +		if (symbol_conf.prefer_latency)
> +			hpp_dimension__add_output(PERF_HPP__LATENCY_ACC, true);
> +		hpp_dimension__add_output(PERF_HPP__OVERHEAD_ACC, true);
> +		if (symbol_conf.enable_latency)
> +			hpp_dimension__add_output(PERF_HPP__LATENCY_ACC, true);
>  		perf_hpp__format[PERF_HPP__OVERHEAD].name = "Self";
>  	}
>  
> -	hpp_dimension__add_output(PERF_HPP__OVERHEAD);
> +	if (symbol_conf.prefer_latency)
> +		hpp_dimension__add_output(PERF_HPP__LATENCY, true);
> +	hpp_dimension__add_output(PERF_HPP__OVERHEAD, true);
> +	if (symbol_conf.enable_latency)
> +		hpp_dimension__add_output(PERF_HPP__LATENCY, true);
>  
>  	if (symbol_conf.show_cpu_utilization) {
> -		hpp_dimension__add_output(PERF_HPP__OVERHEAD_SYS);
> -		hpp_dimension__add_output(PERF_HPP__OVERHEAD_US);
> +		hpp_dimension__add_output(PERF_HPP__OVERHEAD_SYS, false);
> +		hpp_dimension__add_output(PERF_HPP__OVERHEAD_US, false);
>  
>  		if (perf_guest) {
> -			hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_SYS);
> -			hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_US);
> +			hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_SYS, false);
> +			hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_US, false);
>  		}
>  	}
>  
>  	if (symbol_conf.show_nr_samples)
> -		hpp_dimension__add_output(PERF_HPP__SAMPLES);
> +		hpp_dimension__add_output(PERF_HPP__SAMPLES, false);
>  
>  	if (symbol_conf.show_total_period)
> -		hpp_dimension__add_output(PERF_HPP__PERIOD);
> +		hpp_dimension__add_output(PERF_HPP__PERIOD, false);
>  }
>  
>  void perf_hpp_list__column_register(struct perf_hpp_list *list,
> @@ -701,6 +710,22 @@ void perf_hpp__cancel_cumulate(void)
>  	}
>  }
>  
> +void perf_hpp__cancel_latency(void)
> +{
> +	struct perf_hpp_fmt *fmt, *lat, *acc, *tmp;
> +
> +	if (is_strict_order(field_order) || is_strict_order(sort_order))
> +		return;
> +
> +	lat = &perf_hpp__format[PERF_HPP__LATENCY];
> +	acc = &perf_hpp__format[PERF_HPP__LATENCY_ACC];
> +
> +	perf_hpp_list__for_each_format_safe(&perf_hpp_list, fmt, tmp) {
> +		if (fmt_equal(lat, fmt) || fmt_equal(acc, fmt))
> +			perf_hpp__column_unregister(fmt);
> +	}
> +}
> +
>  void perf_hpp__setup_output_field(struct perf_hpp_list *list)
>  {
>  	struct perf_hpp_fmt *fmt;
> diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
> index 91159f16c60b2..29d4c7a3d1747 100644
> --- a/tools/perf/util/hist.h
> +++ b/tools/perf/util/hist.h
> @@ -582,6 +582,7 @@ enum {
>  
>  void perf_hpp__init(void);
>  void perf_hpp__cancel_cumulate(void);
> +void perf_hpp__cancel_latency(void);
>  void perf_hpp__setup_output_field(struct perf_hpp_list *list);
>  void perf_hpp__reset_output_field(struct perf_hpp_list *list);
>  void perf_hpp__append_sort_keys(struct perf_hpp_list *list);
> diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
> index bc4c3acfe7552..2b6023de7a53a 100644
> --- a/tools/perf/util/sort.c
> +++ b/tools/perf/util/sort.c
> @@ -2622,6 +2622,7 @@ struct hpp_dimension {
>  	const char		*name;
>  	struct perf_hpp_fmt	*fmt;
>  	int			taken;
> +	int			was_taken;
>  };
>  
>  #define DIM(d, n) { .name = n, .fmt = &perf_hpp__format[d], }
> @@ -3513,6 +3514,7 @@ static int __hpp_dimension__add(struct hpp_dimension *hd,
>  		return -1;
>  
>  	hd->taken = 1;
> +	hd->was_taken = 1;
>  	perf_hpp_list__register_sort_field(list, fmt);
>  	return 0;
>  }
> @@ -3547,10 +3549,15 @@ static int __hpp_dimension__add_output(struct perf_hpp_list *list,
>  	return 0;
>  }
>  
> -int hpp_dimension__add_output(unsigned col)
> +int hpp_dimension__add_output(unsigned col, bool implicit)
>  {
> +	struct hpp_dimension *hd;
> +
>  	BUG_ON(col >= PERF_HPP__MAX_INDEX);
> -	return __hpp_dimension__add_output(&perf_hpp_list, &hpp_sort_dimensions[col]);
> +	hd = &hpp_sort_dimensions[col];
> +	if (implicit && !hd->was_taken)
> +		return 0;

I don't think you need these implicit and was_taken things.
Just removing from the sort list when it's unregistered seems to work.

---8<---
@@ -685,6 +685,7 @@ void perf_hpp_list__prepend_sort_field(struct perf_hpp_list *list,
 static void perf_hpp__column_unregister(struct perf_hpp_fmt *format)
 {
        list_del_init(&format->list);
+       list_del_init(&format->sort_list);
        fmt_free(format);
 }
 
---8<---

Thanks,
Namhyung


> +	return __hpp_dimension__add_output(&perf_hpp_list, hd);
>  }
>  
>  int sort_dimension__add(struct perf_hpp_list *list, const char *tok,
> @@ -3809,10 +3816,24 @@ static char *setup_overhead(char *keys)
>  	if (sort__mode == SORT_MODE__DIFF)
>  		return keys;
>  
> -	keys = prefix_if_not_in("overhead", keys);
> -
> -	if (symbol_conf.cumulate_callchain)
> -		keys = prefix_if_not_in("overhead_children", keys);
> +	if (symbol_conf.prefer_latency) {
> +		keys = prefix_if_not_in("overhead", keys);
> +		keys = prefix_if_not_in("latency", keys);
> +		if (symbol_conf.cumulate_callchain) {
> +			keys = prefix_if_not_in("overhead_children", keys);
> +			keys = prefix_if_not_in("latency_children", keys);
> +		}
> +	} else if (!keys || (!strstr(keys, "overhead") &&
> +			!strstr(keys, "latency"))) {
> +		if (symbol_conf.enable_latency)
> +			keys = prefix_if_not_in("latency", keys);
> +		keys = prefix_if_not_in("overhead", keys);
> +		if (symbol_conf.cumulate_callchain) {
> +			if (symbol_conf.enable_latency)
> +				keys = prefix_if_not_in("latency_children", keys);
> +			keys = prefix_if_not_in("overhead_children", keys);
> +		}
> +	}
>  
>  	return keys;
>  }
> diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
> index 11fb15f914093..180d36a2bea35 100644
> --- a/tools/perf/util/sort.h
> +++ b/tools/perf/util/sort.h
> @@ -141,7 +141,7 @@ int report_parse_ignore_callees_opt(const struct option *opt, const char *arg, i
>  
>  bool is_strict_order(const char *order);
>  
> -int hpp_dimension__add_output(unsigned col);
> +int hpp_dimension__add_output(unsigned col, bool implicit);
>  void reset_dimensions(void);
>  int sort_dimension__add(struct perf_hpp_list *list, const char *tok,
>  			struct evlist *evlist,
> diff --git a/tools/perf/util/symbol_conf.h b/tools/perf/util/symbol_conf.h
> index c5b2e56127e22..cd9aa82c7d5ad 100644
> --- a/tools/perf/util/symbol_conf.h
> +++ b/tools/perf/util/symbol_conf.h
> @@ -49,7 +49,9 @@ struct symbol_conf {
>  			keep_exited_threads,
>  			annotate_data_member,
>  			annotate_data_sample,
> -			skip_empty;
> +			skip_empty,
> +			enable_latency,
> +			prefer_latency;
>  	const char	*vmlinux_name,
>  			*kallsyms_name,
>  			*source_prefix,
> -- 
> 2.48.1.362.g079036d154-goog
>

Re: [PATCH v5 6/8] perf report: Add --latency flag

Posted by Dmitry Vyukov 1 year ago

On Fri, 7 Feb 2025 at 04:44, Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Wed, Feb 05, 2025 at 05:27:45PM +0100, Dmitry Vyukov wrote:
> > Add record/report --latency flag that allows to capture and show
> > latency-centric profiles rather than the default CPU-consumption-centric
> > profiles. For latency profiles record captures context switch events,
> > and report shows Latency as the first column.
> >
> > Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
> > Cc: Namhyung Kim <namhyung@kernel.org>
> > Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> > Cc: Ian Rogers <irogers@google.com>
> > Cc: linux-perf-users@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> >
> > ---
> > Changes in v5:
> >  - added description of --latency flag in Documentation flags
> > ---
> >  tools/perf/Documentation/perf-record.txt |  4 +++
> >  tools/perf/Documentation/perf-report.txt |  5 +++
> >  tools/perf/builtin-record.c              | 20 ++++++++++++
> >  tools/perf/builtin-report.c              | 32 +++++++++++++++---
> >  tools/perf/ui/hist.c                     | 41 +++++++++++++++++++-----
> >  tools/perf/util/hist.h                   |  1 +
> >  tools/perf/util/sort.c                   | 33 +++++++++++++++----
> >  tools/perf/util/sort.h                   |  2 +-
> >  tools/perf/util/symbol_conf.h            |  4 ++-
> >  9 files changed, 122 insertions(+), 20 deletions(-)
> >
> > diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
> > index 80686d590de24..c7fc1ba265e27 100644
> > --- a/tools/perf/Documentation/perf-record.txt
> > +++ b/tools/perf/Documentation/perf-record.txt
> > @@ -227,6 +227,10 @@ OPTIONS
> >       '--filter' exists, the new filter expression will be combined with
> >       them by '&&'.
> >
> > +--latency::
> > +     Enable data collection for latency profiling.
> > +     Use perf report --latency for latency-centric profile.
> > +
> >  -a::
> >  --all-cpus::
> >          System-wide collection from all CPUs (default if no target is specified).
> > diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
> > index 87f8645194062..66794131aec48 100644
> > --- a/tools/perf/Documentation/perf-report.txt
> > +++ b/tools/perf/Documentation/perf-report.txt
> > @@ -68,6 +68,11 @@ OPTIONS
> >  --hide-unresolved::
> >          Only display entries resolved to a symbol.
> >
> > +--latency::
> > +        Show latency-centric profile rather than the default
> > +        CPU-consumption-centric profile
> > +        (requires perf record --latency flag).
> > +
> >  -s::
> >  --sort=::
> >       Sort histogram entries by given key(s) - multiple keys can be specified
> > diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> > index 5db1aedf48df9..e219639ac401b 100644
> > --- a/tools/perf/builtin-record.c
> > +++ b/tools/perf/builtin-record.c
> > @@ -161,6 +161,7 @@ struct record {
> >       struct evlist           *sb_evlist;
> >       pthread_t               thread_id;
> >       int                     realtime_prio;
> > +     bool                    latency;
> >       bool                    switch_output_event_set;
> >       bool                    no_buildid;
> >       bool                    no_buildid_set;
> > @@ -3371,6 +3372,9 @@ static struct option __record_options[] = {
> >                    parse_events_option),
> >       OPT_CALLBACK(0, "filter", &record.evlist, "filter",
> >                    "event filter", parse_filter),
> > +     OPT_BOOLEAN(0, "latency", &record.latency,
> > +                 "Enable data collection for latency profiling.\n"
> > +                 "\t\t\t  Use perf report --latency for latency-centric profile."),
> >       OPT_CALLBACK_NOOPT(0, "exclude-perf", &record.evlist,
> >                          NULL, "don't record events from perf itself",
> >                          exclude_perf),
> > @@ -4017,6 +4021,22 @@ int cmd_record(int argc, const char **argv)
> >
> >       }
> >
> > +     if (record.latency) {
> > +             /*
> > +              * There is no fundamental reason why latency profiling
> > +              * can't work for system-wide mode, but exact semantics
> > +              * and details are to be defined.
> > +              * See the following thread for details:
> > +              * https://lore.kernel.org/all/Z4XDJyvjiie3howF@google.com/
> > +              */
> > +             if (record.opts.target.system_wide) {
> > +                     pr_err("Failed: latency profiling is not supported with system-wide collection.\n");
> > +                     err = -EINVAL;
> > +                     goto out_opts;
> > +             }
> > +             record.opts.record_switch_events = true;
> > +     }
> > +
> >       if (rec->buildid_mmap) {
> >               if (!perf_can_record_build_id()) {
> >                       pr_err("Failed: no support to record build id in mmap events, update your kernel.\n");
> > diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
> > index 2a19abdc869a1..69de6dbefecfa 100644
> > --- a/tools/perf/builtin-report.c
> > +++ b/tools/perf/builtin-report.c
> > @@ -112,6 +112,8 @@ struct report {
> >       u64                     nr_entries;
> >       u64                     queue_size;
> >       u64                     total_cycles;
> > +     u64                     total_samples;
> > +     u64                     singlethreaded_samples;
> >       int                     socket_filter;
> >       DECLARE_BITMAP(cpu_bitmap, MAX_NR_CPUS);
> >       struct branch_type_stat brtype_stat;
> > @@ -331,6 +333,10 @@ static int process_sample_event(const struct perf_tool *tool,
> >                                    &rep->total_cycles, evsel);
> >       }
> >
> > +     rep->total_samples++;
> > +     if (al.parallelism == 1)
> > +             rep->singlethreaded_samples++;
> > +
> >       ret = hist_entry_iter__add(&iter, &al, rep->max_stack, rep);
> >       if (ret < 0)
> >               pr_debug("problem adding hist entry, skipping event\n");
> > @@ -1079,6 +1085,11 @@ static int __cmd_report(struct report *rep)
> >               return ret;
> >       }
> >
> > +     /* Don't show Latency column for non-parallel profiles by default. */
> > +     if (rep->singlethreaded_samples * 100 / rep->total_samples >= 99 &&
> > +                     !symbol_conf.prefer_latency)
> > +             perf_hpp__cancel_latency();
> > +
> >       evlist__check_mem_load_aux(session->evlist);
> >
> >       if (rep->stats_mode)
> > @@ -1468,6 +1479,10 @@ int cmd_report(int argc, const char **argv)
> >                   "Disable raw trace ordering"),
> >       OPT_BOOLEAN(0, "skip-empty", &report.skip_empty,
> >                   "Do not display empty (or dummy) events in the output"),
> > +     OPT_BOOLEAN(0, "latency", &symbol_conf.prefer_latency,
> > +                 "Show latency-centric profile rather than the default\n"
> > +                 "\t\t\t  CPU-consumption-centric profile\n"
> > +                 "\t\t\t  (requires perf record --latency flag)."),
> >       OPT_END()
> >       };
> >       struct perf_data data = {
> > @@ -1722,16 +1737,25 @@ int cmd_report(int argc, const char **argv)
> >               symbol_conf.annotate_data_sample = true;
> >       }
> >
> > +     symbol_conf.enable_latency = true;
> >       if (report.disable_order || !perf_session__has_switch_events(session)) {
> >               if (symbol_conf.parallelism_list_str ||
> > -                             (sort_order && strstr(sort_order, "parallelism")) ||
> > -                             (field_order && strstr(field_order, "parallelism"))) {
> > +                     symbol_conf.prefer_latency ||
> > +                     (sort_order && (strstr(sort_order, "latency") ||
> > +                             strstr(sort_order, "parallelism"))) ||
> > +                     (field_order && (strstr(field_order, "latency") ||
> > +                             strstr(field_order, "parallelism")))) {
> >                       if (report.disable_order)
> > -                             ui__error("Use of parallelism is incompatible with --disable-order.\n");
> > +                             ui__error("Use of latency profile or parallelism is incompatible with --disable-order.\n");
> >                       else
> > -                             ui__error("Use of parallelism requires --switch-events during record.\n");
> > +                             ui__error("Use of latency profile or parallelism requires --latency flag during record.\n");
> >                       return -1;
> >               }
> > +             /*
> > +              * If user did not ask for anything related to
> > +              * latency/parallelism explicitly, just don't show it.
> > +              */
> > +             symbol_conf.enable_latency = false;
> >       }
> >
> >       if (sort_order && strstr(sort_order, "ipc")) {
> > diff --git a/tools/perf/ui/hist.c b/tools/perf/ui/hist.c
> > index 6de6309595f9e..b453f8eb579cc 100644
> > --- a/tools/perf/ui/hist.c
> > +++ b/tools/perf/ui/hist.c
> > @@ -632,27 +632,36 @@ void perf_hpp__init(void)
> >               return;
> >
> >       if (symbol_conf.cumulate_callchain) {
> > -             hpp_dimension__add_output(PERF_HPP__OVERHEAD_ACC);
> > +             /* Use idempotent addition to avoid more complex logic. */
> > +             if (symbol_conf.prefer_latency)
> > +                     hpp_dimension__add_output(PERF_HPP__LATENCY_ACC, true);
> > +             hpp_dimension__add_output(PERF_HPP__OVERHEAD_ACC, true);
> > +             if (symbol_conf.enable_latency)
> > +                     hpp_dimension__add_output(PERF_HPP__LATENCY_ACC, true);
> >               perf_hpp__format[PERF_HPP__OVERHEAD].name = "Self";
> >       }
> >
> > -     hpp_dimension__add_output(PERF_HPP__OVERHEAD);
> > +     if (symbol_conf.prefer_latency)
> > +             hpp_dimension__add_output(PERF_HPP__LATENCY, true);
> > +     hpp_dimension__add_output(PERF_HPP__OVERHEAD, true);
> > +     if (symbol_conf.enable_latency)
> > +             hpp_dimension__add_output(PERF_HPP__LATENCY, true);
> >
> >       if (symbol_conf.show_cpu_utilization) {
> > -             hpp_dimension__add_output(PERF_HPP__OVERHEAD_SYS);
> > -             hpp_dimension__add_output(PERF_HPP__OVERHEAD_US);
> > +             hpp_dimension__add_output(PERF_HPP__OVERHEAD_SYS, false);
> > +             hpp_dimension__add_output(PERF_HPP__OVERHEAD_US, false);
> >
> >               if (perf_guest) {
> > -                     hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_SYS);
> > -                     hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_US);
> > +                     hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_SYS, false);
> > +                     hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_US, false);
> >               }
> >       }
> >
> >       if (symbol_conf.show_nr_samples)
> > -             hpp_dimension__add_output(PERF_HPP__SAMPLES);
> > +             hpp_dimension__add_output(PERF_HPP__SAMPLES, false);
> >
> >       if (symbol_conf.show_total_period)
> > -             hpp_dimension__add_output(PERF_HPP__PERIOD);
> > +             hpp_dimension__add_output(PERF_HPP__PERIOD, false);
> >  }
> >
> >  void perf_hpp_list__column_register(struct perf_hpp_list *list,
> > @@ -701,6 +710,22 @@ void perf_hpp__cancel_cumulate(void)
> >       }
> >  }
> >
> > +void perf_hpp__cancel_latency(void)
> > +{
> > +     struct perf_hpp_fmt *fmt, *lat, *acc, *tmp;
> > +
> > +     if (is_strict_order(field_order) || is_strict_order(sort_order))
> > +             return;
> > +
> > +     lat = &perf_hpp__format[PERF_HPP__LATENCY];
> > +     acc = &perf_hpp__format[PERF_HPP__LATENCY_ACC];
> > +
> > +     perf_hpp_list__for_each_format_safe(&perf_hpp_list, fmt, tmp) {
> > +             if (fmt_equal(lat, fmt) || fmt_equal(acc, fmt))
> > +                     perf_hpp__column_unregister(fmt);
> > +     }
> > +}
> > +
> >  void perf_hpp__setup_output_field(struct perf_hpp_list *list)
> >  {
> >       struct perf_hpp_fmt *fmt;
> > diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
> > index 91159f16c60b2..29d4c7a3d1747 100644
> > --- a/tools/perf/util/hist.h
> > +++ b/tools/perf/util/hist.h
> > @@ -582,6 +582,7 @@ enum {
> >
> >  void perf_hpp__init(void);
> >  void perf_hpp__cancel_cumulate(void);
> > +void perf_hpp__cancel_latency(void);
> >  void perf_hpp__setup_output_field(struct perf_hpp_list *list);
> >  void perf_hpp__reset_output_field(struct perf_hpp_list *list);
> >  void perf_hpp__append_sort_keys(struct perf_hpp_list *list);
> > diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
> > index bc4c3acfe7552..2b6023de7a53a 100644
> > --- a/tools/perf/util/sort.c
> > +++ b/tools/perf/util/sort.c
> > @@ -2622,6 +2622,7 @@ struct hpp_dimension {
> >       const char              *name;
> >       struct perf_hpp_fmt     *fmt;
> >       int                     taken;
> > +     int                     was_taken;
> >  };
> >
> >  #define DIM(d, n) { .name = n, .fmt = &perf_hpp__format[d], }
> > @@ -3513,6 +3514,7 @@ static int __hpp_dimension__add(struct hpp_dimension *hd,
> >               return -1;
> >
> >       hd->taken = 1;
> > +     hd->was_taken = 1;
> >       perf_hpp_list__register_sort_field(list, fmt);
> >       return 0;
> >  }
> > @@ -3547,10 +3549,15 @@ static int __hpp_dimension__add_output(struct perf_hpp_list *list,
> >       return 0;
> >  }
> >
> > -int hpp_dimension__add_output(unsigned col)
> > +int hpp_dimension__add_output(unsigned col, bool implicit)
> >  {
> > +     struct hpp_dimension *hd;
> > +
> >       BUG_ON(col >= PERF_HPP__MAX_INDEX);
> > -     return __hpp_dimension__add_output(&perf_hpp_list, &hpp_sort_dimensions[col]);
> > +     hd = &hpp_sort_dimensions[col];
> > +     if (implicit && !hd->was_taken)
> > +             return 0;
>
> I don't think you need these implicit and was_taken things.
> Just removing from the sort list when it's unregistered seems to work.
>
> ---8<---
> @@ -685,6 +685,7 @@ void perf_hpp_list__prepend_sort_field(struct perf_hpp_list *list,
>  static void perf_hpp__column_unregister(struct perf_hpp_fmt *format)
>  {
>         list_del_init(&format->list);
> +       list_del_init(&format->sort_list);
>         fmt_free(format);
>  }
>
> ---8<---

It merely suppresses the warning, but does not work the same way. See
this for details:
https://lore.kernel.org/all/CACT4Y+ZREdDL7a+DMKGFGae1ZjX1C8uNRwCGF0c8iUJtTTq0Lw@mail.gmail.com/

> Thanks,
> Namhyung
>
>
> > +     return __hpp_dimension__add_output(&perf_hpp_list, hd);
> >  }
> >
> >  int sort_dimension__add(struct perf_hpp_list *list, const char *tok,
> > @@ -3809,10 +3816,24 @@ static char *setup_overhead(char *keys)
> >       if (sort__mode == SORT_MODE__DIFF)
> >               return keys;
> >
> > -     keys = prefix_if_not_in("overhead", keys);
> > -
> > -     if (symbol_conf.cumulate_callchain)
> > -             keys = prefix_if_not_in("overhead_children", keys);
> > +     if (symbol_conf.prefer_latency) {
> > +             keys = prefix_if_not_in("overhead", keys);
> > +             keys = prefix_if_not_in("latency", keys);
> > +             if (symbol_conf.cumulate_callchain) {
> > +                     keys = prefix_if_not_in("overhead_children", keys);
> > +                     keys = prefix_if_not_in("latency_children", keys);
> > +             }
> > +     } else if (!keys || (!strstr(keys, "overhead") &&
> > +                     !strstr(keys, "latency"))) {
> > +             if (symbol_conf.enable_latency)
> > +                     keys = prefix_if_not_in("latency", keys);
> > +             keys = prefix_if_not_in("overhead", keys);
> > +             if (symbol_conf.cumulate_callchain) {
> > +                     if (symbol_conf.enable_latency)
> > +                             keys = prefix_if_not_in("latency_children", keys);
> > +                     keys = prefix_if_not_in("overhead_children", keys);
> > +             }
> > +     }
> >
> >       return keys;
> >  }
> > diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
> > index 11fb15f914093..180d36a2bea35 100644
> > --- a/tools/perf/util/sort.h
> > +++ b/tools/perf/util/sort.h
> > @@ -141,7 +141,7 @@ int report_parse_ignore_callees_opt(const struct option *opt, const char *arg, i
> >
> >  bool is_strict_order(const char *order);
> >
> > -int hpp_dimension__add_output(unsigned col);
> > +int hpp_dimension__add_output(unsigned col, bool implicit);
> >  void reset_dimensions(void);
> >  int sort_dimension__add(struct perf_hpp_list *list, const char *tok,
> >                       struct evlist *evlist,
> > diff --git a/tools/perf/util/symbol_conf.h b/tools/perf/util/symbol_conf.h
> > index c5b2e56127e22..cd9aa82c7d5ad 100644
> > --- a/tools/perf/util/symbol_conf.h
> > +++ b/tools/perf/util/symbol_conf.h
> > @@ -49,7 +49,9 @@ struct symbol_conf {
> >                       keep_exited_threads,
> >                       annotate_data_member,
> >                       annotate_data_sample,
> > -                     skip_empty;
> > +                     skip_empty,
> > +                     enable_latency,
> > +                     prefer_latency;
> >       const char      *vmlinux_name,
> >                       *kallsyms_name,
> >                       *source_prefix,
> > --
> > 2.48.1.362.g079036d154-goog
> >

Re: [PATCH v5 6/8] perf report: Add --latency flag

Posted by Namhyung Kim 12 months ago

On Fri, Feb 07, 2025 at 08:23:58AM +0100, Dmitry Vyukov wrote:
> On Fri, 7 Feb 2025 at 04:44, Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > On Wed, Feb 05, 2025 at 05:27:45PM +0100, Dmitry Vyukov wrote:
> > > Add record/report --latency flag that allows to capture and show
> > > latency-centric profiles rather than the default CPU-consumption-centric
> > > profiles. For latency profiles record captures context switch events,
> > > and report shows Latency as the first column.
> > >
> > > Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
> > > Cc: Namhyung Kim <namhyung@kernel.org>
> > > Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> > > Cc: Ian Rogers <irogers@google.com>
> > > Cc: linux-perf-users@vger.kernel.org
> > > Cc: linux-kernel@vger.kernel.org
> > >
> > > ---
> > > Changes in v5:
> > >  - added description of --latency flag in Documentation flags
> > > ---
> > >  tools/perf/Documentation/perf-record.txt |  4 +++
> > >  tools/perf/Documentation/perf-report.txt |  5 +++
> > >  tools/perf/builtin-record.c              | 20 ++++++++++++
> > >  tools/perf/builtin-report.c              | 32 +++++++++++++++---
> > >  tools/perf/ui/hist.c                     | 41 +++++++++++++++++++-----
> > >  tools/perf/util/hist.h                   |  1 +
> > >  tools/perf/util/sort.c                   | 33 +++++++++++++++----
> > >  tools/perf/util/sort.h                   |  2 +-
> > >  tools/perf/util/symbol_conf.h            |  4 ++-
> > >  9 files changed, 122 insertions(+), 20 deletions(-)
> > >
> > > diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
> > > index 80686d590de24..c7fc1ba265e27 100644
> > > --- a/tools/perf/Documentation/perf-record.txt
> > > +++ b/tools/perf/Documentation/perf-record.txt
> > > @@ -227,6 +227,10 @@ OPTIONS
> > >       '--filter' exists, the new filter expression will be combined with
> > >       them by '&&'.
> > >
> > > +--latency::
> > > +     Enable data collection for latency profiling.
> > > +     Use perf report --latency for latency-centric profile.
> > > +
> > >  -a::
> > >  --all-cpus::
> > >          System-wide collection from all CPUs (default if no target is specified).
> > > diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
> > > index 87f8645194062..66794131aec48 100644
> > > --- a/tools/perf/Documentation/perf-report.txt
> > > +++ b/tools/perf/Documentation/perf-report.txt
> > > @@ -68,6 +68,11 @@ OPTIONS
> > >  --hide-unresolved::
> > >          Only display entries resolved to a symbol.
> > >
> > > +--latency::
> > > +        Show latency-centric profile rather than the default
> > > +        CPU-consumption-centric profile
> > > +        (requires perf record --latency flag).
> > > +
> > >  -s::
> > >  --sort=::
> > >       Sort histogram entries by given key(s) - multiple keys can be specified
> > > diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> > > index 5db1aedf48df9..e219639ac401b 100644
> > > --- a/tools/perf/builtin-record.c
> > > +++ b/tools/perf/builtin-record.c
> > > @@ -161,6 +161,7 @@ struct record {
> > >       struct evlist           *sb_evlist;
> > >       pthread_t               thread_id;
> > >       int                     realtime_prio;
> > > +     bool                    latency;
> > >       bool                    switch_output_event_set;
> > >       bool                    no_buildid;
> > >       bool                    no_buildid_set;
> > > @@ -3371,6 +3372,9 @@ static struct option __record_options[] = {
> > >                    parse_events_option),
> > >       OPT_CALLBACK(0, "filter", &record.evlist, "filter",
> > >                    "event filter", parse_filter),
> > > +     OPT_BOOLEAN(0, "latency", &record.latency,
> > > +                 "Enable data collection for latency profiling.\n"
> > > +                 "\t\t\t  Use perf report --latency for latency-centric profile."),
> > >       OPT_CALLBACK_NOOPT(0, "exclude-perf", &record.evlist,
> > >                          NULL, "don't record events from perf itself",
> > >                          exclude_perf),
> > > @@ -4017,6 +4021,22 @@ int cmd_record(int argc, const char **argv)
> > >
> > >       }
> > >
> > > +     if (record.latency) {
> > > +             /*
> > > +              * There is no fundamental reason why latency profiling
> > > +              * can't work for system-wide mode, but exact semantics
> > > +              * and details are to be defined.
> > > +              * See the following thread for details:
> > > +              * https://lore.kernel.org/all/Z4XDJyvjiie3howF@google.com/
> > > +              */
> > > +             if (record.opts.target.system_wide) {
> > > +                     pr_err("Failed: latency profiling is not supported with system-wide collection.\n");
> > > +                     err = -EINVAL;
> > > +                     goto out_opts;
> > > +             }
> > > +             record.opts.record_switch_events = true;
> > > +     }
> > > +
> > >       if (rec->buildid_mmap) {
> > >               if (!perf_can_record_build_id()) {
> > >                       pr_err("Failed: no support to record build id in mmap events, update your kernel.\n");
> > > diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
> > > index 2a19abdc869a1..69de6dbefecfa 100644
> > > --- a/tools/perf/builtin-report.c
> > > +++ b/tools/perf/builtin-report.c
> > > @@ -112,6 +112,8 @@ struct report {
> > >       u64                     nr_entries;
> > >       u64                     queue_size;
> > >       u64                     total_cycles;
> > > +     u64                     total_samples;
> > > +     u64                     singlethreaded_samples;
> > >       int                     socket_filter;
> > >       DECLARE_BITMAP(cpu_bitmap, MAX_NR_CPUS);
> > >       struct branch_type_stat brtype_stat;
> > > @@ -331,6 +333,10 @@ static int process_sample_event(const struct perf_tool *tool,
> > >                                    &rep->total_cycles, evsel);
> > >       }
> > >
> > > +     rep->total_samples++;
> > > +     if (al.parallelism == 1)
> > > +             rep->singlethreaded_samples++;
> > > +
> > >       ret = hist_entry_iter__add(&iter, &al, rep->max_stack, rep);
> > >       if (ret < 0)
> > >               pr_debug("problem adding hist entry, skipping event\n");
> > > @@ -1079,6 +1085,11 @@ static int __cmd_report(struct report *rep)
> > >               return ret;
> > >       }
> > >
> > > +     /* Don't show Latency column for non-parallel profiles by default. */
> > > +     if (rep->singlethreaded_samples * 100 / rep->total_samples >= 99 &&
> > > +                     !symbol_conf.prefer_latency)
> > > +             perf_hpp__cancel_latency();
> > > +
> > >       evlist__check_mem_load_aux(session->evlist);
> > >
> > >       if (rep->stats_mode)
> > > @@ -1468,6 +1479,10 @@ int cmd_report(int argc, const char **argv)
> > >                   "Disable raw trace ordering"),
> > >       OPT_BOOLEAN(0, "skip-empty", &report.skip_empty,
> > >                   "Do not display empty (or dummy) events in the output"),
> > > +     OPT_BOOLEAN(0, "latency", &symbol_conf.prefer_latency,
> > > +                 "Show latency-centric profile rather than the default\n"
> > > +                 "\t\t\t  CPU-consumption-centric profile\n"
> > > +                 "\t\t\t  (requires perf record --latency flag)."),
> > >       OPT_END()
> > >       };
> > >       struct perf_data data = {
> > > @@ -1722,16 +1737,25 @@ int cmd_report(int argc, const char **argv)
> > >               symbol_conf.annotate_data_sample = true;
> > >       }
> > >
> > > +     symbol_conf.enable_latency = true;
> > >       if (report.disable_order || !perf_session__has_switch_events(session)) {
> > >               if (symbol_conf.parallelism_list_str ||
> > > -                             (sort_order && strstr(sort_order, "parallelism")) ||
> > > -                             (field_order && strstr(field_order, "parallelism"))) {
> > > +                     symbol_conf.prefer_latency ||
> > > +                     (sort_order && (strstr(sort_order, "latency") ||
> > > +                             strstr(sort_order, "parallelism"))) ||
> > > +                     (field_order && (strstr(field_order, "latency") ||
> > > +                             strstr(field_order, "parallelism")))) {
> > >                       if (report.disable_order)
> > > -                             ui__error("Use of parallelism is incompatible with --disable-order.\n");
> > > +                             ui__error("Use of latency profile or parallelism is incompatible with --disable-order.\n");
> > >                       else
> > > -                             ui__error("Use of parallelism requires --switch-events during record.\n");
> > > +                             ui__error("Use of latency profile or parallelism requires --latency flag during record.\n");
> > >                       return -1;
> > >               }
> > > +             /*
> > > +              * If user did not ask for anything related to
> > > +              * latency/parallelism explicitly, just don't show it.
> > > +              */
> > > +             symbol_conf.enable_latency = false;
> > >       }
> > >
> > >       if (sort_order && strstr(sort_order, "ipc")) {
> > > diff --git a/tools/perf/ui/hist.c b/tools/perf/ui/hist.c
> > > index 6de6309595f9e..b453f8eb579cc 100644
> > > --- a/tools/perf/ui/hist.c
> > > +++ b/tools/perf/ui/hist.c
> > > @@ -632,27 +632,36 @@ void perf_hpp__init(void)
> > >               return;
> > >
> > >       if (symbol_conf.cumulate_callchain) {
> > > -             hpp_dimension__add_output(PERF_HPP__OVERHEAD_ACC);
> > > +             /* Use idempotent addition to avoid more complex logic. */
> > > +             if (symbol_conf.prefer_latency)
> > > +                     hpp_dimension__add_output(PERF_HPP__LATENCY_ACC, true);
> > > +             hpp_dimension__add_output(PERF_HPP__OVERHEAD_ACC, true);
> > > +             if (symbol_conf.enable_latency)
> > > +                     hpp_dimension__add_output(PERF_HPP__LATENCY_ACC, true);
> > >               perf_hpp__format[PERF_HPP__OVERHEAD].name = "Self";
> > >       }
> > >
> > > -     hpp_dimension__add_output(PERF_HPP__OVERHEAD);
> > > +     if (symbol_conf.prefer_latency)
> > > +             hpp_dimension__add_output(PERF_HPP__LATENCY, true);
> > > +     hpp_dimension__add_output(PERF_HPP__OVERHEAD, true);
> > > +     if (symbol_conf.enable_latency)
> > > +             hpp_dimension__add_output(PERF_HPP__LATENCY, true);
> > >
> > >       if (symbol_conf.show_cpu_utilization) {
> > > -             hpp_dimension__add_output(PERF_HPP__OVERHEAD_SYS);
> > > -             hpp_dimension__add_output(PERF_HPP__OVERHEAD_US);
> > > +             hpp_dimension__add_output(PERF_HPP__OVERHEAD_SYS, false);
> > > +             hpp_dimension__add_output(PERF_HPP__OVERHEAD_US, false);
> > >
> > >               if (perf_guest) {
> > > -                     hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_SYS);
> > > -                     hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_US);
> > > +                     hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_SYS, false);
> > > +                     hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_US, false);
> > >               }
> > >       }
> > >
> > >       if (symbol_conf.show_nr_samples)
> > > -             hpp_dimension__add_output(PERF_HPP__SAMPLES);
> > > +             hpp_dimension__add_output(PERF_HPP__SAMPLES, false);
> > >
> > >       if (symbol_conf.show_total_period)
> > > -             hpp_dimension__add_output(PERF_HPP__PERIOD);
> > > +             hpp_dimension__add_output(PERF_HPP__PERIOD, false);
> > >  }
> > >
> > >  void perf_hpp_list__column_register(struct perf_hpp_list *list,
> > > @@ -701,6 +710,22 @@ void perf_hpp__cancel_cumulate(void)
> > >       }
> > >  }
> > >
> > > +void perf_hpp__cancel_latency(void)
> > > +{
> > > +     struct perf_hpp_fmt *fmt, *lat, *acc, *tmp;
> > > +
> > > +     if (is_strict_order(field_order) || is_strict_order(sort_order))
> > > +             return;
> > > +
> > > +     lat = &perf_hpp__format[PERF_HPP__LATENCY];
> > > +     acc = &perf_hpp__format[PERF_HPP__LATENCY_ACC];
> > > +
> > > +     perf_hpp_list__for_each_format_safe(&perf_hpp_list, fmt, tmp) {
> > > +             if (fmt_equal(lat, fmt) || fmt_equal(acc, fmt))
> > > +                     perf_hpp__column_unregister(fmt);
> > > +     }
> > > +}
> > > +
> > >  void perf_hpp__setup_output_field(struct perf_hpp_list *list)
> > >  {
> > >       struct perf_hpp_fmt *fmt;
> > > diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
> > > index 91159f16c60b2..29d4c7a3d1747 100644
> > > --- a/tools/perf/util/hist.h
> > > +++ b/tools/perf/util/hist.h
> > > @@ -582,6 +582,7 @@ enum {
> > >
> > >  void perf_hpp__init(void);
> > >  void perf_hpp__cancel_cumulate(void);
> > > +void perf_hpp__cancel_latency(void);
> > >  void perf_hpp__setup_output_field(struct perf_hpp_list *list);
> > >  void perf_hpp__reset_output_field(struct perf_hpp_list *list);
> > >  void perf_hpp__append_sort_keys(struct perf_hpp_list *list);
> > > diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
> > > index bc4c3acfe7552..2b6023de7a53a 100644
> > > --- a/tools/perf/util/sort.c
> > > +++ b/tools/perf/util/sort.c
> > > @@ -2622,6 +2622,7 @@ struct hpp_dimension {
> > >       const char              *name;
> > >       struct perf_hpp_fmt     *fmt;
> > >       int                     taken;
> > > +     int                     was_taken;
> > >  };
> > >
> > >  #define DIM(d, n) { .name = n, .fmt = &perf_hpp__format[d], }
> > > @@ -3513,6 +3514,7 @@ static int __hpp_dimension__add(struct hpp_dimension *hd,
> > >               return -1;
> > >
> > >       hd->taken = 1;
> > > +     hd->was_taken = 1;
> > >       perf_hpp_list__register_sort_field(list, fmt);
> > >       return 0;
> > >  }
> > > @@ -3547,10 +3549,15 @@ static int __hpp_dimension__add_output(struct perf_hpp_list *list,
> > >       return 0;
> > >  }
> > >
> > > -int hpp_dimension__add_output(unsigned col)
> > > +int hpp_dimension__add_output(unsigned col, bool implicit)
> > >  {
> > > +     struct hpp_dimension *hd;
> > > +
> > >       BUG_ON(col >= PERF_HPP__MAX_INDEX);
> > > -     return __hpp_dimension__add_output(&perf_hpp_list, &hpp_sort_dimensions[col]);
> > > +     hd = &hpp_sort_dimensions[col];
> > > +     if (implicit && !hd->was_taken)
> > > +             return 0;
> >
> > I don't think you need these implicit and was_taken things.
> > Just removing from the sort list when it's unregistered seems to work.
> >
> > ---8<---
> > @@ -685,6 +685,7 @@ void perf_hpp_list__prepend_sort_field(struct perf_hpp_list *list,
> >  static void perf_hpp__column_unregister(struct perf_hpp_fmt *format)
> >  {
> >         list_del_init(&format->list);
> > +       list_del_init(&format->sort_list);
> >         fmt_free(format);
> >  }
> >
> > ---8<---
> 
> It merely suppresses the warning, but does not work the same way. See
> this for details:
> https://lore.kernel.org/all/CACT4Y+ZREdDL7a+DMKGFGae1ZjX1C8uNRwCGF0c8iUJtTTq0Lw@mail.gmail.com/

But I think it's better to pass --latency option rather than adding it
to -s option.  If you really want to have specific output fields, then
please use -F latency,sym instead.

Also I've realized that it should add one sort key in setup_overhead()
to support hierarchy mode properly.  Something like this?

Thanks,
Namhyung


---8<---
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index 2b6023de7a53ae2e..329c2e9bbc69a725 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -3817,22 +3817,15 @@ static char *setup_overhead(char *keys)
 		return keys;
 
 	if (symbol_conf.prefer_latency) {
-		keys = prefix_if_not_in("overhead", keys);
-		keys = prefix_if_not_in("latency", keys);
-		if (symbol_conf.cumulate_callchain) {
-			keys = prefix_if_not_in("overhead_children", keys);
+		if (symbol_conf.cumulate_callchain)
 			keys = prefix_if_not_in("latency_children", keys);
-		}
-	} else if (!keys || (!strstr(keys, "overhead") &&
-			!strstr(keys, "latency"))) {
-		if (symbol_conf.enable_latency)
+		else
 			keys = prefix_if_not_in("latency", keys);
-		keys = prefix_if_not_in("overhead", keys);
-		if (symbol_conf.cumulate_callchain) {
-			if (symbol_conf.enable_latency)
-				keys = prefix_if_not_in("latency_children", keys);
+	} else {
+		if (symbol_conf.cumulate_callchain)
 			keys = prefix_if_not_in("overhead_children", keys);
-		}
+		else
+			keys = prefix_if_not_in("overhead", keys);
 	}
 
 	return keys;
---8<---

Re: [PATCH v5 6/8] perf report: Add --latency flag

Posted by Dmitry Vyukov 12 months ago

On Tue, 11 Feb 2025 at 02:02, Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Fri, Feb 07, 2025 at 08:23:58AM +0100, Dmitry Vyukov wrote:
> > On Fri, 7 Feb 2025 at 04:44, Namhyung Kim <namhyung@kernel.org> wrote:
> > >
> > > On Wed, Feb 05, 2025 at 05:27:45PM +0100, Dmitry Vyukov wrote:
> > > > Add record/report --latency flag that allows to capture and show
> > > > latency-centric profiles rather than the default CPU-consumption-centric
> > > > profiles. For latency profiles record captures context switch events,
> > > > and report shows Latency as the first column.
> > > >
> > > > Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
> > > > Cc: Namhyung Kim <namhyung@kernel.org>
> > > > Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> > > > Cc: Ian Rogers <irogers@google.com>
> > > > Cc: linux-perf-users@vger.kernel.org
> > > > Cc: linux-kernel@vger.kernel.org
> > > >
> > > > ---
> > > > Changes in v5:
> > > >  - added description of --latency flag in Documentation flags
> > > > ---
> > > >  tools/perf/Documentation/perf-record.txt |  4 +++
> > > >  tools/perf/Documentation/perf-report.txt |  5 +++
> > > >  tools/perf/builtin-record.c              | 20 ++++++++++++
> > > >  tools/perf/builtin-report.c              | 32 +++++++++++++++---
> > > >  tools/perf/ui/hist.c                     | 41 +++++++++++++++++++-----
> > > >  tools/perf/util/hist.h                   |  1 +
> > > >  tools/perf/util/sort.c                   | 33 +++++++++++++++----
> > > >  tools/perf/util/sort.h                   |  2 +-
> > > >  tools/perf/util/symbol_conf.h            |  4 ++-
> > > >  9 files changed, 122 insertions(+), 20 deletions(-)
> > > >
> > > > diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
> > > > index 80686d590de24..c7fc1ba265e27 100644
> > > > --- a/tools/perf/Documentation/perf-record.txt
> > > > +++ b/tools/perf/Documentation/perf-record.txt
> > > > @@ -227,6 +227,10 @@ OPTIONS
> > > >       '--filter' exists, the new filter expression will be combined with
> > > >       them by '&&'.
> > > >
> > > > +--latency::
> > > > +     Enable data collection for latency profiling.
> > > > +     Use perf report --latency for latency-centric profile.
> > > > +
> > > >  -a::
> > > >  --all-cpus::
> > > >          System-wide collection from all CPUs (default if no target is specified).
> > > > diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
> > > > index 87f8645194062..66794131aec48 100644
> > > > --- a/tools/perf/Documentation/perf-report.txt
> > > > +++ b/tools/perf/Documentation/perf-report.txt
> > > > @@ -68,6 +68,11 @@ OPTIONS
> > > >  --hide-unresolved::
> > > >          Only display entries resolved to a symbol.
> > > >
> > > > +--latency::
> > > > +        Show latency-centric profile rather than the default
> > > > +        CPU-consumption-centric profile
> > > > +        (requires perf record --latency flag).
> > > > +
> > > >  -s::
> > > >  --sort=::
> > > >       Sort histogram entries by given key(s) - multiple keys can be specified
> > > > diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> > > > index 5db1aedf48df9..e219639ac401b 100644
> > > > --- a/tools/perf/builtin-record.c
> > > > +++ b/tools/perf/builtin-record.c
> > > > @@ -161,6 +161,7 @@ struct record {
> > > >       struct evlist           *sb_evlist;
> > > >       pthread_t               thread_id;
> > > >       int                     realtime_prio;
> > > > +     bool                    latency;
> > > >       bool                    switch_output_event_set;
> > > >       bool                    no_buildid;
> > > >       bool                    no_buildid_set;
> > > > @@ -3371,6 +3372,9 @@ static struct option __record_options[] = {
> > > >                    parse_events_option),
> > > >       OPT_CALLBACK(0, "filter", &record.evlist, "filter",
> > > >                    "event filter", parse_filter),
> > > > +     OPT_BOOLEAN(0, "latency", &record.latency,
> > > > +                 "Enable data collection for latency profiling.\n"
> > > > +                 "\t\t\t  Use perf report --latency for latency-centric profile."),
> > > >       OPT_CALLBACK_NOOPT(0, "exclude-perf", &record.evlist,
> > > >                          NULL, "don't record events from perf itself",
> > > >                          exclude_perf),
> > > > @@ -4017,6 +4021,22 @@ int cmd_record(int argc, const char **argv)
> > > >
> > > >       }
> > > >
> > > > +     if (record.latency) {
> > > > +             /*
> > > > +              * There is no fundamental reason why latency profiling
> > > > +              * can't work for system-wide mode, but exact semantics
> > > > +              * and details are to be defined.
> > > > +              * See the following thread for details:
> > > > +              * https://lore.kernel.org/all/Z4XDJyvjiie3howF@google.com/
> > > > +              */
> > > > +             if (record.opts.target.system_wide) {
> > > > +                     pr_err("Failed: latency profiling is not supported with system-wide collection.\n");
> > > > +                     err = -EINVAL;
> > > > +                     goto out_opts;
> > > > +             }
> > > > +             record.opts.record_switch_events = true;
> > > > +     }
> > > > +
> > > >       if (rec->buildid_mmap) {
> > > >               if (!perf_can_record_build_id()) {
> > > >                       pr_err("Failed: no support to record build id in mmap events, update your kernel.\n");
> > > > diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
> > > > index 2a19abdc869a1..69de6dbefecfa 100644
> > > > --- a/tools/perf/builtin-report.c
> > > > +++ b/tools/perf/builtin-report.c
> > > > @@ -112,6 +112,8 @@ struct report {
> > > >       u64                     nr_entries;
> > > >       u64                     queue_size;
> > > >       u64                     total_cycles;
> > > > +     u64                     total_samples;
> > > > +     u64                     singlethreaded_samples;
> > > >       int                     socket_filter;
> > > >       DECLARE_BITMAP(cpu_bitmap, MAX_NR_CPUS);
> > > >       struct branch_type_stat brtype_stat;
> > > > @@ -331,6 +333,10 @@ static int process_sample_event(const struct perf_tool *tool,
> > > >                                    &rep->total_cycles, evsel);
> > > >       }
> > > >
> > > > +     rep->total_samples++;
> > > > +     if (al.parallelism == 1)
> > > > +             rep->singlethreaded_samples++;
> > > > +
> > > >       ret = hist_entry_iter__add(&iter, &al, rep->max_stack, rep);
> > > >       if (ret < 0)
> > > >               pr_debug("problem adding hist entry, skipping event\n");
> > > > @@ -1079,6 +1085,11 @@ static int __cmd_report(struct report *rep)
> > > >               return ret;
> > > >       }
> > > >
> > > > +     /* Don't show Latency column for non-parallel profiles by default. */
> > > > +     if (rep->singlethreaded_samples * 100 / rep->total_samples >= 99 &&
> > > > +                     !symbol_conf.prefer_latency)
> > > > +             perf_hpp__cancel_latency();
> > > > +
> > > >       evlist__check_mem_load_aux(session->evlist);
> > > >
> > > >       if (rep->stats_mode)
> > > > @@ -1468,6 +1479,10 @@ int cmd_report(int argc, const char **argv)
> > > >                   "Disable raw trace ordering"),
> > > >       OPT_BOOLEAN(0, "skip-empty", &report.skip_empty,
> > > >                   "Do not display empty (or dummy) events in the output"),
> > > > +     OPT_BOOLEAN(0, "latency", &symbol_conf.prefer_latency,
> > > > +                 "Show latency-centric profile rather than the default\n"
> > > > +                 "\t\t\t  CPU-consumption-centric profile\n"
> > > > +                 "\t\t\t  (requires perf record --latency flag)."),
> > > >       OPT_END()
> > > >       };
> > > >       struct perf_data data = {
> > > > @@ -1722,16 +1737,25 @@ int cmd_report(int argc, const char **argv)
> > > >               symbol_conf.annotate_data_sample = true;
> > > >       }
> > > >
> > > > +     symbol_conf.enable_latency = true;
> > > >       if (report.disable_order || !perf_session__has_switch_events(session)) {
> > > >               if (symbol_conf.parallelism_list_str ||
> > > > -                             (sort_order && strstr(sort_order, "parallelism")) ||
> > > > -                             (field_order && strstr(field_order, "parallelism"))) {
> > > > +                     symbol_conf.prefer_latency ||
> > > > +                     (sort_order && (strstr(sort_order, "latency") ||
> > > > +                             strstr(sort_order, "parallelism"))) ||
> > > > +                     (field_order && (strstr(field_order, "latency") ||
> > > > +                             strstr(field_order, "parallelism")))) {
> > > >                       if (report.disable_order)
> > > > -                             ui__error("Use of parallelism is incompatible with --disable-order.\n");
> > > > +                             ui__error("Use of latency profile or parallelism is incompatible with --disable-order.\n");
> > > >                       else
> > > > -                             ui__error("Use of parallelism requires --switch-events during record.\n");
> > > > +                             ui__error("Use of latency profile or parallelism requires --latency flag during record.\n");
> > > >                       return -1;
> > > >               }
> > > > +             /*
> > > > +              * If user did not ask for anything related to
> > > > +              * latency/parallelism explicitly, just don't show it.
> > > > +              */
> > > > +             symbol_conf.enable_latency = false;
> > > >       }
> > > >
> > > >       if (sort_order && strstr(sort_order, "ipc")) {
> > > > diff --git a/tools/perf/ui/hist.c b/tools/perf/ui/hist.c
> > > > index 6de6309595f9e..b453f8eb579cc 100644
> > > > --- a/tools/perf/ui/hist.c
> > > > +++ b/tools/perf/ui/hist.c
> > > > @@ -632,27 +632,36 @@ void perf_hpp__init(void)
> > > >               return;
> > > >
> > > >       if (symbol_conf.cumulate_callchain) {
> > > > -             hpp_dimension__add_output(PERF_HPP__OVERHEAD_ACC);
> > > > +             /* Use idempotent addition to avoid more complex logic. */
> > > > +             if (symbol_conf.prefer_latency)
> > > > +                     hpp_dimension__add_output(PERF_HPP__LATENCY_ACC, true);
> > > > +             hpp_dimension__add_output(PERF_HPP__OVERHEAD_ACC, true);
> > > > +             if (symbol_conf.enable_latency)
> > > > +                     hpp_dimension__add_output(PERF_HPP__LATENCY_ACC, true);
> > > >               perf_hpp__format[PERF_HPP__OVERHEAD].name = "Self";
> > > >       }
> > > >
> > > > -     hpp_dimension__add_output(PERF_HPP__OVERHEAD);
> > > > +     if (symbol_conf.prefer_latency)
> > > > +             hpp_dimension__add_output(PERF_HPP__LATENCY, true);
> > > > +     hpp_dimension__add_output(PERF_HPP__OVERHEAD, true);
> > > > +     if (symbol_conf.enable_latency)
> > > > +             hpp_dimension__add_output(PERF_HPP__LATENCY, true);
> > > >
> > > >       if (symbol_conf.show_cpu_utilization) {
> > > > -             hpp_dimension__add_output(PERF_HPP__OVERHEAD_SYS);
> > > > -             hpp_dimension__add_output(PERF_HPP__OVERHEAD_US);
> > > > +             hpp_dimension__add_output(PERF_HPP__OVERHEAD_SYS, false);
> > > > +             hpp_dimension__add_output(PERF_HPP__OVERHEAD_US, false);
> > > >
> > > >               if (perf_guest) {
> > > > -                     hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_SYS);
> > > > -                     hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_US);
> > > > +                     hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_SYS, false);
> > > > +                     hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_US, false);
> > > >               }
> > > >       }
> > > >
> > > >       if (symbol_conf.show_nr_samples)
> > > > -             hpp_dimension__add_output(PERF_HPP__SAMPLES);
> > > > +             hpp_dimension__add_output(PERF_HPP__SAMPLES, false);
> > > >
> > > >       if (symbol_conf.show_total_period)
> > > > -             hpp_dimension__add_output(PERF_HPP__PERIOD);
> > > > +             hpp_dimension__add_output(PERF_HPP__PERIOD, false);
> > > >  }
> > > >
> > > >  void perf_hpp_list__column_register(struct perf_hpp_list *list,
> > > > @@ -701,6 +710,22 @@ void perf_hpp__cancel_cumulate(void)
> > > >       }
> > > >  }
> > > >
> > > > +void perf_hpp__cancel_latency(void)
> > > > +{
> > > > +     struct perf_hpp_fmt *fmt, *lat, *acc, *tmp;
> > > > +
> > > > +     if (is_strict_order(field_order) || is_strict_order(sort_order))
> > > > +             return;
> > > > +
> > > > +     lat = &perf_hpp__format[PERF_HPP__LATENCY];
> > > > +     acc = &perf_hpp__format[PERF_HPP__LATENCY_ACC];
> > > > +
> > > > +     perf_hpp_list__for_each_format_safe(&perf_hpp_list, fmt, tmp) {
> > > > +             if (fmt_equal(lat, fmt) || fmt_equal(acc, fmt))
> > > > +                     perf_hpp__column_unregister(fmt);
> > > > +     }
> > > > +}
> > > > +
> > > >  void perf_hpp__setup_output_field(struct perf_hpp_list *list)
> > > >  {
> > > >       struct perf_hpp_fmt *fmt;
> > > > diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
> > > > index 91159f16c60b2..29d4c7a3d1747 100644
> > > > --- a/tools/perf/util/hist.h
> > > > +++ b/tools/perf/util/hist.h
> > > > @@ -582,6 +582,7 @@ enum {
> > > >
> > > >  void perf_hpp__init(void);
> > > >  void perf_hpp__cancel_cumulate(void);
> > > > +void perf_hpp__cancel_latency(void);
> > > >  void perf_hpp__setup_output_field(struct perf_hpp_list *list);
> > > >  void perf_hpp__reset_output_field(struct perf_hpp_list *list);
> > > >  void perf_hpp__append_sort_keys(struct perf_hpp_list *list);
> > > > diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
> > > > index bc4c3acfe7552..2b6023de7a53a 100644
> > > > --- a/tools/perf/util/sort.c
> > > > +++ b/tools/perf/util/sort.c
> > > > @@ -2622,6 +2622,7 @@ struct hpp_dimension {
> > > >       const char              *name;
> > > >       struct perf_hpp_fmt     *fmt;
> > > >       int                     taken;
> > > > +     int                     was_taken;
> > > >  };
> > > >
> > > >  #define DIM(d, n) { .name = n, .fmt = &perf_hpp__format[d], }
> > > > @@ -3513,6 +3514,7 @@ static int __hpp_dimension__add(struct hpp_dimension *hd,
> > > >               return -1;
> > > >
> > > >       hd->taken = 1;
> > > > +     hd->was_taken = 1;
> > > >       perf_hpp_list__register_sort_field(list, fmt);
> > > >       return 0;
> > > >  }
> > > > @@ -3547,10 +3549,15 @@ static int __hpp_dimension__add_output(struct perf_hpp_list *list,
> > > >       return 0;
> > > >  }
> > > >
> > > > -int hpp_dimension__add_output(unsigned col)
> > > > +int hpp_dimension__add_output(unsigned col, bool implicit)
> > > >  {
> > > > +     struct hpp_dimension *hd;
> > > > +
> > > >       BUG_ON(col >= PERF_HPP__MAX_INDEX);
> > > > -     return __hpp_dimension__add_output(&perf_hpp_list, &hpp_sort_dimensions[col]);
> > > > +     hd = &hpp_sort_dimensions[col];
> > > > +     if (implicit && !hd->was_taken)
> > > > +             return 0;
> > >
> > > I don't think you need these implicit and was_taken things.
> > > Just removing from the sort list when it's unregistered seems to work.
> > >
> > > ---8<---
> > > @@ -685,6 +685,7 @@ void perf_hpp_list__prepend_sort_field(struct perf_hpp_list *list,
> > >  static void perf_hpp__column_unregister(struct perf_hpp_fmt *format)
> > >  {
> > >         list_del_init(&format->list);
> > > +       list_del_init(&format->sort_list);
> > >         fmt_free(format);
> > >  }
> > >
> > > ---8<---
> >
> > It merely suppresses the warning, but does not work the same way. See
> > this for details:
> > https://lore.kernel.org/all/CACT4Y+ZREdDL7a+DMKGFGae1ZjX1C8uNRwCGF0c8iUJtTTq0Lw@mail.gmail.com/
>
> But I think it's better to pass --latency option rather than adding it
> to -s option.  If you really want to have specific output fields, then
> please use -F latency,sym instead.

-F does not work in all cases, e.g. the command from tips.txt with -F
instead of --sort:

perf report --hierarchy -F latency,parallelism,comm,symbol
Error: --hierarchy and --fields options cannot be used together

I also don't understand how you see it working. Please elaborate.

--latency is no more than a handy present for the default case so that
a user does not need to learn other flags, field names, etc. It's
just:

perf report --latency

Then there are gazillions of custom permutations of various flags with
sort/field orders. It's impossible to second guess what a user means
in all of these cases and simplify with a magic --latency flag. For
example, possible commands may be different just by field order:
"latency", or "latency,overhead", or "overhead,latency", all of the
are "latency" in some sense, but which one of these 3 a user wants in
a particular case is impossible to infer.

Also, I don't see lots of value in using --latency flag instead of
latency sort field. It's just a bit longer command and nothing more.
Am I missing something? If a user is proficient with perf and a custom
specific configuration, I would let users specify what they want
rather than try to second guess their intentions.



> Also I've realized that it should add one sort key in setup_overhead()
> to support hierarchy mode properly.  Something like this?
>
> Thanks,
> Namhyung
>
>
> ---8<---
> diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
> index 2b6023de7a53ae2e..329c2e9bbc69a725 100644
> --- a/tools/perf/util/sort.c
> +++ b/tools/perf/util/sort.c
> @@ -3817,22 +3817,15 @@ static char *setup_overhead(char *keys)
>                 return keys;
>
>         if (symbol_conf.prefer_latency) {
> -               keys = prefix_if_not_in("overhead", keys);
> -               keys = prefix_if_not_in("latency", keys);
> -               if (symbol_conf.cumulate_callchain) {
> -                       keys = prefix_if_not_in("overhead_children", keys);
> +               if (symbol_conf.cumulate_callchain)
>                         keys = prefix_if_not_in("latency_children", keys);
> -               }
> -       } else if (!keys || (!strstr(keys, "overhead") &&
> -                       !strstr(keys, "latency"))) {
> -               if (symbol_conf.enable_latency)
> +               else
>                         keys = prefix_if_not_in("latency", keys);
> -               keys = prefix_if_not_in("overhead", keys);
> -               if (symbol_conf.cumulate_callchain) {
> -                       if (symbol_conf.enable_latency)
> -                               keys = prefix_if_not_in("latency_children", keys);
> +       } else {
> +               if (symbol_conf.cumulate_callchain)
>                         keys = prefix_if_not_in("overhead_children", keys);
> -               }
> +               else
> +                       keys = prefix_if_not_in("overhead", keys);
>         }
>
>         return keys;
> ---8<---
>

Re: [PATCH v5 6/8] perf report: Add --latency flag

Posted by Dmitry Vyukov 12 months ago

On Tue, 11 Feb 2025 at 02:02, Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Fri, Feb 07, 2025 at 08:23:58AM +0100, Dmitry Vyukov wrote:
> > On Fri, 7 Feb 2025 at 04:44, Namhyung Kim <namhyung@kernel.org> wrote:
> > >
> > > On Wed, Feb 05, 2025 at 05:27:45PM +0100, Dmitry Vyukov wrote:
> > > > Add record/report --latency flag that allows to capture and show
> > > > latency-centric profiles rather than the default CPU-consumption-centric
> > > > profiles. For latency profiles record captures context switch events,
> > > > and report shows Latency as the first column.
> > > >
> > > > Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
> > > > Cc: Namhyung Kim <namhyung@kernel.org>
> > > > Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> > > > Cc: Ian Rogers <irogers@google.com>
> > > > Cc: linux-perf-users@vger.kernel.org
> > > > Cc: linux-kernel@vger.kernel.org
> > > >
> > > > ---
> > > > Changes in v5:
> > > >  - added description of --latency flag in Documentation flags
> > > > ---
> > > >  tools/perf/Documentation/perf-record.txt |  4 +++
> > > >  tools/perf/Documentation/perf-report.txt |  5 +++
> > > >  tools/perf/builtin-record.c              | 20 ++++++++++++
> > > >  tools/perf/builtin-report.c              | 32 +++++++++++++++---
> > > >  tools/perf/ui/hist.c                     | 41 +++++++++++++++++++-----
> > > >  tools/perf/util/hist.h                   |  1 +
> > > >  tools/perf/util/sort.c                   | 33 +++++++++++++++----
> > > >  tools/perf/util/sort.h                   |  2 +-
> > > >  tools/perf/util/symbol_conf.h            |  4 ++-
> > > >  9 files changed, 122 insertions(+), 20 deletions(-)
> > > >
> > > > diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
> > > > index 80686d590de24..c7fc1ba265e27 100644
> > > > --- a/tools/perf/Documentation/perf-record.txt
> > > > +++ b/tools/perf/Documentation/perf-record.txt
> > > > @@ -227,6 +227,10 @@ OPTIONS
> > > >       '--filter' exists, the new filter expression will be combined with
> > > >       them by '&&'.
> > > >
> > > > +--latency::
> > > > +     Enable data collection for latency profiling.
> > > > +     Use perf report --latency for latency-centric profile.
> > > > +
> > > >  -a::
> > > >  --all-cpus::
> > > >          System-wide collection from all CPUs (default if no target is specified).
> > > > diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
> > > > index 87f8645194062..66794131aec48 100644
> > > > --- a/tools/perf/Documentation/perf-report.txt
> > > > +++ b/tools/perf/Documentation/perf-report.txt
> > > > @@ -68,6 +68,11 @@ OPTIONS
> > > >  --hide-unresolved::
> > > >          Only display entries resolved to a symbol.
> > > >
> > > > +--latency::
> > > > +        Show latency-centric profile rather than the default
> > > > +        CPU-consumption-centric profile
> > > > +        (requires perf record --latency flag).
> > > > +
> > > >  -s::
> > > >  --sort=::
> > > >       Sort histogram entries by given key(s) - multiple keys can be specified
> > > > diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> > > > index 5db1aedf48df9..e219639ac401b 100644
> > > > --- a/tools/perf/builtin-record.c
> > > > +++ b/tools/perf/builtin-record.c
> > > > @@ -161,6 +161,7 @@ struct record {
> > > >       struct evlist           *sb_evlist;
> > > >       pthread_t               thread_id;
> > > >       int                     realtime_prio;
> > > > +     bool                    latency;
> > > >       bool                    switch_output_event_set;
> > > >       bool                    no_buildid;
> > > >       bool                    no_buildid_set;
> > > > @@ -3371,6 +3372,9 @@ static struct option __record_options[] = {
> > > >                    parse_events_option),
> > > >       OPT_CALLBACK(0, "filter", &record.evlist, "filter",
> > > >                    "event filter", parse_filter),
> > > > +     OPT_BOOLEAN(0, "latency", &record.latency,
> > > > +                 "Enable data collection for latency profiling.\n"
> > > > +                 "\t\t\t  Use perf report --latency for latency-centric profile."),
> > > >       OPT_CALLBACK_NOOPT(0, "exclude-perf", &record.evlist,
> > > >                          NULL, "don't record events from perf itself",
> > > >                          exclude_perf),
> > > > @@ -4017,6 +4021,22 @@ int cmd_record(int argc, const char **argv)
> > > >
> > > >       }
> > > >
> > > > +     if (record.latency) {
> > > > +             /*
> > > > +              * There is no fundamental reason why latency profiling
> > > > +              * can't work for system-wide mode, but exact semantics
> > > > +              * and details are to be defined.
> > > > +              * See the following thread for details:
> > > > +              * https://lore.kernel.org/all/Z4XDJyvjiie3howF@google.com/
> > > > +              */
> > > > +             if (record.opts.target.system_wide) {
> > > > +                     pr_err("Failed: latency profiling is not supported with system-wide collection.\n");
> > > > +                     err = -EINVAL;
> > > > +                     goto out_opts;
> > > > +             }
> > > > +             record.opts.record_switch_events = true;
> > > > +     }
> > > > +
> > > >       if (rec->buildid_mmap) {
> > > >               if (!perf_can_record_build_id()) {
> > > >                       pr_err("Failed: no support to record build id in mmap events, update your kernel.\n");
> > > > diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
> > > > index 2a19abdc869a1..69de6dbefecfa 100644
> > > > --- a/tools/perf/builtin-report.c
> > > > +++ b/tools/perf/builtin-report.c
> > > > @@ -112,6 +112,8 @@ struct report {
> > > >       u64                     nr_entries;
> > > >       u64                     queue_size;
> > > >       u64                     total_cycles;
> > > > +     u64                     total_samples;
> > > > +     u64                     singlethreaded_samples;
> > > >       int                     socket_filter;
> > > >       DECLARE_BITMAP(cpu_bitmap, MAX_NR_CPUS);
> > > >       struct branch_type_stat brtype_stat;
> > > > @@ -331,6 +333,10 @@ static int process_sample_event(const struct perf_tool *tool,
> > > >                                    &rep->total_cycles, evsel);
> > > >       }
> > > >
> > > > +     rep->total_samples++;
> > > > +     if (al.parallelism == 1)
> > > > +             rep->singlethreaded_samples++;
> > > > +
> > > >       ret = hist_entry_iter__add(&iter, &al, rep->max_stack, rep);
> > > >       if (ret < 0)
> > > >               pr_debug("problem adding hist entry, skipping event\n");
> > > > @@ -1079,6 +1085,11 @@ static int __cmd_report(struct report *rep)
> > > >               return ret;
> > > >       }
> > > >
> > > > +     /* Don't show Latency column for non-parallel profiles by default. */
> > > > +     if (rep->singlethreaded_samples * 100 / rep->total_samples >= 99 &&
> > > > +                     !symbol_conf.prefer_latency)
> > > > +             perf_hpp__cancel_latency();
> > > > +
> > > >       evlist__check_mem_load_aux(session->evlist);
> > > >
> > > >       if (rep->stats_mode)
> > > > @@ -1468,6 +1479,10 @@ int cmd_report(int argc, const char **argv)
> > > >                   "Disable raw trace ordering"),
> > > >       OPT_BOOLEAN(0, "skip-empty", &report.skip_empty,
> > > >                   "Do not display empty (or dummy) events in the output"),
> > > > +     OPT_BOOLEAN(0, "latency", &symbol_conf.prefer_latency,
> > > > +                 "Show latency-centric profile rather than the default\n"
> > > > +                 "\t\t\t  CPU-consumption-centric profile\n"
> > > > +                 "\t\t\t  (requires perf record --latency flag)."),
> > > >       OPT_END()
> > > >       };
> > > >       struct perf_data data = {
> > > > @@ -1722,16 +1737,25 @@ int cmd_report(int argc, const char **argv)
> > > >               symbol_conf.annotate_data_sample = true;
> > > >       }
> > > >
> > > > +     symbol_conf.enable_latency = true;
> > > >       if (report.disable_order || !perf_session__has_switch_events(session)) {
> > > >               if (symbol_conf.parallelism_list_str ||
> > > > -                             (sort_order && strstr(sort_order, "parallelism")) ||
> > > > -                             (field_order && strstr(field_order, "parallelism"))) {
> > > > +                     symbol_conf.prefer_latency ||
> > > > +                     (sort_order && (strstr(sort_order, "latency") ||
> > > > +                             strstr(sort_order, "parallelism"))) ||
> > > > +                     (field_order && (strstr(field_order, "latency") ||
> > > > +                             strstr(field_order, "parallelism")))) {
> > > >                       if (report.disable_order)
> > > > -                             ui__error("Use of parallelism is incompatible with --disable-order.\n");
> > > > +                             ui__error("Use of latency profile or parallelism is incompatible with --disable-order.\n");
> > > >                       else
> > > > -                             ui__error("Use of parallelism requires --switch-events during record.\n");
> > > > +                             ui__error("Use of latency profile or parallelism requires --latency flag during record.\n");
> > > >                       return -1;
> > > >               }
> > > > +             /*
> > > > +              * If user did not ask for anything related to
> > > > +              * latency/parallelism explicitly, just don't show it.
> > > > +              */
> > > > +             symbol_conf.enable_latency = false;
> > > >       }
> > > >
> > > >       if (sort_order && strstr(sort_order, "ipc")) {
> > > > diff --git a/tools/perf/ui/hist.c b/tools/perf/ui/hist.c
> > > > index 6de6309595f9e..b453f8eb579cc 100644
> > > > --- a/tools/perf/ui/hist.c
> > > > +++ b/tools/perf/ui/hist.c
> > > > @@ -632,27 +632,36 @@ void perf_hpp__init(void)
> > > >               return;
> > > >
> > > >       if (symbol_conf.cumulate_callchain) {
> > > > -             hpp_dimension__add_output(PERF_HPP__OVERHEAD_ACC);
> > > > +             /* Use idempotent addition to avoid more complex logic. */
> > > > +             if (symbol_conf.prefer_latency)
> > > > +                     hpp_dimension__add_output(PERF_HPP__LATENCY_ACC, true);
> > > > +             hpp_dimension__add_output(PERF_HPP__OVERHEAD_ACC, true);
> > > > +             if (symbol_conf.enable_latency)
> > > > +                     hpp_dimension__add_output(PERF_HPP__LATENCY_ACC, true);
> > > >               perf_hpp__format[PERF_HPP__OVERHEAD].name = "Self";
> > > >       }
> > > >
> > > > -     hpp_dimension__add_output(PERF_HPP__OVERHEAD);
> > > > +     if (symbol_conf.prefer_latency)
> > > > +             hpp_dimension__add_output(PERF_HPP__LATENCY, true);
> > > > +     hpp_dimension__add_output(PERF_HPP__OVERHEAD, true);
> > > > +     if (symbol_conf.enable_latency)
> > > > +             hpp_dimension__add_output(PERF_HPP__LATENCY, true);
> > > >
> > > >       if (symbol_conf.show_cpu_utilization) {
> > > > -             hpp_dimension__add_output(PERF_HPP__OVERHEAD_SYS);
> > > > -             hpp_dimension__add_output(PERF_HPP__OVERHEAD_US);
> > > > +             hpp_dimension__add_output(PERF_HPP__OVERHEAD_SYS, false);
> > > > +             hpp_dimension__add_output(PERF_HPP__OVERHEAD_US, false);
> > > >
> > > >               if (perf_guest) {
> > > > -                     hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_SYS);
> > > > -                     hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_US);
> > > > +                     hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_SYS, false);
> > > > +                     hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_US, false);
> > > >               }
> > > >       }
> > > >
> > > >       if (symbol_conf.show_nr_samples)
> > > > -             hpp_dimension__add_output(PERF_HPP__SAMPLES);
> > > > +             hpp_dimension__add_output(PERF_HPP__SAMPLES, false);
> > > >
> > > >       if (symbol_conf.show_total_period)
> > > > -             hpp_dimension__add_output(PERF_HPP__PERIOD);
> > > > +             hpp_dimension__add_output(PERF_HPP__PERIOD, false);
> > > >  }
> > > >
> > > >  void perf_hpp_list__column_register(struct perf_hpp_list *list,
> > > > @@ -701,6 +710,22 @@ void perf_hpp__cancel_cumulate(void)
> > > >       }
> > > >  }
> > > >
> > > > +void perf_hpp__cancel_latency(void)
> > > > +{
> > > > +     struct perf_hpp_fmt *fmt, *lat, *acc, *tmp;
> > > > +
> > > > +     if (is_strict_order(field_order) || is_strict_order(sort_order))
> > > > +             return;
> > > > +
> > > > +     lat = &perf_hpp__format[PERF_HPP__LATENCY];
> > > > +     acc = &perf_hpp__format[PERF_HPP__LATENCY_ACC];
> > > > +
> > > > +     perf_hpp_list__for_each_format_safe(&perf_hpp_list, fmt, tmp) {
> > > > +             if (fmt_equal(lat, fmt) || fmt_equal(acc, fmt))
> > > > +                     perf_hpp__column_unregister(fmt);
> > > > +     }
> > > > +}
> > > > +
> > > >  void perf_hpp__setup_output_field(struct perf_hpp_list *list)
> > > >  {
> > > >       struct perf_hpp_fmt *fmt;
> > > > diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
> > > > index 91159f16c60b2..29d4c7a3d1747 100644
> > > > --- a/tools/perf/util/hist.h
> > > > +++ b/tools/perf/util/hist.h
> > > > @@ -582,6 +582,7 @@ enum {
> > > >
> > > >  void perf_hpp__init(void);
> > > >  void perf_hpp__cancel_cumulate(void);
> > > > +void perf_hpp__cancel_latency(void);
> > > >  void perf_hpp__setup_output_field(struct perf_hpp_list *list);
> > > >  void perf_hpp__reset_output_field(struct perf_hpp_list *list);
> > > >  void perf_hpp__append_sort_keys(struct perf_hpp_list *list);
> > > > diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
> > > > index bc4c3acfe7552..2b6023de7a53a 100644
> > > > --- a/tools/perf/util/sort.c
> > > > +++ b/tools/perf/util/sort.c
> > > > @@ -2622,6 +2622,7 @@ struct hpp_dimension {
> > > >       const char              *name;
> > > >       struct perf_hpp_fmt     *fmt;
> > > >       int                     taken;
> > > > +     int                     was_taken;
> > > >  };
> > > >
> > > >  #define DIM(d, n) { .name = n, .fmt = &perf_hpp__format[d], }
> > > > @@ -3513,6 +3514,7 @@ static int __hpp_dimension__add(struct hpp_dimension *hd,
> > > >               return -1;
> > > >
> > > >       hd->taken = 1;
> > > > +     hd->was_taken = 1;
> > > >       perf_hpp_list__register_sort_field(list, fmt);
> > > >       return 0;
> > > >  }
> > > > @@ -3547,10 +3549,15 @@ static int __hpp_dimension__add_output(struct perf_hpp_list *list,
> > > >       return 0;
> > > >  }
> > > >
> > > > -int hpp_dimension__add_output(unsigned col)
> > > > +int hpp_dimension__add_output(unsigned col, bool implicit)
> > > >  {
> > > > +     struct hpp_dimension *hd;
> > > > +
> > > >       BUG_ON(col >= PERF_HPP__MAX_INDEX);
> > > > -     return __hpp_dimension__add_output(&perf_hpp_list, &hpp_sort_dimensions[col]);
> > > > +     hd = &hpp_sort_dimensions[col];
> > > > +     if (implicit && !hd->was_taken)
> > > > +             return 0;
> > >
> > > I don't think you need these implicit and was_taken things.
> > > Just removing from the sort list when it's unregistered seems to work.
> > >
> > > ---8<---
> > > @@ -685,6 +685,7 @@ void perf_hpp_list__prepend_sort_field(struct perf_hpp_list *list,
> > >  static void perf_hpp__column_unregister(struct perf_hpp_fmt *format)
> > >  {
> > >         list_del_init(&format->list);
> > > +       list_del_init(&format->sort_list);
> > >         fmt_free(format);
> > >  }
> > >
> > > ---8<---
> >
> > It merely suppresses the warning, but does not work the same way. See
> > this for details:
> > https://lore.kernel.org/all/CACT4Y+ZREdDL7a+DMKGFGae1ZjX1C8uNRwCGF0c8iUJtTTq0Lw@mail.gmail.com/
>
> But I think it's better to pass --latency option rather than adding it
> to -s option.  If you really want to have specific output fields, then
> please use -F latency,sym instead.
>
> Also I've realized that it should add one sort key in setup_overhead()
> to support hierarchy mode properly.  Something like this?
>
> Thanks,
> Namhyung
>
>
> ---8<---
> diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
> index 2b6023de7a53ae2e..329c2e9bbc69a725 100644
> --- a/tools/perf/util/sort.c
> +++ b/tools/perf/util/sort.c
> @@ -3817,22 +3817,15 @@ static char *setup_overhead(char *keys)
>                 return keys;
>
>         if (symbol_conf.prefer_latency) {
> -               keys = prefix_if_not_in("overhead", keys);
> -               keys = prefix_if_not_in("latency", keys);
> -               if (symbol_conf.cumulate_callchain) {
> -                       keys = prefix_if_not_in("overhead_children", keys);
> +               if (symbol_conf.cumulate_callchain)
>                         keys = prefix_if_not_in("latency_children", keys);
> -               }
> -       } else if (!keys || (!strstr(keys, "overhead") &&
> -                       !strstr(keys, "latency"))) {
> -               if (symbol_conf.enable_latency)
> +               else
>                         keys = prefix_if_not_in("latency", keys);
> -               keys = prefix_if_not_in("overhead", keys);
> -               if (symbol_conf.cumulate_callchain) {
> -                       if (symbol_conf.enable_latency)
> -                               keys = prefix_if_not_in("latency_children", keys);
> +       } else {
> +               if (symbol_conf.cumulate_callchain)
>                         keys = prefix_if_not_in("overhead_children", keys);
> -               }
> +               else
> +                       keys = prefix_if_not_in("overhead", keys);
>         }
>
>         return keys;


Have I decoded the patch correctly?

        if (symbol_conf.prefer_latency) {
                if (symbol_conf.cumulate_callchain)
                        keys = prefix_if_not_in("latency_children", keys);
                else
                        keys = prefix_if_not_in("latency", keys);
        } else {
                if (symbol_conf.cumulate_callchain)
                        keys = prefix_if_not_in("overhead_children", keys);
                else
                        keys = prefix_if_not_in("overhead", keys);
        }

If I decoded the patch correctly, it's not what we want.

For the default prefer_latency case we also want to add overhead, that
was intentional for the --latency preset. It does not harm, and allows
to see/compare differences in latency and overhead.
Again, if a user wants something custom, there is no way to second
guess all possible intentions. For non-default cases, we just let the
user say what exactly they want, and we will follow that.

"latency" should be added even if cumulate_callchain.

For the !prefer_latency case, we don't want to mess with
overhead/latency fields if the user specified any of them explicitly.
Otherwise this convenience part gets in the user's way and does not
allow them to do what they want. User says "I want X" and perf says
"screw you, I will give you Y instead, and won't allow you to possibly
do X".

And see above: -F does not work with --hierarchy, so this part is unskippable.

Re: [PATCH v5 6/8] perf report: Add --latency flag

Posted by Namhyung Kim 12 months ago

On Tue, Feb 11, 2025 at 09:42:16AM +0100, Dmitry Vyukov wrote:
> On Tue, 11 Feb 2025 at 02:02, Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > On Fri, Feb 07, 2025 at 08:23:58AM +0100, Dmitry Vyukov wrote:
> > > On Fri, 7 Feb 2025 at 04:44, Namhyung Kim <namhyung@kernel.org> wrote:
[SNIP]
> > > > > @@ -3547,10 +3549,15 @@ static int __hpp_dimension__add_output(struct perf_hpp_list *list,
> > > > >       return 0;
> > > > >  }
> > > > >
> > > > > -int hpp_dimension__add_output(unsigned col)
> > > > > +int hpp_dimension__add_output(unsigned col, bool implicit)
> > > > >  {
> > > > > +     struct hpp_dimension *hd;
> > > > > +
> > > > >       BUG_ON(col >= PERF_HPP__MAX_INDEX);
> > > > > -     return __hpp_dimension__add_output(&perf_hpp_list, &hpp_sort_dimensions[col]);
> > > > > +     hd = &hpp_sort_dimensions[col];
> > > > > +     if (implicit && !hd->was_taken)
> > > > > +             return 0;
> > > >
> > > > I don't think you need these implicit and was_taken things.
> > > > Just removing from the sort list when it's unregistered seems to work.
> > > >
> > > > ---8<---
> > > > @@ -685,6 +685,7 @@ void perf_hpp_list__prepend_sort_field(struct perf_hpp_list *list,
> > > >  static void perf_hpp__column_unregister(struct perf_hpp_fmt *format)
> > > >  {
> > > >         list_del_init(&format->list);
> > > > +       list_del_init(&format->sort_list);
> > > >         fmt_free(format);
> > > >  }
> > > >
> > > > ---8<---
> > >
> > > It merely suppresses the warning, but does not work the same way. See
> > > this for details:
> > > https://lore.kernel.org/all/CACT4Y+ZREdDL7a+DMKGFGae1ZjX1C8uNRwCGF0c8iUJtTTq0Lw@mail.gmail.com/
> >
> > But I think it's better to pass --latency option rather than adding it
> > to -s option.  If you really want to have specific output fields, then
> > please use -F latency,sym instead.
> >
> > Also I've realized that it should add one sort key in setup_overhead()
> > to support hierarchy mode properly.  Something like this?
> >
> > Thanks,
> > Namhyung
> >
> >
> > ---8<---
> > diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
> > index 2b6023de7a53ae2e..329c2e9bbc69a725 100644
> > --- a/tools/perf/util/sort.c
> > +++ b/tools/perf/util/sort.c
> > @@ -3817,22 +3817,15 @@ static char *setup_overhead(char *keys)
> >                 return keys;
> >
> >         if (symbol_conf.prefer_latency) {
> > -               keys = prefix_if_not_in("overhead", keys);
> > -               keys = prefix_if_not_in("latency", keys);
> > -               if (symbol_conf.cumulate_callchain) {
> > -                       keys = prefix_if_not_in("overhead_children", keys);
> > +               if (symbol_conf.cumulate_callchain)
> >                         keys = prefix_if_not_in("latency_children", keys);
> > -               }
> > -       } else if (!keys || (!strstr(keys, "overhead") &&
> > -                       !strstr(keys, "latency"))) {
> > -               if (symbol_conf.enable_latency)
> > +               else
> >                         keys = prefix_if_not_in("latency", keys);
> > -               keys = prefix_if_not_in("overhead", keys);
> > -               if (symbol_conf.cumulate_callchain) {
> > -                       if (symbol_conf.enable_latency)
> > -                               keys = prefix_if_not_in("latency_children", keys);
> > +       } else {
> > +               if (symbol_conf.cumulate_callchain)
> >                         keys = prefix_if_not_in("overhead_children", keys);
> > -               }
> > +               else
> > +                       keys = prefix_if_not_in("overhead", keys);
> >         }
> >
> >         return keys;
> 
> 
> Have I decoded the patch correctly?
> 
>         if (symbol_conf.prefer_latency) {
>                 if (symbol_conf.cumulate_callchain)
>                         keys = prefix_if_not_in("latency_children", keys);
>                 else
>                         keys = prefix_if_not_in("latency", keys);
>         } else {
>                 if (symbol_conf.cumulate_callchain)
>                         keys = prefix_if_not_in("overhead_children", keys);
>                 else
>                         keys = prefix_if_not_in("overhead", keys);
>         }
> 

Yep, that's correct.


> If I decoded the patch correctly, it's not what we want.
> 
> For the default prefer_latency case we also want to add overhead, that
> was intentional for the --latency preset. It does not harm, and allows
> to see/compare differences in latency and overhead.
> Again, if a user wants something custom, there is no way to second
> guess all possible intentions. For non-default cases, we just let the
> user say what exactly they want, and we will follow that.
> 
> "latency" should be added even if cumulate_callchain.

Please note that it just sets the sort key - which column you want to
order the result.  The output fields for overhead and children will be
added in perf_hpp__init() if you remove the 'was_taken' logic.  So I
think this change will have the same output with that.

> 
> For the !prefer_latency case, we don't want to mess with
> overhead/latency fields if the user specified any of them explicitly.
> Otherwise this convenience part gets in the user's way and does not
> allow them to do what they want. User says "I want X" and perf says
> "screw you, I will give you Y instead, and won't allow you to possibly
> do X".

That's what -F option does.  The -s option used to specify how to group
the histogram entries and it will add 'overhead' (and/or 'latency') if
it's not even requested.  So I think it's ok to add more output column
when -s option is used.

But unfortunately, using -F and -s together is confusing and change the
meaning of -s option - it now says how it sort the result.

> 
> And see above: -F does not work with --hierarchy, so this part is unskippable.

Yep, but I mean it fixes --hierarchy and --latency.  I'm thinking of a
way to support -F and --hierarchy in general.

Thanks,
Namhyung

Re: [PATCH v5 6/8] perf report: Add --latency flag

Posted by Dmitry Vyukov 12 months ago

On Tue, 11 Feb 2025 at 18:42, Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Tue, Feb 11, 2025 at 09:42:16AM +0100, Dmitry Vyukov wrote:
> > On Tue, 11 Feb 2025 at 02:02, Namhyung Kim <namhyung@kernel.org> wrote:
> > >
> > > On Fri, Feb 07, 2025 at 08:23:58AM +0100, Dmitry Vyukov wrote:
> > > > On Fri, 7 Feb 2025 at 04:44, Namhyung Kim <namhyung@kernel.org> wrote:
> [SNIP]
> > > > > > @@ -3547,10 +3549,15 @@ static int __hpp_dimension__add_output(struct perf_hpp_list *list,
> > > > > >       return 0;
> > > > > >  }
> > > > > >
> > > > > > -int hpp_dimension__add_output(unsigned col)
> > > > > > +int hpp_dimension__add_output(unsigned col, bool implicit)
> > > > > >  {
> > > > > > +     struct hpp_dimension *hd;
> > > > > > +
> > > > > >       BUG_ON(col >= PERF_HPP__MAX_INDEX);
> > > > > > -     return __hpp_dimension__add_output(&perf_hpp_list, &hpp_sort_dimensions[col]);
> > > > > > +     hd = &hpp_sort_dimensions[col];
> > > > > > +     if (implicit && !hd->was_taken)
> > > > > > +             return 0;
> > > > >
> > > > > I don't think you need these implicit and was_taken things.
> > > > > Just removing from the sort list when it's unregistered seems to work.
> > > > >
> > > > > ---8<---
> > > > > @@ -685,6 +685,7 @@ void perf_hpp_list__prepend_sort_field(struct perf_hpp_list *list,
> > > > >  static void perf_hpp__column_unregister(struct perf_hpp_fmt *format)
> > > > >  {
> > > > >         list_del_init(&format->list);
> > > > > +       list_del_init(&format->sort_list);
> > > > >         fmt_free(format);
> > > > >  }
> > > > >
> > > > > ---8<---
> > > >
> > > > It merely suppresses the warning, but does not work the same way. See
> > > > this for details:
> > > > https://lore.kernel.org/all/CACT4Y+ZREdDL7a+DMKGFGae1ZjX1C8uNRwCGF0c8iUJtTTq0Lw@mail.gmail.com/
> > >
> > > But I think it's better to pass --latency option rather than adding it
> > > to -s option.  If you really want to have specific output fields, then
> > > please use -F latency,sym instead.
> > >
> > > Also I've realized that it should add one sort key in setup_overhead()
> > > to support hierarchy mode properly.  Something like this?
> > >
> > > Thanks,
> > > Namhyung
> > >
> > >
> > > ---8<---
> > > diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
> > > index 2b6023de7a53ae2e..329c2e9bbc69a725 100644
> > > --- a/tools/perf/util/sort.c
> > > +++ b/tools/perf/util/sort.c
> > > @@ -3817,22 +3817,15 @@ static char *setup_overhead(char *keys)
> > >                 return keys;
> > >
> > >         if (symbol_conf.prefer_latency) {
> > > -               keys = prefix_if_not_in("overhead", keys);
> > > -               keys = prefix_if_not_in("latency", keys);
> > > -               if (symbol_conf.cumulate_callchain) {
> > > -                       keys = prefix_if_not_in("overhead_children", keys);
> > > +               if (symbol_conf.cumulate_callchain)
> > >                         keys = prefix_if_not_in("latency_children", keys);
> > > -               }
> > > -       } else if (!keys || (!strstr(keys, "overhead") &&
> > > -                       !strstr(keys, "latency"))) {
> > > -               if (symbol_conf.enable_latency)
> > > +               else
> > >                         keys = prefix_if_not_in("latency", keys);
> > > -               keys = prefix_if_not_in("overhead", keys);
> > > -               if (symbol_conf.cumulate_callchain) {
> > > -                       if (symbol_conf.enable_latency)
> > > -                               keys = prefix_if_not_in("latency_children", keys);
> > > +       } else {
> > > +               if (symbol_conf.cumulate_callchain)
> > >                         keys = prefix_if_not_in("overhead_children", keys);
> > > -               }
> > > +               else
> > > +                       keys = prefix_if_not_in("overhead", keys);
> > >         }
> > >
> > >         return keys;
> >
> >
> > Have I decoded the patch correctly?
> >
> >         if (symbol_conf.prefer_latency) {
> >                 if (symbol_conf.cumulate_callchain)
> >                         keys = prefix_if_not_in("latency_children", keys);
> >                 else
> >                         keys = prefix_if_not_in("latency", keys);
> >         } else {
> >                 if (symbol_conf.cumulate_callchain)
> >                         keys = prefix_if_not_in("overhead_children", keys);
> >                 else
> >                         keys = prefix_if_not_in("overhead", keys);
> >         }
> >
>
> Yep, that's correct.
>
>
> > If I decoded the patch correctly, it's not what we want.
> >
> > For the default prefer_latency case we also want to add overhead, that
> > was intentional for the --latency preset. It does not harm, and allows
> > to see/compare differences in latency and overhead.
> > Again, if a user wants something custom, there is no way to second
> > guess all possible intentions. For non-default cases, we just let the
> > user say what exactly they want, and we will follow that.
> >
> > "latency" should be added even if cumulate_callchain.
>
> Please note that it just sets the sort key - which column you want to
> order the result.  The output fields for overhead and children will be
> added in perf_hpp__init() if you remove the 'was_taken' logic.  So I
> think this change will have the same output with that.

Yes, but perf_hpp__init() does not have the logic that's currently
contained in setup_overhead().

If the user specified a "latency" field, and we don't want to add
"overhead" in that case, then _both_ setup_overhead() and
perf_hpp__init() must not add "overhead".

If we do what you proposed, then perf_hpp__init() will still add
"overhead" and we go back to square 0.

I used was_taken to not duplicate this non-trivial logic in both
functions. As I mentioned in the previous replies, I tried that but it
was messier/more complex. was_taken is a simple way to not duplicate
logic and keep these functions consistent.


> > For the !prefer_latency case, we don't want to mess with
> > overhead/latency fields if the user specified any of them explicitly.
> > Otherwise this convenience part gets in the user's way and does not
> > allow them to do what they want. User says "I want X" and perf says
> > "screw you, I will give you Y instead, and won't allow you to possibly
> > do X".
>
> That's what -F option does.  The -s option used to specify how to group
> the histogram entries and it will add 'overhead' (and/or 'latency') if
> it's not even requested.  So I think it's ok to add more output column
> when -s option is used.
>
> But unfortunately, using -F and -s together is confusing and change the
> meaning of -s option - it now says how it sort the result.
>
> >
> > And see above: -F does not work with --hierarchy, so this part is unskippable.
>
> Yep, but I mean it fixes --hierarchy and --latency.  I'm thinking of a
> way to support -F and --hierarchy in general.

I don't know why it was disabled. There are likely other things to
improve, but please let's not tie that to this change.

Re: [PATCH v5 6/8] perf report: Add --latency flag

Posted by Namhyung Kim 12 months ago

On Tue, Feb 11, 2025 at 09:23:30PM +0100, Dmitry Vyukov wrote:
> On Tue, 11 Feb 2025 at 18:42, Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > On Tue, Feb 11, 2025 at 09:42:16AM +0100, Dmitry Vyukov wrote:
> > > On Tue, 11 Feb 2025 at 02:02, Namhyung Kim <namhyung@kernel.org> wrote:
> > > >
> > > > On Fri, Feb 07, 2025 at 08:23:58AM +0100, Dmitry Vyukov wrote:
> > > > > On Fri, 7 Feb 2025 at 04:44, Namhyung Kim <namhyung@kernel.org> wrote:
> > [SNIP]
> > > > > > > @@ -3547,10 +3549,15 @@ static int __hpp_dimension__add_output(struct perf_hpp_list *list,
> > > > > > >       return 0;
> > > > > > >  }
> > > > > > >
> > > > > > > -int hpp_dimension__add_output(unsigned col)
> > > > > > > +int hpp_dimension__add_output(unsigned col, bool implicit)
> > > > > > >  {
> > > > > > > +     struct hpp_dimension *hd;
> > > > > > > +
> > > > > > >       BUG_ON(col >= PERF_HPP__MAX_INDEX);
> > > > > > > -     return __hpp_dimension__add_output(&perf_hpp_list, &hpp_sort_dimensions[col]);
> > > > > > > +     hd = &hpp_sort_dimensions[col];
> > > > > > > +     if (implicit && !hd->was_taken)
> > > > > > > +             return 0;
> > > > > >
> > > > > > I don't think you need these implicit and was_taken things.
> > > > > > Just removing from the sort list when it's unregistered seems to work.
> > > > > >
> > > > > > ---8<---
> > > > > > @@ -685,6 +685,7 @@ void perf_hpp_list__prepend_sort_field(struct perf_hpp_list *list,
> > > > > >  static void perf_hpp__column_unregister(struct perf_hpp_fmt *format)
> > > > > >  {
> > > > > >         list_del_init(&format->list);
> > > > > > +       list_del_init(&format->sort_list);
> > > > > >         fmt_free(format);
> > > > > >  }
> > > > > >
> > > > > > ---8<---
> > > > >
> > > > > It merely suppresses the warning, but does not work the same way. See
> > > > > this for details:
> > > > > https://lore.kernel.org/all/CACT4Y+ZREdDL7a+DMKGFGae1ZjX1C8uNRwCGF0c8iUJtTTq0Lw@mail.gmail.com/
> > > >
> > > > But I think it's better to pass --latency option rather than adding it
> > > > to -s option.  If you really want to have specific output fields, then
> > > > please use -F latency,sym instead.
> > > >
> > > > Also I've realized that it should add one sort key in setup_overhead()
> > > > to support hierarchy mode properly.  Something like this?
> > > >
> > > > Thanks,
> > > > Namhyung
> > > >
> > > >
> > > > ---8<---
> > > > diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
> > > > index 2b6023de7a53ae2e..329c2e9bbc69a725 100644
> > > > --- a/tools/perf/util/sort.c
> > > > +++ b/tools/perf/util/sort.c
> > > > @@ -3817,22 +3817,15 @@ static char *setup_overhead(char *keys)
> > > >                 return keys;
> > > >
> > > >         if (symbol_conf.prefer_latency) {
> > > > -               keys = prefix_if_not_in("overhead", keys);
> > > > -               keys = prefix_if_not_in("latency", keys);
> > > > -               if (symbol_conf.cumulate_callchain) {
> > > > -                       keys = prefix_if_not_in("overhead_children", keys);
> > > > +               if (symbol_conf.cumulate_callchain)
> > > >                         keys = prefix_if_not_in("latency_children", keys);
> > > > -               }
> > > > -       } else if (!keys || (!strstr(keys, "overhead") &&
> > > > -                       !strstr(keys, "latency"))) {
> > > > -               if (symbol_conf.enable_latency)
> > > > +               else
> > > >                         keys = prefix_if_not_in("latency", keys);
> > > > -               keys = prefix_if_not_in("overhead", keys);
> > > > -               if (symbol_conf.cumulate_callchain) {
> > > > -                       if (symbol_conf.enable_latency)
> > > > -                               keys = prefix_if_not_in("latency_children", keys);
> > > > +       } else {
> > > > +               if (symbol_conf.cumulate_callchain)
> > > >                         keys = prefix_if_not_in("overhead_children", keys);
> > > > -               }
> > > > +               else
> > > > +                       keys = prefix_if_not_in("overhead", keys);
> > > >         }
> > > >
> > > >         return keys;
> > >
> > >
> > > Have I decoded the patch correctly?
> > >
> > >         if (symbol_conf.prefer_latency) {
> > >                 if (symbol_conf.cumulate_callchain)
> > >                         keys = prefix_if_not_in("latency_children", keys);
> > >                 else
> > >                         keys = prefix_if_not_in("latency", keys);
> > >         } else {
> > >                 if (symbol_conf.cumulate_callchain)
> > >                         keys = prefix_if_not_in("overhead_children", keys);
> > >                 else
> > >                         keys = prefix_if_not_in("overhead", keys);
> > >         }
> > >
> >
> > Yep, that's correct.
> >
> >
> > > If I decoded the patch correctly, it's not what we want.
> > >
> > > For the default prefer_latency case we also want to add overhead, that
> > > was intentional for the --latency preset. It does not harm, and allows
> > > to see/compare differences in latency and overhead.
> > > Again, if a user wants something custom, there is no way to second
> > > guess all possible intentions. For non-default cases, we just let the
> > > user say what exactly they want, and we will follow that.
> > >
> > > "latency" should be added even if cumulate_callchain.
> >
> > Please note that it just sets the sort key - which column you want to
> > order the result.  The output fields for overhead and children will be
> > added in perf_hpp__init() if you remove the 'was_taken' logic.  So I
> > think this change will have the same output with that.
> 
> Yes, but perf_hpp__init() does not have the logic that's currently
> contained in setup_overhead().
> 
> If the user specified a "latency" field, and we don't want to add
> "overhead" in that case, then _both_ setup_overhead() and
> perf_hpp__init() must not add "overhead".

Ok, I see your point.  But I think it'd be much easier if you allow the
'overhead' column in that case too.

> 
> If we do what you proposed, then perf_hpp__init() will still add
> "overhead" and we go back to square 0.

Right, but currently the default perf report and with --latency option,
will show both overhead and latency columns.  That's why I thought you
wanted to display them together.

Actually I don't want to use -s option to describe output fields (like
overhead and latency) but I cannot prevent people from doing that. :(
Maybe you can skip the setup_overhead() if user gives either 'overhead'
or 'latency' explicitly - oh, you have that in the !prefer_latency case.

> 
> I used was_taken to not duplicate this non-trivial logic in both
> functions. As I mentioned in the previous replies, I tried that but it
> was messier/more complex. was_taken is a simple way to not duplicate
> logic and keep these functions consistent.

Hmm.. ok.  Maybe we can update this part later.  Can you please add a
comment in the perf_hpp__init() that says overhead and latency columns
are added to the sort list in setup_overhead() so it's added implicitly
here only if it's already taken?

> 
> 
> > > For the !prefer_latency case, we don't want to mess with
> > > overhead/latency fields if the user specified any of them explicitly.
> > > Otherwise this convenience part gets in the user's way and does not
> > > allow them to do what they want. User says "I want X" and perf says
> > > "screw you, I will give you Y instead, and won't allow you to possibly
> > > do X".
> >
> > That's what -F option does.  The -s option used to specify how to group
> > the histogram entries and it will add 'overhead' (and/or 'latency') if
> > it's not even requested.  So I think it's ok to add more output column
> > when -s option is used.
> >
> > But unfortunately, using -F and -s together is confusing and change the
> > meaning of -s option - it now says how it sort the result.
> >
> > >
> > > And see above: -F does not work with --hierarchy, so this part is unskippable.
> >
> > Yep, but I mean it fixes --hierarchy and --latency.  I'm thinking of a
> > way to support -F and --hierarchy in general.
> 
> I don't know why it was disabled. There are likely other things to
> improve, but please let's not tie that to this change.

Right, it's a separate issue.  I was afraid of mixing output fields and
sort keys in an unexpected order.  But maybe we can support that if
that's what user wants.

Thanks,
Namhyung

Re: [PATCH v5 6/8] perf report: Add --latency flag

Posted by Dmitry Vyukov 12 months ago

On Wed, 12 Feb 2025 at 20:47, Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Tue, Feb 11, 2025 at 09:23:30PM +0100, Dmitry Vyukov wrote:
> > On Tue, 11 Feb 2025 at 18:42, Namhyung Kim <namhyung@kernel.org> wrote:
> > >
> > > On Tue, Feb 11, 2025 at 09:42:16AM +0100, Dmitry Vyukov wrote:
> > > > On Tue, 11 Feb 2025 at 02:02, Namhyung Kim <namhyung@kernel.org> wrote:
> > > > >
> > > > > On Fri, Feb 07, 2025 at 08:23:58AM +0100, Dmitry Vyukov wrote:
> > > > > > On Fri, 7 Feb 2025 at 04:44, Namhyung Kim <namhyung@kernel.org> wrote:
> > > [SNIP]
> > > > > > > > @@ -3547,10 +3549,15 @@ static int __hpp_dimension__add_output(struct perf_hpp_list *list,
> > > > > > > >       return 0;
> > > > > > > >  }
> > > > > > > >
> > > > > > > > -int hpp_dimension__add_output(unsigned col)
> > > > > > > > +int hpp_dimension__add_output(unsigned col, bool implicit)
> > > > > > > >  {
> > > > > > > > +     struct hpp_dimension *hd;
> > > > > > > > +
> > > > > > > >       BUG_ON(col >= PERF_HPP__MAX_INDEX);
> > > > > > > > -     return __hpp_dimension__add_output(&perf_hpp_list, &hpp_sort_dimensions[col]);
> > > > > > > > +     hd = &hpp_sort_dimensions[col];
> > > > > > > > +     if (implicit && !hd->was_taken)
> > > > > > > > +             return 0;
> > > > > > >
> > > > > > > I don't think you need these implicit and was_taken things.
> > > > > > > Just removing from the sort list when it's unregistered seems to work.
> > > > > > >
> > > > > > > ---8<---
> > > > > > > @@ -685,6 +685,7 @@ void perf_hpp_list__prepend_sort_field(struct perf_hpp_list *list,
> > > > > > >  static void perf_hpp__column_unregister(struct perf_hpp_fmt *format)
> > > > > > >  {
> > > > > > >         list_del_init(&format->list);
> > > > > > > +       list_del_init(&format->sort_list);
> > > > > > >         fmt_free(format);
> > > > > > >  }
> > > > > > >
> > > > > > > ---8<---
> > > > > >
> > > > > > It merely suppresses the warning, but does not work the same way. See
> > > > > > this for details:
> > > > > > https://lore.kernel.org/all/CACT4Y+ZREdDL7a+DMKGFGae1ZjX1C8uNRwCGF0c8iUJtTTq0Lw@mail.gmail.com/
> > > > >
> > > > > But I think it's better to pass --latency option rather than adding it
> > > > > to -s option.  If you really want to have specific output fields, then
> > > > > please use -F latency,sym instead.
> > > > >
> > > > > Also I've realized that it should add one sort key in setup_overhead()
> > > > > to support hierarchy mode properly.  Something like this?
> > > > >
> > > > > Thanks,
> > > > > Namhyung
> > > > >
> > > > >
> > > > > ---8<---
> > > > > diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
> > > > > index 2b6023de7a53ae2e..329c2e9bbc69a725 100644
> > > > > --- a/tools/perf/util/sort.c
> > > > > +++ b/tools/perf/util/sort.c
> > > > > @@ -3817,22 +3817,15 @@ static char *setup_overhead(char *keys)
> > > > >                 return keys;
> > > > >
> > > > >         if (symbol_conf.prefer_latency) {
> > > > > -               keys = prefix_if_not_in("overhead", keys);
> > > > > -               keys = prefix_if_not_in("latency", keys);
> > > > > -               if (symbol_conf.cumulate_callchain) {
> > > > > -                       keys = prefix_if_not_in("overhead_children", keys);
> > > > > +               if (symbol_conf.cumulate_callchain)
> > > > >                         keys = prefix_if_not_in("latency_children", keys);
> > > > > -               }
> > > > > -       } else if (!keys || (!strstr(keys, "overhead") &&
> > > > > -                       !strstr(keys, "latency"))) {
> > > > > -               if (symbol_conf.enable_latency)
> > > > > +               else
> > > > >                         keys = prefix_if_not_in("latency", keys);
> > > > > -               keys = prefix_if_not_in("overhead", keys);
> > > > > -               if (symbol_conf.cumulate_callchain) {
> > > > > -                       if (symbol_conf.enable_latency)
> > > > > -                               keys = prefix_if_not_in("latency_children", keys);
> > > > > +       } else {
> > > > > +               if (symbol_conf.cumulate_callchain)
> > > > >                         keys = prefix_if_not_in("overhead_children", keys);
> > > > > -               }
> > > > > +               else
> > > > > +                       keys = prefix_if_not_in("overhead", keys);
> > > > >         }
> > > > >
> > > > >         return keys;
> > > >
> > > >
> > > > Have I decoded the patch correctly?
> > > >
> > > >         if (symbol_conf.prefer_latency) {
> > > >                 if (symbol_conf.cumulate_callchain)
> > > >                         keys = prefix_if_not_in("latency_children", keys);
> > > >                 else
> > > >                         keys = prefix_if_not_in("latency", keys);
> > > >         } else {
> > > >                 if (symbol_conf.cumulate_callchain)
> > > >                         keys = prefix_if_not_in("overhead_children", keys);
> > > >                 else
> > > >                         keys = prefix_if_not_in("overhead", keys);
> > > >         }
> > > >
> > >
> > > Yep, that's correct.
> > >
> > >
> > > > If I decoded the patch correctly, it's not what we want.
> > > >
> > > > For the default prefer_latency case we also want to add overhead, that
> > > > was intentional for the --latency preset. It does not harm, and allows
> > > > to see/compare differences in latency and overhead.
> > > > Again, if a user wants something custom, there is no way to second
> > > > guess all possible intentions. For non-default cases, we just let the
> > > > user say what exactly they want, and we will follow that.
> > > >
> > > > "latency" should be added even if cumulate_callchain.
> > >
> > > Please note that it just sets the sort key - which column you want to
> > > order the result.  The output fields for overhead and children will be
> > > added in perf_hpp__init() if you remove the 'was_taken' logic.  So I
> > > think this change will have the same output with that.
> >
> > Yes, but perf_hpp__init() does not have the logic that's currently
> > contained in setup_overhead().
> >
> > If the user specified a "latency" field, and we don't want to add
> > "overhead" in that case, then _both_ setup_overhead() and
> > perf_hpp__init() must not add "overhead".
>
> Ok, I see your point.  But I think it'd be much easier if you allow the
> 'overhead' column in that case too.


It may be possible. However, one of my motivations that I did not
fully explain and realized myself is as follows.

You referred to fields and sort order as to different things. However,
for a person not working on perf code internals these are mostly the
same thing.

Fields are merged into sort order, sort order elements appear as fields.
Say, if I do "-F symbol,overhead", output will be sorted by symbols
first, so fields == sort. Similarly, anything specified in --sort
appears as fields in output, so sort == fields.

My current code keeps behavior for them consistent. Making it
non-consistent may be confusing for users.

For example, --hierarchy requires --sort (fails with -F), so I tend to
use it more than F (commands with --sort are already in my bash
history).
So if I want latency and symbol I tend to do:

perf report --sort latency,symbol

and with my current code this behaves the same way as:

perf report -F latency,symbol

Which matches my mental model of F/sort being the same.
However, if we remove was_taken, then the above commands behave
differently, which for end user like me is:
¯\_(ツ)_/¯


After reading the code, I think I understand the difference
(hopefully), e.g. it allows me to do a subtle things like:
-F symbol,overhead --sort overhead,symbol
(not sure how useful this is compared to always sticking to field
order for sorting)




> > If we do what you proposed, then perf_hpp__init() will still add
> > "overhead" and we go back to square 0.
>
> Right, but currently the default perf report and with --latency option,
> will show both overhead and latency columns.  That's why I thought you
> wanted to display them together.
>
> Actually I don't want to use -s option to describe output fields (like
> overhead and latency) but I cannot prevent people from doing that. :(
> Maybe you can skip the setup_overhead() if user gives either 'overhead'
> or 'latency' explicitly - oh, you have that in the !prefer_latency case.
>
> >
> > I used was_taken to not duplicate this non-trivial logic in both
> > functions. As I mentioned in the previous replies, I tried that but it
> > was messier/more complex. was_taken is a simple way to not duplicate
> > logic and keep these functions consistent.
>
> Hmm.. ok.  Maybe we can update this part later.  Can you please add a
> comment in the perf_hpp__init() that says overhead and latency columns
> are added to the sort list in setup_overhead() so it's added implicitly
> here only if it's already taken?

Added a comment capturing some of this discussion in v7


> > > > For the !prefer_latency case, we don't want to mess with
> > > > overhead/latency fields if the user specified any of them explicitly.
> > > > Otherwise this convenience part gets in the user's way and does not
> > > > allow them to do what they want. User says "I want X" and perf says
> > > > "screw you, I will give you Y instead, and won't allow you to possibly
> > > > do X".
> > >
> > > That's what -F option does.  The -s option used to specify how to group
> > > the histogram entries and it will add 'overhead' (and/or 'latency') if
> > > it's not even requested.  So I think it's ok to add more output column
> > > when -s option is used.
> > >
> > > But unfortunately, using -F and -s together is confusing and change the
> > > meaning of -s option - it now says how it sort the result.
> > >
> > > >
> > > > And see above: -F does not work with --hierarchy, so this part is unskippable.
> > >
> > > Yep, but I mean it fixes --hierarchy and --latency.  I'm thinking of a
> > > way to support -F and --hierarchy in general.
> >
> > I don't know why it was disabled. There are likely other things to
> > improve, but please let's not tie that to this change.
>
> Right, it's a separate issue.  I was afraid of mixing output fields and
> sort keys in an unexpected order.  But maybe we can support that if
> that's what user wants.
>
> Thanks,
> Namhyung
>