From nobody Fri Feb 21 23:32:33 2025
Received: from mail-ed1-f73.google.com (mail-ed1-f73.google.com
 [209.85.208.73])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4AD501FDE04
	for <linux-kernel@vger.kernel.org>; Mon, 27 Jan 2025 09:59:33 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.208.73
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1737971975; cv=none;
 b=cNozydSd8woJdSR4YPB6a95gB6vFGIQIAFPE3EFyPqb/cYHhwBWl5w5BvBnVa9WsKikXKOemOWA4C3MM+3aUZ2CPSHww1zOCBg05+GhXLNplAF6gGiK+1yr0zOAf5uDx9HlZCvPJwPiAs8fOsdjF88IF5YTCU+l+WVRezwgKb5c=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1737971975; c=relaxed/simple;
	bh=GggeSPiR8YbgLXP8wp9zrKxhR5MXY84YH0YwALVqOGA=;
	h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From:
	 To:Cc:Content-Type;
 b=mNYs89wHbGH+vuScnpyIjzXaRjgEU2LSE2aUlpFjrW5Oen+9khpUIUAm38vcW3YJ5MoA9vqfKreY2gNfdSzPyounseil0gYFDs2XTCcRQm4+JAHs70HBZ4lj+jFNrEoS0zomzQPlILwJJdT+mR/fKb3GW8m0T+X6/ZkbL3E3KY4=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com;
 spf=pass smtp.mailfrom=flex--dvyukov.bounces.google.com;
 dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b=s0yyDJ/O; arc=none smtp.client-ip=209.85.208.73
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=flex--dvyukov.bounces.google.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="s0yyDJ/O"
Received: by mail-ed1-f73.google.com with SMTP id
 4fb4d7f45d1cf-5d3d9d6293fso3819035a12.0
        for <linux-kernel@vger.kernel.org>;
 Mon, 27 Jan 2025 01:59:33 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1737971972; x=1738576772;
 darn=vger.kernel.org;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=bL5cvZ53bMgUb5o74DP+LHGdAWU/IsbnP4gd/RMdofU=;
        b=s0yyDJ/O+ZAfts+n6mGQ+Uo5AFXsRvjXfgu33WTdAZPHR7uYCeWq9XygxC35P2LZfp
         fLEJazCB2wMbZta+NPgctS0BTdCfgWyNUA4CrwiOLyCwSH0b1+ZM0zAyWqvd4HV8EaX7
         nDs0Sf7V0Ip/g5c7L7dUQaCTbBkAhL+LlvuhfIhDhGEM1/46QdtgH3k15Cx6LDp3ij/d
         2+y/FTgkyPyN6JeAyztuwl36x94aXa/QfWdulG7DJaSlgXryJKNKHXgS6R5Is7oopmF7
         WGrRLRQBTwWNefMXq6pUi4guI9kDPjFyPhVlzLWYausM8ucHOYfH5ohyL1oQ2OGvgTtX
         1APw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1737971972; x=1738576772;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=bL5cvZ53bMgUb5o74DP+LHGdAWU/IsbnP4gd/RMdofU=;
        b=AqVJknoP/g3d49UiTdFAjg328m/Ici6zwJpp3TyEahU+yNkO9i8Aqmst1KKHRr/1Lz
         dcNCJNKmgWvrHTKp996WaQVyTOAcVZSS39LXIY9qD1nwK5bszVWufFqv49k/nCYHhHoW
         rvFZKT5q+xWNYYeG8gOVNkJD8WiCkRsw6/RG1XpMKSCkb9bahQfirAFqCC1ADD6M9nIJ
         bUt28gYxSw3/WLXq41ID6/6U7ye9CTcjuB7Om1zgKQgLcrjRgCmNw7wUV7QR6bwwnmA3
         8V2RzQvHiZJYWphh5bms6kPmZZBvzwd5uqRiuDcSs4nnglrP9AFCJubeHaCKiL+aexWR
         C7Ig==
X-Forwarded-Encrypted: i=1;
 AJvYcCXO1OS1bgCHCDboA7am0gPVqv7+83iBizeXyVfFjjfRLAKCShXxKezGiAOo41Tkh/a7yUIjx6Nw3tj00xY=@vger.kernel.org
X-Gm-Message-State: AOJu0YwI9lTcEaKOqNZmfwmYxYNQEcpivUx/xmR6nhxJ/TeDrOqLkjMv
	FQZ/2xPrbxh38GR/kfntKQufjexJY+B5BKHSGCb7iblLiqWqps8Bcc9/FFCbqd2aah0niULx0GO
	EPvie7A==
X-Google-Smtp-Source: 
 AGHT+IEGmBHSZs/A63I7ECXvqOSG/Yu96vP2p77HqP1FtAxQ6jqD1LDPSViEIMqS3DYIXjKGTHKNaXRMzBlM
X-Received: from edc7.prod.google.com ([2002:a05:6402:4607:b0:5d0:e889:c403])
 (user=dvyukov job=prod-delivery.src-stubby-dispatcher) by
 2002:a05:6402:210c:b0:5d9:cde9:2b7f
 with SMTP id 4fb4d7f45d1cf-5db7d3550d5mr33152266a12.23.1737971971687; Mon, 27
 Jan 2025 01:59:31 -0800 (PST)
Date: Mon, 27 Jan 2025 10:58:54 +0100
In-Reply-To: <cover.1737971364.git.dvyukov@google.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
Mime-Version: 1.0
References: <cover.1737971364.git.dvyukov@google.com>
X-Mailer: git-send-email 2.48.1.262.g85cc9f2d1e-goog
Message-ID: 
 <caaef05b295f604d339d22d7836b27942cec1aec.1737971364.git.dvyukov@google.com>
Subject: [PATCH v3 7/7] perf report: Add latency and parallelism profiling
 documentation
From: Dmitry Vyukov <dvyukov@google.com>
To: namhyung@kernel.org, irogers@google.com
Cc: linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org,
	Dmitry Vyukov <dvyukov@google.com>,
 Arnaldo Carvalho de Melo <acme@kernel.org>
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Describe latency and parallelism profiling, related flags, and differences
with the currently only supported CPU-consumption-centric profiling.

Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: linux-perf-users@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 .../callchain-overhead-calculation.txt        |  5 +-
 .../cpu-and-latency-overheads.txt             | 85 +++++++++++++++++++
 tools/perf/Documentation/perf-report.txt      | 49 +++++++----
 tools/perf/Documentation/tips.txt             |  3 +
 4 files changed, 123 insertions(+), 19 deletions(-)

diff --git a/tools/perf/Documentation/callchain-overhead-calculation.txt b/=
tools/perf/Documentation/callchain-overhead-calculation.txt
index 1a757927195ed..e0202bf5bd1a0 100644
--- a/tools/perf/Documentation/callchain-overhead-calculation.txt
+++ b/tools/perf/Documentation/callchain-overhead-calculation.txt
@@ -1,7 +1,8 @@
 Overhead calculation
 --------------------
-The overhead can be shown in two columns as 'Children' and 'Self' when
-perf collects callchains.  The 'self' overhead is simply calculated by
+The CPU overhead can be shown in two columns as 'Children' and 'Self'
+when perf collects callchains (and corresponding 'Wall' columns for
+wall-clock overhead).  The 'self' overhead is simply calculated by
 adding all period values of the entry - usually a function (symbol).
 This is the value that perf shows traditionally and sum of all the
 'self' overhead values should be 100%.
diff --git a/tools/perf/Documentation/cpu-and-latency-overheads.txt b/tools=
/perf/Documentation/cpu-and-latency-overheads.txt
new file mode 100644
index 0000000000000..3b6d637054651
--- /dev/null
+++ b/tools/perf/Documentation/cpu-and-latency-overheads.txt
@@ -0,0 +1,85 @@
+CPU and latency overheads
+-------------------------
+There are two notions of time: wall-clock time and CPU time.
+For a single-threaded program, or a program running on a single-core machi=
ne,
+these notions are the same. However, for a multi-threaded/multi-process pr=
ogram
+running on a multi-core machine, these notions are significantly different.
+Each second of wall-clock time we have number-of-cores seconds of CPU time.
+Perf can measure overhead for both of these times (shown in 'overhead' and
+'latency' columns for CPU and wall-clock time correspondingly).
+
+Optimizing CPU overhead is useful to improve 'throughput', while optimizing
+latency overhead is useful to improve 'latency'. It's important to underst=
and
+which one is useful in a concrete situation at hand. For example, the form=
er
+may be useful to improve max throughput of a CI build server that runs on =
100%
+CPU utilization, while the latter may be useful to improve user-perceived
+latency of a single interactive program build.
+These overheads may be significantly different in some cases. For example,
+consider a program that executes function 'foo' for 9 seconds with 1 threa=
d,
+and then executes function 'bar' for 1 second with 128 threads (consumes
+128 seconds of CPU time). The CPU overhead is: 'foo' - 6.6%, 'bar' - 93.4%.
+While the latency overhead is: 'foo' - 90%, 'bar' - 10%. If we try to opti=
mize
+running time of the program looking at the (wrong in this case) CPU overhe=
ad,
+we would concentrate on the function 'bar', but it can yield only 10% runn=
ing
+time improvement at best.
+
+By default, perf shows only CPU overhead. To show latency overhead, use
+'perf record --latency' and 'perf report':
+
+-----------------------------------
+Overhead  Latency  Command
+  93.88%   25.79%  cc1
+   1.90%   39.87%  gzip
+   0.99%   10.16%  dpkg-deb
+   0.57%    1.00%  as
+   0.40%    0.46%  sh
+-----------------------------------
+
+To sort by latency overhead, use 'perf report --latency':
+
+-----------------------------------
+Latency  Overhead  Command
+ 39.87%     1.90%  gzip
+ 25.79%    93.88%  cc1
+ 10.16%     0.99%  dpkg-deb
+  4.17%     0.29%  git
+  2.81%     0.11%  objtool
+-----------------------------------
+
+To get insight into the difference between the overheads, you may check
+parallelization histogram with '--sort=3Dlatency,parallelism,comm,symbol -=
-hierarchy'
+flags. It shows fraction of (wall-clock) time the workload utilizes differ=
ent
+numbers of cores ('Parallelism' column). For example, in the following case
+the workload utilizes only 1 core most of the time, but also has some
+highly-parallel phases, which explains significant difference between
+CPU and wall-clock overheads:
+
+-----------------------------------
+  Latency  Overhead     Parallelism / Command / Symbol
++  56.98%     2.29%     1
++  16.94%     1.36%     2
++   4.00%    20.13%     125
++   3.66%    18.25%     124
++   3.48%    17.66%     126
++   3.26%     0.39%     3
++   2.61%    12.93%     123
+-----------------------------------
+
+By expanding corresponding lines, you may see what commands/functions run
+at the given parallelism level:
+
+-----------------------------------
+  Latency  Overhead     Parallelism / Command / Symbol
+-  56.98%     2.29%     1
+      32.80%     1.32%     gzip
+       4.46%     0.18%     cc1
+       2.81%     0.11%     objtool
+       2.43%     0.10%     dpkg-source
+       2.22%     0.09%     ld
+       2.10%     0.08%     dpkg-genchanges
+-----------------------------------
+
+To see the normal function-level profile for particular parallelism levels
+(number of threads actively running on CPUs), you may use '--parallelism'
+filter. For example, to see the profile only for low parallelism phases
+of a workload use '--latency --parallelism=3D1-2' flags.
diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Document=
ation/perf-report.txt
index 87f8645194062..7e0ba990d71e8 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -44,7 +44,7 @@ OPTIONS
 --comms=3D::
 	Only consider symbols in these comms. CSV that understands
 	file://filename entries.  This option will affect the percentage of
-	the overhead column.  See --percentage for more info.
+	the overhead and latency columns.  See --percentage for more info.
 --pid=3D::
         Only show events for given process ID (comma separated list).
=20
@@ -54,12 +54,12 @@ OPTIONS
 --dsos=3D::
 	Only consider symbols in these dsos. CSV that understands
 	file://filename entries.  This option will affect the percentage of
-	the overhead column.  See --percentage for more info.
+	the overhead and latency columns.  See --percentage for more info.
 -S::
 --symbols=3D::
 	Only consider these symbols. CSV that understands
 	file://filename entries.  This option will affect the percentage of
-	the overhead column.  See --percentage for more info.
+	the overhead and latency columns.  See --percentage for more info.
=20
 --symbol-filter=3D::
 	Only show symbols that match (partially) with this filter.
@@ -68,6 +68,16 @@ OPTIONS
 --hide-unresolved::
         Only display entries resolved to a symbol.
=20
+--parallelism::
+        Only consider these parallelism levels. Parallelism level is the n=
umber
+        of threads that actively run on CPUs at the time of sample. The fl=
ag
+        accepts single number, comma-separated list, and ranges (for examp=
le:
+        "1", "7,8", "1,64-128"). This is useful in understanding what a pr=
ogram
+        is doing during sequential/low-parallelism phases as compared to
+        high-parallelism phases. This option will affect the percentage of
+        the overhead and latency columns. See --percentage for more info.
+        Also see the `CPU and latency overheads' section for more details.
+
 -s::
 --sort=3D::
 	Sort histogram entries by given key(s) - multiple keys can be specified
@@ -87,6 +97,7 @@ OPTIONS
 	entries are displayed as "[other]".
 	- cpu: cpu number the task ran at the time of sample
 	- socket: processor socket number the task ran at the time of sample
+	- parallelism: number of running threads at the time of sample
 	- srcline: filename and line number executed at the time of sample.  The
 	DWARF debugging info must be provided.
 	- srcfile: file name of the source file of the samples. Requires dwarf
@@ -97,12 +108,14 @@ OPTIONS
 	- cgroup_id: ID derived from cgroup namespace device and inode numbers.
 	- cgroup: cgroup pathname in the cgroupfs.
 	- transaction: Transaction abort flags.
-	- overhead: Overhead percentage of sample
-	- overhead_sys: Overhead percentage of sample running in system mode
-	- overhead_us: Overhead percentage of sample running in user mode
-	- overhead_guest_sys: Overhead percentage of sample running in system mode
+	- overhead: CPU overhead percentage of sample.
+	- latency: latency (wall-clock) overhead percentage of sample.
+	  See the `CPU and latency overheads' section for more details.
+	- overhead_sys: CPU overhead percentage of sample running in system mode
+	- overhead_us: CPU overhead percentage of sample running in user mode
+	- overhead_guest_sys: CPU overhead percentage of sample running in system=
 mode
 	on guest machine
-	- overhead_guest_us: Overhead percentage of sample running in user mode on
+	- overhead_guest_us: CPU overhead percentage of sample running in user mo=
de on
 	guest machine
 	- sample: Number of sample
 	- period: Raw number of event count of sample
@@ -125,8 +138,8 @@ OPTIONS
 	- weight2: Average value of event specific weight (2nd field of weight_st=
ruct).
 	- weight3: Average value of event specific weight (3rd field of weight_st=
ruct).
=20
-	By default, comm, dso and symbol keys are used.
-	(i.e. --sort comm,dso,symbol)
+	By default, overhead, comm, dso and symbol keys are used.
+	(i.e. --sort overhead,comm,dso,symbol).
=20
 	If --branch-stack option is used, following sort keys are also
 	available:
@@ -201,9 +214,9 @@ OPTIONS
 --fields=3D::
 	Specify output field - multiple keys can be specified in CSV format.
 	Following fields are available:
-	overhead, overhead_sys, overhead_us, overhead_children, sample, period,
-	weight1, weight2, weight3, ins_lat, p_stage_cyc and retire_lat.  The
-	last 3 names are alias for the corresponding weights.  When the weight
+	overhead, latency, overhead_sys, overhead_us, overhead_children, sample,
+	period, weight1, weight2, weight3, ins_lat, p_stage_cyc and retire_lat.
+	The last 3 names are alias for the corresponding weights.  When the weight
 	fields are used, they will show the average value of the weight.
=20
 	Also it can contain any sort key(s).
@@ -289,7 +302,7 @@ OPTIONS
 	Accumulate callchain of children to parent entry so that then can
 	show up in the output.  The output will have a new "Children" column
 	and will be sorted on the data.  It requires callchains are recorded.
-	See the `overhead calculation' section for more details. Enabled by
+	See the `Overhead calculation' section for more details. Enabled by
 	default, disable with --no-children.
=20
 --max-stack::
@@ -442,9 +455,9 @@ OPTIONS
 	--call-graph option for details.
=20
 --percentage::
-	Determine how to display the overhead percentage of filtered entries.
-	Filters can be applied by --comms, --dsos and/or --symbols options and
-	Zoom operations on the TUI (thread, dso, etc).
+	Determine how to display the CPU and latency overhead percentage
+	of filtered entries. Filters can be applied by --comms, --dsos, --symbols
+	and/or --parallelism options and Zoom operations on the TUI (thread, dso,=
 etc).
=20
 	"relative" means it's relative to filtered entries only so that the
 	sum of shown entries will be always 100%.  "absolute" means it retains
@@ -627,6 +640,8 @@ include::itrace.txt[]
 --skip-empty::
 	Do not print 0 results in the --stat output.
=20
+include::cpu-and-latency-overheads.txt[]
+
 include::callchain-overhead-calculation.txt[]
=20
 SEE ALSO
diff --git a/tools/perf/Documentation/tips.txt b/tools/perf/Documentation/t=
ips.txt
index 67b326ba00407..f6f71e70ff2cb 100644
--- a/tools/perf/Documentation/tips.txt
+++ b/tools/perf/Documentation/tips.txt
@@ -62,3 +62,6 @@ To show context switches in perf report sample context ad=
d --switch-events to pe
 To show time in nanoseconds in record/report add --ns
 To compare hot regions in two workloads use perf record -b -o file ... ; p=
erf diff --stream file1 file2
 To compare scalability of two workload samples use perf diff -c ratio file=
1 file2
+For latency profiling, try: perf record/report --latency
+For parallelism histogram, try: perf report --hierarchy --sort latency,par=
allelism,comm,symbol
+To analyze particular parallelism levels, try: perf report --latency --par=
allelism=3D32-64
--=20
2.48.1.262.g85cc9f2d1e-goog