[v3] perf tool_pmu: Support enable/disable for tool PMU events

[PATCH v3 0/2] perf tool_pmu: Support enable/disable for tool PMU events

Posted by Ian Rogers 6 days, 5 hours ago

A regression in perf stat was reported where tool PMU events (like
duration_time used in CPUs_utilized metric) incorrectly included the
delay period when using the delay option (-D).

This series fixes the regression by making tool PMU events
(duration_time, user_time, system_time) behave more like regular
counters by implementing proper enable and disable support. They now
correctly accumulate values only when enabled.

The first patch implements the core enable/disable support for tool PMU
events, and the second patch adds a shell test to verify that
duration_time correctly excludes the delay period.

Changes in v3:
- Refine group handling: only manually enable/disable group members when
  the leader or member is a non-perf-event open PMU, as the kernel allows
  grouping of software and hardware PMUs.
- Fix a file descriptor leak in evsel__tool_pmu_open() on error paths by
  explicitly closing the successfully opened fd before exiting.
- Synchronize the 'disabled' flag for all group members in enable/disable
  paths (both per-CPU and batch loops) to prevent stale disabled flags.
- Add explicit early exits to evsel__tool_pmu_enable() and disable() based
  on evsel->disabled to protect internal metric state.
- Add an upper bound check in test_stat_delay to verify that the delay was
  actually excluded.

Changes in v2:
- Implement evsel__tool_pmu_enable() and evsel__tool_pmu_disable() to
  avoid ioctl failures in batch evsel__enable() and evsel__disable()
  functions.
- Correctly iterate and enable/disable tool PMU events configured as
  non-leader members of event groups.
- Correct the lseek() arguments order in the read_stat helper:
  lseek(fd, 0, SEEK_SET) instead of lseek(fd, SEEK_SET, 0).
- Introduce INVALID_START_TIME (~0ULL) to prevent erroneous large delta
  accumulation in evsel__tool_pmu_read() if /proc/<pid>/stat fails to
  read in enable_cpu (e.g., process exited).
- Improve test parsing to use LC_ALL=C and cut to be robust against
  different locales, and use awk to dynamically compare duration_time to
  time elapsed with a 200ms tolerance (avoiding loaded CI false failures).
  Also added a lower-bound check.
- Fix style warnings from checkpatch.pl (line lengths, braces, and blank
  lines).

Ian Rogers (2):
  perf tool_pmu: Make tool PMU events respect enable/disable
  perf tests: Add test for stat delay option with duration_time

 tools/perf/tests/shell/stat.sh |  53 +++++++
 tools/perf/util/evlist.c       |  10 +-
 tools/perf/util/evsel.c        | 196 +++++++++++++++++++------
 tools/perf/util/evsel.h        |  15 +-
 tools/perf/util/tool_pmu.c     | 253 ++++++++++++++++++++++++++-------
 tools/perf/util/tool_pmu.h     |   4 +
 6 files changed, 432 insertions(+), 99 deletions(-)

-- 
2.54.0.631.ge1b05301d1-goog

[PATCH v4 0/2] perf tool_pmu: Support enable/disable for tool PMU events

Posted by Ian Rogers 6 days, 2 hours ago

A regression in perf stat was reported where tool PMU events (like
duration_time used in CPUs_utilized metric) incorrectly included the
delay period when using the delay option (-D).

This series fixes the regression by making tool PMU events
(duration_time, user_time, system_time) behave more like regular
counters by implementing proper enable and disable support. They now
correctly accumulate values only when enabled.

The first patch implements the core enable/disable support for tool PMU
events, and the second patch adds a shell test to verify that
duration_time correctly excludes the delay period.

Changes in v4:
- Update evsel->disabled immediately after the leader's own enable/disable
  succeeds in evsel__enable() / evsel__disable(), preventing state
  inconsistency on early return if a member fails.
- Remove evsel->disabled state changes from within the CPU loops in
  evsel__tool_pmu_enable_cpu() and evsel__tool_pmu_disable_cpu() to
  prevent __evlist__disable() from skipping disable_cpu calls for other
  CPUs.
- Make *start_time = INVALID_START_TIME unconditional in
  evsel__tool_pmu_disable_cpu() to ensure safe inactive state invalidation.
- Address a checkpatch warning regarding unnecessary braces for a single
  statement if block.

Changes in v3:
- Refine group handling: only manually enable/disable group members when
  the leader or member is a non-perf-event open PMU, as the kernel allows
  grouping of software and hardware PMUs.
- Fix a file descriptor leak in evsel__tool_pmu_open() on error paths by
  explicitly closing the successfully opened fd before exiting.
- Synchronize the 'disabled' flag for all group members in enable/disable
  paths (both per-CPU and batch loops) to prevent stale disabled flags.
- Add explicit early exits to evsel__tool_pmu_enable() and disable() based
  on evsel->disabled to protect internal metric state.
- Add an upper bound check in test_stat_delay to verify that the delay was
  actually excluded.

Changes in v2:
- Implement evsel__tool_pmu_enable() and evsel__tool_pmu_disable() to
  avoid ioctl failures in batch evsel__enable() and evsel__disable()
  functions.
- Correctly iterate and enable/disable tool PMU events configured as
  non-leader members of event groups.
- Correct the lseek() arguments order in the read_stat helper:
  lseek(fd, 0, SEEK_SET) instead of lseek(fd, SEEK_SET, 0).
- Introduce INVALID_START_TIME (~0ULL) to prevent erroneous large delta
  accumulation in evsel__tool_pmu_read() if /proc/<pid>/stat fails to
  read in enable_cpu (e.g., process exited).
- Improve test parsing to use LC_ALL=C and cut to be robust against
  different locales, and use awk to dynamically compare duration_time to
  time elapsed with a 200ms tolerance (avoiding loaded CI false failures).
  Also added a lower-bound check.
- Fix style warnings from checkpatch.pl (line lengths, braces, and blank
  lines).

Ian Rogers (2):
  perf tool_pmu: Make tool PMU events respect enable/disable
  perf tests: Add test for stat delay option with duration_time

 tools/perf/tests/shell/stat.sh |  53 +++++++
 tools/perf/util/evlist.c       |  10 +-
 tools/perf/util/evsel.c        | 197 ++++++++++++++++++++------
 tools/perf/util/evsel.h        |  15 +-
 tools/perf/util/tool_pmu.c     | 250 ++++++++++++++++++++++++++-------
 tools/perf/util/tool_pmu.h     |   4 +
 6 files changed, 430 insertions(+), 99 deletions(-)

-- 
2.54.0.631.ge1b05301d1-goog

Re: [PATCH v4 0/2] perf tool_pmu: Support enable/disable for tool PMU events

Posted by Ian Rogers 2 days, 5 hours ago

On Mon, May 18, 2026 at 6:41 PM Ian Rogers <irogers@google.com> wrote:
>
> A regression in perf stat was reported where tool PMU events (like
> duration_time used in CPUs_utilized metric) incorrectly included the
> delay period when using the delay option (-D).
>
> This series fixes the regression by making tool PMU events
> (duration_time, user_time, system_time) behave more like regular
> counters by implementing proper enable and disable support. They now
> correctly accumulate values only when enabled.
>
> The first patch implements the core enable/disable support for tool PMU
> events, and the second patch adds a shell test to verify that
> duration_time correctly excludes the delay period.

Would be nice to get this reported bug fix landed.

Thanks,
Ian

> Changes in v4:
> - Update evsel->disabled immediately after the leader's own enable/disable
>   succeeds in evsel__enable() / evsel__disable(), preventing state
>   inconsistency on early return if a member fails.
> - Remove evsel->disabled state changes from within the CPU loops in
>   evsel__tool_pmu_enable_cpu() and evsel__tool_pmu_disable_cpu() to
>   prevent __evlist__disable() from skipping disable_cpu calls for other
>   CPUs.
> - Make *start_time = INVALID_START_TIME unconditional in
>   evsel__tool_pmu_disable_cpu() to ensure safe inactive state invalidation.
> - Address a checkpatch warning regarding unnecessary braces for a single
>   statement if block.
>
> Changes in v3:
> - Refine group handling: only manually enable/disable group members when
>   the leader or member is a non-perf-event open PMU, as the kernel allows
>   grouping of software and hardware PMUs.
> - Fix a file descriptor leak in evsel__tool_pmu_open() on error paths by
>   explicitly closing the successfully opened fd before exiting.
> - Synchronize the 'disabled' flag for all group members in enable/disable
>   paths (both per-CPU and batch loops) to prevent stale disabled flags.
> - Add explicit early exits to evsel__tool_pmu_enable() and disable() based
>   on evsel->disabled to protect internal metric state.
> - Add an upper bound check in test_stat_delay to verify that the delay was
>   actually excluded.
>
> Changes in v2:
> - Implement evsel__tool_pmu_enable() and evsel__tool_pmu_disable() to
>   avoid ioctl failures in batch evsel__enable() and evsel__disable()
>   functions.
> - Correctly iterate and enable/disable tool PMU events configured as
>   non-leader members of event groups.
> - Correct the lseek() arguments order in the read_stat helper:
>   lseek(fd, 0, SEEK_SET) instead of lseek(fd, SEEK_SET, 0).
> - Introduce INVALID_START_TIME (~0ULL) to prevent erroneous large delta
>   accumulation in evsel__tool_pmu_read() if /proc/<pid>/stat fails to
>   read in enable_cpu (e.g., process exited).
> - Improve test parsing to use LC_ALL=C and cut to be robust against
>   different locales, and use awk to dynamically compare duration_time to
>   time elapsed with a 200ms tolerance (avoiding loaded CI false failures).
>   Also added a lower-bound check.
> - Fix style warnings from checkpatch.pl (line lengths, braces, and blank
>   lines).
>
> Ian Rogers (2):
>   perf tool_pmu: Make tool PMU events respect enable/disable
>   perf tests: Add test for stat delay option with duration_time
>
>  tools/perf/tests/shell/stat.sh |  53 +++++++
>  tools/perf/util/evlist.c       |  10 +-
>  tools/perf/util/evsel.c        | 197 ++++++++++++++++++++------
>  tools/perf/util/evsel.h        |  15 +-
>  tools/perf/util/tool_pmu.c     | 250 ++++++++++++++++++++++++++-------
>  tools/perf/util/tool_pmu.h     |   4 +
>  6 files changed, 430 insertions(+), 99 deletions(-)
>
> --
> 2.54.0.631.ge1b05301d1-goog
>

[PATCH v4 1/2] perf tool_pmu: Make tool PMU events respect enable/disable

Posted by Ian Rogers 6 days, 2 hours ago

Tool PMU events (duration_time, user_time, system_time) currently
count from when the event is opened to when it is read. This causes
issues with features like the delay option (-D) or control fd, where
events are opened but should not start counting immediately.

Make these events behave more like regular counters by implementing
proper enable and disable support. Add accumulated_time to struct
evsel to track time while enabled, and implement enable/disable CPU
callbacks to start/stop counting.

Also generalize userspace PMU mixed group handling. Userspace synthetic
PMUs (type > PERF_PMU_TYPE_PE_END) do not have kernel implementations and
cannot be grouped in the kernel (opened with group_fd = -1), and are
skipped by kernel enable/disable calls. Iterate over group members in
userspace and manually enable/disable any members if the leader or the
member is a non-perf-event open PMU, and synchronize their disabled flags.

Fixes: b71f46a6a708 ("perf stat: Remove hard coded shadow metrics")
Reported-by: Francesco Nigro <nigro.fra@gmail.com>
Closes: https://lore.kernel.org/linux-perf-users/20260517093650.2540920-1-nigro.fra@gmail.com/
Signed-off-by: Ian Rogers <irogers@google.com>
Assisted-by: Antigravity:gemini-3-flash
---
 tools/perf/util/evlist.c   |  10 +-
 tools/perf/util/evsel.c    | 197 ++++++++++++++++++++++-------
 tools/perf/util/evsel.h    |  15 ++-
 tools/perf/util/tool_pmu.c | 250 +++++++++++++++++++++++++++++--------
 tools/perf/util/tool_pmu.h |   4 +
 5 files changed, 377 insertions(+), 99 deletions(-)

diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index ee971d15b3c6..1a238b245b3a 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -529,7 +529,7 @@ static int evlist__is_enabled(struct evlist *evlist)
 
 static void __evlist__disable(struct evlist *evlist, char *evsel_name, bool excl_dummy)
 {
-	struct evsel *pos;
+	struct evsel *pos, *member;
 	struct evlist_cpu_iterator evlist_cpu_itr;
 	bool has_imm = false;
 
@@ -561,6 +561,9 @@ static void __evlist__disable(struct evlist *evlist, char *evsel_name, bool excl
 		if (excl_dummy && evsel__is_dummy_event(pos))
 			continue;
 		pos->disabled = true;
+
+		for_each_group_member(member, pos)
+			member->disabled = true;
 	}
 
 	/*
@@ -590,7 +593,7 @@ void evlist__disable_evsel(struct evlist *evlist, char *evsel_name)
 
 static void __evlist__enable(struct evlist *evlist, char *evsel_name, bool excl_dummy)
 {
-	struct evsel *pos;
+	struct evsel *pos, *member;
 	struct evlist_cpu_iterator evlist_cpu_itr;
 
 	evlist__for_each_cpu(evlist_cpu_itr, evlist) {
@@ -611,6 +614,9 @@ static void __evlist__enable(struct evlist *evlist, char *evsel_name, bool excl_
 		if (excl_dummy && evsel__is_dummy_event(pos))
 			continue;
 		pos->disabled = false;
+
+		for_each_group_member(member, pos)
+			member->disabled = false;
 	}
 
 	/*
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 2ee87fd84d3e..8a80d2e15f5c 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -11,68 +11,71 @@
  */
 #define __SANE_USERSPACE_TYPES__
 
-#include <byteswap.h>
+#include "evsel.h"
+
 #include <errno.h>
 #include <inttypes.h>
+#include <stdlib.h>
+
+#include <dirent.h>
 #include <linux/bitops.h>
-#include <api/fs/fs.h>
-#include <api/fs/tracing_path.h>
-#include <linux/hw_breakpoint.h>
-#include <linux/perf_event.h>
 #include <linux/compiler.h>
+#include <linux/ctype.h>
 #include <linux/err.h>
+#include <linux/hw_breakpoint.h>
+#include <linux/perf_event.h>
 #include <linux/zalloc.h>
 #include <sys/ioctl.h>
 #include <sys/resource.h>
 #include <sys/syscall.h>
 #include <sys/types.h>
-#include <dirent.h>
-#include <stdlib.h>
+
+#include <api/fs/fs.h>
+#include <api/fs/tracing_path.h>
+#include <byteswap.h>
+#include <internal/lib.h>
+#include <internal/threadmap.h>
+#include <internal/xyarray.h>
+#include <perf/cpumap.h>
 #include <perf/evsel.h>
+
+#include "../perf-sys.h"
 #include "asm/bug.h"
+#include "bpf-filter.h"
 #include "bpf_counter.h"
 #include "callchain.h"
 #include "cgroup.h"
 #include "counts.h"
+#include "debug.h"
+#include "drm_pmu.h"
 #include "dwarf-regs.h"
+#include "env.h"
 #include "event.h"
-#include "evsel.h"
-#include "time-utils.h"
-#include "util/env.h"
-#include "util/evsel_config.h"
-#include "util/evsel_fprintf.h"
 #include "evlist.h"
-#include <perf/cpumap.h>
-#include "thread_map.h"
-#include "target.h"
+#include "evsel_config.h"
+#include "evsel_fprintf.h"
+#include "hashmap.h"
+#include "hist.h"
+#include "hwmon_pmu.h"
+#include "intel-tpebs.h"
+#include "memswap.h"
+#include "off_cpu.h"
+#include "parse-branch-options.h"
 #include "perf_regs.h"
+#include "pmu.h"
+#include "pmus.h"
 #include "record.h"
-#include "debug.h"
-#include "trace-event.h"
+#include "rlimit.h"
 #include "session.h"
 #include "stat.h"
 #include "string2.h"
-#include "memswap.h"
-#include "util.h"
-#include "util/hashmap.h"
-#include "off_cpu.h"
-#include "pmu.h"
-#include "pmus.h"
-#include "drm_pmu.h"
-#include "hwmon_pmu.h"
+#include "target.h"
+#include "thread_map.h"
+#include "time-utils.h"
 #include "tool_pmu.h"
 #include "tp_pmu.h"
-#include "rlimit.h"
-#include "../perf-sys.h"
-#include "util/parse-branch-options.h"
-#include "util/bpf-filter.h"
-#include "util/hist.h"
-#include <internal/xyarray.h>
-#include <internal/lib.h>
-#include <internal/threadmap.h>
-#include "util/intel-tpebs.h"
-
-#include <linux/ctype.h>
+#include "trace-event.h"
+#include "util.h"
 
 #ifdef HAVE_LIBTRACEEVENT
 #include <event-parse.h>
@@ -1795,27 +1798,114 @@ int evsel__append_addr_filter(struct evsel *evsel, const char *filter)
 /* Caller has to clear disabled after going through all CPUs. */
 int evsel__enable_cpu(struct evsel *evsel, int cpu_map_idx)
 {
-	return perf_evsel__enable_cpu(&evsel->core, cpu_map_idx);
+	int err;
+
+	if (evsel__is_tool(evsel))
+		err = evsel__tool_pmu_enable_cpu(evsel, cpu_map_idx);
+	else
+		err = perf_evsel__enable_cpu(&evsel->core, cpu_map_idx);
+
+	if (!err && evsel__is_group_leader(evsel)) {
+		struct evsel *member;
+
+		for_each_group_member(member, evsel) {
+			if (evsel__is_non_perf_event_open_pmu(evsel) ||
+			    evsel__is_non_perf_event_open_pmu(member)) {
+				/*
+				 * In a mixed PMU group, userspace PMUs are not
+				 * grouped in the kernel (opened with group_fd = -1)
+				 * and are skipped by the kernel when enabling the
+				 * group leader. We must manually enable them in
+				 * userspace.
+				 */
+				int mem_err = evsel__enable_cpu(member, cpu_map_idx);
+
+				if (mem_err)
+					return mem_err;
+			}
+		}
+	}
+	return err;
 }
 
 int evsel__enable(struct evsel *evsel)
 {
-	int err = perf_evsel__enable(&evsel->core);
+	int err;
+
+	if (evsel__is_tool(evsel))
+		err = evsel__tool_pmu_enable(evsel);
+	else
+		err = perf_evsel__enable(&evsel->core);
 
 	if (!err)
 		evsel->disabled = false;
+
+	if (!err && evsel__is_group_leader(evsel)) {
+		struct evsel *member;
+
+		for_each_group_member(member, evsel) {
+			if (evsel__is_non_perf_event_open_pmu(evsel) ||
+			    evsel__is_non_perf_event_open_pmu(member)) {
+				/*
+				 * In a mixed PMU group, userspace PMUs are not
+				 * grouped in the kernel (opened with group_fd = -1)
+				 * and are skipped by the kernel when enabling the
+				 * group leader. We must manually enable them in
+				 * userspace.
+				 */
+				int mem_err = evsel__enable(member);
+
+				if (mem_err)
+					return mem_err;
+			}
+			member->disabled = false;
+		}
+	}
+
 	return err;
 }
 
 /* Caller has to set disabled after going through all CPUs. */
 int evsel__disable_cpu(struct evsel *evsel, int cpu_map_idx)
 {
-	return perf_evsel__disable_cpu(&evsel->core, cpu_map_idx);
+	int err;
+
+	if (evsel__is_tool(evsel))
+		err = evsel__tool_pmu_disable_cpu(evsel, cpu_map_idx);
+	else
+		err = perf_evsel__disable_cpu(&evsel->core, cpu_map_idx);
+
+	if (!err && evsel__is_group_leader(evsel)) {
+		struct evsel *member;
+
+		for_each_group_member(member, evsel) {
+			if (evsel__is_non_perf_event_open_pmu(evsel) ||
+			    evsel__is_non_perf_event_open_pmu(member)) {
+				/*
+				 * In a mixed PMU group, userspace PMUs are not
+				 * grouped in the kernel and are skipped by the
+				 * kernel when disabling the group leader. We must
+				 * manually disable them in userspace.
+				 */
+				int mem_err = evsel__disable_cpu(member, cpu_map_idx);
+
+				if (mem_err)
+					return mem_err;
+			}
+		}
+	}
+	return err;
 }
 
 int evsel__disable(struct evsel *evsel)
 {
-	int err = perf_evsel__disable(&evsel->core);
+	int err;
+
+	if (evsel__is_tool(evsel))
+		err = evsel__tool_pmu_disable(evsel);
+	else
+		err = perf_evsel__disable(&evsel->core);
+
 	/*
 	 * We mark it disabled here so that tools that disable a event can
 	 * ignore events after they disable it. I.e. the ring buffer may have
@@ -1825,6 +1915,27 @@ int evsel__disable(struct evsel *evsel)
 	if (!err)
 		evsel->disabled = true;
 
+	if (!err && evsel__is_group_leader(evsel)) {
+		struct evsel *member;
+
+		for_each_group_member(member, evsel) {
+			if (evsel__is_non_perf_event_open_pmu(evsel) ||
+			    evsel__is_non_perf_event_open_pmu(member)) {
+				/*
+				 * In a mixed PMU group, userspace PMUs are not
+				 * grouped in the kernel and are skipped by the
+				 * kernel when disabling the group leader. We must
+				 * manually disable them in userspace.
+				 */
+				int mem_err = evsel__disable(member);
+
+				if (mem_err)
+					return mem_err;
+			}
+			member->disabled = true;
+		}
+	}
+
 	return err;
 }
 
@@ -1885,8 +1996,10 @@ void evsel__exit(struct evsel *evsel)
 		evsel__priv_destructor(evsel->priv);
 	perf_evsel__object.fini(evsel);
 	if (evsel__tool_event(evsel) == TOOL_PMU__EVENT_SYSTEM_TIME ||
-	    evsel__tool_event(evsel) == TOOL_PMU__EVENT_USER_TIME)
-		xyarray__delete(evsel->start_times);
+	    evsel__tool_event(evsel) == TOOL_PMU__EVENT_USER_TIME) {
+		xyarray__delete(evsel->process_time.start_times);
+		xyarray__delete(evsel->process_time.accumulated_times);
+	}
 }
 
 void evsel__delete(struct evsel *evsel)
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index 339b5c08a33d..7e3746480269 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -190,12 +190,18 @@ struct evsel {
 			double max;
 		} retirement_latency;
 		/* duration_time is a single global time. */
-		__u64 start_time;
+		struct {
+			__u64 start_time;
+			__u64 accumulated_time;
+		} duration_time;
 		/*
 		 * user_time and system_time read an initial value potentially
 		 * per-CPU or per-pid.
 		 */
-		struct xyarray *start_times;
+		struct {
+			struct xyarray *start_times;
+			struct xyarray *accumulated_times;
+		} process_time;
 	};
 	/* Is the tool's fd for /proc/pid/stat or /proc/stat. */
 	bool pid_stat;
@@ -350,6 +356,11 @@ void arch_evsel__apply_ratio_to_prev(struct evsel *evsel, struct perf_event_attr
 int evsel__set_filter(struct evsel *evsel, const char *filter);
 int evsel__append_tp_filter(struct evsel *evsel, const char *filter);
 int evsel__append_addr_filter(struct evsel *evsel, const char *filter);
+static inline bool evsel__is_non_perf_event_open_pmu(const struct evsel *evsel)
+{
+	return evsel->pmu && evsel->pmu->type > PERF_PMU_TYPE_PE_END;
+}
+
 int evsel__enable_cpu(struct evsel *evsel, int cpu_map_idx);
 int evsel__enable(struct evsel *evsel);
 int evsel__disable(struct evsel *evsel);
diff --git a/tools/perf/util/tool_pmu.c b/tools/perf/util/tool_pmu.c
index 6a9df3dc0e07..5c30854b4644 100644
--- a/tools/perf/util/tool_pmu.c
+++ b/tools/perf/util/tool_pmu.c
@@ -17,6 +17,8 @@
 #include <fcntl.h>
 #include <strings.h>
 
+#define INVALID_START_TIME ~0ULL
+
 static const char *const tool_pmu__event_names[TOOL_PMU__EVENT_MAX] = {
 	NULL,
 	"duration_time",
@@ -205,20 +207,57 @@ int evsel__tool_pmu_prepare_open(struct evsel *evsel,
 				 struct perf_cpu_map *cpus,
 				 int nthreads)
 {
-	if ((evsel__tool_event(evsel) == TOOL_PMU__EVENT_SYSTEM_TIME ||
-	     evsel__tool_event(evsel) == TOOL_PMU__EVENT_USER_TIME) &&
-	    !evsel->start_times) {
-		evsel->start_times = xyarray__new(perf_cpu_map__nr(cpus),
-						  nthreads,
-						  sizeof(__u64));
-		if (!evsel->start_times)
-			return -ENOMEM;
+	enum tool_pmu_event ev = evsel__tool_event(evsel);
+
+	if (ev == TOOL_PMU__EVENT_SYSTEM_TIME || ev == TOOL_PMU__EVENT_USER_TIME) {
+		if (!evsel->process_time.start_times) {
+			evsel->process_time.start_times =
+				xyarray__new(perf_cpu_map__nr(cpus), nthreads, sizeof(__u64));
+			if (!evsel->process_time.start_times)
+				return -ENOMEM;
+		}
+		if (!evsel->process_time.accumulated_times) {
+			evsel->process_time.accumulated_times =
+				xyarray__new(perf_cpu_map__nr(cpus), nthreads, sizeof(__u64));
+			if (!evsel->process_time.accumulated_times)
+				return -ENOMEM;
+		}
 	}
 	return 0;
 }
 
 #define FD(e, x, y) (*(int *)xyarray__entry(e->core.fd, x, y))
 
+static int tool_pmu__read_stat(struct evsel *evsel, int cpu_map_idx, int thread, __u64 *val)
+{
+	enum tool_pmu_event ev = evsel__tool_event(evsel);
+	bool system = ev == TOOL_PMU__EVENT_SYSTEM_TIME;
+	int fd = FD(evsel, cpu_map_idx, thread);
+	int err = 0;
+
+	if (fd < 0) {
+		*val = 0;
+		return 0;
+	}
+
+	lseek(fd, 0, SEEK_SET);
+	if (evsel->pid_stat) {
+		if (cpu_map_idx == 0)
+			err = read_pid_stat_field(fd, system ? 15 : 14, val);
+		else
+			*val = 0;
+	} else {
+		if (thread == 0) {
+			struct perf_cpu cpu = perf_cpu_map__cpu(evsel->core.cpus, cpu_map_idx);
+
+			err = read_stat_field(fd, cpu, system ? 3 : 1, val);
+		} else {
+			*val = 0;
+		}
+	}
+	return err;
+}
+
 int evsel__tool_pmu_open(struct evsel *evsel,
 			 struct perf_thread_map *threads,
 			 int start_cpu_map_idx, int end_cpu_map_idx)
@@ -232,7 +271,14 @@ int evsel__tool_pmu_open(struct evsel *evsel,
 	if (ev == TOOL_PMU__EVENT_DURATION_TIME) {
 		if (evsel->core.attr.sample_period) /* no sampling */
 			return -EINVAL;
-		evsel->start_time = rdclock();
+		evsel->duration_time.accumulated_time = 0;
+		if (evsel->core.attr.disabled) {
+			evsel->disabled = true;
+			evsel->duration_time.start_time = INVALID_START_TIME;
+		} else {
+			evsel->disabled = false;
+			evsel->duration_time.start_time = rdclock();
+		}
 		return 0;
 	}
 
@@ -246,8 +292,8 @@ int evsel__tool_pmu_open(struct evsel *evsel,
 				pid = perf_thread_map__pid(threads, thread);
 
 			if (ev == TOOL_PMU__EVENT_USER_TIME || ev == TOOL_PMU__EVENT_SYSTEM_TIME) {
-				bool system = ev == TOOL_PMU__EVENT_SYSTEM_TIME;
 				__u64 *start_time = NULL;
+				__u64 *accumulated_time = NULL;
 				int fd;
 
 				if (evsel->core.attr.sample_period) {
@@ -269,21 +315,25 @@ int evsel__tool_pmu_open(struct evsel *evsel,
 					err = -errno;
 					goto out_close;
 				}
-				start_time = xyarray__entry(evsel->start_times, idx, thread);
-				if (pid > -1) {
-					err = read_pid_stat_field(fd, system ? 15 : 14,
-								  start_time);
+				start_time = xyarray__entry(evsel->process_time.start_times, idx,
+							    thread);
+				accumulated_time = xyarray__entry(
+					evsel->process_time.accumulated_times, idx, thread);
+				*accumulated_time = 0;
+
+				if (evsel->core.attr.disabled) {
+					evsel->disabled = true;
+					*start_time = INVALID_START_TIME;
 				} else {
-					struct perf_cpu cpu;
-
-					cpu = perf_cpu_map__cpu(evsel->core.cpus, idx);
-					err = read_stat_field(fd, cpu, system ? 3 : 1,
-							      start_time);
+					evsel->disabled = false;
+					err = tool_pmu__read_stat(evsel, idx, thread, start_time);
+					if (err) {
+						close(fd);
+						FD(evsel, idx, thread) = -1;
+						goto out_close;
+					}
 				}
-				if (err)
-					goto out_close;
 			}
-
 		}
 	}
 	return 0;
@@ -467,10 +517,111 @@ static void perf_counts__update(struct perf_counts_values *count,
 		count->lost = 0;
 	}
 }
+int evsel__tool_pmu_enable_cpu(struct evsel *evsel, int cpu_map_idx)
+{
+	enum tool_pmu_event ev = evsel__tool_event(evsel);
+	int thread, nthreads;
+
+	if (!evsel->disabled)
+		return 0;
+
+	if (ev == TOOL_PMU__EVENT_DURATION_TIME) {
+		if (cpu_map_idx == 0)
+			evsel->duration_time.start_time = rdclock();
+		return 0;
+	}
+
+	if (ev == TOOL_PMU__EVENT_USER_TIME || ev == TOOL_PMU__EVENT_SYSTEM_TIME) {
+		nthreads = xyarray__max_y(evsel->process_time.start_times);
+		for (thread = 0; thread < nthreads; thread++) {
+			__u64 *start_time = xyarray__entry(evsel->process_time.start_times,
+							   cpu_map_idx, thread);
+			__u64 val;
+			int err;
+
+			err = tool_pmu__read_stat(evsel, cpu_map_idx, thread, &val);
+			if (!err)
+				*start_time = val;
+			else
+				*start_time = INVALID_START_TIME;
+		}
+	}
+	return 0;
+}
+
+int evsel__tool_pmu_enable(struct evsel *evsel)
+{
+	unsigned int idx;
+	int err = 0;
+
+	if (!evsel->disabled)
+		return 0;
+
+	for (idx = 0; idx < perf_cpu_map__nr(evsel->core.cpus); idx++) {
+		err = evsel__tool_pmu_enable_cpu(evsel, idx);
+		if (err)
+			break;
+	}
+	return err;
+}
+
+int evsel__tool_pmu_disable_cpu(struct evsel *evsel, int cpu_map_idx)
+{
+	enum tool_pmu_event ev = evsel__tool_event(evsel);
+	int thread, nthreads;
+
+	if (evsel->disabled)
+		return 0;
+
+	if (ev == TOOL_PMU__EVENT_DURATION_TIME) {
+		if (cpu_map_idx == 0) {
+			__u64 delta = rdclock() - evsel->duration_time.start_time;
+
+			evsel->duration_time.accumulated_time += delta;
+		}
+		return 0;
+	}
+
+	if (ev == TOOL_PMU__EVENT_USER_TIME || ev == TOOL_PMU__EVENT_SYSTEM_TIME) {
+		nthreads = xyarray__max_y(evsel->process_time.start_times);
+		for (thread = 0; thread < nthreads; thread++) {
+			__u64 *start_time = xyarray__entry(evsel->process_time.start_times,
+							   cpu_map_idx, thread);
+			__u64 *accumulated_time = xyarray__entry(
+				evsel->process_time.accumulated_times, cpu_map_idx, thread);
+			__u64 val;
+			int err;
+
+			err = tool_pmu__read_stat(evsel, cpu_map_idx, thread, &val);
+			if (!err) {
+				if (*start_time != INVALID_START_TIME && val >= *start_time)
+					*accumulated_time += (val - *start_time);
+			}
+			*start_time = INVALID_START_TIME;
+		}
+	}
+	return 0;
+}
+
+int evsel__tool_pmu_disable(struct evsel *evsel)
+{
+	unsigned int idx;
+	int err = 0;
+
+	if (evsel->disabled)
+		return 0;
+
+	for (idx = 0; idx < perf_cpu_map__nr(evsel->core.cpus); idx++) {
+		err = evsel__tool_pmu_disable_cpu(evsel, idx);
+		if (err)
+			break;
+	}
+	return err;
+}
 
 int evsel__tool_pmu_read(struct evsel *evsel, int cpu_map_idx, int thread)
 {
-	__u64 *start_time, cur_time, delta_start;
+	__u64 delta_start = 0;
 	int err = 0;
 	struct perf_counts_values *count, *old_count = NULL;
 	bool adjust = false;
@@ -507,39 +658,33 @@ int evsel__tool_pmu_read(struct evsel *evsel, int cpu_map_idx, int thread)
 		return 0;
 	}
 	case TOOL_PMU__EVENT_DURATION_TIME:
-		/*
-		 * Pretend duration_time is only on the first CPU and thread, or
-		 * else aggregation will scale duration_time by the number of
-		 * CPUs/threads.
-		 */
-		start_time = &evsel->start_time;
-		if (cpu_map_idx == 0 && thread == 0)
-			cur_time = rdclock();
-		else
-			cur_time = *start_time;
+		if (cpu_map_idx == 0 && thread == 0) {
+			delta_start = evsel->duration_time.accumulated_time;
+			if (!evsel->disabled &&
+			    evsel->duration_time.start_time != INVALID_START_TIME)
+				delta_start += (rdclock() - evsel->duration_time.start_time);
+		} else {
+			delta_start = 0;
+		}
 		break;
 	case TOOL_PMU__EVENT_USER_TIME:
 	case TOOL_PMU__EVENT_SYSTEM_TIME: {
-		bool system = evsel__tool_event(evsel) == TOOL_PMU__EVENT_SYSTEM_TIME;
-		int fd = FD(evsel, cpu_map_idx, thread);
-
-		start_time = xyarray__entry(evsel->start_times, cpu_map_idx, thread);
-		lseek(fd, SEEK_SET, 0);
-		if (evsel->pid_stat) {
-			/* The event exists solely on 1 CPU. */
-			if (cpu_map_idx == 0)
-				err = read_pid_stat_field(fd, system ? 15 : 14, &cur_time);
-			else
-				cur_time = 0;
-		} else {
-			/* The event is for all threads. */
-			if (thread == 0) {
-				struct perf_cpu cpu = perf_cpu_map__cpu(evsel->core.cpus,
-									cpu_map_idx);
+		__u64 accumulated = *(__u64 *)xyarray__entry(evsel->process_time.accumulated_times,
+							     cpu_map_idx, thread);
 
-				err = read_stat_field(fd, cpu, system ? 3 : 1, &cur_time);
-			} else {
-				cur_time = 0;
+		if (evsel->disabled) {
+			delta_start = accumulated;
+		} else {
+			__u64 *start_time = xyarray__entry(evsel->process_time.start_times,
+							   cpu_map_idx, thread);
+			__u64 cur_time;
+
+			err = tool_pmu__read_stat(evsel, cpu_map_idx, thread, &cur_time);
+			if (!err) {
+				if (*start_time != INVALID_START_TIME && cur_time >= *start_time)
+					delta_start = accumulated + (cur_time - *start_time);
+				else
+					delta_start = accumulated;
 			}
 		}
 		adjust = true;
@@ -553,7 +698,6 @@ int evsel__tool_pmu_read(struct evsel *evsel, int cpu_map_idx, int thread)
 	if (err)
 		return err;
 
-	delta_start = cur_time - *start_time;
 	if (adjust) {
 		__u64 ticks_per_sec = sysconf(_SC_CLK_TCK);
 
diff --git a/tools/perf/util/tool_pmu.h b/tools/perf/util/tool_pmu.h
index f1714001bc1d..f66d24cf3502 100644
--- a/tools/perf/util/tool_pmu.h
+++ b/tools/perf/util/tool_pmu.h
@@ -56,6 +56,10 @@ int evsel__tool_pmu_prepare_open(struct evsel *evsel,
 int evsel__tool_pmu_open(struct evsel *evsel,
 			 struct perf_thread_map *threads,
 			 int start_cpu_map_idx, int end_cpu_map_idx);
+int evsel__tool_pmu_enable_cpu(struct evsel *evsel, int cpu_map_idx);
+int evsel__tool_pmu_enable(struct evsel *evsel);
+int evsel__tool_pmu_disable_cpu(struct evsel *evsel, int cpu_map_idx);
+int evsel__tool_pmu_disable(struct evsel *evsel);
 int evsel__tool_pmu_read(struct evsel *evsel, int cpu_map_idx, int thread);
 
 struct perf_pmu *tool_pmu__new(void);
-- 
2.54.0.631.ge1b05301d1-goog

Re: [PATCH v4 1/2] perf tool_pmu: Make tool PMU events respect enable/disable

Posted by Arnaldo Carvalho de Melo 2 days, 3 hours ago

On Mon, May 18, 2026 at 06:41:05PM -0700, Ian Rogers wrote:
> Tool PMU events (duration_time, user_time, system_time) currently
> count from when the event is opened to when it is read. This causes
> issues with features like the delay option (-D) or control fd, where
> events are opened but should not start counting immediately.
> 
> Make these events behave more like regular counters by implementing
> proper enable and disable support. Add accumulated_time to struct
> evsel to track time while enabled, and implement enable/disable CPU
> callbacks to start/stop counting.
> 
> Also generalize userspace PMU mixed group handling. Userspace synthetic
> PMUs (type > PERF_PMU_TYPE_PE_END) do not have kernel implementations and
> cannot be grouped in the kernel (opened with group_fd = -1), and are
> skipped by kernel enable/disable calls. Iterate over group members in
> userspace and manually enable/disable any members if the leader or the
> member is a non-perf-event open PMU, and synchronize their disabled flags.
> 
> Fixes: b71f46a6a708 ("perf stat: Remove hard coded shadow metrics")
> Reported-by: Francesco Nigro <nigro.fra@gmail.com>
> Closes: https://lore.kernel.org/linux-perf-users/20260517093650.2540920-1-nigro.fra@gmail.com/
> Signed-off-by: Ian Rogers <irogers@google.com>
> Assisted-by: Antigravity:gemini-3-flash

Thanks, applied both patches to perf-tools-next, for v7.2.

- Arnaldo

Re: [PATCH v4 1/2] perf tool_pmu: Make tool PMU events respect enable/disable

Posted by Namhyung Kim 5 days, 21 hours ago

On Mon, May 18, 2026 at 06:41:05PM -0700, Ian Rogers wrote:
> Tool PMU events (duration_time, user_time, system_time) currently
> count from when the event is opened to when it is read. This causes
> issues with features like the delay option (-D) or control fd, where
> events are opened but should not start counting immediately.
> 
> Make these events behave more like regular counters by implementing
> proper enable and disable support. Add accumulated_time to struct
> evsel to track time while enabled, and implement enable/disable CPU
> callbacks to start/stop counting.
> 
> Also generalize userspace PMU mixed group handling. Userspace synthetic
> PMUs (type > PERF_PMU_TYPE_PE_END) do not have kernel implementations and
> cannot be grouped in the kernel (opened with group_fd = -1), and are
> skipped by kernel enable/disable calls. Iterate over group members in
> userspace and manually enable/disable any members if the leader or the
> member is a non-perf-event open PMU, and synchronize their disabled flags.

Can we divide the commit into smaller pieces?  I think we can have

 * preparation for accumulated_time
 * implement enable/disable for tool PMU
 * wire them to evsel__{enable,disable}[_cpu]
 * support group members properly

What do you think?

Thanks,
Namhyung

> 
> Fixes: b71f46a6a708 ("perf stat: Remove hard coded shadow metrics")
> Reported-by: Francesco Nigro <nigro.fra@gmail.com>
> Closes: https://lore.kernel.org/linux-perf-users/20260517093650.2540920-1-nigro.fra@gmail.com/
> Signed-off-by: Ian Rogers <irogers@google.com>
> Assisted-by: Antigravity:gemini-3-flash
> ---
>  tools/perf/util/evlist.c   |  10 +-
>  tools/perf/util/evsel.c    | 197 ++++++++++++++++++++++-------
>  tools/perf/util/evsel.h    |  15 ++-
>  tools/perf/util/tool_pmu.c | 250 +++++++++++++++++++++++++++++--------
>  tools/perf/util/tool_pmu.h |   4 +
>  5 files changed, 377 insertions(+), 99 deletions(-)

Re: [PATCH v4 1/2] perf tool_pmu: Make tool PMU events respect enable/disable

Posted by Ian Rogers 5 days, 19 hours ago

On Mon, May 18, 2026 at 11:43 PM Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Mon, May 18, 2026 at 06:41:05PM -0700, Ian Rogers wrote:
> > Tool PMU events (duration_time, user_time, system_time) currently
> > count from when the event is opened to when it is read. This causes
> > issues with features like the delay option (-D) or control fd, where
> > events are opened but should not start counting immediately.
> >
> > Make these events behave more like regular counters by implementing
> > proper enable and disable support. Add accumulated_time to struct
> > evsel to track time while enabled, and implement enable/disable CPU
> > callbacks to start/stop counting.
> >
> > Also generalize userspace PMU mixed group handling. Userspace synthetic
> > PMUs (type > PERF_PMU_TYPE_PE_END) do not have kernel implementations and
> > cannot be grouped in the kernel (opened with group_fd = -1), and are
> > skipped by kernel enable/disable calls. Iterate over group members in
> > userspace and manually enable/disable any members if the leader or the
> > member is a non-perf-event open PMU, and synchronize their disabled flags.
>
> Can we divide the commit into smaller pieces?  I think we can have
>
>  * preparation for accumulated_time
>  * implement enable/disable for tool PMU
>  * wire them to evsel__{enable,disable}[_cpu]
>  * support group members properly
>
> What do you think?

That sounds needlessly painful, involving many irrelevant intermediate
states with dead functions and variables. Most of the patch is
confined to tool_pmu, we could make the evsel patch smaller by
removing comments. The changes in evsel are pretty minimal and the
patch is largely confined to tool_pmu.

Thanks,
Ian

> Thanks,
> Namhyung
>
> >
> > Fixes: b71f46a6a708 ("perf stat: Remove hard coded shadow metrics")
> > Reported-by: Francesco Nigro <nigro.fra@gmail.com>
> > Closes: https://lore.kernel.org/linux-perf-users/20260517093650.2540920-1-nigro.fra@gmail.com/
> > Signed-off-by: Ian Rogers <irogers@google.com>
> > Assisted-by: Antigravity:gemini-3-flash
> > ---
> >  tools/perf/util/evlist.c   |  10 +-
> >  tools/perf/util/evsel.c    | 197 ++++++++++++++++++++++-------
> >  tools/perf/util/evsel.h    |  15 ++-
> >  tools/perf/util/tool_pmu.c | 250 +++++++++++++++++++++++++++++--------
> >  tools/perf/util/tool_pmu.h |   4 +
> >  5 files changed, 377 insertions(+), 99 deletions(-)

Re: [PATCH v4 1/2] perf tool_pmu: Make tool PMU events respect enable/disable

Posted by Namhyung Kim 4 days, 4 hours ago

On Tue, May 19, 2026 at 01:13:35AM -0700, Ian Rogers wrote:
> On Mon, May 18, 2026 at 11:43 PM Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > On Mon, May 18, 2026 at 06:41:05PM -0700, Ian Rogers wrote:
> > > Tool PMU events (duration_time, user_time, system_time) currently
> > > count from when the event is opened to when it is read. This causes
> > > issues with features like the delay option (-D) or control fd, where
> > > events are opened but should not start counting immediately.
> > >
> > > Make these events behave more like regular counters by implementing
> > > proper enable and disable support. Add accumulated_time to struct
> > > evsel to track time while enabled, and implement enable/disable CPU
> > > callbacks to start/stop counting.
> > >
> > > Also generalize userspace PMU mixed group handling. Userspace synthetic
> > > PMUs (type > PERF_PMU_TYPE_PE_END) do not have kernel implementations and
> > > cannot be grouped in the kernel (opened with group_fd = -1), and are
> > > skipped by kernel enable/disable calls. Iterate over group members in
> > > userspace and manually enable/disable any members if the leader or the
> > > member is a non-perf-event open PMU, and synchronize their disabled flags.
> >
> > Can we divide the commit into smaller pieces?  I think we can have
> >
> >  * preparation for accumulated_time
> >  * implement enable/disable for tool PMU
> >  * wire them to evsel__{enable,disable}[_cpu]
> >  * support group members properly
> >
> > What do you think?
> 
> That sounds needlessly painful, involving many irrelevant intermediate
> states with dead functions and variables. Most of the patch is
> confined to tool_pmu, we could make the evsel patch smaller by
> removing comments. The changes in evsel are pretty minimal and the
> patch is largely confined to tool_pmu.

I didn't think it'd add many temporary code especially for group
handling.  But I'm fine with the single patch if breaking it up would
cause a lot of pains to you.

Thanks,
Namhyung

[PATCH v4 2/2] perf tests: Add test for stat delay option with duration_time

Posted by Ian Rogers 6 days, 2 hours ago

Add a new test case `test_stat_delay` to `stat.sh` to verify that
`duration_time` correctly excludes the delay period when using the
delay option (-D).

The test runs `perf stat -D 1000 -e duration_time sleep 2` and
verifies that `duration_time` is ~1s (excluding the 1s delay), not
~2s.

Signed-off-by: Ian Rogers <irogers@google.com>
Assisted-by: Antigravity:gemini-3-flash
---
 tools/perf/tests/shell/stat.sh | 53 ++++++++++++++++++++++++++++++++++
 1 file changed, 53 insertions(+)

diff --git a/tools/perf/tests/shell/stat.sh b/tools/perf/tests/shell/stat.sh
index 4edb04039036..1e17bee026bd 100755
--- a/tools/perf/tests/shell/stat.sh
+++ b/tools/perf/tests/shell/stat.sh
@@ -483,6 +483,58 @@ test_stat_pid() {
   wait $pid 2>/dev/null || true
 }
 
+test_stat_delay() {
+  echo "stat -D test"
+  if ! env LC_ALL=C perf stat -D 1000 -e duration_time sleep 2 > "${stat_output}" 2>&1
+  then
+    echo "stat -D test [Failed - command failed]"
+    cat "${stat_output}"
+    err=1
+    return
+  fi
+
+  duration=$(grep "duration_time" "${stat_output}" | awk '{print $1}' | tr -d ',')
+  elapsed=$(grep "seconds time elapsed" "${stat_output}" | awk '{print $1}')
+
+  if [ -z "$duration" ] || [ -z "$elapsed" ]
+  then
+    echo "stat -D test [Failed - failed to find duration_time or time elapsed in output]"
+    cat "${stat_output}"
+    err=1
+    return
+  fi
+
+  # Compare duration (ns) and elapsed (s) using awk to handle float and allow tolerance.
+  if ! awk -v d="$duration" -v e="$elapsed" '
+    BEGIN {
+      diff = d - (e * 1e9);
+      if (diff < 0) diff = -diff;
+      # Allow 200ms tolerance (200,000,000 ns) for loaded CI machines.
+      if (diff > 200000000) {
+        printf "Fail: duration (%d ns) and elapsed (%f s) mismatch (diff %d ns)\n", d, e, diff;
+        exit 1;
+      }
+      # Lower bound check: must be at least 0.5s.
+      if (d < 500000000) {
+        printf "Fail: duration (%d ns) is abnormally small\n", d;
+        exit 1;
+      }
+      # Upper bound check: must be strictly less than 1.7s (proving delay was excluded).
+      if (d > 1700000000) {
+        printf "Fail: duration (%d ns) is too large (delay might not be excluded)\n", d;
+        exit 1;
+      }
+      exit 0;
+    }'
+  then
+    echo "stat -D test [Failed - validation failed]"
+    cat "${stat_output}"
+    err=1
+    return
+  fi
+  echo "stat -D test [Success]"
+}
+
 test_default_stat
 test_null_stat
 test_offline_cpu_stat
@@ -498,6 +550,7 @@ test_stat_no_aggr
 test_stat_detailed
 test_stat_repeat
 test_stat_pid
+test_stat_delay
 
 cleanup
 exit $err
-- 
2.54.0.631.ge1b05301d1-goog