[v6] perf record --off-cpu: Dump off-cpu samples directly

[PATCH v6 1/8] perf evsel: Set off-cpu BPF output to system-wide

Posted by Howard Chu 1 year, 4 months ago

pid = -1 for off-cpu's bpf-output event.

This makes 'perf record -p <PID> --off-cpu', and 'perf record --off-cpu
<workload>' work. Otherwise bpf-output cannot be collected.

The reason (conjecture): say if we open perf_event on pid = 11451, then
in BPF, we call bpf_perf_event_output() when a direct sample is ready to
be dumped. But currently the perf_event of pid 11451 is not __fully__
sched_in yet, so in kernel/trace/bpf_trace.c's
__bpf_perf_event_output(), there will be event->oncpu != cpu, thus
return -EOPNOTSUPP, direct off-cpu sample output failed.

if (unlikely(event->oncpu != cpu))
		return -EOPNOTSUPP;

So I'm making it pid = -1, everybody can do bpf_perf_event_output()

P.S. In perf trace this is not necessary, because it uses syscall
tracepoints, instead of sched_switch.

Signed-off-by: Howard Chu <howardchu95@gmail.com>
---
 tools/perf/util/evsel.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index edfb376f0611..500ca62669cb 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2368,6 +2368,9 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
 
 			test_attr__ready();
 
+			if (evsel__is_offcpu_event(evsel))
+				pid = -1;
+
 			/* Debug message used by test scripts */
 			pr_debug2_peo("sys_perf_event_open: pid %d  cpu %d  group_fd %d  flags %#lx",
 				pid, perf_cpu_map__cpu(cpus, idx).cpu, group_fd, evsel->open_flags);
-- 
2.43.0

Re: [PATCH v6 1/8] perf evsel: Set off-cpu BPF output to system-wide

Posted by Namhyung Kim 1 year, 4 months ago

On Fri, Sep 27, 2024 at 01:27:29PM -0700, Howard Chu wrote:
> pid = -1 for off-cpu's bpf-output event.
> 
> This makes 'perf record -p <PID> --off-cpu', and 'perf record --off-cpu
> <workload>' work. Otherwise bpf-output cannot be collected.
> 
> The reason (conjecture): say if we open perf_event on pid = 11451, then
> in BPF, we call bpf_perf_event_output() when a direct sample is ready to
> be dumped. But currently the perf_event of pid 11451 is not __fully__
> sched_in yet, so in kernel/trace/bpf_trace.c's
> __bpf_perf_event_output(), there will be event->oncpu != cpu, thus
> return -EOPNOTSUPP, direct off-cpu sample output failed.
> 
> if (unlikely(event->oncpu != cpu))
> 		return -EOPNOTSUPP;
> 
> So I'm making it pid = -1, everybody can do bpf_perf_event_output()
> 
> P.S. In perf trace this is not necessary, because it uses syscall
> tracepoints, instead of sched_switch.
> 
> Signed-off-by: Howard Chu <howardchu95@gmail.com>
> ---
>  tools/perf/util/evsel.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index edfb376f0611..500ca62669cb 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -2368,6 +2368,9 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
>  
>  			test_attr__ready();
>  
> +			if (evsel__is_offcpu_event(evsel))
> +				pid = -1;
> +

This looks hacky.  I think you'll end up having two copies of offcpu
events if there are two target tasks.  Maybe you can replace the thread
map of the offcpu event to have a single entry (-1) for any thread after
creating the event.

Thanks,
Namhyung


>  			/* Debug message used by test scripts */
>  			pr_debug2_peo("sys_perf_event_open: pid %d  cpu %d  group_fd %d  flags %#lx",
>  				pid, perf_cpu_map__cpu(cpus, idx).cpu, group_fd, evsel->open_flags);
> -- 
> 2.43.0
>

[PATCH v6 1/8] perf evsel: Set off-cpu BPF output to system-wide
[PATCH v6 2/8] perf record --off-cpu: Add --off-cpu-thresh
[PATCH v6 3/8] perf record --off-cpu: Parse offcpu-time event
[PATCH v6 4/8] perf record off-cpu: Dump direct off-cpu samples in BPF
[PATCH v6 5/8] perf record --off-cpu: Dump total off-cpu time at the end.
[PATCH v6 6/8] perf evsel: Delete unnecessary = 0
[PATCH v6 7/8] perf record --off-cpu: Parse BPF output embedded data
[PATCH v6 8/8] perf test: Add direct off-cpu dumping test