[PATCH v3 0/5] Dump off-cpu samples directly

Howard Chu posted 5 patches 2 months, 3 weeks ago
tools/perf/builtin-record.c             |   2 +
tools/perf/builtin-script.c             |   2 +-
tools/perf/tests/builtin-test.c         |   1 +
tools/perf/tests/shell/record_offcpu.sh |  29 +++++
tools/perf/tests/tests.h                |   1 +
tools/perf/tests/workloads/Build        |   1 +
tools/perf/tests/workloads/offcpu.c     |  16 +++
tools/perf/util/bpf_off_cpu.c           |  53 ++++++++-
tools/perf/util/bpf_skel/off_cpu.bpf.c  | 143 ++++++++++++++++++++++++
tools/perf/util/evsel.c                 |  16 ++-
tools/perf/util/evsel.h                 |  13 +++
tools/perf/util/header.c                |  12 ++
tools/perf/util/off_cpu.h               |   1 +
tools/perf/util/record.h                |   1 +
tools/perf/util/session.c               |  23 +++-
15 files changed, 309 insertions(+), 5 deletions(-)
create mode 100644 tools/perf/tests/workloads/offcpu.c
[PATCH v3 0/5] Dump off-cpu samples directly
Posted by Howard Chu 2 months, 3 weeks ago
As mentioned in: https://bugzilla.kernel.org/show_bug.cgi?id=207323

Currently, off-cpu samples are dumped when perf record is exiting. This
results in off-cpu samples being after the regular samples. This patch
series makes possible dumping off-cpu samples on-the-fly, directly into
perf ring buffer. And it dispatches those samples to the correct format
for perf.data consumers.

Changes in v3:
 - Add off-cpu-thresh argument
 - Process direct off-cpu samples in post

Changes in v2:
 - Remove unnecessary comments.
 - Rename function off_cpu_change_type to off_cpu_prepare_parse

Before:
```
     migration/0      21 [000] 27981.041319: 2944637851    cycles:P:  ffffffff90d2e8aa record_times+0xa ([kernel.kallsyms])
            perf  770116 [001] 27981.041375:          1    cycles:P:  ffffffff90ee4960 event_function+0xf0 ([kernel.kallsyms])
            perf  770116 [001] 27981.041377:          1    cycles:P:  ffffffff90c184b1 intel_bts_enable_local+0x31 ([kernel.kallsyms])
            perf  770116 [001] 27981.041379:      51611    cycles:P:  ffffffff91a160b0 native_sched_clock+0x30 ([kernel.kallsyms])
     migration/1      26 [001] 27981.041400: 4227682775    cycles:P:  ffffffff90d06a74 wakeup_preempt+0x44 ([kernel.kallsyms])
     migration/2      32 [002] 27981.041477: 4159401534    cycles:P:  ffffffff90d11993 update_load_avg+0x63 ([kernel.kallsyms])

sshd  708098 [000] 18446744069.414584:     286392 offcpu-time: 
	    79a864f1c8bb ppoll+0x4b (/usr/lib/libc.so.6)
	    585690935cca [unknown] (/usr/bin/sshd)
```

After:
```
            perf  774767 [003] 28178.033444:        497           cycles:P:  ffffffff91a160c3 native_sched_clock+0x43 ([kernel.kallsyms])
            perf  774767 [003] 28178.033445:     399440           cycles:P:  ffffffff91c01f8d nmi_restore+0x25 ([kernel.kallsyms])
         swapper       0 [001] 28178.036639:  376650973           cycles:P:  ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
         swapper       0 [003] 28178.182921:  348779378           cycles:P:  ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
    blueman-tray    1355 [000] 28178.627906:  100184571 offcpu-time-direct: 
	    7528eef1c39d __poll+0x4d (/usr/lib/libc.so.6)
	    7528edf7d8fd [unknown] (/usr/lib/libglib-2.0.so.0.8000.2)
	    7528edf1af95 g_main_context_iteration+0x35 (/usr/lib/libglib-2.0.so.0.8000.2)
	    7528eda4ab86 g_application_run+0x1f6 (/usr/lib/libgio-2.0.so.0.8000.2)
	    7528ee6aa596 [unknown] (/usr/lib/libffi.so.8.1.4)
	    7fff24e862d8 [unknown] ([unknown])


    blueman-tray    1355 [000] 28178.728137:  100187539 offcpu-time-direct: 
	    7528eef1c39d __poll+0x4d (/usr/lib/libc.so.6)
	    7528edf7d8fd [unknown] (/usr/lib/libglib-2.0.so.0.8000.2)
	    7528edf1af95 g_main_context_iteration+0x35 (/usr/lib/libglib-2.0.so.0.8000.2)
	    7528eda4ab86 g_application_run+0x1f6 (/usr/lib/libgio-2.0.so.0.8000.2)
	    7528ee6aa596 [unknown] (/usr/lib/libffi.so.8.1.4)
	    7fff24e862d8 [unknown] ([unknown])


         swapper       0 [000] 28178.463253:  195945410           cycles:P:  ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
     dbus-broker     412 [002] 28178.464855:  376737008           cycles:P:  ffffffff91c000a0 entry_SYSCALL_64+0x20 ([kernel.kallsyms])
```

Howard Chu (5):
  perf record off-cpu: Add direct off-cpu event
  perf record off-cpu: Dumping samples in BPF
  perf record off-cpu: processing of embedded sample
  perf record off-cpu: save embedded sample type
  perf record off-cpu: Add direct off-cpu test

 tools/perf/builtin-record.c             |   2 +
 tools/perf/builtin-script.c             |   2 +-
 tools/perf/tests/builtin-test.c         |   1 +
 tools/perf/tests/shell/record_offcpu.sh |  29 +++++
 tools/perf/tests/tests.h                |   1 +
 tools/perf/tests/workloads/Build        |   1 +
 tools/perf/tests/workloads/offcpu.c     |  16 +++
 tools/perf/util/bpf_off_cpu.c           |  53 ++++++++-
 tools/perf/util/bpf_skel/off_cpu.bpf.c  | 143 ++++++++++++++++++++++++
 tools/perf/util/evsel.c                 |  16 ++-
 tools/perf/util/evsel.h                 |  13 +++
 tools/perf/util/header.c                |  12 ++
 tools/perf/util/off_cpu.h               |   1 +
 tools/perf/util/record.h                |   1 +
 tools/perf/util/session.c               |  23 +++-
 15 files changed, 309 insertions(+), 5 deletions(-)
 create mode 100644 tools/perf/tests/workloads/offcpu.c

-- 
2.45.2
Re: [PATCH v3 0/5] Dump off-cpu samples directly
Posted by Ian Rogers 2 months, 3 weeks ago
On Fri, Jul 26, 2024 at 3:28 AM Howard Chu <howardchu95@gmail.com> wrote:
>
> As mentioned in: https://bugzilla.kernel.org/show_bug.cgi?id=207323
>
> Currently, off-cpu samples are dumped when perf record is exiting. This
> results in off-cpu samples being after the regular samples. This patch
> series makes possible dumping off-cpu samples on-the-fly, directly into
> perf ring buffer. And it dispatches those samples to the correct format
> for perf.data consumers.
>
> Changes in v3:
>  - Add off-cpu-thresh argument
>  - Process direct off-cpu samples in post
>
> Changes in v2:
>  - Remove unnecessary comments.
>  - Rename function off_cpu_change_type to off_cpu_prepare_parse
>
> Before:
> ```
>      migration/0      21 [000] 27981.041319: 2944637851    cycles:P:  ffffffff90d2e8aa record_times+0xa ([kernel.kallsyms])
>             perf  770116 [001] 27981.041375:          1    cycles:P:  ffffffff90ee4960 event_function+0xf0 ([kernel.kallsyms])
>             perf  770116 [001] 27981.041377:          1    cycles:P:  ffffffff90c184b1 intel_bts_enable_local+0x31 ([kernel.kallsyms])
>             perf  770116 [001] 27981.041379:      51611    cycles:P:  ffffffff91a160b0 native_sched_clock+0x30 ([kernel.kallsyms])
>      migration/1      26 [001] 27981.041400: 4227682775    cycles:P:  ffffffff90d06a74 wakeup_preempt+0x44 ([kernel.kallsyms])
>      migration/2      32 [002] 27981.041477: 4159401534    cycles:P:  ffffffff90d11993 update_load_avg+0x63 ([kernel.kallsyms])
>
> sshd  708098 [000] 18446744069.414584:     286392 offcpu-time:
>             79a864f1c8bb ppoll+0x4b (/usr/lib/libc.so.6)
>             585690935cca [unknown] (/usr/bin/sshd)
> ```
>
> After:
> ```
>             perf  774767 [003] 28178.033444:        497           cycles:P:  ffffffff91a160c3 native_sched_clock+0x43 ([kernel.kallsyms])
>             perf  774767 [003] 28178.033445:     399440           cycles:P:  ffffffff91c01f8d nmi_restore+0x25 ([kernel.kallsyms])
>          swapper       0 [001] 28178.036639:  376650973           cycles:P:  ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
>          swapper       0 [003] 28178.182921:  348779378           cycles:P:  ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
>     blueman-tray    1355 [000] 28178.627906:  100184571 offcpu-time-direct:
>             7528eef1c39d __poll+0x4d (/usr/lib/libc.so.6)
>             7528edf7d8fd [unknown] (/usr/lib/libglib-2.0.so.0.8000.2)
>             7528edf1af95 g_main_context_iteration+0x35 (/usr/lib/libglib-2.0.so.0.8000.2)
>             7528eda4ab86 g_application_run+0x1f6 (/usr/lib/libgio-2.0.so.0.8000.2)
>             7528ee6aa596 [unknown] (/usr/lib/libffi.so.8.1.4)
>             7fff24e862d8 [unknown] ([unknown])
>
>
>     blueman-tray    1355 [000] 28178.728137:  100187539 offcpu-time-direct:
>             7528eef1c39d __poll+0x4d (/usr/lib/libc.so.6)
>             7528edf7d8fd [unknown] (/usr/lib/libglib-2.0.so.0.8000.2)
>             7528edf1af95 g_main_context_iteration+0x35 (/usr/lib/libglib-2.0.so.0.8000.2)
>             7528eda4ab86 g_application_run+0x1f6 (/usr/lib/libgio-2.0.so.0.8000.2)
>             7528ee6aa596 [unknown] (/usr/lib/libffi.so.8.1.4)
>             7fff24e862d8 [unknown] ([unknown])
>
>
>          swapper       0 [000] 28178.463253:  195945410           cycles:P:  ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
>      dbus-broker     412 [002] 28178.464855:  376737008           cycles:P:  ffffffff91c000a0 entry_SYSCALL_64+0x20 ([kernel.kallsyms])
> ```
>
> Howard Chu (5):
>   perf record off-cpu: Add direct off-cpu event
>   perf record off-cpu: Dumping samples in BPF
>   perf record off-cpu: processing of embedded sample
>   perf record off-cpu: save embedded sample type
>   perf record off-cpu: Add direct off-cpu test

This worked nicely on some tests with me. I'm not sure about
separating offcpu-time and offcpu-time-direct rather than the just 1
event. I also immediately set --off-cpu-thresh=1 so maybe milliseconds
should be microseconds, and the default shouldn't be to not use the
direct samples.

Thanks,
Ian

>  tools/perf/builtin-record.c             |   2 +
>  tools/perf/builtin-script.c             |   2 +-
>  tools/perf/tests/builtin-test.c         |   1 +
>  tools/perf/tests/shell/record_offcpu.sh |  29 +++++
>  tools/perf/tests/tests.h                |   1 +
>  tools/perf/tests/workloads/Build        |   1 +
>  tools/perf/tests/workloads/offcpu.c     |  16 +++
>  tools/perf/util/bpf_off_cpu.c           |  53 ++++++++-
>  tools/perf/util/bpf_skel/off_cpu.bpf.c  | 143 ++++++++++++++++++++++++
>  tools/perf/util/evsel.c                 |  16 ++-
>  tools/perf/util/evsel.h                 |  13 +++
>  tools/perf/util/header.c                |  12 ++
>  tools/perf/util/off_cpu.h               |   1 +
>  tools/perf/util/record.h                |   1 +
>  tools/perf/util/session.c               |  23 +++-
>  15 files changed, 309 insertions(+), 5 deletions(-)
>  create mode 100644 tools/perf/tests/workloads/offcpu.c
>
> --
> 2.45.2
>
Re: [PATCH v3 0/5] Dump off-cpu samples directly
Posted by Namhyung Kim 2 months, 3 weeks ago
Hello Howard,

On Fri, Jul 26, 2024 at 06:28:21PM +0800, Howard Chu wrote:
> As mentioned in: https://bugzilla.kernel.org/show_bug.cgi?id=207323
> 
> Currently, off-cpu samples are dumped when perf record is exiting. This
> results in off-cpu samples being after the regular samples. This patch
> series makes possible dumping off-cpu samples on-the-fly, directly into
> perf ring buffer. And it dispatches those samples to the correct format
> for perf.data consumers.

Thanks for your work!

But I'm not sure we need a separate event for offcpu-time-direct.  If we
fix the format for the direct event, we can adjust the format of offcpu-
time when it dumps at the end.

Anyway, as far as I can see you don't need to fill the sample info in
the offcpu-time-direct manually in your BPF program.  Because the
bpf_perf_event_output() will call perf_event_output() which fills all
the sample information according to the sample_type flags.

Well.. it'll set IP to the schedule function, but it should be ok.
(updating IP using CALLCHAIN like in off_cpu_write() is a kinda hack and
not absoluately necessary, probably I can get rid of it..  Let's go with
simple for now.)

So I think what you need is to ensure it has the uncessary flags.  And
the only info it needs to fill is the time between the previous schedule
and this can be added to the raw data.

Thanks,
Namhyung

> 
> Changes in v3:
>  - Add off-cpu-thresh argument
>  - Process direct off-cpu samples in post
> 
> Changes in v2:
>  - Remove unnecessary comments.
>  - Rename function off_cpu_change_type to off_cpu_prepare_parse
> 
> Before:
> ```
>      migration/0      21 [000] 27981.041319: 2944637851    cycles:P:  ffffffff90d2e8aa record_times+0xa ([kernel.kallsyms])
>             perf  770116 [001] 27981.041375:          1    cycles:P:  ffffffff90ee4960 event_function+0xf0 ([kernel.kallsyms])
>             perf  770116 [001] 27981.041377:          1    cycles:P:  ffffffff90c184b1 intel_bts_enable_local+0x31 ([kernel.kallsyms])
>             perf  770116 [001] 27981.041379:      51611    cycles:P:  ffffffff91a160b0 native_sched_clock+0x30 ([kernel.kallsyms])
>      migration/1      26 [001] 27981.041400: 4227682775    cycles:P:  ffffffff90d06a74 wakeup_preempt+0x44 ([kernel.kallsyms])
>      migration/2      32 [002] 27981.041477: 4159401534    cycles:P:  ffffffff90d11993 update_load_avg+0x63 ([kernel.kallsyms])
> 
> sshd  708098 [000] 18446744069.414584:     286392 offcpu-time: 
> 	    79a864f1c8bb ppoll+0x4b (/usr/lib/libc.so.6)
> 	    585690935cca [unknown] (/usr/bin/sshd)
> ```
> 
> After:
> ```
>             perf  774767 [003] 28178.033444:        497           cycles:P:  ffffffff91a160c3 native_sched_clock+0x43 ([kernel.kallsyms])
>             perf  774767 [003] 28178.033445:     399440           cycles:P:  ffffffff91c01f8d nmi_restore+0x25 ([kernel.kallsyms])
>          swapper       0 [001] 28178.036639:  376650973           cycles:P:  ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
>          swapper       0 [003] 28178.182921:  348779378           cycles:P:  ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
>     blueman-tray    1355 [000] 28178.627906:  100184571 offcpu-time-direct: 
> 	    7528eef1c39d __poll+0x4d (/usr/lib/libc.so.6)
> 	    7528edf7d8fd [unknown] (/usr/lib/libglib-2.0.so.0.8000.2)
> 	    7528edf1af95 g_main_context_iteration+0x35 (/usr/lib/libglib-2.0.so.0.8000.2)
> 	    7528eda4ab86 g_application_run+0x1f6 (/usr/lib/libgio-2.0.so.0.8000.2)
> 	    7528ee6aa596 [unknown] (/usr/lib/libffi.so.8.1.4)
> 	    7fff24e862d8 [unknown] ([unknown])
> 
> 
>     blueman-tray    1355 [000] 28178.728137:  100187539 offcpu-time-direct: 
> 	    7528eef1c39d __poll+0x4d (/usr/lib/libc.so.6)
> 	    7528edf7d8fd [unknown] (/usr/lib/libglib-2.0.so.0.8000.2)
> 	    7528edf1af95 g_main_context_iteration+0x35 (/usr/lib/libglib-2.0.so.0.8000.2)
> 	    7528eda4ab86 g_application_run+0x1f6 (/usr/lib/libgio-2.0.so.0.8000.2)
> 	    7528ee6aa596 [unknown] (/usr/lib/libffi.so.8.1.4)
> 	    7fff24e862d8 [unknown] ([unknown])
> 
> 
>          swapper       0 [000] 28178.463253:  195945410           cycles:P:  ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
>      dbus-broker     412 [002] 28178.464855:  376737008           cycles:P:  ffffffff91c000a0 entry_SYSCALL_64+0x20 ([kernel.kallsyms])
> ```
> 
> Howard Chu (5):
>   perf record off-cpu: Add direct off-cpu event
>   perf record off-cpu: Dumping samples in BPF
>   perf record off-cpu: processing of embedded sample
>   perf record off-cpu: save embedded sample type
>   perf record off-cpu: Add direct off-cpu test
> 
>  tools/perf/builtin-record.c             |   2 +
>  tools/perf/builtin-script.c             |   2 +-
>  tools/perf/tests/builtin-test.c         |   1 +
>  tools/perf/tests/shell/record_offcpu.sh |  29 +++++
>  tools/perf/tests/tests.h                |   1 +
>  tools/perf/tests/workloads/Build        |   1 +
>  tools/perf/tests/workloads/offcpu.c     |  16 +++
>  tools/perf/util/bpf_off_cpu.c           |  53 ++++++++-
>  tools/perf/util/bpf_skel/off_cpu.bpf.c  | 143 ++++++++++++++++++++++++
>  tools/perf/util/evsel.c                 |  16 ++-
>  tools/perf/util/evsel.h                 |  13 +++
>  tools/perf/util/header.c                |  12 ++
>  tools/perf/util/off_cpu.h               |   1 +
>  tools/perf/util/record.h                |   1 +
>  tools/perf/util/session.c               |  23 +++-
>  15 files changed, 309 insertions(+), 5 deletions(-)
>  create mode 100644 tools/perf/tests/workloads/offcpu.c
> 
> -- 
> 2.45.2
>
Re: [PATCH v3 0/5] Dump off-cpu samples directly
Posted by Howard Chu 2 months, 2 weeks ago
Hello Namhyung,

Thanks for reviewing this patch.

On Mon, Jul 29, 2024 at 9:21 AM Namhyung Kim <namhyung@kernel.org> wrote:
>
> Hello Howard,
>
> On Fri, Jul 26, 2024 at 06:28:21PM +0800, Howard Chu wrote:
> > As mentioned in: https://bugzilla.kernel.org/show_bug.cgi?id=207323
> >
> > Currently, off-cpu samples are dumped when perf record is exiting. This
> > results in off-cpu samples being after the regular samples. This patch
> > series makes possible dumping off-cpu samples on-the-fly, directly into
> > perf ring buffer. And it dispatches those samples to the correct format
> > for perf.data consumers.
>
> Thanks for your work!
>
> But I'm not sure we need a separate event for offcpu-time-direct.  If we
> fix the format for the direct event, we can adjust the format of offcpu-
> time when it dumps at the end.

Thank you and Ian for this advice, I'll do that.

>
> Anyway, as far as I can see you don't need to fill the sample info in
> the offcpu-time-direct manually in your BPF program.  Because the
> bpf_perf_event_output() will call perf_event_output() which fills all
> the sample information according to the sample_type flags.
>
> Well.. it'll set IP to the schedule function, but it should be ok.
> (updating IP using CALLCHAIN like in off_cpu_write() is a kinda hack and
> not absoluately necessary, probably I can get rid of it..  Let's go with
> simple for now.)
>
> So I think what you need is to ensure it has the uncessary flags.  And
> the only info it needs to fill is the time between the previous schedule
> and this can be added to the raw data.

Sure thing, thank you. Other than the off-cpu duration, do you think
we should collect the callchain as well?

This idea is great because it minimizes the data we need to transfer
from kernel space to user space. But with this, the whole
"full-fledged sample in BPF output" is not the case anymore, because
we only take a couple of things(such as the time between the
schedules) from the raw data. This idea is more like an extension of
the sample data, I think it's interesting, will try to implement it.

Thanks,
Howard
>
> Thanks,
> Namhyung
>
> >
> > Changes in v3:
> >  - Add off-cpu-thresh argument
> >  - Process direct off-cpu samples in post
> >
> > Changes in v2:
> >  - Remove unnecessary comments.
> >  - Rename function off_cpu_change_type to off_cpu_prepare_parse
> >
> > Before:
> > ```
> >      migration/0      21 [000] 27981.041319: 2944637851    cycles:P:  ffffffff90d2e8aa record_times+0xa ([kernel.kallsyms])
> >             perf  770116 [001] 27981.041375:          1    cycles:P:  ffffffff90ee4960 event_function+0xf0 ([kernel.kallsyms])
> >             perf  770116 [001] 27981.041377:          1    cycles:P:  ffffffff90c184b1 intel_bts_enable_local+0x31 ([kernel.kallsyms])
> >             perf  770116 [001] 27981.041379:      51611    cycles:P:  ffffffff91a160b0 native_sched_clock+0x30 ([kernel.kallsyms])
> >      migration/1      26 [001] 27981.041400: 4227682775    cycles:P:  ffffffff90d06a74 wakeup_preempt+0x44 ([kernel.kallsyms])
> >      migration/2      32 [002] 27981.041477: 4159401534    cycles:P:  ffffffff90d11993 update_load_avg+0x63 ([kernel.kallsyms])
> >
> > sshd  708098 [000] 18446744069.414584:     286392 offcpu-time:
> >           79a864f1c8bb ppoll+0x4b (/usr/lib/libc.so.6)
> >           585690935cca [unknown] (/usr/bin/sshd)
> > ```
> >
> > After:
> > ```
> >             perf  774767 [003] 28178.033444:        497           cycles:P:  ffffffff91a160c3 native_sched_clock+0x43 ([kernel.kallsyms])
> >             perf  774767 [003] 28178.033445:     399440           cycles:P:  ffffffff91c01f8d nmi_restore+0x25 ([kernel.kallsyms])
> >          swapper       0 [001] 28178.036639:  376650973           cycles:P:  ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
> >          swapper       0 [003] 28178.182921:  348779378           cycles:P:  ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
> >     blueman-tray    1355 [000] 28178.627906:  100184571 offcpu-time-direct:
> >           7528eef1c39d __poll+0x4d (/usr/lib/libc.so.6)
> >           7528edf7d8fd [unknown] (/usr/lib/libglib-2.0.so.0.8000.2)
> >           7528edf1af95 g_main_context_iteration+0x35 (/usr/lib/libglib-2.0.so.0.8000.2)
> >           7528eda4ab86 g_application_run+0x1f6 (/usr/lib/libgio-2.0.so.0.8000.2)
> >           7528ee6aa596 [unknown] (/usr/lib/libffi.so.8.1.4)
> >           7fff24e862d8 [unknown] ([unknown])
> >
> >
> >     blueman-tray    1355 [000] 28178.728137:  100187539 offcpu-time-direct:
> >           7528eef1c39d __poll+0x4d (/usr/lib/libc.so.6)
> >           7528edf7d8fd [unknown] (/usr/lib/libglib-2.0.so.0.8000.2)
> >           7528edf1af95 g_main_context_iteration+0x35 (/usr/lib/libglib-2.0.so.0.8000.2)
> >           7528eda4ab86 g_application_run+0x1f6 (/usr/lib/libgio-2.0.so.0.8000.2)
> >           7528ee6aa596 [unknown] (/usr/lib/libffi.so.8.1.4)
> >           7fff24e862d8 [unknown] ([unknown])
> >
> >
> >          swapper       0 [000] 28178.463253:  195945410           cycles:P:  ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
> >      dbus-broker     412 [002] 28178.464855:  376737008           cycles:P:  ffffffff91c000a0 entry_SYSCALL_64+0x20 ([kernel.kallsyms])
> > ```
> >
> > Howard Chu (5):
> >   perf record off-cpu: Add direct off-cpu event
> >   perf record off-cpu: Dumping samples in BPF
> >   perf record off-cpu: processing of embedded sample
> >   perf record off-cpu: save embedded sample type
> >   perf record off-cpu: Add direct off-cpu test
> >
> >  tools/perf/builtin-record.c             |   2 +
> >  tools/perf/builtin-script.c             |   2 +-
> >  tools/perf/tests/builtin-test.c         |   1 +
> >  tools/perf/tests/shell/record_offcpu.sh |  29 +++++
> >  tools/perf/tests/tests.h                |   1 +
> >  tools/perf/tests/workloads/Build        |   1 +
> >  tools/perf/tests/workloads/offcpu.c     |  16 +++
> >  tools/perf/util/bpf_off_cpu.c           |  53 ++++++++-
> >  tools/perf/util/bpf_skel/off_cpu.bpf.c  | 143 ++++++++++++++++++++++++
> >  tools/perf/util/evsel.c                 |  16 ++-
> >  tools/perf/util/evsel.h                 |  13 +++
> >  tools/perf/util/header.c                |  12 ++
> >  tools/perf/util/off_cpu.h               |   1 +
> >  tools/perf/util/record.h                |   1 +
> >  tools/perf/util/session.c               |  23 +++-
> >  15 files changed, 309 insertions(+), 5 deletions(-)
> >  create mode 100644 tools/perf/tests/workloads/offcpu.c
> >
> > --
> > 2.45.2
> >
Re: [PATCH v3 0/5] Dump off-cpu samples directly
Posted by Namhyung Kim 2 months, 2 weeks ago
On Mon, Jul 29, 2024 at 11:24:47PM +0800, Howard Chu wrote:
> Hello Namhyung,
> 
> Thanks for reviewing this patch.
> 
> On Mon, Jul 29, 2024 at 9:21 AM Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > Hello Howard,
> >
> > On Fri, Jul 26, 2024 at 06:28:21PM +0800, Howard Chu wrote:
> > > As mentioned in: https://bugzilla.kernel.org/show_bug.cgi?id=207323
> > >
> > > Currently, off-cpu samples are dumped when perf record is exiting. This
> > > results in off-cpu samples being after the regular samples. This patch
> > > series makes possible dumping off-cpu samples on-the-fly, directly into
> > > perf ring buffer. And it dispatches those samples to the correct format
> > > for perf.data consumers.
> >
> > Thanks for your work!
> >
> > But I'm not sure we need a separate event for offcpu-time-direct.  If we
> > fix the format for the direct event, we can adjust the format of offcpu-
> > time when it dumps at the end.
> 
> Thank you and Ian for this advice, I'll do that.
> 
> >
> > Anyway, as far as I can see you don't need to fill the sample info in
> > the offcpu-time-direct manually in your BPF program.  Because the
> > bpf_perf_event_output() will call perf_event_output() which fills all
> > the sample information according to the sample_type flags.
> >
> > Well.. it'll set IP to the schedule function, but it should be ok.
> > (updating IP using CALLCHAIN like in off_cpu_write() is a kinda hack and
> > not absoluately necessary, probably I can get rid of it..  Let's go with
> > simple for now.)
> >
> > So I think what you need is to ensure it has the uncessary flags.  And
> > the only info it needs to fill is the time between the previous schedule
> > and this can be added to the raw data.
> 
> Sure thing, thank you. Other than the off-cpu duration, do you think
> we should collect the callchain as well?

I think the kernel will do that for you once you set the
SAMPLE_CALLCHAIN flag in the event attr.

Thanks,
Namhyung

> 
> This idea is great because it minimizes the data we need to transfer
> from kernel space to user space. But with this, the whole
> "full-fledged sample in BPF output" is not the case anymore, because
> we only take a couple of things(such as the time between the
> schedules) from the raw data. This idea is more like an extension of
> the sample data, I think it's interesting, will try to implement it.
> 
> Thanks,
> Howard
> >
> > Thanks,
> > Namhyung
> >
> > >
> > > Changes in v3:
> > >  - Add off-cpu-thresh argument
> > >  - Process direct off-cpu samples in post
> > >
> > > Changes in v2:
> > >  - Remove unnecessary comments.
> > >  - Rename function off_cpu_change_type to off_cpu_prepare_parse
> > >
> > > Before:
> > > ```
> > >      migration/0      21 [000] 27981.041319: 2944637851    cycles:P:  ffffffff90d2e8aa record_times+0xa ([kernel.kallsyms])
> > >             perf  770116 [001] 27981.041375:          1    cycles:P:  ffffffff90ee4960 event_function+0xf0 ([kernel.kallsyms])
> > >             perf  770116 [001] 27981.041377:          1    cycles:P:  ffffffff90c184b1 intel_bts_enable_local+0x31 ([kernel.kallsyms])
> > >             perf  770116 [001] 27981.041379:      51611    cycles:P:  ffffffff91a160b0 native_sched_clock+0x30 ([kernel.kallsyms])
> > >      migration/1      26 [001] 27981.041400: 4227682775    cycles:P:  ffffffff90d06a74 wakeup_preempt+0x44 ([kernel.kallsyms])
> > >      migration/2      32 [002] 27981.041477: 4159401534    cycles:P:  ffffffff90d11993 update_load_avg+0x63 ([kernel.kallsyms])
> > >
> > > sshd  708098 [000] 18446744069.414584:     286392 offcpu-time:
> > >           79a864f1c8bb ppoll+0x4b (/usr/lib/libc.so.6)
> > >           585690935cca [unknown] (/usr/bin/sshd)
> > > ```
> > >
> > > After:
> > > ```
> > >             perf  774767 [003] 28178.033444:        497           cycles:P:  ffffffff91a160c3 native_sched_clock+0x43 ([kernel.kallsyms])
> > >             perf  774767 [003] 28178.033445:     399440           cycles:P:  ffffffff91c01f8d nmi_restore+0x25 ([kernel.kallsyms])
> > >          swapper       0 [001] 28178.036639:  376650973           cycles:P:  ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
> > >          swapper       0 [003] 28178.182921:  348779378           cycles:P:  ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
> > >     blueman-tray    1355 [000] 28178.627906:  100184571 offcpu-time-direct:
> > >           7528eef1c39d __poll+0x4d (/usr/lib/libc.so.6)
> > >           7528edf7d8fd [unknown] (/usr/lib/libglib-2.0.so.0.8000.2)
> > >           7528edf1af95 g_main_context_iteration+0x35 (/usr/lib/libglib-2.0.so.0.8000.2)
> > >           7528eda4ab86 g_application_run+0x1f6 (/usr/lib/libgio-2.0.so.0.8000.2)
> > >           7528ee6aa596 [unknown] (/usr/lib/libffi.so.8.1.4)
> > >           7fff24e862d8 [unknown] ([unknown])
> > >
> > >
> > >     blueman-tray    1355 [000] 28178.728137:  100187539 offcpu-time-direct:
> > >           7528eef1c39d __poll+0x4d (/usr/lib/libc.so.6)
> > >           7528edf7d8fd [unknown] (/usr/lib/libglib-2.0.so.0.8000.2)
> > >           7528edf1af95 g_main_context_iteration+0x35 (/usr/lib/libglib-2.0.so.0.8000.2)
> > >           7528eda4ab86 g_application_run+0x1f6 (/usr/lib/libgio-2.0.so.0.8000.2)
> > >           7528ee6aa596 [unknown] (/usr/lib/libffi.so.8.1.4)
> > >           7fff24e862d8 [unknown] ([unknown])
> > >
> > >
> > >          swapper       0 [000] 28178.463253:  195945410           cycles:P:  ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
> > >      dbus-broker     412 [002] 28178.464855:  376737008           cycles:P:  ffffffff91c000a0 entry_SYSCALL_64+0x20 ([kernel.kallsyms])
> > > ```
> > >
> > > Howard Chu (5):
> > >   perf record off-cpu: Add direct off-cpu event
> > >   perf record off-cpu: Dumping samples in BPF
> > >   perf record off-cpu: processing of embedded sample
> > >   perf record off-cpu: save embedded sample type
> > >   perf record off-cpu: Add direct off-cpu test
> > >
> > >  tools/perf/builtin-record.c             |   2 +
> > >  tools/perf/builtin-script.c             |   2 +-
> > >  tools/perf/tests/builtin-test.c         |   1 +
> > >  tools/perf/tests/shell/record_offcpu.sh |  29 +++++
> > >  tools/perf/tests/tests.h                |   1 +
> > >  tools/perf/tests/workloads/Build        |   1 +
> > >  tools/perf/tests/workloads/offcpu.c     |  16 +++
> > >  tools/perf/util/bpf_off_cpu.c           |  53 ++++++++-
> > >  tools/perf/util/bpf_skel/off_cpu.bpf.c  | 143 ++++++++++++++++++++++++
> > >  tools/perf/util/evsel.c                 |  16 ++-
> > >  tools/perf/util/evsel.h                 |  13 +++
> > >  tools/perf/util/header.c                |  12 ++
> > >  tools/perf/util/off_cpu.h               |   1 +
> > >  tools/perf/util/record.h                |   1 +
> > >  tools/perf/util/session.c               |  23 +++-
> > >  15 files changed, 309 insertions(+), 5 deletions(-)
> > >  create mode 100644 tools/perf/tests/workloads/offcpu.c
> > >
> > > --
> > > 2.45.2
> > >
Re: [PATCH v3 0/5] Dump off-cpu samples directly
Posted by Arnaldo Carvalho de Melo 2 months, 2 weeks ago
On Wed, Jul 31, 2024 at 10:46:18AM -0700, Namhyung Kim wrote:
> On Mon, Jul 29, 2024 at 11:24:47PM +0800, Howard Chu wrote:
> > On Mon, Jul 29, 2024 at 9:21 AM Namhyung Kim <namhyung@kernel.org> wrote:
> > > On Fri, Jul 26, 2024 at 06:28:21PM +0800, Howard Chu wrote:
> > > > As mentioned in: https://bugzilla.kernel.org/show_bug.cgi?id=207323

> > > > Currently, off-cpu samples are dumped when perf record is exiting. This
> > > > results in off-cpu samples being after the regular samples. This patch
> > > > series makes possible dumping off-cpu samples on-the-fly, directly into
> > > > perf ring buffer. And it dispatches those samples to the correct format
> > > > for perf.data consumers.

> > > Thanks for your work!

> > > But I'm not sure we need a separate event for offcpu-time-direct.  If we
> > > fix the format for the direct event, we can adjust the format of offcpu-
> > > time when it dumps at the end.

> > Thank you and Ian for this advice, I'll do that.

> > > Anyway, as far as I can see you don't need to fill the sample info in
> > > the offcpu-time-direct manually in your BPF program.  Because the
> > > bpf_perf_event_output() will call perf_event_output() which fills all
> > > the sample information according to the sample_type flags.

> > > Well.. it'll set IP to the schedule function, but it should be ok.
> > > (updating IP using CALLCHAIN like in off_cpu_write() is a kinda hack and
> > > not absoluately necessary, probably I can get rid of it..  Let's go with
> > > simple for now.)

> > > So I think what you need is to ensure it has the uncessary flags.  And
> > > the only info it needs to fill is the time between the previous schedule
> > > and this can be added to the raw data.

I wonder if there wouldn't be other kernel information about things that
may have affected the time it took for the task to be off-cpu, maybe
system load average, C/P states, but then it would be overengineering, I
think, just thought about what else we could add that could help
understanding the off-cpu time that could be obtained from the vantage
point of BPF attached to sched-out and sched-in, things we could collect
at sched-out in addition to the timestamp and ditto at sched-in, but I'm
no scheduler expert, so take this just as some brainstorming. Maybe we
could have some sort of sample_type specific to this off-cpu event that
would allow us to ask for extra info in an extensible way. We can start
with just PERF_OFFCPU_SAMPLE_TIMESTAMP...

> > Sure thing, thank you. Other than the off-cpu duration, do you think
> > we should collect the callchain as well?
 
> I think the kernel will do that for you once you set the
> SAMPLE_CALLCHAIN flag in the event attr.

Yes, we should not reinvent the wheel for all things that can be asked
from the kernel perf subsystem, only using the raw-data payload on the
bpf-output event for things we can't get from the perf subsystem, and
that is the timestamp for a previous event stored in a BPF map, looking
if the delta to the current time (on a sched-in event) is over the
threshold and then recording this time on this specific "made-up on the
fly using BPF" event that appears on the ring buffer just like any other
"native" events such as tracepoints, hardware events, cache events, etc.

- Arnaldo