[PATCH 0/8] tracing: Allow system call tracepoints to handle page faults

Mathieu Desnoyers posted 8 patches 2 months, 3 weeks ago
There is a newer version of this series
include/linux/tracepoint.h      | 87 +++++++++++++++++++++++++--------
include/trace/bpf_probe.h       | 13 +++++
include/trace/define_trace.h    |  5 ++
include/trace/events/syscalls.h |  4 +-
include/trace/perf.h            | 43 ++++++++++++++--
include/trace/trace_events.h    | 61 +++++++++++++++++++++--
init/Kconfig                    |  1 +
kernel/entry/common.c           |  4 +-
kernel/trace/trace_syscalls.c   | 36 ++++++++++++--
9 files changed, 218 insertions(+), 36 deletions(-)
[PATCH 0/8] tracing: Allow system call tracepoints to handle page faults
Posted by Mathieu Desnoyers 2 months, 3 weeks ago
Wire up the system call tracepoints with Tasks Trace RCU to allow
the ftrace, perf, and eBPF tracers to handle page faults.

This series does the initial wire-up allowing tracers to handle page
faults, but leaves out the actual handling of said page faults as future
work.

This series was compile and runtime tested with ftrace and perf syscall
tracing and raw syscall tracing, adding a WARN_ON_ONCE() in the
generated code to validate that the intended probes are used for raw
syscall tracing. The might_fault() added within those probes validate
that they are called from a context where handling a page fault is OK.

For ebpf, this series is compile-tested only.

This series replaces the "Faultable Tracepoints v6" series found at [1].

Thanks,

Mathieu

Link: https://lore.kernel.org/lkml/20240828144153.829582-1-mathieu.desnoyers@efficios.com/ # [1]
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: linux-trace-kernel@vger.kernel.org

Mathieu Desnoyers (8):
  tracing: Declare system call tracepoints with TRACE_EVENT_SYSCALL
  tracing/ftrace: guard syscall probe with preempt_notrace
  tracing/perf: guard syscall probe with preempt_notrace
  tracing/bpf: guard syscall probe with preempt_notrace
  tracing: Allow system call tracepoints to handle page faults
  tracing/ftrace: Add might_fault check to syscall probes
  tracing/perf: Add might_fault check to syscall probes
  tracing/bpf: Add might_fault check to syscall probes

 include/linux/tracepoint.h      | 87 +++++++++++++++++++++++++--------
 include/trace/bpf_probe.h       | 13 +++++
 include/trace/define_trace.h    |  5 ++
 include/trace/events/syscalls.h |  4 +-
 include/trace/perf.h            | 43 ++++++++++++++--
 include/trace/trace_events.h    | 61 +++++++++++++++++++++--
 init/Kconfig                    |  1 +
 kernel/entry/common.c           |  4 +-
 kernel/trace/trace_syscalls.c   | 36 ++++++++++++--
 9 files changed, 218 insertions(+), 36 deletions(-)

-- 
2.39.2
Re: [PATCH 0/8] tracing: Allow system call tracepoints to handle page faults
Posted by Andrii Nakryiko 2 months, 3 weeks ago
On Mon, Sep 9, 2024 at 1:17 PM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> Wire up the system call tracepoints with Tasks Trace RCU to allow
> the ftrace, perf, and eBPF tracers to handle page faults.
>
> This series does the initial wire-up allowing tracers to handle page
> faults, but leaves out the actual handling of said page faults as future
> work.
>
> This series was compile and runtime tested with ftrace and perf syscall
> tracing and raw syscall tracing, adding a WARN_ON_ONCE() in the
> generated code to validate that the intended probes are used for raw
> syscall tracing. The might_fault() added within those probes validate
> that they are called from a context where handling a page fault is OK.
>
> For ebpf, this series is compile-tested only.

What tree/branch was this based on? I can't apply it cleanly anywhere I tried...

>
> This series replaces the "Faultable Tracepoints v6" series found at [1].
>
> Thanks,
>
> Mathieu
>
> Link: https://lore.kernel.org/lkml/20240828144153.829582-1-mathieu.desnoyers@efficios.com/ # [1]
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Yonghong Song <yhs@fb.com>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> Cc: bpf@vger.kernel.org
> Cc: Joel Fernandes <joel@joelfernandes.org>
> Cc: linux-trace-kernel@vger.kernel.org
>
> Mathieu Desnoyers (8):
>   tracing: Declare system call tracepoints with TRACE_EVENT_SYSCALL
>   tracing/ftrace: guard syscall probe with preempt_notrace
>   tracing/perf: guard syscall probe with preempt_notrace
>   tracing/bpf: guard syscall probe with preempt_notrace
>   tracing: Allow system call tracepoints to handle page faults
>   tracing/ftrace: Add might_fault check to syscall probes
>   tracing/perf: Add might_fault check to syscall probes
>   tracing/bpf: Add might_fault check to syscall probes
>
>  include/linux/tracepoint.h      | 87 +++++++++++++++++++++++++--------
>  include/trace/bpf_probe.h       | 13 +++++
>  include/trace/define_trace.h    |  5 ++
>  include/trace/events/syscalls.h |  4 +-
>  include/trace/perf.h            | 43 ++++++++++++++--
>  include/trace/trace_events.h    | 61 +++++++++++++++++++++--
>  init/Kconfig                    |  1 +
>  kernel/entry/common.c           |  4 +-
>  kernel/trace/trace_syscalls.c   | 36 ++++++++++++--
>  9 files changed, 218 insertions(+), 36 deletions(-)
>
> --
> 2.39.2
Re: [PATCH 0/8] tracing: Allow system call tracepoints to handle page faults
Posted by Mathieu Desnoyers 2 months, 3 weeks ago
On 2024-09-09 19:53, Andrii Nakryiko wrote:
> On Mon, Sep 9, 2024 at 1:17 PM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> Wire up the system call tracepoints with Tasks Trace RCU to allow
>> the ftrace, perf, and eBPF tracers to handle page faults.
>>
>> This series does the initial wire-up allowing tracers to handle page
>> faults, but leaves out the actual handling of said page faults as future
>> work.
>>
>> This series was compile and runtime tested with ftrace and perf syscall
>> tracing and raw syscall tracing, adding a WARN_ON_ONCE() in the
>> generated code to validate that the intended probes are used for raw
>> syscall tracing. The might_fault() added within those probes validate
>> that they are called from a context where handling a page fault is OK.
>>
>> For ebpf, this series is compile-tested only.
> 
> What tree/branch was this based on? I can't apply it cleanly anywhere I tried...

This series was based on tag v6.10.6

Sorry I should have included this information in patch 0.

Thanks,

Mathieu

> 
>>
>> This series replaces the "Faultable Tracepoints v6" series found at [1].
>>
>> Thanks,
>>
>> Mathieu
>>
>> Link: https://lore.kernel.org/lkml/20240828144153.829582-1-mathieu.desnoyers@efficios.com/ # [1]
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Alexei Starovoitov <ast@kernel.org>
>> Cc: Yonghong Song <yhs@fb.com>
>> Cc: Paul E. McKenney <paulmck@kernel.org>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
>> Cc: Mark Rutland <mark.rutland@arm.com>
>> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
>> Cc: Namhyung Kim <namhyung@kernel.org>
>> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
>> Cc: bpf@vger.kernel.org
>> Cc: Joel Fernandes <joel@joelfernandes.org>
>> Cc: linux-trace-kernel@vger.kernel.org
>>
>> Mathieu Desnoyers (8):
>>    tracing: Declare system call tracepoints with TRACE_EVENT_SYSCALL
>>    tracing/ftrace: guard syscall probe with preempt_notrace
>>    tracing/perf: guard syscall probe with preempt_notrace
>>    tracing/bpf: guard syscall probe with preempt_notrace
>>    tracing: Allow system call tracepoints to handle page faults
>>    tracing/ftrace: Add might_fault check to syscall probes
>>    tracing/perf: Add might_fault check to syscall probes
>>    tracing/bpf: Add might_fault check to syscall probes
>>
>>   include/linux/tracepoint.h      | 87 +++++++++++++++++++++++++--------
>>   include/trace/bpf_probe.h       | 13 +++++
>>   include/trace/define_trace.h    |  5 ++
>>   include/trace/events/syscalls.h |  4 +-
>>   include/trace/perf.h            | 43 ++++++++++++++--
>>   include/trace/trace_events.h    | 61 +++++++++++++++++++++--
>>   init/Kconfig                    |  1 +
>>   kernel/entry/common.c           |  4 +-
>>   kernel/trace/trace_syscalls.c   | 36 ++++++++++++--
>>   9 files changed, 218 insertions(+), 36 deletions(-)
>>
>> --
>> 2.39.2

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH 0/8] tracing: Allow system call tracepoints to handle page faults
Posted by Andrii Nakryiko 2 months, 2 weeks ago
On Mon, Sep 9, 2024 at 5:36 PM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> On 2024-09-09 19:53, Andrii Nakryiko wrote:
> > On Mon, Sep 9, 2024 at 1:17 PM Mathieu Desnoyers
> > <mathieu.desnoyers@efficios.com> wrote:
> >>
> >> Wire up the system call tracepoints with Tasks Trace RCU to allow
> >> the ftrace, perf, and eBPF tracers to handle page faults.
> >>
> >> This series does the initial wire-up allowing tracers to handle page
> >> faults, but leaves out the actual handling of said page faults as future
> >> work.
> >>
> >> This series was compile and runtime tested with ftrace and perf syscall
> >> tracing and raw syscall tracing, adding a WARN_ON_ONCE() in the
> >> generated code to validate that the intended probes are used for raw
> >> syscall tracing. The might_fault() added within those probes validate
> >> that they are called from a context where handling a page fault is OK.
> >>
> >> For ebpf, this series is compile-tested only.
> >
> > What tree/branch was this based on? I can't apply it cleanly anywhere I tried...
>
> This series was based on tag v6.10.6
>

Didn't find 6.10.6, but it applied cleanly to 6.10.3. I tested that
BPF parts work:

Tested-by: Andrii Nakryiko <andrii@kernel.org> # BPF parts

> Sorry I should have included this information in patch 0.
>
> Thanks,
>
> Mathieu
>
> >
> >>
> >> This series replaces the "Faultable Tracepoints v6" series found at [1].
> >>
> >> Thanks,
> >>
> >> Mathieu
> >>
> >> Link: https://lore.kernel.org/lkml/20240828144153.829582-1-mathieu.desnoyers@efficios.com/ # [1]
> >> Cc: Peter Zijlstra <peterz@infradead.org>
> >> Cc: Alexei Starovoitov <ast@kernel.org>
> >> Cc: Yonghong Song <yhs@fb.com>
> >> Cc: Paul E. McKenney <paulmck@kernel.org>
> >> Cc: Ingo Molnar <mingo@redhat.com>
> >> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> >> Cc: Mark Rutland <mark.rutland@arm.com>
> >> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> >> Cc: Namhyung Kim <namhyung@kernel.org>
> >> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> >> Cc: bpf@vger.kernel.org
> >> Cc: Joel Fernandes <joel@joelfernandes.org>
> >> Cc: linux-trace-kernel@vger.kernel.org
> >>
> >> Mathieu Desnoyers (8):
> >>    tracing: Declare system call tracepoints with TRACE_EVENT_SYSCALL
> >>    tracing/ftrace: guard syscall probe with preempt_notrace
> >>    tracing/perf: guard syscall probe with preempt_notrace
> >>    tracing/bpf: guard syscall probe with preempt_notrace
> >>    tracing: Allow system call tracepoints to handle page faults
> >>    tracing/ftrace: Add might_fault check to syscall probes
> >>    tracing/perf: Add might_fault check to syscall probes
> >>    tracing/bpf: Add might_fault check to syscall probes
> >>
> >>   include/linux/tracepoint.h      | 87 +++++++++++++++++++++++++--------
> >>   include/trace/bpf_probe.h       | 13 +++++
> >>   include/trace/define_trace.h    |  5 ++
> >>   include/trace/events/syscalls.h |  4 +-
> >>   include/trace/perf.h            | 43 ++++++++++++++--
> >>   include/trace/trace_events.h    | 61 +++++++++++++++++++++--
> >>   init/Kconfig                    |  1 +
> >>   kernel/entry/common.c           |  4 +-
> >>   kernel/trace/trace_syscalls.c   | 36 ++++++++++++--
> >>   9 files changed, 218 insertions(+), 36 deletions(-)
> >>
> >> --
> >> 2.39.2
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>
Re: [PATCH 0/8] tracing: Allow system call tracepoints to handle page faults
Posted by Masami Hiramatsu (Google) 2 months, 2 weeks ago
On Mon,  9 Sep 2024 16:16:44 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> Wire up the system call tracepoints with Tasks Trace RCU to allow
> the ftrace, perf, and eBPF tracers to handle page faults.
> 
> This series does the initial wire-up allowing tracers to handle page
> faults, but leaves out the actual handling of said page faults as future
> work.
> 
> This series was compile and runtime tested with ftrace and perf syscall
> tracing and raw syscall tracing, adding a WARN_ON_ONCE() in the
> generated code to validate that the intended probes are used for raw
> syscall tracing. The might_fault() added within those probes validate
> that they are called from a context where handling a page fault is OK.

I think this series itself is valuable.
However, I'm still not sure that why ftrace needs to handle page faults.
This allows syscall trace-event itself to handle page faults, but the
raw-syscall/syscall events only accesses registers, right?

I think that the page faults happen only when dereference those registers
as a pointer to the data structure, and currently that is done by probes
like eprobe and fprobe. In order to handle faults in those probes, we
need to change how those writes data in per-cpu ring buffer.

Currently, those probes reserves an entry on ring buffer and writes the
dereferenced data on the entry, and commits it. So during this reserve-
write-commit operation, this still disables preemption. So we need a
another buffer for dereference on the stack and copy it.

Thank you,


> 
> For ebpf, this series is compile-tested only.
> 
> This series replaces the "Faultable Tracepoints v6" series found at [1].
> 
> Thanks,
> 
> Mathieu
> 
> Link: https://lore.kernel.org/lkml/20240828144153.829582-1-mathieu.desnoyers@efficios.com/ # [1]
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Yonghong Song <yhs@fb.com>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> Cc: bpf@vger.kernel.org
> Cc: Joel Fernandes <joel@joelfernandes.org>
> Cc: linux-trace-kernel@vger.kernel.org
> 
> Mathieu Desnoyers (8):
>   tracing: Declare system call tracepoints with TRACE_EVENT_SYSCALL
>   tracing/ftrace: guard syscall probe with preempt_notrace
>   tracing/perf: guard syscall probe with preempt_notrace
>   tracing/bpf: guard syscall probe with preempt_notrace
>   tracing: Allow system call tracepoints to handle page faults
>   tracing/ftrace: Add might_fault check to syscall probes
>   tracing/perf: Add might_fault check to syscall probes
>   tracing/bpf: Add might_fault check to syscall probes
> 
>  include/linux/tracepoint.h      | 87 +++++++++++++++++++++++++--------
>  include/trace/bpf_probe.h       | 13 +++++
>  include/trace/define_trace.h    |  5 ++
>  include/trace/events/syscalls.h |  4 +-
>  include/trace/perf.h            | 43 ++++++++++++++--
>  include/trace/trace_events.h    | 61 +++++++++++++++++++++--
>  init/Kconfig                    |  1 +
>  kernel/entry/common.c           |  4 +-
>  kernel/trace/trace_syscalls.c   | 36 ++++++++++++--
>  9 files changed, 218 insertions(+), 36 deletions(-)
> 
> -- 
> 2.39.2


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>
Re: [PATCH 0/8] tracing: Allow system call tracepoints to handle page faults
Posted by Mathieu Desnoyers 2 months, 1 week ago
On 2024-09-16 21:49, Masami Hiramatsu (Google) wrote:
> On Mon,  9 Sep 2024 16:16:44 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
>> Wire up the system call tracepoints with Tasks Trace RCU to allow
>> the ftrace, perf, and eBPF tracers to handle page faults.
>>
>> This series does the initial wire-up allowing tracers to handle page
>> faults, but leaves out the actual handling of said page faults as future
>> work.
>>
>> This series was compile and runtime tested with ftrace and perf syscall
>> tracing and raw syscall tracing, adding a WARN_ON_ONCE() in the
>> generated code to validate that the intended probes are used for raw
>> syscall tracing. The might_fault() added within those probes validate
>> that they are called from a context where handling a page fault is OK.
> 
> I think this series itself is valuable.
> However, I'm still not sure that why ftrace needs to handle page faults.
> This allows syscall trace-event itself to handle page faults, but the
> raw-syscall/syscall events only accesses registers, right?

You are correct that ftrace currently only accesses registers as of
today. And maybe it will stay the focus for ftrace, as the ftrace
focus appears to be more about what happens inside the kernel than
the causality from user-space. But different tracers have different
focus and use-cases.

It's a different story for eBPF and LTTng though: LTTng grabs filename strings from user-space for the openat system call for instance, so
we can reconstruct which system calls were done on which files at
post-processing. This is convenient if the end user wishes to focus
on the activity for given file/set of files.

eBPF also allows grabbing userspace data AFAIR, but none of those
tracers can handle page faults because tracepoints disables preemption,
which leads to missing data in specific cases, e.g. immediately after an
execve syscall when pages are not faulted in yet.

Also having syscall entry called from a context that can handle
preemption would allow LTTng (or eBPF) to do an immediate stackwalk
(see the sframe work from Josh) directly at system call entry. This
can be useful for filtering based on the user callstack before writing
to a ring buffer.

> 
> I think that the page faults happen only when dereference those registers
> as a pointer to the data structure, and currently that is done by probes
> like eprobe and fprobe. In order to handle faults in those probes, we
> need to change how those writes data in per-cpu ring buffer.
> 
> Currently, those probes reserves an entry on ring buffer and writes the
> dereferenced data on the entry, and commits it. So during this reserve-
> write-commit operation, this still disables preemption. So we need a
> another buffer for dereference on the stack and copy it.

There are quite a few approaches we can take there, with different
tradeoffs.

A) Issue dummy loads of user-space data just to trigger page faults
before disabling preemption. Unless the system has extreme memory
pressure, it should be enough to page in the data and it should stay
available for copy into the ring buffer immediately after with preemption
disabled. This should be fine for practical purposes. This is simple to
implement and is the route I intend to take initially for LTTng.

B) Do a copy in a local buffer and take page faults at that point. This
bring the question of where to allocate the buffer. This also requires an
extra copy from userspace, to the local buffer, then to the per-cpu ring
buffer, so it may come with a certain overhead. One advantage of that
approach is that it opens the door to fix TOCTOU races that syscall audit
systems (e.g. seccomp) have if we change the system call implementation
to use data from this argument copy rather than re-read them from
userspace within the system call. But this is a much larger endeavor that
should be done in collaboration between the tracing & seccomp developers.

C) Modify the ring buffer to make it usable without disabling preemption.
It's straightforward in LTTng because its ring buffer has been designed
to be usable in preemptible userspace context as well (LTTng-UST).
This may not be as easy for ftrace since disabling preemption is
rooted deep in its ring buffer design.

Besides those basic tradeoffs, we should of course consider the overhead
associated with each approach.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com