From nobody Wed Dec 17 09:50:31 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A019BE7491B for ; Mon, 2 Oct 2023 20:26:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236567AbjJBU0F (ORCPT ); Mon, 2 Oct 2023 16:26:05 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50374 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231178AbjJBUZr (ORCPT ); Mon, 2 Oct 2023 16:25:47 -0400 Received: from smtpout.efficios.com (unknown [IPv6:2607:5300:203:b2ee::31e5]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0503BB8; Mon, 2 Oct 2023 13:25:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=efficios.com; s=smtpout1; t=1696278340; bh=z+YuZq2cNhW9qkJkSOYFIaeT5NeVzYDA/Sxs3IVrfPA=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=iuukl7rLdKlpCi8oSG9knAW43FOWxG3rMihV3zbYQSnCvQcWrlQeqsFtOFbr+rsX7 4KTnnn33X9bO5Mz0HPILI7CeD92rUhUSS17dwnh4H6N6DGu0ywubL6Zfd11AxNfI54 tf91N0dMxhHnrxXumd57m5qifaLdQWwnliViWcD31qJ0p1ZdcDqVYnPecJpresEbBj UFNbsrD0NUA+uoFpwph9J4hBcFWpZUwL87PZoeL5Ahe8RccVBFfwcDRAucjYdaw8bI 22wfmZOtb0jMs15EsJWH5smZXgQYlfZ1RMYv0KJ+mGRAF2Z29IX6wc1+F3VbT7JLJ4 0JmJQysGJ5exA== Received: from localhost.localdomain (192-222-143-198.qc.cable.ebox.net [192.222.143.198]) by smtpout.efficios.com (Postfix) with ESMTPSA id 4RzssJ4lTzz1Vtm; Mon, 2 Oct 2023 16:25:40 -0400 (EDT) From: Mathieu Desnoyers To: Steven Rostedt Cc: linux-kernel@vger.kernel.org, Mathieu Desnoyers , Michael Jeanson , Peter Zijlstra , Alexei Starovoitov , Yonghong Song , "Paul E . McKenney" , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , bpf@vger.kernel.org, Joel Fernandes Subject: [RFC PATCH v3 5/5] tracing: convert sys_enter/exit to faultable tracepoints Date: Mon, 2 Oct 2023 16:25:31 -0400 Message-Id: <20231002202531.3160-6-mathieu.desnoyers@efficios.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20231002202531.3160-1-mathieu.desnoyers@efficios.com> References: <20231002202531.3160-1-mathieu.desnoyers@efficios.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Convert the definition of the system call enter/exit tracepoints to faultable tracepoints now that all upstream tracers handle it. This allows tracers to fault-in userspace system call arguments such as path strings within their probe callbacks. Co-developed-by: Michael Jeanson Signed-off-by: Mathieu Desnoyers Signed-off-by: Michael Jeanson Cc: Steven Rostedt (VMware) Cc: Peter Zijlstra Cc: Alexei Starovoitov Cc: Yonghong Song Cc: Paul E. McKenney Cc: Ingo Molnar Cc: Arnaldo Carvalho de Melo Cc: Mark Rutland Cc: Alexander Shishkin Cc: Jiri Olsa Cc: Namhyung Kim Cc: bpf@vger.kernel.org Cc: Joel Fernandes --- include/trace/events/syscalls.h | 4 +- kernel/trace/trace_syscalls.c | 92 +++++++++++++++++++++++---------- 2 files changed, 68 insertions(+), 28 deletions(-) diff --git a/include/trace/events/syscalls.h b/include/trace/events/syscall= s.h index b6e0cbc2c71f..dc30e3004818 100644 --- a/include/trace/events/syscalls.h +++ b/include/trace/events/syscalls.h @@ -15,7 +15,7 @@ =20 #ifdef CONFIG_HAVE_SYSCALL_TRACEPOINTS =20 -TRACE_EVENT_FN(sys_enter, +TRACE_EVENT_FN_MAY_FAULT(sys_enter, =20 TP_PROTO(struct pt_regs *regs, long id), =20 @@ -41,7 +41,7 @@ TRACE_EVENT_FN(sys_enter, =20 TRACE_EVENT_FLAGS(sys_enter, TRACE_EVENT_FL_CAP_ANY) =20 -TRACE_EVENT_FN(sys_exit, +TRACE_EVENT_FN_MAY_FAULT(sys_exit, =20 TP_PROTO(struct pt_regs *regs, long ret), =20 diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c index 942ddbdace4a..e4414f7bdbe7 100644 --- a/kernel/trace/trace_syscalls.c +++ b/kernel/trace/trace_syscalls.c @@ -299,27 +299,33 @@ static void ftrace_syscall_enter(void *data, struct p= t_regs *regs, long id) int syscall_nr; int size; =20 + /* + * Probe called with preemption enabled (may_fault), but ring buffer and + * per-cpu data require preemption to be disabled. + */ + preempt_disable_notrace(); + syscall_nr =3D trace_get_syscall_nr(current, regs); if (syscall_nr < 0 || syscall_nr >=3D NR_syscalls) - return; + goto end; =20 /* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE) */ trace_file =3D rcu_dereference_sched(tr->enter_syscall_files[syscall_nr]); if (!trace_file) - return; + goto end; =20 if (trace_trigger_soft_disabled(trace_file)) - return; + goto end; =20 sys_data =3D syscall_nr_to_meta(syscall_nr); if (!sys_data) - return; + goto end; =20 size =3D sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args; =20 entry =3D trace_event_buffer_reserve(&fbuffer, trace_file, size); if (!entry) - return; + goto end; =20 entry =3D ring_buffer_event_data(fbuffer.event); entry->nr =3D syscall_nr; @@ -327,6 +333,8 @@ static void ftrace_syscall_enter(void *data, struct pt_= regs *regs, long id) memcpy(entry->args, args, sizeof(unsigned long) * sys_data->nb_args); =20 trace_event_buffer_commit(&fbuffer); +end: + preempt_enable_notrace(); } =20 static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret) @@ -338,31 +346,39 @@ static void ftrace_syscall_exit(void *data, struct pt= _regs *regs, long ret) struct trace_event_buffer fbuffer; int syscall_nr; =20 + /* + * Probe called with preemption enabled (may_fault), but ring buffer and + * per-cpu data require preemption to be disabled. + */ + preempt_disable_notrace(); + syscall_nr =3D trace_get_syscall_nr(current, regs); if (syscall_nr < 0 || syscall_nr >=3D NR_syscalls) - return; + goto end; =20 /* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE()) */ trace_file =3D rcu_dereference_sched(tr->exit_syscall_files[syscall_nr]); if (!trace_file) - return; + goto end; =20 if (trace_trigger_soft_disabled(trace_file)) - return; + goto end; =20 sys_data =3D syscall_nr_to_meta(syscall_nr); if (!sys_data) - return; + goto end; =20 entry =3D trace_event_buffer_reserve(&fbuffer, trace_file, sizeof(*entry)= ); if (!entry) - return; + goto end; =20 entry =3D ring_buffer_event_data(fbuffer.event); entry->nr =3D syscall_nr; entry->ret =3D syscall_get_return_value(current, regs); =20 trace_event_buffer_commit(&fbuffer); +end: + preempt_enable_notrace(); } =20 static int reg_event_syscall_enter(struct trace_event_file *file, @@ -377,7 +393,9 @@ static int reg_event_syscall_enter(struct trace_event_f= ile *file, return -ENOSYS; mutex_lock(&syscall_trace_lock); if (!tr->sys_refcount_enter) - ret =3D register_trace_sys_enter(ftrace_syscall_enter, tr); + ret =3D register_trace_prio_flags_sys_enter(ftrace_syscall_enter, tr, + TRACEPOINT_DEFAULT_PRIO, + TRACEPOINT_MAY_FAULT); if (!ret) { rcu_assign_pointer(tr->enter_syscall_files[num], file); tr->sys_refcount_enter++; @@ -415,7 +433,9 @@ static int reg_event_syscall_exit(struct trace_event_fi= le *file, return -ENOSYS; mutex_lock(&syscall_trace_lock); if (!tr->sys_refcount_exit) - ret =3D register_trace_sys_exit(ftrace_syscall_exit, tr); + ret =3D register_trace_prio_flags_sys_exit(ftrace_syscall_exit, tr, + TRACEPOINT_DEFAULT_PRIO, + TRACEPOINT_MAY_FAULT); if (!ret) { rcu_assign_pointer(tr->exit_syscall_files[num], file); tr->sys_refcount_exit++; @@ -579,20 +599,26 @@ static void perf_syscall_enter(void *ignore, struct p= t_regs *regs, long id) int rctx; int size; =20 + /* + * Probe called with preemption enabled (may_fault), but ring buffer and + * per-cpu data require preemption to be disabled. + */ + preempt_disable_notrace(); + syscall_nr =3D trace_get_syscall_nr(current, regs); if (syscall_nr < 0 || syscall_nr >=3D NR_syscalls) - return; + goto end; if (!test_bit(syscall_nr, enabled_perf_enter_syscalls)) - return; + goto end; =20 sys_data =3D syscall_nr_to_meta(syscall_nr); if (!sys_data) - return; + goto end; =20 head =3D this_cpu_ptr(sys_data->enter_event->perf_events); valid_prog_array =3D bpf_prog_array_valid(sys_data->enter_event); if (!valid_prog_array && hlist_empty(head)) - return; + goto end; =20 /* get the size after alignment with the u32 buffer size field */ size =3D sizeof(unsigned long) * sys_data->nb_args + sizeof(*rec); @@ -601,7 +627,7 @@ static void perf_syscall_enter(void *ignore, struct pt_= regs *regs, long id) =20 rec =3D perf_trace_buf_alloc(size, NULL, &rctx); if (!rec) - return; + goto end; =20 rec->nr =3D syscall_nr; syscall_get_arguments(current, regs, args); @@ -611,12 +637,14 @@ static void perf_syscall_enter(void *ignore, struct p= t_regs *regs, long id) !perf_call_bpf_enter(sys_data->enter_event, regs, sys_data, rec)) || hlist_empty(head)) { perf_swevent_put_recursion_context(rctx); - return; + goto end; } =20 perf_trace_buf_submit(rec, size, rctx, sys_data->enter_event->event.type, 1, regs, head, NULL); +end: + preempt_enable_notrace(); } =20 static int perf_sysenter_enable(struct trace_event_call *call) @@ -628,7 +656,9 @@ static int perf_sysenter_enable(struct trace_event_call= *call) =20 mutex_lock(&syscall_trace_lock); if (!sys_perf_refcount_enter) - ret =3D register_trace_sys_enter(perf_syscall_enter, NULL); + ret =3D register_trace_prio_flags_sys_enter(perf_syscall_enter, NULL, + TRACEPOINT_DEFAULT_PRIO, + TRACEPOINT_MAY_FAULT); if (ret) { pr_info("event trace: Could not activate syscall entry trace point"); } else { @@ -678,20 +708,26 @@ static void perf_syscall_exit(void *ignore, struct pt= _regs *regs, long ret) int rctx; int size; =20 + /* + * Probe called with preemption enabled (may_fault), but ring buffer and + * per-cpu data require preemption to be disabled. + */ + preempt_disable_notrace(); + syscall_nr =3D trace_get_syscall_nr(current, regs); if (syscall_nr < 0 || syscall_nr >=3D NR_syscalls) - return; + goto end; if (!test_bit(syscall_nr, enabled_perf_exit_syscalls)) - return; + goto end; =20 sys_data =3D syscall_nr_to_meta(syscall_nr); if (!sys_data) - return; + goto end; =20 head =3D this_cpu_ptr(sys_data->exit_event->perf_events); valid_prog_array =3D bpf_prog_array_valid(sys_data->exit_event); if (!valid_prog_array && hlist_empty(head)) - return; + goto end; =20 /* We can probably do that at build time */ size =3D ALIGN(sizeof(*rec) + sizeof(u32), sizeof(u64)); @@ -699,7 +735,7 @@ static void perf_syscall_exit(void *ignore, struct pt_r= egs *regs, long ret) =20 rec =3D perf_trace_buf_alloc(size, NULL, &rctx); if (!rec) - return; + goto end; =20 rec->nr =3D syscall_nr; rec->ret =3D syscall_get_return_value(current, regs); @@ -708,11 +744,13 @@ static void perf_syscall_exit(void *ignore, struct pt= _regs *regs, long ret) !perf_call_bpf_exit(sys_data->exit_event, regs, rec)) || hlist_empty(head)) { perf_swevent_put_recursion_context(rctx); - return; + goto end; } =20 perf_trace_buf_submit(rec, size, rctx, sys_data->exit_event->event.type, 1, regs, head, NULL); +end: + preempt_enable_notrace(); } =20 static int perf_sysexit_enable(struct trace_event_call *call) @@ -724,7 +762,9 @@ static int perf_sysexit_enable(struct trace_event_call = *call) =20 mutex_lock(&syscall_trace_lock); if (!sys_perf_refcount_exit) - ret =3D register_trace_sys_exit(perf_syscall_exit, NULL); + ret =3D register_trace_prio_flags_sys_exit(perf_syscall_exit, NULL, + TRACEPOINT_DEFAULT_PRIO, + TRACEPOINT_MAY_FAULT); if (ret) { pr_info("event trace: Could not activate syscall exit trace point"); } else { --=20 2.25.1