From nobody Thu Oct 2 03:28:38 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 515C330FC3D; Tue, 23 Sep 2025 13:05:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758632754; cv=none; b=uHIvn16HKbJRzVCqdyWYfBFHGAyF7ArfWRZA7UOG097NsP0O1aCz239dbGIC8wuL3E33EpUVVO0K4uoW9cZB5tisfVmeFY78SbIH5h+BYi69sHDBvnL87mgQayg5gPsWxIKjDtOnDFfyvEllIf+uKgFvRDZAiXmMqtNXWBySuYg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758632754; c=relaxed/simple; bh=txOR/egCsh57GZfAXHNisq1KB0ET6MZ/gLsLbqj4Ck8=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=q5xcEcdC4wdPwxcN3duaV7epfsPWXUcIGvIjQTvHZfKD1ANrIiB2Gv8jRTnuCwcLRTRxOGp5zIRHzpwhfHQcq1WWuD3x/qkezTFAUGqAl1Rsg/Er7vbGd0fCd/bnRX3LsGFjIiUZc7+8CibGmjcCRlv7AOMMV8mXn1H6+n0Bv5s= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=PgFpVxRi; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="PgFpVxRi" Received: by smtp.kernel.org (Postfix) with ESMTPSA id EF462C113D0; Tue, 23 Sep 2025 13:05:53 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1758632754; bh=txOR/egCsh57GZfAXHNisq1KB0ET6MZ/gLsLbqj4Ck8=; h=Date:From:To:Cc:Subject:References:From; b=PgFpVxRi5gnK2KZLq4QbtEGvm1eef3QblRmJFbLDKsCmc7By/OO9uoSr+nBEdvWRz PFN60eyzvhTJmdf1QyxEVuyvkVzVSE7PV8cRumuuWiQsfxQ2DBbEGmE3c+Fg5R+Mym 8qDWF9geibUR7nyahBd6dC1nE5TsCwtwIAF44m1WExE9Vt3AOFDgE9vNSA6KeFdovn +nk4h3W6f0FwaaubKg9lCvubjb79cOlzo98URNsWhUnQcwCKur4N+T0iAtAvfgzVwD mKFj6DtlBMoa+SzBHzm6fir4XYwr4BRshQz1+7NKr7BPqUpxmyu4+P3xVTYuD37MBb C03iwcMHPmWFA== Received: from rostedt by gandalf with local (Exim 4.98.2) (envelope-from ) id 1v12jh-0000000Cop9-37nB; Tue, 23 Sep 2025 09:07:13 -0400 Message-ID: <20250923130713.594320290@kernel.org> User-Agent: quilt/0.68 Date: Tue, 23 Sep 2025 09:04:58 -0400 From: Steven Rostedt To: linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org Cc: Masami Hiramatsu , Mark Rutland , Mathieu Desnoyers , Andrew Morton , Peter Zijlstra , Namhyung Kim , Takaya Saeki , Tom Zanussi , Thomas Gleixner , Ian Rogers , Douglas Raillard , "Paul E. McKenney" Subject: [PATCH v2 1/8] tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE() References: <20250923130457.901085554@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Steven Rostedt The syscall events are pseudo events that hook to the raw syscalls. The ftrace_syscall_enter/exit() callback is called by the raw_syscall enter/exit tracepoints respectively whenever any of the syscall events are enabled. The trace_array has an array of syscall "files" that correspond to the system calls based on their __NR_SYSCALL number. The array is read and if there's a pointer to a trace_event_file then it is considered enabled and if it is NULL that syscall event is considered disabled. Currently it uses an rcu_dereference_sched() to get this pointer and a rcu_assign_ptr() or RCU_INIT_POINTER() to write to it. This is unnecessary as the file pointer will not go away outside the synchronization of the tracepoint logic itself. And this code adds no extra RCU synchronization that uses this. Replace these functions with a simple READ_ONCE() and WRITE_ONCE() which is all they need. This will also allow this code to not depend on preemption being disabled as system call tracepoints are now allowed to fault. Reviewed-by: Paul E. McKenney Signed-off-by: Steven Rostedt (Google) --- Changes since v1: https://lore.kernel.org/20250805193234.745705874@kernel.o= rg - Removed __rcu annotation to the fields that do not need RCU to protect them. kernel/trace/trace.h | 4 ++-- kernel/trace/trace_syscalls.c | 14 ++++++-------- 2 files changed, 8 insertions(+), 10 deletions(-) diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index 5f4bed5842f9..85eabb454bee 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -380,8 +380,8 @@ struct trace_array { #ifdef CONFIG_FTRACE_SYSCALLS int sys_refcount_enter; int sys_refcount_exit; - struct trace_event_file __rcu *enter_syscall_files[NR_syscalls]; - struct trace_event_file __rcu *exit_syscall_files[NR_syscalls]; + struct trace_event_file *enter_syscall_files[NR_syscalls]; + struct trace_event_file *exit_syscall_files[NR_syscalls]; #endif int stop_count; int clock_id; diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c index 46aab0ab9350..3a0b65f89130 100644 --- a/kernel/trace/trace_syscalls.c +++ b/kernel/trace/trace_syscalls.c @@ -310,8 +310,7 @@ static void ftrace_syscall_enter(void *data, struct pt_= regs *regs, long id) if (syscall_nr < 0 || syscall_nr >=3D NR_syscalls) return; =20 - /* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE) */ - trace_file =3D rcu_dereference_sched(tr->enter_syscall_files[syscall_nr]); + trace_file =3D READ_ONCE(tr->enter_syscall_files[syscall_nr]); if (!trace_file) return; =20 @@ -356,8 +355,7 @@ static void ftrace_syscall_exit(void *data, struct pt_r= egs *regs, long ret) if (syscall_nr < 0 || syscall_nr >=3D NR_syscalls) return; =20 - /* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE()) */ - trace_file =3D rcu_dereference_sched(tr->exit_syscall_files[syscall_nr]); + trace_file =3D READ_ONCE(tr->exit_syscall_files[syscall_nr]); if (!trace_file) return; =20 @@ -393,7 +391,7 @@ static int reg_event_syscall_enter(struct trace_event_f= ile *file, if (!tr->sys_refcount_enter) ret =3D register_trace_sys_enter(ftrace_syscall_enter, tr); if (!ret) { - rcu_assign_pointer(tr->enter_syscall_files[num], file); + WRITE_ONCE(tr->enter_syscall_files[num], file); tr->sys_refcount_enter++; } mutex_unlock(&syscall_trace_lock); @@ -411,7 +409,7 @@ static void unreg_event_syscall_enter(struct trace_even= t_file *file, return; mutex_lock(&syscall_trace_lock); tr->sys_refcount_enter--; - RCU_INIT_POINTER(tr->enter_syscall_files[num], NULL); + WRITE_ONCE(tr->enter_syscall_files[num], NULL); if (!tr->sys_refcount_enter) unregister_trace_sys_enter(ftrace_syscall_enter, tr); mutex_unlock(&syscall_trace_lock); @@ -431,7 +429,7 @@ static int reg_event_syscall_exit(struct trace_event_fi= le *file, if (!tr->sys_refcount_exit) ret =3D register_trace_sys_exit(ftrace_syscall_exit, tr); if (!ret) { - rcu_assign_pointer(tr->exit_syscall_files[num], file); + WRITE_ONCE(tr->exit_syscall_files[num], file); tr->sys_refcount_exit++; } mutex_unlock(&syscall_trace_lock); @@ -449,7 +447,7 @@ static void unreg_event_syscall_exit(struct trace_event= _file *file, return; mutex_lock(&syscall_trace_lock); tr->sys_refcount_exit--; - RCU_INIT_POINTER(tr->exit_syscall_files[num], NULL); + WRITE_ONCE(tr->exit_syscall_files[num], NULL); if (!tr->sys_refcount_exit) unregister_trace_sys_exit(ftrace_syscall_exit, tr); mutex_unlock(&syscall_trace_lock); --=20 2.50.1 From nobody Thu Oct 2 03:28:38 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4AFE31A9F90; Tue, 23 Sep 2025 13:05:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758632754; cv=none; b=EhOtKdaBKjqN7NpZqzpbV5uovhNaIVRADfPtydyU60VwWGprmYVXRp9iYf+l1kEwlltz/A2eQG7sxLtDkW9pWlBnNmqawTVIHuFlnBe46bXWiiBExDUhyV98z7d62x844+qLfoCV0bJ9p0STwKuSyjQpL4AzR3uGVKJTAg826LU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758632754; c=relaxed/simple; bh=Flt5aJNz1gfLnak7pHQFz0fCiVR4gaqzsAxMb+syYyg=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=OmMSJZ4kxfxw4bbQODQ2KfhceCAhvX3RrZv7zd/f5KROlWiswh/grQ21VMvuuJPF423ZXatELFfd57qyA4WsYeknxv/UgsEwbZJ9DXlRlOgi57SftD7ehOeooCHqNl7FpWMK8OHrhCUhiuW1/T2kVvkYIgoqR4cPqU6DUuUR00Y= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=HtNzpZ++; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="HtNzpZ++" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 1DB99C19421; Tue, 23 Sep 2025 13:05:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1758632754; bh=Flt5aJNz1gfLnak7pHQFz0fCiVR4gaqzsAxMb+syYyg=; h=Date:From:To:Cc:Subject:References:From; b=HtNzpZ++JQ4yNTdsyG4pEdKYTqdsL60q1NGaGQFAUD6FMvkYEkZ5GaivBSQnKbx1G frcS0zklJldMYBosprmqNeQLoGkXGds2I4DLu9BAGCES4Tzz8ZxU+gDBC+YTApoJ1M kTfw6wE3TrkKn7xYbS2Py+5DfWCRhWHzopiUvpKAAO6JZRF+QFxjksucxbXzZcu4pE ZHmwMADMzhNX77weilt+Nvcaxj40n5PYmmExw0J3+r10sgSjCprPO+SUq6VUqnORQV 1bzvwdeh22I0wHAlOnernQ9tq+F4gkhX8YKyoKqhmU+DhuMkZI5DqbKEwnVZR4vO+H 7W4Rr5WDsspzQ== Received: from rostedt by gandalf with local (Exim 4.98.2) (envelope-from ) id 1v12jh-0000000Copf-3q6T; Tue, 23 Sep 2025 09:07:13 -0400 Message-ID: <20250923130713.764558957@kernel.org> User-Agent: quilt/0.68 Date: Tue, 23 Sep 2025 09:04:59 -0400 From: Steven Rostedt To: linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org Cc: Masami Hiramatsu , Mark Rutland , Mathieu Desnoyers , Andrew Morton , Peter Zijlstra , Namhyung Kim , Takaya Saeki , Tom Zanussi , Thomas Gleixner , Ian Rogers , Douglas Raillard Subject: [PATCH v2 2/8] tracing: Have syscall trace events show "0x" for values greater than 10 References: <20250923130457.901085554@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Steven Rostedt Currently the syscall trace events show each value as hexadecimal, but without adding "0x" it can be confusing: sys_write(fd: 4, buf: 0x55c4a1fa9270, count: 44) Looks like the above write wrote 44 bytes, when in reality it wrote 68 bytes. Add a "0x" for all values greater or equal to 10 to remove the ambiguity. For values less than 10, leave off the "0x" as that just adds noise to the output. Also change the iterator to check if "i" is nonzero and print the ", " delimiter at the start, then adding the logic to the trace_seq_printf() at the end. Signed-off-by: Steven Rostedt (Google) --- kernel/trace/trace_syscalls.c | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c index 3a0b65f89130..0f932b22f9ec 100644 --- a/kernel/trace/trace_syscalls.c +++ b/kernel/trace/trace_syscalls.c @@ -153,14 +153,20 @@ print_syscall_enter(struct trace_iterator *iter, int = flags, if (trace_seq_has_overflowed(s)) goto end; =20 + if (i) + trace_seq_puts(s, ", "); + /* parameter types */ if (tr && tr->trace_flags & TRACE_ITER_VERBOSE) trace_seq_printf(s, "%s ", entry->types[i]); =20 /* parameter values */ - trace_seq_printf(s, "%s: %lx%s", entry->args[i], - trace->args[i], - i =3D=3D entry->nb_args - 1 ? "" : ", "); + if (trace->args[i] < 10) + trace_seq_printf(s, "%s: %lu", entry->args[i], + trace->args[i]); + else + trace_seq_printf(s, "%s: 0x%lx", entry->args[i], + trace->args[i]); } =20 trace_seq_putc(s, ')'); --=20 2.50.1 From nobody Thu Oct 2 03:28:38 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C5458324B07; Tue, 23 Sep 2025 13:05:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758632754; cv=none; b=GcK9jMkeHZbU83gI4d1H9Rj+cHzwYDC/GCLsuZboNyKMRLBxkzUFy2XJY7u6XeGeV2mOTB1slLgi8aaMOnx5GdfuxHunu5BW8yZyw6NqRlIpMVnuGJ8SFiT0J86dY6ouZtZuW/cYJREowNtXh2S1lBa7293CUDaLj7YAbdLcAlk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758632754; c=relaxed/simple; bh=rNBSgUDsBKlfStPzbudHOSIzVkIjyxeB/28FHnmxYo4=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=hYu3JUjE8r3bxbewCSN/MOcqjCaq5Ys4xx4tywA1jKr4p0/H2EVpgBiGD01YvPKwTdYZoKKVEfXJgvuym+tcfIDPyaw/I7T5gMYQzBu4v3UsQ743NIn0EvW3UKjTk9PIzezW57UF4NwMnWHKVXnUrOh63zaNY6Nj7ZKJufxg2V0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=nKdi6BsC; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="nKdi6BsC" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 59F6DC113CF; Tue, 23 Sep 2025 13:05:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1758632754; bh=rNBSgUDsBKlfStPzbudHOSIzVkIjyxeB/28FHnmxYo4=; h=Date:From:To:Cc:Subject:References:From; b=nKdi6BsCD9o7XGxlnjQVQseh0qU3/BHYSh9i6IRoKNmWA5+hxomHR0ZWWVVhHBeGq U67mnBsjS1guhGTVswHiyfsSKVyCWlMj7MfmShM6oLBtvTlPoQ95n+vtSqAOh9e4Aq kJ30UMQaTWa0nYv0KAP1mbNTNcwelvi8+V+YhBkbBDVpDX/tGHJConI6yC0YWkCQes iTh++o/5EKn0tB5wAKSOSyLSNU7WSL118NPkIXKSAr9+JLlGNJsL6di4RMVZthojrk hWBt74PKDKaAZO0E37RCPUoPlti1vNk05I7bPP4LgxE7vkr0ctOYqHQ1B/HVQLduQE iW54KLIM+56og== Received: from rostedt by gandalf with local (Exim 4.98.2) (envelope-from ) id 1v12ji-0000000Coq9-0Kt2; Tue, 23 Sep 2025 09:07:14 -0400 Message-ID: <20250923130713.936188500@kernel.org> User-Agent: quilt/0.68 Date: Tue, 23 Sep 2025 09:05:00 -0400 From: Steven Rostedt To: linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org Cc: Masami Hiramatsu , Mark Rutland , Mathieu Desnoyers , Andrew Morton , Peter Zijlstra , Namhyung Kim , Takaya Saeki , Tom Zanussi , Thomas Gleixner , Ian Rogers , Douglas Raillard Subject: [PATCH v2 3/8] tracing: Have syscall trace events read user space string References: <20250923130457.901085554@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Steven Rostedt As of commit 654ced4a1377 ("tracing: Introduce tracepoint_is_faultable()") system call trace events allow faulting in user space memory. Have some of the system call trace events take advantage of this. Introduce a way to read strings that are nul terminated into the trace event. The way this is accomplished is by creating a per CPU temporary buffer that is used to read unsafe user memory. When a syscall trace event needs to read user memory, it reads the per CPU schedule switch counter. It then disables migration and enables preemption, copies the user space memory into this buffer, then disables preemption again. It reads the per CPU schedule switch counter again and if it matches it considers the buffer is valid. Otherwise it needs to try again. This is similar to how seqcount works, but uses the per CPU context switch counter as the sequence counter. The reason it uses the sched switch counter and not just a per CPU counter is because that wouldn't catch the case of: [task 1] cnt =3D this_cpu_inc(counter); preempt_enable() [task 2] cnt =3D this_cpu_inc(counter); preempt_enable(); buffer =3D task 2 data [task 1] buffer =3D task 1 data [task 2] preempt_disable(); if (cnt =3D=3D this_cpu_read(counter)) Will return true even though the buffer was corrupted. The syscall event has its nb_args shorten from an int to a short (where even u8 is plenty big enough) and the freed two bytes are used for "user_mask". The new "user_mask" field is used to store the index of the "args" field array that has the address to read from user space. This value is set to 0 if the system call event does not need to read user space for a field. This mask can be used to know if the event may fault or not. Only one bit set in user_mask is supported at this time. This allows the output to look like this: sys_access(filename: 0x7f8c55368470 "/etc/ld.so.preload", mode: 4) sys_execve(filename: 0x564ebcf5a6b8 "/usr/bin/emacs", argv: 0x7fff357c0300= , envp: 0x564ebc4a4820) Signed-off-by: Steven Rostedt (Google) --- Changes since v1: https://lore.kernel.org/20250805193235.080757106@kernel.o= rg - Hide newsfstat around #if defined(__ARCH_WANT_NEW_STAT) || defined(__ARCH_WANT_STAT64) as parisc failed to build without it. (kernel test robot) - Fixed allocation of sinfo which used sizeof(sinfo) and not sizeof(*sinfo) (kernel test robot) - Instead of incrementing a counter via the sched_switch tracepoint, use the nr_context_switches() API. (Mathieu Desnoyers). - Use the length saved in the meta data of the event to limit the size of the string printed "%.*s", len, str. - Add comment describing that the method to read the memory from user space is similar to how seqcount works. include/trace/syscall.h | 4 +- kernel/trace/trace_syscalls.c | 480 ++++++++++++++++++++++++++++++++-- 2 files changed, 464 insertions(+), 20 deletions(-) diff --git a/include/trace/syscall.h b/include/trace/syscall.h index 8e193f3a33b3..85f21ca15a41 100644 --- a/include/trace/syscall.h +++ b/include/trace/syscall.h @@ -16,6 +16,7 @@ * @name: name of the syscall * @syscall_nr: number of the syscall * @nb_args: number of parameters it takes + * @user_mask: mask of @args that will read user space * @types: list of types as strings * @args: list of args as strings (args[i] matches types[i]) * @enter_fields: list of fields for syscall_enter trace event @@ -25,7 +26,8 @@ struct syscall_metadata { const char *name; int syscall_nr; - int nb_args; + short nb_args; + short user_mask; const char **types; const char **args; struct list_head enter_fields; diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c index 0f932b22f9ec..7ea763c07bb7 100644 --- a/kernel/trace/trace_syscalls.c +++ b/kernel/trace/trace_syscalls.c @@ -1,6 +1,7 @@ // SPDX-License-Identifier: GPL-2.0 #include #include +#include #include #include #include @@ -123,6 +124,9 @@ const char *get_syscall_name(int syscall) return entry->name; } =20 +/* Added to user strings when max limit is reached */ +#define EXTRA "..." + static enum print_line_t print_syscall_enter(struct trace_iterator *iter, int flags, struct trace_event *event) @@ -132,7 +136,9 @@ print_syscall_enter(struct trace_iterator *iter, int fl= ags, struct trace_entry *ent =3D iter->ent; struct syscall_trace_enter *trace; struct syscall_metadata *entry; - int i, syscall; + int i, syscall, val; + unsigned char *ptr; + int len; =20 trace =3D (typeof(trace))ent; syscall =3D trace->nr; @@ -167,6 +173,19 @@ print_syscall_enter(struct trace_iterator *iter, int f= lags, else trace_seq_printf(s, "%s: 0x%lx", entry->args[i], trace->args[i]); + + if (!(BIT(i) & entry->user_mask)) + continue; + + /* This arg points to a user space string */ + ptr =3D (void *)trace->args + sizeof(long) * entry->nb_args; + val =3D *(int *)ptr; + + /* The value is a dynamic string (len << 16 | offset) */ + ptr =3D (void *)ent + (val & 0xffff); + len =3D val >> 16; + + trace_seq_printf(s, " \"%.*s\"", len, ptr); } =20 trace_seq_putc(s, ')'); @@ -223,15 +242,27 @@ __set_enter_print_fmt(struct syscall_metadata *entry,= char *buf, int len) =20 pos +=3D snprintf(buf + pos, LEN_OR_ZERO, "\""); for (i =3D 0; i < entry->nb_args; i++) { - pos +=3D snprintf(buf + pos, LEN_OR_ZERO, "%s: 0x%%0%zulx%s", - entry->args[i], sizeof(unsigned long), - i =3D=3D entry->nb_args - 1 ? "" : ", "); + if (i) + pos +=3D snprintf(buf + pos, LEN_OR_ZERO, ", "); + pos +=3D snprintf(buf + pos, LEN_OR_ZERO, "%s: 0x%%0%zulx", + entry->args[i], sizeof(unsigned long)); + + if (!(BIT(i) & entry->user_mask)) + continue; + + /* Add the format for the user space string */ + pos +=3D snprintf(buf + pos, LEN_OR_ZERO, " \\\"%%s\\\""); } pos +=3D snprintf(buf + pos, LEN_OR_ZERO, "\""); =20 for (i =3D 0; i < entry->nb_args; i++) { pos +=3D snprintf(buf + pos, LEN_OR_ZERO, ", ((unsigned long)(REC->%s))", entry->args[i]); + if (!(BIT(i) & entry->user_mask)) + continue; + /* The user space string for arg has name ___val */ + pos +=3D snprintf(buf + pos, LEN_OR_ZERO, ", __get_str(__%s_val)", + entry->args[i]); } =20 #undef LEN_OR_ZERO @@ -277,8 +308,12 @@ static int __init syscall_enter_define_fields(struct t= race_event_call *call) { struct syscall_trace_enter trace; struct syscall_metadata *meta =3D call->data; + unsigned long mask; + char *arg; int offset =3D offsetof(typeof(trace), args); + int idx; int ret =3D 0; + int len; int i; =20 for (i =3D 0; i < meta->nb_args; i++) { @@ -291,9 +326,232 @@ static int __init syscall_enter_define_fields(struct = trace_event_call *call) offset +=3D sizeof(unsigned long); } =20 + if (ret || !meta->user_mask) + return ret; + + mask =3D meta->user_mask; + idx =3D ffs(mask) - 1; + + /* + * User space strings are faulted into a temporary buffer and then + * added as a dynamic string to the end of the event. + * The user space string name for the arg pointer is "___val". + */ + len =3D strlen(meta->args[idx]) + sizeof("___val"); + arg =3D kmalloc(len, GFP_KERNEL); + if (WARN_ON_ONCE(!arg)) { + meta->user_mask =3D 0; + return -ENOMEM; + } + + snprintf(arg, len, "__%s_val", meta->args[idx]); + + ret =3D trace_define_field(call, "__data_loc char[]", + arg, offset, sizeof(int), 0, + FILTER_OTHER); + if (ret) + kfree(arg); return ret; } =20 +struct syscall_buf { + char *buf; +}; + +struct syscall_buf_info { + struct rcu_head rcu; + struct syscall_buf __percpu *sbuf; +}; + +/* Create a per CPU temporary buffer to copy user space pointers into */ +#define SYSCALL_FAULT_BUF_SZ 512 +static struct syscall_buf_info *syscall_buffer; + +static int syscall_fault_buffer_cnt; + +static void syscall_fault_buffer_free(struct syscall_buf_info *sinfo) +{ + char *buf; + int cpu; + + for_each_possible_cpu(cpu) { + buf =3D per_cpu_ptr(sinfo->sbuf, cpu)->buf; + kfree(buf); + } + kfree(sinfo); +} + +static void rcu_free_syscall_buffer(struct rcu_head *rcu) +{ + struct syscall_buf_info *sinfo =3D container_of(rcu, struct syscall_buf_i= nfo, rcu); + + syscall_fault_buffer_free(sinfo); +} + +/* + * The per CPU buffer syscall_fault_buffer is written to optimstically. + * The per CPU context switch count is taken, preemption is enabled, + * the copying of the user space memory is placed into the syscall_fault_b= uffer, + * Preeption is re-enabled and the count is read again. If the count does + * not match its previous reading, it could mean that another user space + * task scheduled in and the buffer is unreliable for use. + */ +static int syscall_fault_buffer_enable(void) +{ + struct syscall_buf_info *sinfo; + char *buf; + int cpu; + + lockdep_assert_held(&syscall_trace_lock); + + if (syscall_fault_buffer_cnt++) + return 0; + + sinfo =3D kmalloc(sizeof(*sinfo), GFP_KERNEL); + if (!sinfo) + return -ENOMEM; + + sinfo->sbuf =3D alloc_percpu(struct syscall_buf); + if (!sinfo->sbuf) { + kfree(sinfo); + return -ENOMEM; + } + + /* Clear each buffer in case of error */ + for_each_possible_cpu(cpu) { + per_cpu_ptr(sinfo->sbuf, cpu)->buf =3D NULL; + } + + for_each_possible_cpu(cpu) { + buf =3D kmalloc_node(SYSCALL_FAULT_BUF_SZ, GFP_KERNEL, + cpu_to_node(cpu)); + if (!buf) { + syscall_fault_buffer_free(sinfo); + return -ENOMEM; + } + per_cpu_ptr(sinfo->sbuf, cpu)->buf =3D buf; + } + + WRITE_ONCE(syscall_buffer, sinfo); + return 0; +} + +static void syscall_fault_buffer_disable(void) +{ + struct syscall_buf_info *sinfo =3D syscall_buffer; + + lockdep_assert_held(&syscall_trace_lock); + + if (--syscall_fault_buffer_cnt) + return; + + WRITE_ONCE(syscall_buffer, NULL); + call_rcu_tasks_trace(&sinfo->rcu, rcu_free_syscall_buffer); +} + +static char *sys_fault_user(struct syscall_metadata *sys_data, struct sysc= all_buf_info *sinfo, + unsigned long *args, unsigned int *data_size) +{ + int cpu =3D smp_processor_id(); + char *buf =3D per_cpu_ptr(sinfo->sbuf, cpu)->buf; + unsigned long size =3D SYSCALL_FAULT_BUF_SZ - 1; + unsigned long mask =3D sys_data->user_mask; + unsigned int cnt; + int idx =3D ffs(mask) - 1; + char *ptr; + int trys =3D 0; + int ret; + + /* Get the pointer to user space memory to read */ + ptr =3D (char *)args[idx]; + *data_size =3D 0; + + /* + * This acts similar to a seqcount. The per CPU context switches are + * recorded, migration is disabled and preemption is enabled. The + * read of the user space memory is copied into the per CPU buffer. + * Preemption is disabled again, and if the per CPU context switches count + * is still the same, it means the buffer has not been corrupted. + * If the count is different, it is assumed the buffer is corrupted + * and reading must be tried again. + */ + again: + /* + * If for some reason, copy_from_user() always causes a context + * switch, this would then cause an inifinite loop. + * If this task is preempted by another user space task, it + * will cause this task to try again. But just in case something + * changes where the copying from user space causes another task + * to run, prevent this from going into an infinite loop. + * 10 tries should be plenty. + */ + if (trys++ > 10) { + static bool once; + /* + * Only print a message instead of a WARN_ON() as this could + * theoretically trigger under real load. + */ + if (!once) + pr_warn("Error: Too many tries to read syscall %s\n", sys_data->name); + once =3D true; + return buf; + } + + /* Read the current CPU context switch counter */ + cnt =3D nr_context_switches_cpu(cpu); + + /* + * Preemption is going to be enabled, but this task must + * remain on this CPU. + */ + migrate_disable(); + + /* + * Now preemption is being enabed and another task can come in + * and use the same buffer and corrupt our data. + */ + preempt_enable_notrace(); + + ret =3D strncpy_from_user(buf, ptr, size); + + preempt_disable_notrace(); + migrate_enable(); + + /* If it faulted, no use to try again */ + if (ret < 0) + return buf; + + /* + * Preemption is disabled again, now check the per CPU context + * switch counter. If it doesn't match, then another user space + * process may have schedule in and corrupted our buffer. In that + * case the copying must be retried. + */ + if (nr_context_switches_cpu(cpu) !=3D cnt) + goto again; + + /* Replace any non-printable characters with '.' */ + for (int i =3D 0; i < ret; i++) { + if (!isprint(buf[i])) + buf[i] =3D '.'; + } + + /* + * If the text was truncated due to our max limit, add "..." to + * the string. + */ + if (ret > SYSCALL_FAULT_BUF_SZ - sizeof(EXTRA)) { + strscpy(buf + SYSCALL_FAULT_BUF_SZ - sizeof(EXTRA), + EXTRA, sizeof(EXTRA)); + ret =3D SYSCALL_FAULT_BUF_SZ; + } else { + buf[ret++] =3D '\0'; + } + + *data_size =3D ret; + return buf; +} + static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id) { struct trace_array *tr =3D data; @@ -302,15 +560,17 @@ static void ftrace_syscall_enter(void *data, struct p= t_regs *regs, long id) struct syscall_metadata *sys_data; struct trace_event_buffer fbuffer; unsigned long args[6]; + char *user_ptr; + int user_size =3D 0; int syscall_nr; - int size; + int size =3D 0; + bool mayfault; =20 /* * Syscall probe called with preemption enabled, but the ring * buffer and per-cpu data require preemption to be disabled. */ might_fault(); - guard(preempt_notrace)(); =20 syscall_nr =3D trace_get_syscall_nr(current, regs); if (syscall_nr < 0 || syscall_nr >=3D NR_syscalls) @@ -327,7 +587,32 @@ static void ftrace_syscall_enter(void *data, struct pt= _regs *regs, long id) if (!sys_data) return; =20 - size =3D sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args; + /* Check if this syscall event faults in user space memory */ + mayfault =3D sys_data->user_mask !=3D 0; + + guard(preempt_notrace)(); + + syscall_get_arguments(current, regs, args); + + if (mayfault) { + struct syscall_buf_info *sinfo; + + /* If the syscall_buffer is NULL, tracing is being shutdown */ + sinfo =3D READ_ONCE(syscall_buffer); + if (!sinfo) + return; + + user_ptr =3D sys_fault_user(sys_data, sinfo, args, &user_size); + /* + * user_size is the amount of data to append. + * Need to add 4 for the meta field that points to + * the user memory at the end of the event and also + * stores its size. + */ + size =3D 4 + user_size; + } + + size +=3D sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args; =20 entry =3D trace_event_buffer_reserve(&fbuffer, trace_file, size); if (!entry) @@ -335,9 +620,36 @@ static void ftrace_syscall_enter(void *data, struct pt= _regs *regs, long id) =20 entry =3D ring_buffer_event_data(fbuffer.event); entry->nr =3D syscall_nr; - syscall_get_arguments(current, regs, args); + memcpy(entry->args, args, sizeof(unsigned long) * sys_data->nb_args); =20 + if (mayfault) { + void *ptr; + int val; + + /* + * Set the pointer to point to the meta data of the event + * that has information about the stored user space memory. + */ + ptr =3D (void *)entry->args + sizeof(unsigned long) * sys_data->nb_args; + + /* + * The meta data will store the offset of the user data from + * the beginning of the event. + */ + val =3D (ptr - (void *)entry) + 4; + + /* Store the offset and the size into the meta data */ + *(int *)ptr =3D val | (user_size << 16); + + /* Nothing to do if the user space was empty or faulted */ + if (user_size) { + /* Now store the user space data into the event */ + ptr +=3D 4; + memcpy(ptr, user_ptr, user_size); + } + } + trace_event_buffer_commit(&fbuffer); } =20 @@ -386,39 +698,50 @@ static void ftrace_syscall_exit(void *data, struct pt= _regs *regs, long ret) static int reg_event_syscall_enter(struct trace_event_file *file, struct trace_event_call *call) { + struct syscall_metadata *sys_data =3D call->data; struct trace_array *tr =3D file->tr; int ret =3D 0; int num; =20 - num =3D ((struct syscall_metadata *)call->data)->syscall_nr; + num =3D sys_data->syscall_nr; if (WARN_ON_ONCE(num < 0 || num >=3D NR_syscalls)) return -ENOSYS; - mutex_lock(&syscall_trace_lock); - if (!tr->sys_refcount_enter) + guard(mutex)(&syscall_trace_lock); + if (sys_data->user_mask) { + ret =3D syscall_fault_buffer_enable(); + if (ret) + return ret; + } + if (!tr->sys_refcount_enter) { ret =3D register_trace_sys_enter(ftrace_syscall_enter, tr); - if (!ret) { - WRITE_ONCE(tr->enter_syscall_files[num], file); - tr->sys_refcount_enter++; + if (ret < 0) { + if (sys_data->user_mask) + syscall_fault_buffer_disable(); + return ret; + } } - mutex_unlock(&syscall_trace_lock); - return ret; + WRITE_ONCE(tr->enter_syscall_files[num], file); + tr->sys_refcount_enter++; + return 0; } =20 static void unreg_event_syscall_enter(struct trace_event_file *file, struct trace_event_call *call) { + struct syscall_metadata *sys_data =3D call->data; struct trace_array *tr =3D file->tr; int num; =20 - num =3D ((struct syscall_metadata *)call->data)->syscall_nr; + num =3D sys_data->syscall_nr; if (WARN_ON_ONCE(num < 0 || num >=3D NR_syscalls)) return; - mutex_lock(&syscall_trace_lock); + guard(mutex)(&syscall_trace_lock); tr->sys_refcount_enter--; WRITE_ONCE(tr->enter_syscall_files[num], NULL); if (!tr->sys_refcount_enter) unregister_trace_sys_enter(ftrace_syscall_enter, tr); - mutex_unlock(&syscall_trace_lock); + if (sys_data->user_mask) + syscall_fault_buffer_disable(); } =20 static int reg_event_syscall_exit(struct trace_event_file *file, @@ -459,6 +782,123 @@ static void unreg_event_syscall_exit(struct trace_eve= nt_file *file, mutex_unlock(&syscall_trace_lock); } =20 +/* + * For system calls that reference user space memory that can + * be recorded into the event, set the system call meta data's user_mask + * to the "args" index that points to the user space memory to retrieve. + */ +static void check_faultable_syscall(struct trace_event_call *call, int nr) +{ + struct syscall_metadata *sys_data =3D call->data; + + /* Only work on entry */ + if (sys_data->enter_event !=3D call) + return; + + switch (nr) { + /* user arg at position 0 */ + case __NR_access: + case __NR_acct: + case __NR_add_key: /* Just _type. TODO add _description */ + case __NR_chdir: + case __NR_chown: + case __NR_chmod: + case __NR_chroot: + case __NR_creat: + case __NR_delete_module: + case __NR_execve: + case __NR_fsopen: + case __NR_getxattr: /* Just pathname, TODO add name */ + case __NR_lchown: + case __NR_lgetxattr: /* Just pathname, TODO add name */ + case __NR_lremovexattr: /* Just pathname, TODO add name */ + case __NR_link: /* Just oldname. TODO add newname */ + case __NR_listxattr: /* Just pathname, TODO add list */ + case __NR_llistxattr: /* Just pathname, TODO add list */ + case __NR_lsetxattr: /* Just pathname, TODO add list */ + case __NR_open: + case __NR_memfd_create: + case __NR_mount: /* Just dev_name, TODO add dir_name and type */ + case __NR_mkdir: + case __NR_mknod: + case __NR_mq_open: + case __NR_mq_unlink: + case __NR_pivot_root: /* Just new_root, TODO add old_root */ + case __NR_readlink: + case __NR_removexattr: /* Just pathname, TODO add name */ + case __NR_rename: /* Just oldname. TODO add newname */ + case __NR_request_key: /* Just _type. TODO add _description */ + case __NR_rmdir: + case __NR_setxattr: /* Just pathname, TODO add list */ + case __NR_shmdt: + case __NR_statfs: + case __NR_swapon: + case __NR_swapoff: + case __NR_symlink: /* Just oldname. TODO add newname */ + case __NR_truncate: + case __NR_unlink: + case __NR_umount2: + case __NR_utime: + case __NR_utimes: + sys_data->user_mask =3D BIT(0); + break; + /* user arg at position 1 */ + case __NR_execveat: + case __NR_faccessat: + case __NR_faccessat2: + case __NR_finit_module: + case __NR_fchmodat: + case __NR_fchmodat2: + case __NR_fchownat: + case __NR_fgetxattr: + case __NR_flistxattr: + case __NR_fsetxattr: + case __NR_fspick: + case __NR_fremovexattr: + case __NR_futimesat: + case __NR_getxattrat: /* Just pathname, TODO add name */ + case __NR_inotify_add_watch: + case __NR_linkat: /* Just oldname. TODO add newname */ + case __NR_listxattrat: /* Just pathname, TODO add list */ + case __NR_mkdirat: + case __NR_mknodat: + case __NR_mount_setattr: + case __NR_move_mount: /* Just from_pathname, TODO add to_pathname */ + case __NR_name_to_handle_at: +#if defined(__ARCH_WANT_NEW_STAT) || defined(__ARCH_WANT_STAT64) + case __NR_newfstatat: +#endif + case __NR_openat: + case __NR_openat2: + case __NR_open_tree: + case __NR_open_tree_attr: + case __NR_readlinkat: + case __NR_renameat: /* Just oldname. TODO add newname */ + case __NR_renameat2: /* Just oldname. TODO add newname */ + case __NR_removexattrat: /* Just pathname, TODO add name */ + case __NR_quotactl: + case __NR_setxattrat: /* Just pathname, TODO add list */ + case __NR_syslog: + case __NR_symlinkat: /* Just oldname. TODO add newname */ + case __NR_statx: + case __NR_unlinkat: + case __NR_utimensat: + sys_data->user_mask =3D BIT(1); + break; + /* user arg at position 2 */ + case __NR_init_module: + case __NR_fsconfig: + sys_data->user_mask =3D BIT(2); + break; + /* user arg at position 4 */ + case __NR_fanotify_mark: + sys_data->user_mask =3D BIT(4); + break; + default: + sys_data->user_mask =3D 0; + } +} + static int __init init_syscall_trace(struct trace_event_call *call) { int id; @@ -471,6 +911,8 @@ static int __init init_syscall_trace(struct trace_event= _call *call) return -ENOSYS; } =20 + check_faultable_syscall(call, num); + if (set_syscall_print_fmt(call) < 0) return -ENOMEM; =20 --=20 2.50.1 From nobody Thu Oct 2 03:28:38 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C53D7324B06; Tue, 23 Sep 2025 13:05:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758632754; cv=none; b=S8fxYD2CWLCIt4qekP41ndoBpdteUPRUytAlB0nbuNO1vMugGAB1r5EBK1OOKqMinLTHg5jesegp6AuDw262MABOKH4q8ys3vG/AKfamGGWSSkzPnIPF/Wojfw/vMVxaebjeZrkA7kyCpwnZHO6qW2z4SpXqF0tQvMXZOLBdGAE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758632754; c=relaxed/simple; bh=VKA2yJYweWElPnAZodk8sAgNvirHzx1QCF7wG5bI2Zg=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=dlQ5DZ5tiwJirHTZw5jgcOF3tCJZ/s9yc+psZHZZPUtG7kHZBiwOcD0r1+zIZAO1YssAAD9BSP8JD6cVoX5Kbnd5L66ZEK1nw1eTaKhQuh+JCCsD8d9/GzpU6JTtW7Y/PK+EcveHdo6uL7fSD2JxirfNtl8Pwk1Og8Bo+FJfbyA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=bwwiMQvu; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="bwwiMQvu" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 6B448C19422; Tue, 23 Sep 2025 13:05:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1758632754; bh=VKA2yJYweWElPnAZodk8sAgNvirHzx1QCF7wG5bI2Zg=; h=Date:From:To:Cc:Subject:References:From; b=bwwiMQvuREY68BR2iJn9BeuDEub6TKUks4tCZWUyXcbinByckr1DX8iJ5b09YyktF okK/mlYgsa4hJtrRflGjcyC8XYeyOWe9BI+pTAZB25P/67QtWGNCvhPc123Yr71adI pqei9oUTCzIXMENlaQNm3TkbHKwSmFYq9s6bRJ9kefKO1kMFdEs16aHxlckxWHGP+c rW3f4xJ8Qy7p+a+38aFz1MCFPsxQa+r8K13U0epyvIPArhncZVaIOnddG/63sUYHq9 x+3KCu7Njb6DdP315BVtddBb/zosKai2FgOKIQacfRbgQK2+PpsZorc4H899hsJQ0B Yz0pWmDGjhK9Q== Received: from rostedt by gandalf with local (Exim 4.98.2) (envelope-from ) id 1v12ji-0000000Coqd-11hK; Tue, 23 Sep 2025 09:07:14 -0400 Message-ID: <20250923130714.101202613@kernel.org> User-Agent: quilt/0.68 Date: Tue, 23 Sep 2025 09:05:01 -0400 From: Steven Rostedt To: linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org Cc: Masami Hiramatsu , Mark Rutland , Mathieu Desnoyers , Andrew Morton , Peter Zijlstra , Namhyung Kim , Takaya Saeki , Tom Zanussi , Thomas Gleixner , Ian Rogers , Douglas Raillard Subject: [PATCH v2 4/8] tracing: Have system call events record user array data References: <20250923130457.901085554@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Steven Rostedt For system call events that have a length field, add a "user_arg_size" parameter to the system call meta data that denotes the index of the args array that holds the size of arg that the user_mask field has a bit set for. The "user_mask" has a bit set that denotes the arg that points to an array in the user space address space and if a system call event has the user_mask field set and the user_arg_size set, it will then record the content of that address into the trace event, up to the size defined by SYSCALL_FAULT_BUF_SZ - 1. This allows the output to look like: sys_write(fd: 0xa, buf: 0x5646978d13c0 (01:00:05:00:00:00:00:00:01:87:55:= 89:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00), count: 0x2= 0) Signed-off-by: Steven Rostedt (Google) --- include/trace/syscall.h | 4 +- kernel/trace/trace_syscalls.c | 111 +++++++++++++++++++++++++--------- 2 files changed, 86 insertions(+), 29 deletions(-) diff --git a/include/trace/syscall.h b/include/trace/syscall.h index 85f21ca15a41..9413c139da66 100644 --- a/include/trace/syscall.h +++ b/include/trace/syscall.h @@ -16,6 +16,7 @@ * @name: name of the syscall * @syscall_nr: number of the syscall * @nb_args: number of parameters it takes + * @user_arg_size: holds @arg that has size of the user space to read * @user_mask: mask of @args that will read user space * @types: list of types as strings * @args: list of args as strings (args[i] matches types[i]) @@ -26,7 +27,8 @@ struct syscall_metadata { const char *name; int syscall_nr; - short nb_args; + u8 nb_args; + s8 user_arg_size; short user_mask; const char **types; const char **args; diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c index 7ea763c07bb7..7658b592c55f 100644 --- a/kernel/trace/trace_syscalls.c +++ b/kernel/trace/trace_syscalls.c @@ -124,7 +124,7 @@ const char *get_syscall_name(int syscall) return entry->name; } =20 -/* Added to user strings when max limit is reached */ +/* Added to user strings or arrays when max limit is reached */ #define EXTRA "..." =20 static enum print_line_t @@ -136,9 +136,8 @@ print_syscall_enter(struct trace_iterator *iter, int fl= ags, struct trace_entry *ent =3D iter->ent; struct syscall_trace_enter *trace; struct syscall_metadata *entry; - int i, syscall, val; + int i, syscall, val, len; unsigned char *ptr; - int len; =20 trace =3D (typeof(trace))ent; syscall =3D trace->nr; @@ -185,7 +184,23 @@ print_syscall_enter(struct trace_iterator *iter, int f= lags, ptr =3D (void *)ent + (val & 0xffff); len =3D val >> 16; =20 - trace_seq_printf(s, " \"%.*s\"", len, ptr); + if (entry->user_arg_size < 0) { + trace_seq_printf(s, " \"%.*s\"", len, ptr); + continue; + } + + val =3D trace->args[entry->user_arg_size]; + + trace_seq_puts(s, " ("); + for (int x =3D 0; x < len; x++, ptr++) { + if (x) + trace_seq_putc(s, ':'); + trace_seq_printf(s, "%02x", *ptr); + } + if (len < val) + trace_seq_printf(s, ", %s", EXTRA); + + trace_seq_putc(s, ')'); } =20 trace_seq_putc(s, ')'); @@ -250,8 +265,11 @@ __set_enter_print_fmt(struct syscall_metadata *entry, = char *buf, int len) if (!(BIT(i) & entry->user_mask)) continue; =20 - /* Add the format for the user space string */ - pos +=3D snprintf(buf + pos, LEN_OR_ZERO, " \\\"%%s\\\""); + /* Add the format for the user space string or array */ + if (entry->user_arg_size < 0) + pos +=3D snprintf(buf + pos, LEN_OR_ZERO, " \\\"%%s\\\""); + else + pos +=3D snprintf(buf + pos, LEN_OR_ZERO, " (%%s)"); } pos +=3D snprintf(buf + pos, LEN_OR_ZERO, "\""); =20 @@ -260,9 +278,14 @@ __set_enter_print_fmt(struct syscall_metadata *entry, = char *buf, int len) ", ((unsigned long)(REC->%s))", entry->args[i]); if (!(BIT(i) & entry->user_mask)) continue; - /* The user space string for arg has name ___val */ - pos +=3D snprintf(buf + pos, LEN_OR_ZERO, ", __get_str(__%s_val)", - entry->args[i]); + /* The user space data for arg has name ___val */ + if (entry->user_arg_size < 0) { + pos +=3D snprintf(buf + pos, LEN_OR_ZERO, ", __get_str(__%s_val)", + entry->args[i]); + } else { + pos +=3D snprintf(buf + pos, LEN_OR_ZERO, ", __print_dynamic_array(__%s= _val, 1)", + entry->args[i]); + } } =20 #undef LEN_OR_ZERO @@ -333,9 +356,9 @@ static int __init syscall_enter_define_fields(struct tr= ace_event_call *call) idx =3D ffs(mask) - 1; =20 /* - * User space strings are faulted into a temporary buffer and then - * added as a dynamic string to the end of the event. - * The user space string name for the arg pointer is "___val". + * User space data is faulted into a temporary buffer and then + * added as a dynamic string or array to the end of the event. + * The user space data name for the arg pointer is "___val". */ len =3D strlen(meta->args[idx]) + sizeof("___val"); arg =3D kmalloc(len, GFP_KERNEL); @@ -458,6 +481,7 @@ static char *sys_fault_user(struct syscall_metadata *sy= s_data, struct syscall_bu unsigned long mask =3D sys_data->user_mask; unsigned int cnt; int idx =3D ffs(mask) - 1; + bool array =3D false; char *ptr; int trys =3D 0; int ret; @@ -500,6 +524,18 @@ static char *sys_fault_user(struct syscall_metadata *s= ys_data, struct syscall_bu /* Read the current CPU context switch counter */ cnt =3D nr_context_switches_cpu(cpu); =20 + /* + * If this system call event has a size argument, use + * it to define how much of user space memory to read, + * and read it as an array and not a string. + */ + if (sys_data->user_arg_size >=3D 0) { + array =3D true; + size =3D args[sys_data->user_arg_size]; + if (size > SYSCALL_FAULT_BUF_SZ - 1) + size =3D SYSCALL_FAULT_BUF_SZ - 1; + } + /* * Preemption is going to be enabled, but this task must * remain on this CPU. @@ -512,7 +548,12 @@ static char *sys_fault_user(struct syscall_metadata *s= ys_data, struct syscall_bu */ preempt_enable_notrace(); =20 - ret =3D strncpy_from_user(buf, ptr, size); + if (array) { + ret =3D __copy_from_user(buf, ptr, size); + ret =3D ret ? -1 : size; + } else { + ret =3D strncpy_from_user(buf, ptr, size); + } =20 preempt_disable_notrace(); migrate_enable(); @@ -530,22 +571,24 @@ static char *sys_fault_user(struct syscall_metadata *= sys_data, struct syscall_bu if (nr_context_switches_cpu(cpu) !=3D cnt) goto again; =20 - /* Replace any non-printable characters with '.' */ - for (int i =3D 0; i < ret; i++) { - if (!isprint(buf[i])) - buf[i] =3D '.'; - } + /* For strings, replace any non-printable characters with '.' */ + if (!array) { + for (int i =3D 0; i < ret; i++) { + if (!isprint(buf[i])) + buf[i] =3D '.'; + } =20 - /* - * If the text was truncated due to our max limit, add "..." to - * the string. - */ - if (ret > SYSCALL_FAULT_BUF_SZ - sizeof(EXTRA)) { - strscpy(buf + SYSCALL_FAULT_BUF_SZ - sizeof(EXTRA), - EXTRA, sizeof(EXTRA)); - ret =3D SYSCALL_FAULT_BUF_SZ; - } else { - buf[ret++] =3D '\0'; + /* + * If the text was truncated due to our max limit, add "..." to + * the string. + */ + if (ret > SYSCALL_FAULT_BUF_SZ - sizeof(EXTRA)) { + strscpy(buf + SYSCALL_FAULT_BUF_SZ - sizeof(EXTRA), + EXTRA, sizeof(EXTRA)); + ret =3D SYSCALL_FAULT_BUF_SZ; + } else { + buf[ret++] =3D '\0'; + } } =20 *data_size =3D ret; @@ -642,6 +685,9 @@ static void ftrace_syscall_enter(void *data, struct pt_= regs *regs, long id) /* Store the offset and the size into the meta data */ *(int *)ptr =3D val | (user_size << 16); =20 + if (WARN_ON_ONCE((ptr - (void *)entry + user_size) > size)) + user_size =3D 0; + /* Nothing to do if the user space was empty or faulted */ if (user_size) { /* Now store the user space data into the event */ @@ -795,7 +841,16 @@ static void check_faultable_syscall(struct trace_event= _call *call, int nr) if (sys_data->enter_event !=3D call) return; =20 + sys_data->user_arg_size =3D -1; + switch (nr) { + /* user arg 1 with size arg at 2 */ + case __NR_write: + case __NR_mq_timedsend: + case __NR_pwrite64: + sys_data->user_mask =3D BIT(1); + sys_data->user_arg_size =3D 2; + break; /* user arg at position 0 */ case __NR_access: case __NR_acct: --=20 2.50.1 From nobody Thu Oct 2 03:28:38 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 048AF324B20; Tue, 23 Sep 2025 13:05:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758632755; cv=none; b=ZYnbJ6yJGv+shKFaOdtFwv5+2JeV7cl1Hu0GYJ/ueR+BDBu2x4n2IN+qrZkZhGqw4ZhIds7pSL0ORpVHz6d39UmMQCSlGh2uxzV1bdinl47hDlj1rxbr8Viyb318VJjydzoVnVGhrovimkweniObfky0IWDsEEmC3PuASlEMBBE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758632755; c=relaxed/simple; bh=RYaT0rLWqlos3+8ZsCiH/EtkBxlq5yK6xADm0AiDDkA=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=p16nUBI2TWXC2foNouVnemw1nBH1hioeu14c5BM7TE2VAnlEwvSfT6X124P5c9QRHPhaoTGRSG0cEUl9U1BfDvNhr8wpae6k0VQBus83408HRHm0TkDTDz01nWn9u5VPswd3RED6UBSefLHzt5YUOdGxKHAvadhcPEA2HVdqTSw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=Swed5O5o; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="Swed5O5o" Received: by smtp.kernel.org (Postfix) with ESMTPSA id A5384C19424; Tue, 23 Sep 2025 13:05:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1758632754; bh=RYaT0rLWqlos3+8ZsCiH/EtkBxlq5yK6xADm0AiDDkA=; h=Date:From:To:Cc:Subject:References:From; b=Swed5O5o+NQiXwmws1QwYO32bREsbCoFhwwljzcNaHBw+X910M+h22I018olPDveN aE/xJWNPWauKtKl6aDhPhhqJfoCqVeyBPqEAYk70N2MgZUtY/P05ocWNC6IJnXYf8t LaKW3e5QEa/K+kUr74MElLgzCevr+FXG6iw5vBI+ZXpC3heJwmfSUC8AhhZdjK1H2r 0zu80dafgznEEArsePHiuAirYgAw+CMduOZu+CorzLzHD+bvSWgDfRTg1bw3PGl8Gs b6MsPcSNsJl/rGzv9w3c98/sEke2wZdaxT0bagwNvRiLZFNRFiqcts1r5U1XcpKgX2 Tc0pCVj6pVi7A== Received: from rostedt by gandalf with local (Exim 4.98.2) (envelope-from ) id 1v12ji-0000000Cor7-1j0i; Tue, 23 Sep 2025 09:07:14 -0400 Message-ID: <20250923130714.265621062@kernel.org> User-Agent: quilt/0.68 Date: Tue, 23 Sep 2025 09:05:02 -0400 From: Steven Rostedt To: linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org Cc: Masami Hiramatsu , Mark Rutland , Mathieu Desnoyers , Andrew Morton , Peter Zijlstra , Namhyung Kim , Takaya Saeki , Tom Zanussi , Thomas Gleixner , Ian Rogers , Douglas Raillard Subject: [PATCH v2 5/8] tracing: Display some syscall arrays as strings References: <20250923130457.901085554@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Steven Rostedt Some of the system calls that read a fixed length of memory from the user space address are not arrays but strings. Take a bit away from the nb_args field in the syscall meta data to use as a flag to denote that the system call's user_arg_size is being used as a string. The nb_args should never be more than 6, so 7 bits is plenty to hold that number. When the user_arg_is_str flag that, when set, will display the data array from the user space address as a string and not an array. This will allow the output to look like this: sys_sethostname(name: 0x5584310eb2a0 "debian", len: 6) Signed-off-by: Steven Rostedt (Google) --- Changes since v1: https://lore.kernel.org/20250805193235.416382557@kernel.o= rg - Hide kexec_file_load around #if defined(__ARCH_WANT_TIME32_SYSCALLS) || __BITS_PER_LONG !=3D 32 to not break the i386 build. include/trace/syscall.h | 4 +++- kernel/trace/trace_syscalls.c | 22 +++++++++++++++++++--- 2 files changed, 22 insertions(+), 4 deletions(-) diff --git a/include/trace/syscall.h b/include/trace/syscall.h index 9413c139da66..0dd7f2b33431 100644 --- a/include/trace/syscall.h +++ b/include/trace/syscall.h @@ -16,6 +16,7 @@ * @name: name of the syscall * @syscall_nr: number of the syscall * @nb_args: number of parameters it takes + * @user_arg_is_str: set if the arg for @user_arg_size is a string * @user_arg_size: holds @arg that has size of the user space to read * @user_mask: mask of @args that will read user space * @types: list of types as strings @@ -27,7 +28,8 @@ struct syscall_metadata { const char *name; int syscall_nr; - u8 nb_args; + u8 nb_args:7; + u8 user_arg_is_str:1; s8 user_arg_size; short user_mask; const char **types; diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c index 7658b592c55f..64be38cf790d 100644 --- a/kernel/trace/trace_syscalls.c +++ b/kernel/trace/trace_syscalls.c @@ -184,7 +184,7 @@ print_syscall_enter(struct trace_iterator *iter, int fl= ags, ptr =3D (void *)ent + (val & 0xffff); len =3D val >> 16; =20 - if (entry->user_arg_size < 0) { + if (entry->user_arg_size < 0 || entry->user_arg_is_str) { trace_seq_printf(s, " \"%.*s\"", len, ptr); continue; } @@ -249,6 +249,7 @@ print_syscall_exit(struct trace_iterator *iter, int fla= gs, static int __init __set_enter_print_fmt(struct syscall_metadata *entry, char *buf, int len) { + bool is_string =3D entry->user_arg_is_str; int i; int pos =3D 0; =20 @@ -266,7 +267,7 @@ __set_enter_print_fmt(struct syscall_metadata *entry, c= har *buf, int len) continue; =20 /* Add the format for the user space string or array */ - if (entry->user_arg_size < 0) + if (entry->user_arg_size < 0 || is_string) pos +=3D snprintf(buf + pos, LEN_OR_ZERO, " \\\"%%s\\\""); else pos +=3D snprintf(buf + pos, LEN_OR_ZERO, " (%%s)"); @@ -279,7 +280,7 @@ __set_enter_print_fmt(struct syscall_metadata *entry, c= har *buf, int len) if (!(BIT(i) & entry->user_mask)) continue; /* The user space data for arg has name ___val */ - if (entry->user_arg_size < 0) { + if (entry->user_arg_size < 0 || is_string) { pos +=3D snprintf(buf + pos, LEN_OR_ZERO, ", __get_str(__%s_val)", entry->args[i]); } else { @@ -851,6 +852,21 @@ static void check_faultable_syscall(struct trace_event= _call *call, int nr) sys_data->user_mask =3D BIT(1); sys_data->user_arg_size =3D 2; break; + /* user arg 0 with size arg at 1 as string */ + case __NR_setdomainname: + case __NR_sethostname: + sys_data->user_mask =3D BIT(0); + sys_data->user_arg_size =3D 1; + sys_data->user_arg_is_str =3D 1; + break; +#if defined(__ARCH_WANT_TIME32_SYSCALLS) || __BITS_PER_LONG !=3D 32 + /* user arg 4 with size arg at 3 as string */ + case __NR_kexec_file_load: + sys_data->user_mask =3D BIT(4); + sys_data->user_arg_size =3D 3; + sys_data->user_arg_is_str =3D 1; + break; +#endif /* user arg at position 0 */ case __NR_access: case __NR_acct: --=20 2.50.1 From nobody Thu Oct 2 03:28:38 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 07CE9324B22; Tue, 23 Sep 2025 13:05:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758632755; cv=none; b=Cml2U/HbGUqy3LRY4F44Rh+ojszlur6zbcW3B946i0ACnyiJackDyU5o8BNatTl5Ss5ckrppXjGVnT8Bha15gZfjFTmDUIaUcKE1W13eq2+bR8rBLwJwElLyEnWM7AAf0861sBLAZswrjIUHH3UWQ4NGEENrq/DOfdKNNPgrlYM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758632755; c=relaxed/simple; bh=DJm2HIBad4aWCjZmSDMKK47+J8YGcYBSbQsdAmEAIoI=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=bD3m3XM4mXP1Vx58NjCKoUK3xQFZrAGphbCEtINdER7c9r1cBS2VE96+IIbYXjzeO6yjAUjwxWtHEKOiXhf8I+DWP4rO1fV0YCucI/vlhL69ovFtXL9oLkercz8idWyn9qshPlIGzsPvSaJP+uV+n7qtS0svW7wDqsVCzOdI6m8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=Ed8YQ4CN; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="Ed8YQ4CN" Received: by smtp.kernel.org (Postfix) with ESMTPSA id B76B3C116C6; Tue, 23 Sep 2025 13:05:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1758632754; bh=DJm2HIBad4aWCjZmSDMKK47+J8YGcYBSbQsdAmEAIoI=; h=Date:From:To:Cc:Subject:References:From; b=Ed8YQ4CNUn62EmuxxezWIpJkaTZ5UV0Wq/AqG39tpXdyCH0atRylaQQGHsgQZQdf9 URbQKgjRjgomT2VtXyaix4Qf+D2OXqMhs/xxU/9Akly/sKZIpWXdeDXJzzlzM9OElw 6uByFSTHZSodkCB2hVK7QuWaI3m59Qjo50cdIk7jkC/bsJVgQHXM9cELSMT7i7wDxh Zp2vcx9S1mArAOPth7zKUKKAu2n5JZmVTj4HaHOtzrlPO9z/As9xyVtVi2ecscM6WW 756lYlmGvVjKKFXG6avrT6Sl+uqbT63iOgCAZlhqJ4QfAaeKw1s3zeNpWwS0QXY0NG RpYze6UDGxboA== Received: from rostedt by gandalf with local (Exim 4.98.2) (envelope-from ) id 1v12ji-0000000Corb-2RCo; Tue, 23 Sep 2025 09:07:14 -0400 Message-ID: <20250923130714.432331909@kernel.org> User-Agent: quilt/0.68 Date: Tue, 23 Sep 2025 09:05:03 -0400 From: Steven Rostedt To: linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org Cc: Masami Hiramatsu , Mark Rutland , Mathieu Desnoyers , Andrew Morton , Peter Zijlstra , Namhyung Kim , Takaya Saeki , Tom Zanussi , Thomas Gleixner , Ian Rogers , Douglas Raillard Subject: [PATCH v2 6/8] tracing: Allow syscall trace events to read more than one user parameter References: <20250923130457.901085554@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Steven Rostedt Allow more than one field of a syscall trace event to read user space. Build on top of the user_mask by allowing more than one bit to be set that corresponds to the @args array of the syscall metadata. For each argument in the @args array that is to be read, it will have a dynamic array/string field associated to it. Note that multiple fields to be read from user space is not supported if the user_arg_size field is set in the syscall metada. That field can only be used if only one field is being read from user space as that field is a number representing the size field of the syscall event that holds the size of the data to read from user space. It becomes ambiguous if the system call reads more than one field. Currently this is not an issue. If a syscall event happens to enable two events to read user space and sets the user_arg_size field, it will trigger a warning at boot and the user_arg_size field will be cleared. The per CPU buffer that is used to read the user space addresses is now broken up into 3 sections, each of 168 bytes. The reason for 168 is that it is the biggest portion of 512 bytes divided by 3 that is 8 byte aligned. The max amount copied into the ring buffer from user space is now only 128 bytes, which is plenty. When reading user space, it still reads 167 (168-1) bytes and uses the remaining to know if it should append the extra "..." to the end or not. This will allow the event to look like this: sys_renameat2(olddfd: 0xffffff9c, oldname: 0x7ffe02facdff "/tmp/x", newdf= d: 0xffffff9c, newname: 0x7ffe02face06 "/tmp/y", flags: 1) Signed-off-by: Steven Rostedt (Google) --- Changes since v1: https://lore.kernel.org/20250805193235.582013098@kernel.o= rg - Added __user annotation to variable copying from user (kernel test robot) kernel/trace/trace_syscalls.c | 312 ++++++++++++++++++++++------------ 1 file changed, 207 insertions(+), 105 deletions(-) diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c index 64be38cf790d..b602c9a7dbd8 100644 --- a/kernel/trace/trace_syscalls.c +++ b/kernel/trace/trace_syscalls.c @@ -138,6 +138,7 @@ print_syscall_enter(struct trace_iterator *iter, int fl= ags, struct syscall_metadata *entry; int i, syscall, val, len; unsigned char *ptr; + int offset =3D 0; =20 trace =3D (typeof(trace))ent; syscall =3D trace->nr; @@ -177,12 +178,13 @@ print_syscall_enter(struct trace_iterator *iter, int = flags, continue; =20 /* This arg points to a user space string */ - ptr =3D (void *)trace->args + sizeof(long) * entry->nb_args; + ptr =3D (void *)trace->args + sizeof(long) * entry->nb_args + offset; val =3D *(int *)ptr; =20 /* The value is a dynamic string (len << 16 | offset) */ ptr =3D (void *)ent + (val & 0xffff); len =3D val >> 16; + offset +=3D 4; =20 if (entry->user_arg_size < 0 || entry->user_arg_is_str) { trace_seq_printf(s, " \"%.*s\"", len, ptr); @@ -335,7 +337,6 @@ static int __init syscall_enter_define_fields(struct tr= ace_event_call *call) unsigned long mask; char *arg; int offset =3D offsetof(typeof(trace), args); - int idx; int ret =3D 0; int len; int i; @@ -354,27 +355,35 @@ static int __init syscall_enter_define_fields(struct = trace_event_call *call) return ret; =20 mask =3D meta->user_mask; - idx =3D ffs(mask) - 1; =20 - /* - * User space data is faulted into a temporary buffer and then - * added as a dynamic string or array to the end of the event. - * The user space data name for the arg pointer is "___val". - */ - len =3D strlen(meta->args[idx]) + sizeof("___val"); - arg =3D kmalloc(len, GFP_KERNEL); - if (WARN_ON_ONCE(!arg)) { - meta->user_mask =3D 0; - return -ENOMEM; - } + while (mask) { + int idx =3D ffs(mask) - 1; + mask &=3D ~BIT(idx); + + /* + * User space data is faulted into a temporary buffer and then + * added as a dynamic string or array to the end of the event. + * The user space data name for the arg pointer is + * "___val". + */ + len =3D strlen(meta->args[idx]) + sizeof("___val"); + arg =3D kmalloc(len, GFP_KERNEL); + if (WARN_ON_ONCE(!arg)) { + meta->user_mask =3D 0; + return -ENOMEM; + } =20 - snprintf(arg, len, "__%s_val", meta->args[idx]); + snprintf(arg, len, "__%s_val", meta->args[idx]); =20 - ret =3D trace_define_field(call, "__data_loc char[]", - arg, offset, sizeof(int), 0, - FILTER_OTHER); - if (ret) - kfree(arg); + ret =3D trace_define_field(call, "__data_loc char[]", + arg, offset, sizeof(int), 0, + FILTER_OTHER); + if (ret) { + kfree(arg); + break; + } + offset +=3D 4; + } return ret; } =20 @@ -387,8 +396,25 @@ struct syscall_buf_info { struct syscall_buf __percpu *sbuf; }; =20 -/* Create a per CPU temporary buffer to copy user space pointers into */ +/* + * Create a per CPU temporary buffer to copy user space pointers into. + * + * SYSCALL_FAULT_BUF_SZ holds the size of the per CPU buffer to use + * to copy memory from user space addresses into. + * + * SYSCALL_FAULT_ARG_SZ is the amount to copy from user space. + * + * SYSCALL_FAULT_USER_MAX is the amount to copy into the ring buffer. + * It's slightly smaller than SYSCALL_FAULT_ARG_SZ to know if it + * needs to append the EXTRA or not. + * + * This only allows up to 3 args from system calls. + */ #define SYSCALL_FAULT_BUF_SZ 512 +#define SYSCALL_FAULT_ARG_SZ 168 +#define SYSCALL_FAULT_USER_MAX 128 +#define SYSCALL_FAULT_MAX_CNT 3 + static struct syscall_buf_info *syscall_buffer; =20 static int syscall_fault_buffer_cnt; @@ -473,23 +499,58 @@ static void syscall_fault_buffer_disable(void) call_rcu_tasks_trace(&sinfo->rcu, rcu_free_syscall_buffer); } =20 -static char *sys_fault_user(struct syscall_metadata *sys_data, struct sysc= all_buf_info *sinfo, - unsigned long *args, unsigned int *data_size) +static char *sys_fault_user(struct syscall_metadata *sys_data, + struct syscall_buf_info *sinfo, + unsigned long *args, + unsigned int data_size[SYSCALL_FAULT_MAX_CNT]) { int cpu =3D smp_processor_id(); - char *buf =3D per_cpu_ptr(sinfo->sbuf, cpu)->buf; - unsigned long size =3D SYSCALL_FAULT_BUF_SZ - 1; + char *buffer =3D per_cpu_ptr(sinfo->sbuf, cpu)->buf; unsigned long mask =3D sys_data->user_mask; + unsigned long size =3D SYSCALL_FAULT_ARG_SZ - 1; unsigned int cnt; - int idx =3D ffs(mask) - 1; bool array =3D false; - char *ptr; + char *ptr_array[SYSCALL_FAULT_MAX_CNT]; + char *buf; + int read[SYSCALL_FAULT_MAX_CNT]; int trys =3D 0; + int uargs; int ret; + int i =3D 0; + + /* The extra is appended to the user data in the buffer */ + BUILD_BUG_ON(SYSCALL_FAULT_USER_MAX + sizeof(EXTRA) >=3D + SYSCALL_FAULT_ARG_SZ); + + /* + * If this system call event has a size argument, use + * it to define how much of user space memory to read, + * and read it as an array and not a string. + */ + if (sys_data->user_arg_size >=3D 0) { + array =3D true; + size =3D args[sys_data->user_arg_size]; + if (size > SYSCALL_FAULT_ARG_SZ - 1) + size =3D SYSCALL_FAULT_ARG_SZ - 1; + } + + while (mask) { + int idx =3D ffs(mask) - 1; + mask &=3D ~BIT(idx); + + if (WARN_ON_ONCE(i =3D=3D SYSCALL_FAULT_MAX_CNT)) + break; + + /* Get the pointer to user space memory to read */ + ptr_array[i++] =3D (char *)args[idx]; + } =20 - /* Get the pointer to user space memory to read */ - ptr =3D (char *)args[idx]; - *data_size =3D 0; + uargs =3D i; + + /* Clear the values that are not used */ + for (; i < SYSCALL_FAULT_MAX_CNT; i++) { + data_size[i] =3D -1; /* Denotes no pointer */ + } =20 /* * This acts similar to a seqcount. The per CPU context switches are @@ -519,24 +580,12 @@ static char *sys_fault_user(struct syscall_metadata *= sys_data, struct syscall_bu if (!once) pr_warn("Error: Too many tries to read syscall %s\n", sys_data->name); once =3D true; - return buf; + return buffer; } =20 /* Read the current CPU context switch counter */ cnt =3D nr_context_switches_cpu(cpu); =20 - /* - * If this system call event has a size argument, use - * it to define how much of user space memory to read, - * and read it as an array and not a string. - */ - if (sys_data->user_arg_size >=3D 0) { - array =3D true; - size =3D args[sys_data->user_arg_size]; - if (size > SYSCALL_FAULT_BUF_SZ - 1) - size =3D SYSCALL_FAULT_BUF_SZ - 1; - } - /* * Preemption is going to be enabled, but this task must * remain on this CPU. @@ -549,20 +598,23 @@ static char *sys_fault_user(struct syscall_metadata *= sys_data, struct syscall_bu */ preempt_enable_notrace(); =20 - if (array) { - ret =3D __copy_from_user(buf, ptr, size); - ret =3D ret ? -1 : size; - } else { - ret =3D strncpy_from_user(buf, ptr, size); + buf =3D buffer; + + for (i =3D 0; i < uargs; i++, buf +=3D SYSCALL_FAULT_ARG_SZ) { + char __user *ptr =3D (char __user *)ptr_array[i]; + + if (array) { + ret =3D __copy_from_user(buf, ptr, size); + ret =3D ret ? -1 : size; + } else { + ret =3D strncpy_from_user(buf, ptr, size); + } + read[i] =3D ret; } =20 preempt_disable_notrace(); migrate_enable(); =20 - /* If it faulted, no use to try again */ - if (ret < 0) - return buf; - /* * Preemption is disabled again, now check the per CPU context * switch counter. If it doesn't match, then another user space @@ -572,28 +624,39 @@ static char *sys_fault_user(struct syscall_metadata *= sys_data, struct syscall_bu if (nr_context_switches_cpu(cpu) !=3D cnt) goto again; =20 - /* For strings, replace any non-printable characters with '.' */ - if (!array) { - for (int i =3D 0; i < ret; i++) { - if (!isprint(buf[i])) - buf[i] =3D '.'; - } + buf =3D buffer; + for (i =3D 0; i < uargs; i++, buf +=3D SYSCALL_FAULT_ARG_SZ) { =20 - /* - * If the text was truncated due to our max limit, add "..." to - * the string. - */ - if (ret > SYSCALL_FAULT_BUF_SZ - sizeof(EXTRA)) { - strscpy(buf + SYSCALL_FAULT_BUF_SZ - sizeof(EXTRA), - EXTRA, sizeof(EXTRA)); - ret =3D SYSCALL_FAULT_BUF_SZ; + ret =3D read[i]; + if (ret < 0) + continue; + buf[ret] =3D '\0'; + + /* For strings, replace any non-printable characters with '.' */ + if (!array) { + for (int x =3D 0; x < ret; x++) { + if (!isprint(buf[x])) + buf[x] =3D '.'; + } + + /* + * If the text was truncated due to our max limit, + * add "..." to the string. + */ + if (ret > SYSCALL_FAULT_USER_MAX) { + strscpy(buf + SYSCALL_FAULT_USER_MAX, EXTRA, + sizeof(EXTRA)); + ret =3D SYSCALL_FAULT_USER_MAX + sizeof(EXTRA); + } else { + buf[ret++] =3D '\0'; + } } else { - buf[ret++] =3D '\0'; + ret =3D min(ret, SYSCALL_FAULT_USER_MAX); } + data_size[i] =3D ret; } =20 - *data_size =3D ret; - return buf; + return buffer; } =20 static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id) @@ -605,9 +668,10 @@ static void ftrace_syscall_enter(void *data, struct pt= _regs *regs, long id) struct trace_event_buffer fbuffer; unsigned long args[6]; char *user_ptr; - int user_size =3D 0; + int user_sizes[SYSCALL_FAULT_MAX_CNT] =3D {}; int syscall_nr; int size =3D 0; + int uargs =3D 0; bool mayfault; =20 /* @@ -640,20 +704,27 @@ static void ftrace_syscall_enter(void *data, struct p= t_regs *regs, long id) =20 if (mayfault) { struct syscall_buf_info *sinfo; + int i; =20 /* If the syscall_buffer is NULL, tracing is being shutdown */ sinfo =3D READ_ONCE(syscall_buffer); if (!sinfo) return; =20 - user_ptr =3D sys_fault_user(sys_data, sinfo, args, &user_size); + user_ptr =3D sys_fault_user(sys_data, sinfo, args, user_sizes); /* * user_size is the amount of data to append. * Need to add 4 for the meta field that points to * the user memory at the end of the event and also * stores its size. */ - size =3D 4 + user_size; + for (i =3D 0; i < SYSCALL_FAULT_MAX_CNT; i++) { + if (user_sizes[i] < 0) + break; + size +=3D user_sizes[i] + 4; + } + /* Save the number of user read arguments of this syscall */ + uargs =3D i; } =20 size +=3D sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args; @@ -668,6 +739,7 @@ static void ftrace_syscall_enter(void *data, struct pt_= regs *regs, long id) memcpy(entry->args, args, sizeof(unsigned long) * sys_data->nb_args); =20 if (mayfault) { + char *buf =3D user_ptr; void *ptr; int val; =20 @@ -679,21 +751,30 @@ static void ftrace_syscall_enter(void *data, struct p= t_regs *regs, long id) =20 /* * The meta data will store the offset of the user data from - * the beginning of the event. + * the beginning of the event. That is after the static arguments + * and the meta data fields. */ - val =3D (ptr - (void *)entry) + 4; + val =3D (ptr - (void *)entry) + 4 * uargs; + + for (int i =3D 0; i < uargs; i++) { =20 - /* Store the offset and the size into the meta data */ - *(int *)ptr =3D val | (user_size << 16); + if (i) + val +=3D user_sizes[i - 1]; =20 - if (WARN_ON_ONCE((ptr - (void *)entry + user_size) > size)) - user_size =3D 0; + /* Store the offset and the size into the meta data */ + *(int *)ptr =3D val | (user_sizes[i] << 16); =20 - /* Nothing to do if the user space was empty or faulted */ - if (user_size) { - /* Now store the user space data into the event */ + /* Skip the meta data */ ptr +=3D 4; - memcpy(ptr, user_ptr, user_size); + } + + for (int i =3D 0; i < uargs; i++, buf +=3D SYSCALL_FAULT_ARG_SZ) { + /* Nothing to do if the user space was empty or faulted */ + if (!user_sizes[i]) + continue; + + memcpy(ptr, buf, user_sizes[i]); + ptr +=3D user_sizes[i]; } } =20 @@ -837,6 +918,7 @@ static void unreg_event_syscall_exit(struct trace_event= _file *file, static void check_faultable_syscall(struct trace_event_call *call, int nr) { struct syscall_metadata *sys_data =3D call->data; + unsigned long mask; =20 /* Only work on entry */ if (sys_data->enter_event !=3D call) @@ -870,7 +952,6 @@ static void check_faultable_syscall(struct trace_event_= call *call, int nr) /* user arg at position 0 */ case __NR_access: case __NR_acct: - case __NR_add_key: /* Just _type. TODO add _description */ case __NR_chdir: case __NR_chown: case __NR_chmod: @@ -879,28 +960,15 @@ static void check_faultable_syscall(struct trace_even= t_call *call, int nr) case __NR_delete_module: case __NR_execve: case __NR_fsopen: - case __NR_getxattr: /* Just pathname, TODO add name */ case __NR_lchown: - case __NR_lgetxattr: /* Just pathname, TODO add name */ - case __NR_lremovexattr: /* Just pathname, TODO add name */ - case __NR_link: /* Just oldname. TODO add newname */ - case __NR_listxattr: /* Just pathname, TODO add list */ - case __NR_llistxattr: /* Just pathname, TODO add list */ - case __NR_lsetxattr: /* Just pathname, TODO add list */ case __NR_open: case __NR_memfd_create: - case __NR_mount: /* Just dev_name, TODO add dir_name and type */ case __NR_mkdir: case __NR_mknod: case __NR_mq_open: case __NR_mq_unlink: - case __NR_pivot_root: /* Just new_root, TODO add old_root */ case __NR_readlink: - case __NR_removexattr: /* Just pathname, TODO add name */ - case __NR_rename: /* Just oldname. TODO add newname */ - case __NR_request_key: /* Just _type. TODO add _description */ case __NR_rmdir: - case __NR_setxattr: /* Just pathname, TODO add list */ case __NR_shmdt: case __NR_statfs: case __NR_swapon: @@ -927,14 +995,10 @@ static void check_faultable_syscall(struct trace_even= t_call *call, int nr) case __NR_fspick: case __NR_fremovexattr: case __NR_futimesat: - case __NR_getxattrat: /* Just pathname, TODO add name */ case __NR_inotify_add_watch: - case __NR_linkat: /* Just oldname. TODO add newname */ - case __NR_listxattrat: /* Just pathname, TODO add list */ case __NR_mkdirat: case __NR_mknodat: case __NR_mount_setattr: - case __NR_move_mount: /* Just from_pathname, TODO add to_pathname */ case __NR_name_to_handle_at: #if defined(__ARCH_WANT_NEW_STAT) || defined(__ARCH_WANT_STAT64) case __NR_newfstatat: @@ -944,13 +1008,8 @@ static void check_faultable_syscall(struct trace_even= t_call *call, int nr) case __NR_open_tree: case __NR_open_tree_attr: case __NR_readlinkat: - case __NR_renameat: /* Just oldname. TODO add newname */ - case __NR_renameat2: /* Just oldname. TODO add newname */ - case __NR_removexattrat: /* Just pathname, TODO add name */ case __NR_quotactl: - case __NR_setxattrat: /* Just pathname, TODO add list */ case __NR_syslog: - case __NR_symlinkat: /* Just oldname. TODO add newname */ case __NR_statx: case __NR_unlinkat: case __NR_utimensat: @@ -965,9 +1024,52 @@ static void check_faultable_syscall(struct trace_even= t_call *call, int nr) case __NR_fanotify_mark: sys_data->user_mask =3D BIT(4); break; + /* 2 user args, 0 and 1 */ + case __NR_add_key: + case __NR_getxattr: + case __NR_lgetxattr: + case __NR_lremovexattr: + case __NR_link: + case __NR_listxattr: + case __NR_llistxattr: + case __NR_lsetxattr: + case __NR_pivot_root: + case __NR_removexattr: + case __NR_rename: + case __NR_request_key: + case __NR_setxattr: + case __NR_symlinkat: + sys_data->user_mask =3D BIT(0) | BIT(1); + break; + /* 2 user args, 1 and 3 */ + case __NR_getxattrat: + case __NR_linkat: + case __NR_listxattrat: + case __NR_move_mount: + case __NR_renameat: + case __NR_renameat2: + case __NR_removexattrat: + case __NR_setxattrat: + sys_data->user_mask =3D BIT(1) | BIT(3); + break; + case __NR_mount: /* Just dev_name and dir_name, TODO add type */ + sys_data->user_mask =3D BIT(0) | BIT(1) | BIT(2); + break; default: sys_data->user_mask =3D 0; + return; } + + if (sys_data->user_arg_size < 0) + return; + + /* + * The user_arg_size can only be used when the system call + * is reading only a single address from user space. + */ + mask =3D sys_data->user_mask; + if (WARN_ON(mask & (mask - 1))) + sys_data->user_arg_size =3D -1; } =20 static int __init init_syscall_trace(struct trace_event_call *call) --=20 2.50.1 From nobody Thu Oct 2 03:28:38 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 709F7324B3F; Tue, 23 Sep 2025 13:05:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758632755; cv=none; b=SUWERN99TB4FlTbl/HMPNpf64McosKWK+7cezuZqSkZ5REEcSxD1aTTW1TwYOnATFnpHKLQMLEWWrgJTRMrIFIsY6vUouQZl8QByyP83oFsDII9CF8kV2lTyGPG+BuZqzO5epX09Vfc6yKGKdjSzrv3Nm/47DVIJULexA0KW2bQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758632755; c=relaxed/simple; bh=5YviO26TYWuozo0wcWDzyxzCROCodBzEAXVmxAHlTPo=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=J8Eaqg9NdvlrXbBEE6Twm9ArwCn8RtLmBFhKXJ7IiwfX6qbYHhyl6hZ4U4dn9EgQ1sduD/2soXc/ctmECjILD3HBMLi/JgZ1mhB2EckQjcDP+VSOhJGMRy+FNFRDPIST5wQiOVjm5u0hjlaouR8i5Tc+GbhFZu5SRM9IJ3nw2O4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=ipk7/Flc; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="ipk7/Flc" Received: by smtp.kernel.org (Postfix) with ESMTPSA id DB423C16AAE; Tue, 23 Sep 2025 13:05:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1758632755; bh=5YviO26TYWuozo0wcWDzyxzCROCodBzEAXVmxAHlTPo=; h=Date:From:To:Cc:Subject:References:From; b=ipk7/Flc66LZV83dX/o1AHFjf2VGtNkzNuuzKBG/sIZd1IFyIjzdX5gSqGZ5o2CbK AUBnoxiPHRAGTIMyvFN+MaQnxWzkNI1EqQ2wsoiDNN1RvtCOsLlPYYw/TGIZ3CIRoN 3XdNLiCZAQiXrOKo5nRApG+YOdCpkTnOy8gdNzQMFs7h8apgHv7J5dG+MyjxVM2aUK PoP6i61J2Bv76MXzgmHecDq+zAvrPvkIfOruYGgC9K77+D+fuCuIBKi6QD3DISkvKq 5R+ONxxEmnbUQunF4q0VGCwKCmW+4eBEN7fa/J3GrsR8B3V0AlwKPf7m1Em9SwpCWE tlXYScGKLBO2g== Received: from rostedt by gandalf with local (Exim 4.98.2) (envelope-from ) id 1v12ji-0000000Cos5-38NT; Tue, 23 Sep 2025 09:07:14 -0400 Message-ID: <20250923130714.603760198@kernel.org> User-Agent: quilt/0.68 Date: Tue, 23 Sep 2025 09:05:04 -0400 From: Steven Rostedt To: linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org Cc: Masami Hiramatsu , Mark Rutland , Mathieu Desnoyers , Andrew Morton , Peter Zijlstra , Namhyung Kim , Takaya Saeki , Tom Zanussi , Thomas Gleixner , Ian Rogers , Douglas Raillard Subject: [PATCH v2 7/8] tracing: Add syscall_user_buf_size to limit amount written References: <20250923130457.901085554@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Steven Rostedt When a system call that reads user space addresses copy it to the ring buffer, it can copy up to 511 bytes of data. This can waste precious ring buffer space if the user isn't interested in the output. Add a new file "syscall_user_buf_size" that gets initialized to a new config CONFIG_SYSCALL_BUF_SIZE_DEFAULT that defaults to 63. Also lower the max down to 165, as this isn't to record everything that a system call may be passing through to the kernel. 165 is more than enough. The reason for 165 is because adding one for the nul terminating byte, as well as possibly needing to append the "..." string turns it into 170 bytes. As this needs to save up to 3 arguments and 3 * 170 is 510 which fits nicely in 512 bytes (a power of 2). Signed-off-by: Steven Rostedt (Google) --- Changes since v1: https://lore.kernel.org/20250805193235.747004484@kernel.o= rg - Change default to 63 (127 seemed too much) - Change the max to 165 to fill in the extra data. - Use the size macros of the max size and max args to calculate the size of the buffer to save the values in. Documentation/trace/ftrace.rst | 8 ++++++ kernel/trace/Kconfig | 13 +++++++++ kernel/trace/trace.c | 52 ++++++++++++++++++++++++++++++++++ kernel/trace/trace.h | 3 ++ kernel/trace/trace_syscalls.c | 42 ++++++++++++++------------- 5 files changed, 98 insertions(+), 20 deletions(-) diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst index af66a05e18cc..87fd3ed1301f 100644 --- a/Documentation/trace/ftrace.rst +++ b/Documentation/trace/ftrace.rst @@ -366,6 +366,14 @@ of ftrace. Here is a list of some of the key files: for each function. The displayed address is the patch-site address and can differ from /proc/kallsyms address. =20 + syscall_user_buf_size: + + Some system call trace events will record the data from a user + space address that one of the parameters point to. The amount of + data per event is limited. This file holds the max number of bytes + that will be recorded into the ring buffer to hold this data. + The max value is currently 165. + dyn_ftrace_total_info: =20 This file is for debugging purposes. The number of functions that diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index d2c79da81e4f..a055ca174da5 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -575,6 +575,19 @@ config FTRACE_SYSCALLS help Basic tracer to catch the syscall entry and exit events. =20 +config TRACE_SYSCALL_BUF_SIZE_DEFAULT + int "System call user read max size" + range 0 165 + default 63 + depends on FTRACE_SYSCALLS + help + Some system call trace events will record the data from a user + space address that one of the parameters point to. The amount of + data per event is limited. It may be further limited by this + config and later changed by writing an ASCII number into: + + /sys/kernel/tracing/syscall_user_buf_size + config TRACER_SNAPSHOT bool "Create a snapshot trace buffer" select TRACER_MAX_TRACE diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 1b7db732c0b1..a3d2e7d1c664 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -6913,6 +6913,43 @@ static ssize_t tracing_splice_read_pipe(struct file = *filp, goto out; } =20 +static ssize_t +tracing_syscall_buf_read(struct file *filp, char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + struct inode *inode =3D file_inode(filp); + struct trace_array *tr =3D inode->i_private; + char buf[64]; + int r; + + r =3D snprintf(buf, 64, "%d\n", tr->syscall_buf_sz); + + return simple_read_from_buffer(ubuf, cnt, ppos, buf, r); +} + +static ssize_t +tracing_syscall_buf_write(struct file *filp, const char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + struct inode *inode =3D file_inode(filp); + struct trace_array *tr =3D inode->i_private; + unsigned long val; + int ret; + + ret =3D kstrtoul_from_user(ubuf, cnt, 10, &val); + if (ret) + return ret; + + if (val > SYSCALL_FAULT_USER_MAX) + val =3D SYSCALL_FAULT_USER_MAX; + + tr->syscall_buf_sz =3D val; + + *ppos +=3D cnt; + + return cnt; +} + static ssize_t tracing_entries_read(struct file *filp, char __user *ubuf, size_t cnt, loff_t *ppos) @@ -7737,6 +7774,14 @@ static const struct file_operations tracing_entries_= fops =3D { .release =3D tracing_release_generic_tr, }; =20 +static const struct file_operations tracing_syscall_buf_fops =3D { + .open =3D tracing_open_generic_tr, + .read =3D tracing_syscall_buf_read, + .write =3D tracing_syscall_buf_write, + .llseek =3D generic_file_llseek, + .release =3D tracing_release_generic_tr, +}; + static const struct file_operations tracing_buffer_meta_fops =3D { .open =3D tracing_buffer_meta_open, .read =3D seq_read, @@ -9839,6 +9884,8 @@ trace_array_create_systems(const char *name, const ch= ar *systems, =20 raw_spin_lock_init(&tr->start_lock); =20 + tr->syscall_buf_sz =3D global_trace.syscall_buf_sz; + tr->max_lock =3D (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED; #ifdef CONFIG_TRACER_MAX_TRACE spin_lock_init(&tr->snapshot_trigger_lock); @@ -10155,6 +10202,9 @@ init_tracer_tracefs(struct trace_array *tr, struct = dentry *d_tracer) trace_create_file("buffer_subbuf_size_kb", TRACE_MODE_WRITE, d_tracer, tr, &buffer_subbuf_size_fops); =20 + trace_create_file("syscall_user_buf_size", TRACE_MODE_WRITE, d_tracer, + tr, &tracing_syscall_buf_fops); + create_trace_options_dir(tr); =20 #ifdef CONFIG_TRACER_MAX_TRACE @@ -11081,6 +11131,8 @@ __init static int tracer_alloc_buffers(void) =20 global_trace.flags =3D TRACE_ARRAY_FL_GLOBAL; =20 + global_trace.syscall_buf_sz =3D CONFIG_TRACE_SYSCALL_BUF_SIZE_DEFAULT; + INIT_LIST_HEAD(&global_trace.systems); INIT_LIST_HEAD(&global_trace.events); INIT_LIST_HEAD(&global_trace.hist_vars); diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index 85eabb454bee..0499e6dd51fa 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -131,6 +131,8 @@ enum trace_type { #define HIST_STACKTRACE_SIZE (HIST_STACKTRACE_DEPTH * sizeof(unsigned long= )) #define HIST_STACKTRACE_SKIP 5 =20 +#define SYSCALL_FAULT_USER_MAX 165 + /* * syscalls are special, and need special handling, this is why * they are not included in trace_entries.h @@ -430,6 +432,7 @@ struct trace_array { int function_enabled; #endif int no_filter_buffering_ref; + unsigned int syscall_buf_sz; struct list_head hist_vars; #ifdef CONFIG_TRACER_SNAPSHOT struct cond_snapshot *cond_snapshot; diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c index b602c9a7dbd8..367e10096c6f 100644 --- a/kernel/trace/trace_syscalls.c +++ b/kernel/trace/trace_syscalls.c @@ -399,24 +399,21 @@ struct syscall_buf_info { /* * Create a per CPU temporary buffer to copy user space pointers into. * - * SYSCALL_FAULT_BUF_SZ holds the size of the per CPU buffer to use - * to copy memory from user space addresses into. - * - * SYSCALL_FAULT_ARG_SZ is the amount to copy from user space. - * - * SYSCALL_FAULT_USER_MAX is the amount to copy into the ring buffer. - * It's slightly smaller than SYSCALL_FAULT_ARG_SZ to know if it - * needs to append the EXTRA or not. + * SYSCALL_FAULT_USER_MAX is the amount to copy from user space. + * (defined in kernel/trace/trace.h) + + * SYSCALL_FAULT_ARG_SZ is the amount to copy from user space plus the + * nul terminating byte and possibly appended EXTRA (4 bytes). * - * This only allows up to 3 args from system calls. + * SYSCALL_FAULT_BUF_SZ holds the size of the per CPU buffer to use + * to copy memory from user space addresses into that will hold + * 3 args as only 3 args are allowed to be copied from system calls. */ -#define SYSCALL_FAULT_BUF_SZ 512 -#define SYSCALL_FAULT_ARG_SZ 168 -#define SYSCALL_FAULT_USER_MAX 128 +#define SYSCALL_FAULT_ARG_SZ (SYSCALL_FAULT_USER_MAX + 1 + 4) #define SYSCALL_FAULT_MAX_CNT 3 +#define SYSCALL_FAULT_BUF_SZ (SYSCALL_FAULT_ARG_SZ * SYSCALL_FAULT_MAX_CNT) =20 static struct syscall_buf_info *syscall_buffer; - static int syscall_fault_buffer_cnt; =20 static void syscall_fault_buffer_free(struct syscall_buf_info *sinfo) @@ -499,7 +496,7 @@ static void syscall_fault_buffer_disable(void) call_rcu_tasks_trace(&sinfo->rcu, rcu_free_syscall_buffer); } =20 -static char *sys_fault_user(struct syscall_metadata *sys_data, +static char *sys_fault_user(struct trace_array *tr, struct syscall_metadat= a *sys_data, struct syscall_buf_info *sinfo, unsigned long *args, unsigned int data_size[SYSCALL_FAULT_MAX_CNT]) @@ -552,6 +549,10 @@ static char *sys_fault_user(struct syscall_metadata *s= ys_data, data_size[i] =3D -1; /* Denotes no pointer */ } =20 + /* A zero size means do not even try */ + if (!tr->syscall_buf_sz) + return buffer; + /* * This acts similar to a seqcount. The per CPU context switches are * recorded, migration is disabled and preemption is enabled. The @@ -639,19 +640,20 @@ static char *sys_fault_user(struct syscall_metadata *= sys_data, buf[x] =3D '.'; } =20 + size =3D min(tr->syscall_buf_sz, SYSCALL_FAULT_USER_MAX); + /* * If the text was truncated due to our max limit, * add "..." to the string. */ - if (ret > SYSCALL_FAULT_USER_MAX) { - strscpy(buf + SYSCALL_FAULT_USER_MAX, EXTRA, - sizeof(EXTRA)); - ret =3D SYSCALL_FAULT_USER_MAX + sizeof(EXTRA); + if (ret > size) { + strscpy(buf + size, EXTRA, sizeof(EXTRA)); + ret =3D size + sizeof(EXTRA); } else { buf[ret++] =3D '\0'; } } else { - ret =3D min(ret, SYSCALL_FAULT_USER_MAX); + ret =3D min((unsigned int)ret, tr->syscall_buf_sz); } data_size[i] =3D ret; } @@ -711,7 +713,7 @@ static void ftrace_syscall_enter(void *data, struct pt_= regs *regs, long id) if (!sinfo) return; =20 - user_ptr =3D sys_fault_user(sys_data, sinfo, args, user_sizes); + user_ptr =3D sys_fault_user(tr, sys_data, sinfo, args, user_sizes); /* * user_size is the amount of data to append. * Need to add 4 for the meta field that points to --=20 2.50.1 From nobody Thu Oct 2 03:28:38 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 42454324B35; Tue, 23 Sep 2025 13:05:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758632755; cv=none; b=LUMB4RYkrA1D/uxffyus2rW2DfkU+UD0hJDwC0Pkin+Pke4FR5qXvX8MLWhmRAJNG+Jc3AsT15+D8Csv3OADm3PfBNDBHD4AeOcOfFej3FHYC4VEC1hw3WH6WKCmdtN/masyZmHPV+PyKVOgBwubslqVm9A4OA3GC+qDmJM74TA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758632755; c=relaxed/simple; bh=oQMySvgGa4+mOAny3q24Ce7lMpaNWT+0gsm+tLY4VAM=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=FsskX1GaWP09e6+LLQWb/lfg1MYswxFVGh+lXyyUSJISBm9zcXKlJnBYjkUOkqIyotZSSIarLuZgruNFhBbOnKNC31nJqroevDJlEvZoT4fAzniBTqoYNYTM8YYF1rx79l7AoIzTEZaLJJnVjwtnLyRLZvjP8bK4ahVEXDhS+0A= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=luTJk2sN; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="luTJk2sN" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 1D115C4CEF5; Tue, 23 Sep 2025 13:05:55 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1758632755; bh=oQMySvgGa4+mOAny3q24Ce7lMpaNWT+0gsm+tLY4VAM=; h=Date:From:To:Cc:Subject:References:From; b=luTJk2sNjpTifGvz8vUq+lURwsLbIQZ7y9NznP4dt+s3oGocAJKm6fPIPL7/O8BuD C+znQV2Dw9rFnk4O5J5uafpuu0PInnh9da2WosDyYB8fv47n2TRnEnUKjope7cFhbp 0h64pTyYifPid9L1XmXvcUDDxyTp5lE268SwaleHRpndBZUdsd3NScXZkd3rmOnr04 sGwMdVy3P1YZIoofVOwdXzYPpIucnzhaU6zyrzvZP/+GqPq9hbsDcBkq40GDjc6Hsc BGq09fAqDHAw3sjWFhXO1h4HBV0ql2YVnKqpzLYcbdy8WLnw3CtIGdio0sGfvw5WKD pJIOEzvZgH3Bg== Received: from rostedt by gandalf with local (Exim 4.98.2) (envelope-from ) id 1v12ji-0000000Cosb-3qR3; Tue, 23 Sep 2025 09:07:14 -0400 Message-ID: <20250923130714.766397031@kernel.org> User-Agent: quilt/0.68 Date: Tue, 23 Sep 2025 09:05:05 -0400 From: Steven Rostedt To: linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org Cc: Masami Hiramatsu , Mark Rutland , Mathieu Desnoyers , Andrew Morton , Peter Zijlstra , Namhyung Kim , Takaya Saeki , Tom Zanussi , Thomas Gleixner , Ian Rogers , Douglas Raillard Subject: [PATCH v2 8/8] tracing: Show printable characters in syscall arrays References: <20250923130457.901085554@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Steven Rostedt When displaying the contents of the user space data passed to the kernel, instead of just showing the array values, also print any printable content. Instead of just: bash-1113 [003] ..... 3433.290654: sys_write(fd: 2, buf: 0x555a8deedd= b0 (72:6f:6f:74:40:64:65:62:69:61:6e:2d:78:38:36:2d:36:34:3a:7e:23:20), cou= nt: 0x16) Display: bash-1113 [003] ..... 3433.290654: sys_write(fd: 2, buf: 0x555a8deedd= b0 (72:6f:6f:74:40:64:65:62:69:61:6e:2d:78:38:36:2d:36:34:3a:7e:23:20) "roo= t@debian-x86-64:~# ", count: 0x16) Signed-off-by: Steven Rostedt (Google) --- kernel/trace/trace_syscalls.c | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c index 367e10096c6f..0625a32f01dd 100644 --- a/kernel/trace/trace_syscalls.c +++ b/kernel/trace/trace_syscalls.c @@ -155,6 +155,8 @@ print_syscall_enter(struct trace_iterator *iter, int fl= ags, trace_seq_printf(s, "%s(", entry->name); =20 for (i =3D 0; i < entry->nb_args; i++) { + bool printable =3D false; + char *str; =20 if (trace_seq_has_overflowed(s)) goto end; @@ -193,8 +195,11 @@ print_syscall_enter(struct trace_iterator *iter, int f= lags, =20 val =3D trace->args[entry->user_arg_size]; =20 + str =3D ptr; trace_seq_puts(s, " ("); for (int x =3D 0; x < len; x++, ptr++) { + if (isascii(*ptr) && isprint(*ptr)) + printable =3D true; if (x) trace_seq_putc(s, ':'); trace_seq_printf(s, "%02x", *ptr); @@ -203,6 +208,22 @@ print_syscall_enter(struct trace_iterator *iter, int f= lags, trace_seq_printf(s, ", %s", EXTRA); =20 trace_seq_putc(s, ')'); + + /* If nothing is printable, don't bother printing anything */ + if (!printable) + continue; + + trace_seq_puts(s, " \""); + for (int x =3D 0; x < len; x++) { + if (isascii(str[x]) && isprint(str[x])) + trace_seq_putc(s, str[x]); + else + trace_seq_putc(s, '.'); + } + if (len < val) + trace_seq_printf(s, "\"%s", EXTRA); + else + trace_seq_putc(s, '"'); } =20 trace_seq_putc(s, ')'); --=20 2.50.1