[PATCH 1/7] tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE()

Steven Rostedt posted 7 patches 2 months ago
There is a newer version of this series
[PATCH 1/7] tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE()
Posted by Steven Rostedt 2 months ago
From: Steven Rostedt <rostedt@goodmis.org>

The syscall events are pseudo events that hook to the raw syscalls. The
ftrace_syscall_enter/exit() callback is called by the raw_syscall
enter/exit tracepoints respectively whenever any of the syscall events are
enabled.

The trace_array has an array of syscall "files" that correspond to the
system calls based on their __NR_SYSCALL number. The array is read and if
there's a pointer to a trace_event_file then it is considered enabled and
if it is NULL that syscall event is considered disabled.

Currently it uses an rcu_dereference_sched() to get this pointer and a
rcu_assign_ptr() or RCU_INIT_POINTER() to write to it. This is unnecessary
as the file pointer will not go away outside the synchronization of the
tracepoint logic itself. And this code adds no extra RCU synchronization
that uses this.

Replace these functions with a simple READ_ONCE() and WRITE_ONCE() which
is all they need. This will also allow this code to not depend on
preemption being disabled as system call tracepoints are now allowed to
fault.

Cc: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/trace_syscalls.c | 14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 46aab0ab9350..3a0b65f89130 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -310,8 +310,7 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
 	if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
 		return;
 
-	/* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE) */
-	trace_file = rcu_dereference_sched(tr->enter_syscall_files[syscall_nr]);
+	trace_file = READ_ONCE(tr->enter_syscall_files[syscall_nr]);
 	if (!trace_file)
 		return;
 
@@ -356,8 +355,7 @@ static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
 	if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
 		return;
 
-	/* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE()) */
-	trace_file = rcu_dereference_sched(tr->exit_syscall_files[syscall_nr]);
+	trace_file = READ_ONCE(tr->exit_syscall_files[syscall_nr]);
 	if (!trace_file)
 		return;
 
@@ -393,7 +391,7 @@ static int reg_event_syscall_enter(struct trace_event_file *file,
 	if (!tr->sys_refcount_enter)
 		ret = register_trace_sys_enter(ftrace_syscall_enter, tr);
 	if (!ret) {
-		rcu_assign_pointer(tr->enter_syscall_files[num], file);
+		WRITE_ONCE(tr->enter_syscall_files[num], file);
 		tr->sys_refcount_enter++;
 	}
 	mutex_unlock(&syscall_trace_lock);
@@ -411,7 +409,7 @@ static void unreg_event_syscall_enter(struct trace_event_file *file,
 		return;
 	mutex_lock(&syscall_trace_lock);
 	tr->sys_refcount_enter--;
-	RCU_INIT_POINTER(tr->enter_syscall_files[num], NULL);
+	WRITE_ONCE(tr->enter_syscall_files[num], NULL);
 	if (!tr->sys_refcount_enter)
 		unregister_trace_sys_enter(ftrace_syscall_enter, tr);
 	mutex_unlock(&syscall_trace_lock);
@@ -431,7 +429,7 @@ static int reg_event_syscall_exit(struct trace_event_file *file,
 	if (!tr->sys_refcount_exit)
 		ret = register_trace_sys_exit(ftrace_syscall_exit, tr);
 	if (!ret) {
-		rcu_assign_pointer(tr->exit_syscall_files[num], file);
+		WRITE_ONCE(tr->exit_syscall_files[num], file);
 		tr->sys_refcount_exit++;
 	}
 	mutex_unlock(&syscall_trace_lock);
@@ -449,7 +447,7 @@ static void unreg_event_syscall_exit(struct trace_event_file *file,
 		return;
 	mutex_lock(&syscall_trace_lock);
 	tr->sys_refcount_exit--;
-	RCU_INIT_POINTER(tr->exit_syscall_files[num], NULL);
+	WRITE_ONCE(tr->exit_syscall_files[num], NULL);
 	if (!tr->sys_refcount_exit)
 		unregister_trace_sys_exit(ftrace_syscall_exit, tr);
 	mutex_unlock(&syscall_trace_lock);
-- 
2.47.2
Re: [PATCH 1/7] tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE()
Posted by kernel test robot 1 month, 4 weeks ago
Hi Steven,

kernel test robot noticed the following build warnings:

[auto build test WARNING on trace/for-next]
[also build test WARNING on linus/master v6.16 next-20250806]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Steven-Rostedt/tracing-Replace-syscall-RCU-pointer-assignment-with-READ-WRITE_ONCE/20250806-122312
base:   https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace for-next
patch link:    https://lore.kernel.org/r/20250805193234.745705874%40kernel.org
patch subject: [PATCH 1/7] tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE()
config: x86_64-randconfig-123-20250806 (https://download.01.org/0day-ci/archive/20250807/202508070546.KmEoxWkg-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14+deb12u1) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250807/202508070546.KmEoxWkg-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202508070546.KmEoxWkg-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> kernel/trace/trace_syscalls.c:313:20: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct trace_event_file *trace_file @@     got struct trace_event_file [noderef] __rcu * @@
   kernel/trace/trace_syscalls.c:313:20: sparse:     expected struct trace_event_file *trace_file
   kernel/trace/trace_syscalls.c:313:20: sparse:     got struct trace_event_file [noderef] __rcu *
   kernel/trace/trace_syscalls.c:358:20: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct trace_event_file *trace_file @@     got struct trace_event_file [noderef] __rcu * @@
   kernel/trace/trace_syscalls.c:358:20: sparse:     expected struct trace_event_file *trace_file
   kernel/trace/trace_syscalls.c:358:20: sparse:     got struct trace_event_file [noderef] __rcu *
>> kernel/trace/trace_syscalls.c:394:17: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct trace_event_file [noderef] __rcu *volatile @@     got struct trace_event_file *file @@
   kernel/trace/trace_syscalls.c:394:17: sparse:     expected struct trace_event_file [noderef] __rcu *volatile
   kernel/trace/trace_syscalls.c:394:17: sparse:     got struct trace_event_file *file
   kernel/trace/trace_syscalls.c:432:17: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct trace_event_file [noderef] __rcu *volatile @@     got struct trace_event_file *file @@
   kernel/trace/trace_syscalls.c:432:17: sparse:     expected struct trace_event_file [noderef] __rcu *volatile
   kernel/trace/trace_syscalls.c:432:17: sparse:     got struct trace_event_file *file

vim +313 kernel/trace/trace_syscalls.c

   290	
   291	static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
   292	{
   293		struct trace_array *tr = data;
   294		struct trace_event_file *trace_file;
   295		struct syscall_trace_enter *entry;
   296		struct syscall_metadata *sys_data;
   297		struct trace_event_buffer fbuffer;
   298		unsigned long args[6];
   299		int syscall_nr;
   300		int size;
   301	
   302		/*
   303		 * Syscall probe called with preemption enabled, but the ring
   304		 * buffer and per-cpu data require preemption to be disabled.
   305		 */
   306		might_fault();
   307		guard(preempt_notrace)();
   308	
   309		syscall_nr = trace_get_syscall_nr(current, regs);
   310		if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
   311			return;
   312	
 > 313		trace_file = READ_ONCE(tr->enter_syscall_files[syscall_nr]);
   314		if (!trace_file)
   315			return;
   316	
   317		if (trace_trigger_soft_disabled(trace_file))
   318			return;
   319	
   320		sys_data = syscall_nr_to_meta(syscall_nr);
   321		if (!sys_data)
   322			return;
   323	
   324		size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
   325	
   326		entry = trace_event_buffer_reserve(&fbuffer, trace_file, size);
   327		if (!entry)
   328			return;
   329	
   330		entry = ring_buffer_event_data(fbuffer.event);
   331		entry->nr = syscall_nr;
   332		syscall_get_arguments(current, regs, args);
   333		memcpy(entry->args, args, sizeof(unsigned long) * sys_data->nb_args);
   334	
   335		trace_event_buffer_commit(&fbuffer);
   336	}
   337	
   338	static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
   339	{
   340		struct trace_array *tr = data;
   341		struct trace_event_file *trace_file;
   342		struct syscall_trace_exit *entry;
   343		struct syscall_metadata *sys_data;
   344		struct trace_event_buffer fbuffer;
   345		int syscall_nr;
   346	
   347		/*
   348		 * Syscall probe called with preemption enabled, but the ring
   349		 * buffer and per-cpu data require preemption to be disabled.
   350		 */
   351		might_fault();
   352		guard(preempt_notrace)();
   353	
   354		syscall_nr = trace_get_syscall_nr(current, regs);
   355		if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
   356			return;
   357	
   358		trace_file = READ_ONCE(tr->exit_syscall_files[syscall_nr]);
   359		if (!trace_file)
   360			return;
   361	
   362		if (trace_trigger_soft_disabled(trace_file))
   363			return;
   364	
   365		sys_data = syscall_nr_to_meta(syscall_nr);
   366		if (!sys_data)
   367			return;
   368	
   369		entry = trace_event_buffer_reserve(&fbuffer, trace_file, sizeof(*entry));
   370		if (!entry)
   371			return;
   372	
   373		entry = ring_buffer_event_data(fbuffer.event);
   374		entry->nr = syscall_nr;
   375		entry->ret = syscall_get_return_value(current, regs);
   376	
   377		trace_event_buffer_commit(&fbuffer);
   378	}
   379	
   380	static int reg_event_syscall_enter(struct trace_event_file *file,
   381					   struct trace_event_call *call)
   382	{
   383		struct trace_array *tr = file->tr;
   384		int ret = 0;
   385		int num;
   386	
   387		num = ((struct syscall_metadata *)call->data)->syscall_nr;
   388		if (WARN_ON_ONCE(num < 0 || num >= NR_syscalls))
   389			return -ENOSYS;
   390		mutex_lock(&syscall_trace_lock);
   391		if (!tr->sys_refcount_enter)
   392			ret = register_trace_sys_enter(ftrace_syscall_enter, tr);
   393		if (!ret) {
 > 394			WRITE_ONCE(tr->enter_syscall_files[num], file);
   395			tr->sys_refcount_enter++;
   396		}
   397		mutex_unlock(&syscall_trace_lock);
   398		return ret;
   399	}
   400	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Re: [PATCH 1/7] tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE()
Posted by Paul E. McKenney 1 month, 4 weeks ago
On Tue, Aug 05, 2025 at 03:26:47PM -0400, Steven Rostedt wrote:
> From: Steven Rostedt <rostedt@goodmis.org>
> 
> The syscall events are pseudo events that hook to the raw syscalls. The
> ftrace_syscall_enter/exit() callback is called by the raw_syscall
> enter/exit tracepoints respectively whenever any of the syscall events are
> enabled.
> 
> The trace_array has an array of syscall "files" that correspond to the
> system calls based on their __NR_SYSCALL number. The array is read and if
> there's a pointer to a trace_event_file then it is considered enabled and
> if it is NULL that syscall event is considered disabled.
> 
> Currently it uses an rcu_dereference_sched() to get this pointer and a
> rcu_assign_ptr() or RCU_INIT_POINTER() to write to it. This is unnecessary
> as the file pointer will not go away outside the synchronization of the
> tracepoint logic itself. And this code adds no extra RCU synchronization
> that uses this.
> 
> Replace these functions with a simple READ_ONCE() and WRITE_ONCE() which
> is all they need. This will also allow this code to not depend on
> preemption being disabled as system call tracepoints are now allowed to
> fault.
> 
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

From an RCU-removal viewpoint:

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>

But is it possible to give some sort of warning just in case some creative
future developer figures out how to make the file pointer go away outside
of the synchronization of the tracepoint logic itself?

							Thanx, Paul

> ---
>  kernel/trace/trace_syscalls.c | 14 ++++++--------
>  1 file changed, 6 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
> index 46aab0ab9350..3a0b65f89130 100644
> --- a/kernel/trace/trace_syscalls.c
> +++ b/kernel/trace/trace_syscalls.c
> @@ -310,8 +310,7 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
>  	if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
>  		return;
>  
> -	/* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE) */
> -	trace_file = rcu_dereference_sched(tr->enter_syscall_files[syscall_nr]);
> +	trace_file = READ_ONCE(tr->enter_syscall_files[syscall_nr]);
>  	if (!trace_file)
>  		return;
>  
> @@ -356,8 +355,7 @@ static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
>  	if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
>  		return;
>  
> -	/* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE()) */
> -	trace_file = rcu_dereference_sched(tr->exit_syscall_files[syscall_nr]);
> +	trace_file = READ_ONCE(tr->exit_syscall_files[syscall_nr]);
>  	if (!trace_file)
>  		return;
>  
> @@ -393,7 +391,7 @@ static int reg_event_syscall_enter(struct trace_event_file *file,
>  	if (!tr->sys_refcount_enter)
>  		ret = register_trace_sys_enter(ftrace_syscall_enter, tr);
>  	if (!ret) {
> -		rcu_assign_pointer(tr->enter_syscall_files[num], file);
> +		WRITE_ONCE(tr->enter_syscall_files[num], file);
>  		tr->sys_refcount_enter++;
>  	}
>  	mutex_unlock(&syscall_trace_lock);
> @@ -411,7 +409,7 @@ static void unreg_event_syscall_enter(struct trace_event_file *file,
>  		return;
>  	mutex_lock(&syscall_trace_lock);
>  	tr->sys_refcount_enter--;
> -	RCU_INIT_POINTER(tr->enter_syscall_files[num], NULL);
> +	WRITE_ONCE(tr->enter_syscall_files[num], NULL);
>  	if (!tr->sys_refcount_enter)
>  		unregister_trace_sys_enter(ftrace_syscall_enter, tr);
>  	mutex_unlock(&syscall_trace_lock);
> @@ -431,7 +429,7 @@ static int reg_event_syscall_exit(struct trace_event_file *file,
>  	if (!tr->sys_refcount_exit)
>  		ret = register_trace_sys_exit(ftrace_syscall_exit, tr);
>  	if (!ret) {
> -		rcu_assign_pointer(tr->exit_syscall_files[num], file);
> +		WRITE_ONCE(tr->exit_syscall_files[num], file);
>  		tr->sys_refcount_exit++;
>  	}
>  	mutex_unlock(&syscall_trace_lock);
> @@ -449,7 +447,7 @@ static void unreg_event_syscall_exit(struct trace_event_file *file,
>  		return;
>  	mutex_lock(&syscall_trace_lock);
>  	tr->sys_refcount_exit--;
> -	RCU_INIT_POINTER(tr->exit_syscall_files[num], NULL);
> +	WRITE_ONCE(tr->exit_syscall_files[num], NULL);
>  	if (!tr->sys_refcount_exit)
>  		unregister_trace_sys_exit(ftrace_syscall_exit, tr);
>  	mutex_unlock(&syscall_trace_lock);
> -- 
> 2.47.2
> 
>
Re: [PATCH 1/7] tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE()
Posted by Steven Rostedt 3 weeks, 2 days ago
Finally have time to get back to these patches.

On Wed, 6 Aug 2025 11:39:57 -0700
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> On Tue, Aug 05, 2025 at 03:26:47PM -0400, Steven Rostedt wrote:
> > From: Steven Rostedt <rostedt@goodmis.org>
> > 
> > The syscall events are pseudo events that hook to the raw syscalls. The
> > ftrace_syscall_enter/exit() callback is called by the raw_syscall
> > enter/exit tracepoints respectively whenever any of the syscall events are
> > enabled.
> > 
> > The trace_array has an array of syscall "files" that correspond to the
> > system calls based on their __NR_SYSCALL number. The array is read and if
> > there's a pointer to a trace_event_file then it is considered enabled and
> > if it is NULL that syscall event is considered disabled.
> > 
> > Currently it uses an rcu_dereference_sched() to get this pointer and a
> > rcu_assign_ptr() or RCU_INIT_POINTER() to write to it. This is unnecessary
> > as the file pointer will not go away outside the synchronization of the
> > tracepoint logic itself. And this code adds no extra RCU synchronization
> > that uses this.
> > 
> > Replace these functions with a simple READ_ONCE() and WRITE_ONCE() which
> > is all they need. This will also allow this code to not depend on
> > preemption being disabled as system call tracepoints are now allowed to
> > fault.
> > 
> > Cc: "Paul E. McKenney" <paulmck@kernel.org>
> > Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>  
> 
> >From an RCU-removal viewpoint:  
> 
> Reviewed-by: Paul E. McKenney <paulmck@kernel.org>

Thanks for the review. I'm also removing the __rcu that triggered the bot.

> 
> But is it possible to give some sort of warning just in case some creative
> future developer figures out how to make the file pointer go away outside
> of the synchronization of the tracepoint logic itself?

That would be quite a big change, and since this is the core code to it,
that future change should fix up this code as well. All the modification
happens in this file so nothing should be hidden.

If they do get it wrong, it should crash pretty amazingly if there's any
testing ;-)

-- Steve