From: Steven Rostedt <rostedt@goodmis.org>
The syscall events are pseudo events that hook to the raw syscalls. The
ftrace_syscall_enter/exit() callback is called by the raw_syscall
enter/exit tracepoints respectively whenever any of the syscall events are
enabled.
The trace_array has an array of syscall "files" that correspond to the
system calls based on their __NR_SYSCALL number. The array is read and if
there's a pointer to a trace_event_file then it is considered enabled and
if it is NULL that syscall event is considered disabled.
Currently it uses an rcu_dereference_sched() to get this pointer and a
rcu_assign_ptr() or RCU_INIT_POINTER() to write to it. This is unnecessary
as the file pointer will not go away outside the synchronization of the
tracepoint logic itself. And this code adds no extra RCU synchronization
that uses this.
Replace these functions with a simple READ_ONCE() and WRITE_ONCE() which
is all they need. This will also allow this code to not depend on
preemption being disabled as system call tracepoints are now allowed to
fault.
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
kernel/trace/trace_syscalls.c | 14 ++++++--------
1 file changed, 6 insertions(+), 8 deletions(-)
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 46aab0ab9350..3a0b65f89130 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -310,8 +310,7 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
return;
- /* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE) */
- trace_file = rcu_dereference_sched(tr->enter_syscall_files[syscall_nr]);
+ trace_file = READ_ONCE(tr->enter_syscall_files[syscall_nr]);
if (!trace_file)
return;
@@ -356,8 +355,7 @@ static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
return;
- /* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE()) */
- trace_file = rcu_dereference_sched(tr->exit_syscall_files[syscall_nr]);
+ trace_file = READ_ONCE(tr->exit_syscall_files[syscall_nr]);
if (!trace_file)
return;
@@ -393,7 +391,7 @@ static int reg_event_syscall_enter(struct trace_event_file *file,
if (!tr->sys_refcount_enter)
ret = register_trace_sys_enter(ftrace_syscall_enter, tr);
if (!ret) {
- rcu_assign_pointer(tr->enter_syscall_files[num], file);
+ WRITE_ONCE(tr->enter_syscall_files[num], file);
tr->sys_refcount_enter++;
}
mutex_unlock(&syscall_trace_lock);
@@ -411,7 +409,7 @@ static void unreg_event_syscall_enter(struct trace_event_file *file,
return;
mutex_lock(&syscall_trace_lock);
tr->sys_refcount_enter--;
- RCU_INIT_POINTER(tr->enter_syscall_files[num], NULL);
+ WRITE_ONCE(tr->enter_syscall_files[num], NULL);
if (!tr->sys_refcount_enter)
unregister_trace_sys_enter(ftrace_syscall_enter, tr);
mutex_unlock(&syscall_trace_lock);
@@ -431,7 +429,7 @@ static int reg_event_syscall_exit(struct trace_event_file *file,
if (!tr->sys_refcount_exit)
ret = register_trace_sys_exit(ftrace_syscall_exit, tr);
if (!ret) {
- rcu_assign_pointer(tr->exit_syscall_files[num], file);
+ WRITE_ONCE(tr->exit_syscall_files[num], file);
tr->sys_refcount_exit++;
}
mutex_unlock(&syscall_trace_lock);
@@ -449,7 +447,7 @@ static void unreg_event_syscall_exit(struct trace_event_file *file,
return;
mutex_lock(&syscall_trace_lock);
tr->sys_refcount_exit--;
- RCU_INIT_POINTER(tr->exit_syscall_files[num], NULL);
+ WRITE_ONCE(tr->exit_syscall_files[num], NULL);
if (!tr->sys_refcount_exit)
unregister_trace_sys_exit(ftrace_syscall_exit, tr);
mutex_unlock(&syscall_trace_lock);
--
2.47.2
Hi Steven, kernel test robot noticed the following build warnings: [auto build test WARNING on trace/for-next] [also build test WARNING on linus/master v6.16 next-20250806] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Steven-Rostedt/tracing-Replace-syscall-RCU-pointer-assignment-with-READ-WRITE_ONCE/20250806-122312 base: https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace for-next patch link: https://lore.kernel.org/r/20250805193234.745705874%40kernel.org patch subject: [PATCH 1/7] tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE() config: x86_64-randconfig-123-20250806 (https://download.01.org/0day-ci/archive/20250807/202508070546.KmEoxWkg-lkp@intel.com/config) compiler: gcc-12 (Debian 12.2.0-14+deb12u1) 12.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250807/202508070546.KmEoxWkg-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202508070546.KmEoxWkg-lkp@intel.com/ sparse warnings: (new ones prefixed by >>) >> kernel/trace/trace_syscalls.c:313:20: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct trace_event_file *trace_file @@ got struct trace_event_file [noderef] __rcu * @@ kernel/trace/trace_syscalls.c:313:20: sparse: expected struct trace_event_file *trace_file kernel/trace/trace_syscalls.c:313:20: sparse: got struct trace_event_file [noderef] __rcu * kernel/trace/trace_syscalls.c:358:20: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct trace_event_file *trace_file @@ got struct trace_event_file [noderef] __rcu * @@ kernel/trace/trace_syscalls.c:358:20: sparse: expected struct trace_event_file *trace_file kernel/trace/trace_syscalls.c:358:20: sparse: got struct trace_event_file [noderef] __rcu * >> kernel/trace/trace_syscalls.c:394:17: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct trace_event_file [noderef] __rcu *volatile @@ got struct trace_event_file *file @@ kernel/trace/trace_syscalls.c:394:17: sparse: expected struct trace_event_file [noderef] __rcu *volatile kernel/trace/trace_syscalls.c:394:17: sparse: got struct trace_event_file *file kernel/trace/trace_syscalls.c:432:17: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct trace_event_file [noderef] __rcu *volatile @@ got struct trace_event_file *file @@ kernel/trace/trace_syscalls.c:432:17: sparse: expected struct trace_event_file [noderef] __rcu *volatile kernel/trace/trace_syscalls.c:432:17: sparse: got struct trace_event_file *file vim +313 kernel/trace/trace_syscalls.c 290 291 static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id) 292 { 293 struct trace_array *tr = data; 294 struct trace_event_file *trace_file; 295 struct syscall_trace_enter *entry; 296 struct syscall_metadata *sys_data; 297 struct trace_event_buffer fbuffer; 298 unsigned long args[6]; 299 int syscall_nr; 300 int size; 301 302 /* 303 * Syscall probe called with preemption enabled, but the ring 304 * buffer and per-cpu data require preemption to be disabled. 305 */ 306 might_fault(); 307 guard(preempt_notrace)(); 308 309 syscall_nr = trace_get_syscall_nr(current, regs); 310 if (syscall_nr < 0 || syscall_nr >= NR_syscalls) 311 return; 312 > 313 trace_file = READ_ONCE(tr->enter_syscall_files[syscall_nr]); 314 if (!trace_file) 315 return; 316 317 if (trace_trigger_soft_disabled(trace_file)) 318 return; 319 320 sys_data = syscall_nr_to_meta(syscall_nr); 321 if (!sys_data) 322 return; 323 324 size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args; 325 326 entry = trace_event_buffer_reserve(&fbuffer, trace_file, size); 327 if (!entry) 328 return; 329 330 entry = ring_buffer_event_data(fbuffer.event); 331 entry->nr = syscall_nr; 332 syscall_get_arguments(current, regs, args); 333 memcpy(entry->args, args, sizeof(unsigned long) * sys_data->nb_args); 334 335 trace_event_buffer_commit(&fbuffer); 336 } 337 338 static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret) 339 { 340 struct trace_array *tr = data; 341 struct trace_event_file *trace_file; 342 struct syscall_trace_exit *entry; 343 struct syscall_metadata *sys_data; 344 struct trace_event_buffer fbuffer; 345 int syscall_nr; 346 347 /* 348 * Syscall probe called with preemption enabled, but the ring 349 * buffer and per-cpu data require preemption to be disabled. 350 */ 351 might_fault(); 352 guard(preempt_notrace)(); 353 354 syscall_nr = trace_get_syscall_nr(current, regs); 355 if (syscall_nr < 0 || syscall_nr >= NR_syscalls) 356 return; 357 358 trace_file = READ_ONCE(tr->exit_syscall_files[syscall_nr]); 359 if (!trace_file) 360 return; 361 362 if (trace_trigger_soft_disabled(trace_file)) 363 return; 364 365 sys_data = syscall_nr_to_meta(syscall_nr); 366 if (!sys_data) 367 return; 368 369 entry = trace_event_buffer_reserve(&fbuffer, trace_file, sizeof(*entry)); 370 if (!entry) 371 return; 372 373 entry = ring_buffer_event_data(fbuffer.event); 374 entry->nr = syscall_nr; 375 entry->ret = syscall_get_return_value(current, regs); 376 377 trace_event_buffer_commit(&fbuffer); 378 } 379 380 static int reg_event_syscall_enter(struct trace_event_file *file, 381 struct trace_event_call *call) 382 { 383 struct trace_array *tr = file->tr; 384 int ret = 0; 385 int num; 386 387 num = ((struct syscall_metadata *)call->data)->syscall_nr; 388 if (WARN_ON_ONCE(num < 0 || num >= NR_syscalls)) 389 return -ENOSYS; 390 mutex_lock(&syscall_trace_lock); 391 if (!tr->sys_refcount_enter) 392 ret = register_trace_sys_enter(ftrace_syscall_enter, tr); 393 if (!ret) { > 394 WRITE_ONCE(tr->enter_syscall_files[num], file); 395 tr->sys_refcount_enter++; 396 } 397 mutex_unlock(&syscall_trace_lock); 398 return ret; 399 } 400 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki
On Tue, Aug 05, 2025 at 03:26:47PM -0400, Steven Rostedt wrote: > From: Steven Rostedt <rostedt@goodmis.org> > > The syscall events are pseudo events that hook to the raw syscalls. The > ftrace_syscall_enter/exit() callback is called by the raw_syscall > enter/exit tracepoints respectively whenever any of the syscall events are > enabled. > > The trace_array has an array of syscall "files" that correspond to the > system calls based on their __NR_SYSCALL number. The array is read and if > there's a pointer to a trace_event_file then it is considered enabled and > if it is NULL that syscall event is considered disabled. > > Currently it uses an rcu_dereference_sched() to get this pointer and a > rcu_assign_ptr() or RCU_INIT_POINTER() to write to it. This is unnecessary > as the file pointer will not go away outside the synchronization of the > tracepoint logic itself. And this code adds no extra RCU synchronization > that uses this. > > Replace these functions with a simple READ_ONCE() and WRITE_ONCE() which > is all they need. This will also allow this code to not depend on > preemption being disabled as system call tracepoints are now allowed to > fault. > > Cc: "Paul E. McKenney" <paulmck@kernel.org> > Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> From an RCU-removal viewpoint: Reviewed-by: Paul E. McKenney <paulmck@kernel.org> But is it possible to give some sort of warning just in case some creative future developer figures out how to make the file pointer go away outside of the synchronization of the tracepoint logic itself? Thanx, Paul > --- > kernel/trace/trace_syscalls.c | 14 ++++++-------- > 1 file changed, 6 insertions(+), 8 deletions(-) > > diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c > index 46aab0ab9350..3a0b65f89130 100644 > --- a/kernel/trace/trace_syscalls.c > +++ b/kernel/trace/trace_syscalls.c > @@ -310,8 +310,7 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id) > if (syscall_nr < 0 || syscall_nr >= NR_syscalls) > return; > > - /* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE) */ > - trace_file = rcu_dereference_sched(tr->enter_syscall_files[syscall_nr]); > + trace_file = READ_ONCE(tr->enter_syscall_files[syscall_nr]); > if (!trace_file) > return; > > @@ -356,8 +355,7 @@ static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret) > if (syscall_nr < 0 || syscall_nr >= NR_syscalls) > return; > > - /* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE()) */ > - trace_file = rcu_dereference_sched(tr->exit_syscall_files[syscall_nr]); > + trace_file = READ_ONCE(tr->exit_syscall_files[syscall_nr]); > if (!trace_file) > return; > > @@ -393,7 +391,7 @@ static int reg_event_syscall_enter(struct trace_event_file *file, > if (!tr->sys_refcount_enter) > ret = register_trace_sys_enter(ftrace_syscall_enter, tr); > if (!ret) { > - rcu_assign_pointer(tr->enter_syscall_files[num], file); > + WRITE_ONCE(tr->enter_syscall_files[num], file); > tr->sys_refcount_enter++; > } > mutex_unlock(&syscall_trace_lock); > @@ -411,7 +409,7 @@ static void unreg_event_syscall_enter(struct trace_event_file *file, > return; > mutex_lock(&syscall_trace_lock); > tr->sys_refcount_enter--; > - RCU_INIT_POINTER(tr->enter_syscall_files[num], NULL); > + WRITE_ONCE(tr->enter_syscall_files[num], NULL); > if (!tr->sys_refcount_enter) > unregister_trace_sys_enter(ftrace_syscall_enter, tr); > mutex_unlock(&syscall_trace_lock); > @@ -431,7 +429,7 @@ static int reg_event_syscall_exit(struct trace_event_file *file, > if (!tr->sys_refcount_exit) > ret = register_trace_sys_exit(ftrace_syscall_exit, tr); > if (!ret) { > - rcu_assign_pointer(tr->exit_syscall_files[num], file); > + WRITE_ONCE(tr->exit_syscall_files[num], file); > tr->sys_refcount_exit++; > } > mutex_unlock(&syscall_trace_lock); > @@ -449,7 +447,7 @@ static void unreg_event_syscall_exit(struct trace_event_file *file, > return; > mutex_lock(&syscall_trace_lock); > tr->sys_refcount_exit--; > - RCU_INIT_POINTER(tr->exit_syscall_files[num], NULL); > + WRITE_ONCE(tr->exit_syscall_files[num], NULL); > if (!tr->sys_refcount_exit) > unregister_trace_sys_exit(ftrace_syscall_exit, tr); > mutex_unlock(&syscall_trace_lock); > -- > 2.47.2 > >
Finally have time to get back to these patches. On Wed, 6 Aug 2025 11:39:57 -0700 "Paul E. McKenney" <paulmck@kernel.org> wrote: > On Tue, Aug 05, 2025 at 03:26:47PM -0400, Steven Rostedt wrote: > > From: Steven Rostedt <rostedt@goodmis.org> > > > > The syscall events are pseudo events that hook to the raw syscalls. The > > ftrace_syscall_enter/exit() callback is called by the raw_syscall > > enter/exit tracepoints respectively whenever any of the syscall events are > > enabled. > > > > The trace_array has an array of syscall "files" that correspond to the > > system calls based on their __NR_SYSCALL number. The array is read and if > > there's a pointer to a trace_event_file then it is considered enabled and > > if it is NULL that syscall event is considered disabled. > > > > Currently it uses an rcu_dereference_sched() to get this pointer and a > > rcu_assign_ptr() or RCU_INIT_POINTER() to write to it. This is unnecessary > > as the file pointer will not go away outside the synchronization of the > > tracepoint logic itself. And this code adds no extra RCU synchronization > > that uses this. > > > > Replace these functions with a simple READ_ONCE() and WRITE_ONCE() which > > is all they need. This will also allow this code to not depend on > > preemption being disabled as system call tracepoints are now allowed to > > fault. > > > > Cc: "Paul E. McKenney" <paulmck@kernel.org> > > Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> > > >From an RCU-removal viewpoint: > > Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Thanks for the review. I'm also removing the __rcu that triggered the bot. > > But is it possible to give some sort of warning just in case some creative > future developer figures out how to make the file pointer go away outside > of the synchronization of the tracepoint logic itself? That would be quite a big change, and since this is the core code to it, that future change should fix up this code as well. All the modification happens in this file so nothing should be hidden. If they do get it wrong, it should crash pretty amazingly if there's any testing ;-) -- Steve
© 2016 - 2025 Red Hat, Inc.