[v1] linux-next: manual merge of the rcu tree with the ftrace tree

linux-next: manual merge of the rcu tree with the ftrace tree

Posted by Stephen Rothwell 2 months, 3 weeks ago

Hi all,

Today's linux-next merge of the rcu tree got a conflict in:

  kernel/trace/trace_syscalls.c

between commit:

  a544d9a66bdf ("tracing: Have syscall trace events read user space string")

from the ftrace tree and commit:

  35587dbc58dd ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast")

from the rcu tree.

I fixed it up (Maybe - see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc kernel/trace/trace_syscalls.c
index e96d0063cbcf,3f699b198c56..000000000000
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@@ -878,6 -322,8 +890,7 @@@ static void ftrace_syscall_enter(void *
  	 * buffer and per-cpu data require preemption to be disabled.
  	 */
  	might_fault();
+ 	preempt_rt_guard();
 -	guard(preempt_notrace)();
  
  	syscall_nr = trace_get_syscall_nr(current, regs);
  	if (syscall_nr < 0 || syscall_nr >= NR_syscalls)

Re: linux-next: manual merge of the rcu tree with the ftrace tree

Posted by Steven Rostedt 2 months, 3 weeks ago

On Fri, 14 Nov 2025 13:52:26 +1100
Stephen Rothwell <sfr@canb.auug.org.au> wrote:

> Hi all,
> 
> Today's linux-next merge of the rcu tree got a conflict in:
> 
>   kernel/trace/trace_syscalls.c
> 
> between commit:
> 
>   a544d9a66bdf ("tracing: Have syscall trace events read user space string")
> 
> from the ftrace tree and commit:
> 
>   35587dbc58dd ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast")
> 
> from the rcu tree.
> 
> I fixed it up (Maybe - see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.

Thanks for the update.

> 

> diff --cc kernel/trace/trace_syscalls.c
> index e96d0063cbcf,3f699b198c56..000000000000
> --- a/kernel/trace/trace_syscalls.c
> +++ b/kernel/trace/trace_syscalls.c
> @@@ -878,6 -322,8 +890,7 @@@ static void ftrace_syscall_enter(void *
>   	 * buffer and per-cpu data require preemption to be disabled.
>   	 */
>   	might_fault();
> + 	preempt_rt_guard();
>  -	guard(preempt_notrace)();

My code made it so that preemption is not needed here but is moved later
down for the logic that does the reading of user space data.

Note, it must have preemption disabled for all configs (including RT).
Otherwise, the data it has can get corrupted.

Paul, can you change it so that you *do not* touch this file?

Thanks,

-- Steve


>   
>   	syscall_nr = trace_get_syscall_nr(current, regs);
>   	if (syscall_nr < 0 || syscall_nr >= NR_syscalls)

Re: linux-next: manual merge of the rcu tree with the ftrace tree

Posted by Paul E. McKenney 2 months, 3 weeks ago

On Fri, Nov 14, 2025 at 07:42:55AM -0500, Steven Rostedt wrote:
> On Fri, 14 Nov 2025 13:52:26 +1100
> Stephen Rothwell <sfr@canb.auug.org.au> wrote:
> 
> > Hi all,
> > 
> > Today's linux-next merge of the rcu tree got a conflict in:
> > 
> >   kernel/trace/trace_syscalls.c
> > 
> > between commit:
> > 
> >   a544d9a66bdf ("tracing: Have syscall trace events read user space string")
> > 
> > from the ftrace tree and commit:
> > 
> >   35587dbc58dd ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast")
> > 
> > from the rcu tree.
> > 
> > I fixed it up (Maybe - see below) and can carry the fix as necessary. This
> > is now fixed as far as linux-next is concerned, but any non trivial
> > conflicts should be mentioned to your upstream maintainer when your tree
> > is submitted for merging.  You may also want to consider cooperating
> > with the maintainer of the conflicting tree to minimise any particularly
> > complex conflicts.
> 
> Thanks for the update.
> 
> > 
> 
> > diff --cc kernel/trace/trace_syscalls.c
> > index e96d0063cbcf,3f699b198c56..000000000000
> > --- a/kernel/trace/trace_syscalls.c
> > +++ b/kernel/trace/trace_syscalls.c
> > @@@ -878,6 -322,8 +890,7 @@@ static void ftrace_syscall_enter(void *
> >   	 * buffer and per-cpu data require preemption to be disabled.
> >   	 */
> >   	might_fault();
> > + 	preempt_rt_guard();
> >  -	guard(preempt_notrace)();
> 
> My code made it so that preemption is not needed here but is moved later
> down for the logic that does the reading of user space data.
> 
> Note, it must have preemption disabled for all configs (including RT).
> Otherwise, the data it has can get corrupted.
> 
> Paul, can you change it so that you *do not* touch this file?

I could, but I believe that this would re-introduce the migration failure.

Maybe we should just defer this until both your patch and the RCU
stack hit mainline, and port on top of those?  Perhaps later in the
merge window?

I believe that migration needs to be disabled at this point, but I am
again adding Yonghong on CC for his perspective.

							Thanx, Paul

> Thanks,
> 
> -- Steve
> 
> 
> >   
> >   	syscall_nr = trace_get_syscall_nr(current, regs);
> >   	if (syscall_nr < 0 || syscall_nr >= NR_syscalls)

Re: linux-next: manual merge of the rcu tree with the ftrace tree

Posted by Paul E. McKenney 2 months, 3 weeks ago

On Fri, Nov 14, 2025 at 09:05:49AM -0800, Paul E. McKenney wrote:
> On Fri, Nov 14, 2025 at 07:42:55AM -0500, Steven Rostedt wrote:
> > On Fri, 14 Nov 2025 13:52:26 +1100
> > Stephen Rothwell <sfr@canb.auug.org.au> wrote:
> > 
> > > Hi all,
> > > 
> > > Today's linux-next merge of the rcu tree got a conflict in:
> > > 
> > >   kernel/trace/trace_syscalls.c
> > > 
> > > between commit:
> > > 
> > >   a544d9a66bdf ("tracing: Have syscall trace events read user space string")
> > > 
> > > from the ftrace tree and commit:
> > > 
> > >   35587dbc58dd ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast")
> > > 
> > > from the rcu tree.
> > > 
> > > I fixed it up (Maybe - see below) and can carry the fix as necessary. This
> > > is now fixed as far as linux-next is concerned, but any non trivial
> > > conflicts should be mentioned to your upstream maintainer when your tree
> > > is submitted for merging.  You may also want to consider cooperating
> > > with the maintainer of the conflicting tree to minimise any particularly
> > > complex conflicts.
> > 
> > Thanks for the update.
> > 
> > > 
> > 
> > > diff --cc kernel/trace/trace_syscalls.c
> > > index e96d0063cbcf,3f699b198c56..000000000000
> > > --- a/kernel/trace/trace_syscalls.c
> > > +++ b/kernel/trace/trace_syscalls.c
> > > @@@ -878,6 -322,8 +890,7 @@@ static void ftrace_syscall_enter(void *
> > >   	 * buffer and per-cpu data require preemption to be disabled.
> > >   	 */
> > >   	might_fault();
> > > + 	preempt_rt_guard();
> > >  -	guard(preempt_notrace)();
> > 
> > My code made it so that preemption is not needed here but is moved later
> > down for the logic that does the reading of user space data.
> > 
> > Note, it must have preemption disabled for all configs (including RT).
> > Otherwise, the data it has can get corrupted.
> > 
> > Paul, can you change it so that you *do not* touch this file?
> 
> I could, but I believe that this would re-introduce the migration failure.
> 
> Maybe we should just defer this until both your patch and the RCU
> stack hit mainline, and port on top of those?  Perhaps later in the
> merge window?
> 
> I believe that migration needs to be disabled at this point, but I am
> again adding Yonghong on CC for his perspective.

In any case, here is what I end up with.  I will expose this to kernel
test robot, but keep it out of -next for the time being.

Thoughts?

							Thanx, Paul

------------------------------------------------------------------------

commit fca6fa23c5a597e9a775babaadb8bed0c1e76010
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Wed Jul 16 12:34:26 2025 -0700

    tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast
    
    The current use of guard(preempt_notrace)() within __DECLARE_TRACE()
    to protect invocation of __DO_TRACE_CALL() means that BPF programs
    attached to tracepoints are non-preemptible.  This is unhelpful in
    real-time systems, whose users apparently wish to use BPF while also
    achieving low latencies.  (Who knew?)
    
    One option would be to use preemptible RCU, but this introduces
    many opportunities for infinite recursion, which many consider to
    be counterproductive, especially given the relatively small stacks
    provided by the Linux kernel.  These opportunities could be shut down
    by sufficiently energetic duplication of code, but this sort of thing
    is considered impolite in some circles.
    
    Therefore, use the shiny new SRCU-fast API, which provides somewhat faster
    readers than those of preemptible RCU, at least on Paul E. McKenney's
    laptop, where task_struct access is more expensive than access to per-CPU
    variables.  And SRCU-fast provides way faster readers than does SRCU,
    courtesy of being able to avoid the read-side use of smp_mb().  Also,
    it is quite straightforward to create srcu_read_{,un}lock_fast_notrace()
    functions.
    
    While in the area, SRCU now supports early boot call_srcu().  Therefore,
    remove the checks that used to avoid such use from rcu_free_old_probes()
    before this commit was applied:
    
    e53244e2c893 ("tracepoint: Remove SRCU protection")
    
    The current commit can be thought of as an approximate revert of that
    commit, with some compensating additions of preemption disabling.
    This preemption disabling uses guard(preempt_notrace)().
    
    However, Yonghong Song points out that BPF assumes that non-sleepable
    BPF programs will remain on the same CPU, which means that migration
    must be disabled whenever preemption remains enabled.  In addition,
    non-RT kernels have performance expectations that would be violated by
    allowing the BPF programs to be preempted.
    
    Therefore, continue to disable preemption in non-RT kernels, and protect
    the BPF program with both SRCU and migration disabling for RT kernels,
    and even then only if preemption is not already disabled.
    
    [ paulmck: Apply kernel test robot and Yonghong Song feedback. ]
    [ paulmck: Remove trace_syscalls.h changes per Steven Rostedt. ]
    
    Link: https://lore.kernel.org/all/20250613152218.1924093-1-bigeasy@linutronix.de/
    Signed-off-by: Steve Rostedt (Google) <rostedt@goodmis.org>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Cc: <bpf@vger.kernel.org>

diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 04307a19cde30..0a276e51d8557 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -221,6 +221,26 @@ static inline unsigned int tracing_gen_ctx_dec(void)
 	return trace_ctx;
 }
 
+/*
+ * When PREEMPT_RT is enabled, trace events are called with disabled
+ * migration. The trace events need to know if the tracepoint disabled
+ * migration or not so that what is recorded to the ring buffer shows
+ * the state of when the trace event triggered, and not the state caused
+ * by the trace event.
+ */
+#ifdef CONFIG_PREEMPT_RT
+static inline unsigned int tracing_gen_ctx_dec_cond(void)
+{
+	unsigned int trace_ctx;
+
+	trace_ctx = tracing_gen_ctx_dec();
+	/* The migration counter starts at bit 4 */
+	return trace_ctx - (1 << 4);
+}
+#else
+# define tracing_gen_ctx_dec_cond() tracing_gen_ctx_dec()
+#endif
+
 struct trace_event_file;
 
 struct ring_buffer_event *
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 826ce3f8e1f85..5294110c3e84a 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -100,6 +100,25 @@ void for_each_tracepoint_in_module(struct module *mod,
 }
 #endif /* CONFIG_MODULES */
 
+/*
+ * BPF programs can attach to the tracepoint callbacks. But if the
+ * callbacks are called with preemption disabled, the BPF programs
+ * can cause quite a bit of latency. When PREEMPT_RT is enabled,
+ * instead of disabling preemption, use srcu_fast_notrace() for
+ * synchronization. As BPF programs that are attached to tracepoints
+ * expect to stay on the same CPU, also disable migration.
+ */
+#ifdef CONFIG_PREEMPT_RT
+extern struct srcu_struct tracepoint_srcu;
+# define tracepoint_sync() synchronize_srcu(&tracepoint_srcu);
+# define tracepoint_guard()				\
+	guard(srcu_fast_notrace)(&tracepoint_srcu);	\
+	guard(migrate)()
+#else
+# define tracepoint_sync() synchronize_rcu();
+# define tracepoint_guard() guard(preempt_notrace)()
+#endif
+
 /*
  * tracepoint_synchronize_unregister must be called between the last tracepoint
  * probe unregistration and the end of module exit to make sure there is no
@@ -115,7 +134,7 @@ void for_each_tracepoint_in_module(struct module *mod,
 static inline void tracepoint_synchronize_unregister(void)
 {
 	synchronize_rcu_tasks_trace();
-	synchronize_rcu();
+	tracepoint_sync();
 }
 static inline bool tracepoint_is_faultable(struct tracepoint *tp)
 {
@@ -266,12 +285,12 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 		return static_branch_unlikely(&__tracepoint_##name.key);\
 	}
 
-#define __DECLARE_TRACE(name, proto, args, cond, data_proto)		\
+#define __DECLARE_TRACE(name, proto, args, cond, data_proto)			\
 	__DECLARE_TRACE_COMMON(name, PARAMS(proto), PARAMS(args), PARAMS(data_proto)) \
 	static inline void __do_trace_##name(proto)			\
 	{								\
 		if (cond) {						\
-			guard(preempt_notrace)();			\
+			tracepoint_guard();				\
 			__DO_TRACE_CALL(name, TP_ARGS(args));		\
 		}							\
 	}								\
diff --git a/include/trace/perf.h b/include/trace/perf.h
index a1754b73a8f55..348ad1d9b5566 100644
--- a/include/trace/perf.h
+++ b/include/trace/perf.h
@@ -71,6 +71,7 @@ perf_trace_##call(void *__data, proto)					\
 	u64 __count __attribute__((unused));				\
 	struct task_struct *__task __attribute__((unused));		\
 									\
+	guard(preempt_notrace)();					\
 	do_perf_trace_##call(__data, args);				\
 }
 
@@ -85,9 +86,8 @@ perf_trace_##call(void *__data, proto)					\
 	struct task_struct *__task __attribute__((unused));		\
 									\
 	might_fault();							\
-	preempt_disable_notrace();					\
+	guard(preempt_notrace)();					\
 	do_perf_trace_##call(__data, args);				\
-	preempt_enable_notrace();					\
 }
 
 /*
diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
index 4f22136fd4656..6fb58387e9f15 100644
--- a/include/trace/trace_events.h
+++ b/include/trace/trace_events.h
@@ -429,6 +429,22 @@ do_trace_event_raw_event_##call(void *__data, proto)			\
 	trace_event_buffer_commit(&fbuffer);				\
 }
 
+/*
+ * When PREEMPT_RT is enabled, the tracepoint does not disable preemption
+ * but instead disables migration. The callbacks for the trace events
+ * need to have a consistent state so that it can reflect the proper
+ * preempt_disabled counter.
+ */
+#ifdef CONFIG_PREEMPT_RT
+/* disable preemption for RT so that the counters still match */
+# define trace_event_guard() guard(preempt_notrace)()
+/* Have syscalls up the migrate disable counter to emulate non-syscalls */
+# define trace_syscall_event_guard() guard(migrate)()
+#else
+# define trace_event_guard()
+# define trace_syscall_event_guard()
+#endif
+
 #undef DECLARE_EVENT_CLASS
 #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
 __DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), PARAMS(tstruct), \
@@ -436,6 +452,7 @@ __DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), PARAMS(tstruct), \
 static notrace void							\
 trace_event_raw_event_##call(void *__data, proto)			\
 {									\
+	trace_event_guard();						\
 	do_trace_event_raw_event_##call(__data, args);			\
 }
 
@@ -447,9 +464,9 @@ static notrace void							\
 trace_event_raw_event_##call(void *__data, proto)			\
 {									\
 	might_fault();							\
-	preempt_disable_notrace();					\
+	trace_syscall_event_guard();					\
+	guard(preempt_notrace)();					\
 	do_trace_event_raw_event_##call(__data, args);			\
-	preempt_enable_notrace();					\
 }
 
 /*
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index e00da4182deb7..000665649fcb1 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -659,13 +659,7 @@ void *trace_event_buffer_reserve(struct trace_event_buffer *fbuffer,
 	    trace_event_ignore_this_pid(trace_file))
 		return NULL;
 
-	/*
-	 * If CONFIG_PREEMPTION is enabled, then the tracepoint itself disables
-	 * preemption (adding one to the preempt_count). Since we are
-	 * interested in the preempt_count at the time the tracepoint was
-	 * hit, we need to subtract one to offset the increment.
-	 */
-	fbuffer->trace_ctx = tracing_gen_ctx_dec();
+	fbuffer->trace_ctx = tracing_gen_ctx_dec_cond();
 	fbuffer->trace_file = trace_file;
 
 	fbuffer->event =
diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index 62719d2941c90..6a6bcf86bfbed 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -25,6 +25,12 @@ enum tp_func_state {
 extern tracepoint_ptr_t __start___tracepoints_ptrs[];
 extern tracepoint_ptr_t __stop___tracepoints_ptrs[];
 
+/* In PREEMPT_RT, SRCU is used to protect the tracepoint callbacks */
+#ifdef CONFIG_PREEMPT_RT
+DEFINE_SRCU_FAST(tracepoint_srcu);
+EXPORT_SYMBOL_GPL(tracepoint_srcu);
+#endif
+
 enum tp_transition_sync {
 	TP_TRANSITION_SYNC_1_0_1,
 	TP_TRANSITION_SYNC_N_2_1,
@@ -34,6 +40,7 @@ enum tp_transition_sync {
 
 struct tp_transition_snapshot {
 	unsigned long rcu;
+	unsigned long srcu_gp;
 	bool ongoing;
 };
 
@@ -46,6 +53,9 @@ static void tp_rcu_get_state(enum tp_transition_sync sync)
 
 	/* Keep the latest get_state snapshot. */
 	snapshot->rcu = get_state_synchronize_rcu();
+#ifdef CONFIG_PREEMPT_RT
+	snapshot->srcu_gp = start_poll_synchronize_srcu(&tracepoint_srcu);
+#endif
 	snapshot->ongoing = true;
 }
 
@@ -56,6 +66,10 @@ static void tp_rcu_cond_sync(enum tp_transition_sync sync)
 	if (!snapshot->ongoing)
 		return;
 	cond_synchronize_rcu(snapshot->rcu);
+#ifdef CONFIG_PREEMPT_RT
+	if (!poll_state_synchronize_srcu(&tracepoint_srcu, snapshot->srcu_gp))
+		synchronize_srcu(&tracepoint_srcu);
+#endif
 	snapshot->ongoing = false;
 }
 
@@ -101,10 +115,22 @@ static inline void *allocate_probes(int count)
 	return p == NULL ? NULL : p->probes;
 }
 
+#ifdef CONFIG_PREEMPT_RT
+static void srcu_free_old_probes(struct rcu_head *head)
+{
+	kfree(container_of(head, struct tp_probes, rcu));
+}
+
+static void rcu_free_old_probes(struct rcu_head *head)
+{
+	call_srcu(&tracepoint_srcu, head, srcu_free_old_probes);
+}
+#else
 static void rcu_free_old_probes(struct rcu_head *head)
 {
 	kfree(container_of(head, struct tp_probes, rcu));
 }
+#endif
 
 static inline void release_probes(struct tracepoint *tp, struct tracepoint_func *old)
 {
@@ -112,6 +138,13 @@ static inline void release_probes(struct tracepoint *tp, struct tracepoint_func
 		struct tp_probes *tp_probes = container_of(old,
 			struct tp_probes, probes[0]);
 
+		/*
+		 * Tracepoint probes are protected by either RCU or
+		 * Tasks Trace RCU and also by SRCU.  By calling the SRCU
+		 * callback in the [Tasks Trace] RCU callback we cover
+		 * both cases. So let us chain the SRCU and [Tasks Trace]
+		 * RCU callbacks to wait for both grace periods.
+		 */
 		if (tracepoint_is_faultable(tp))
 			call_rcu_tasks_trace(&tp_probes->rcu, rcu_free_old_probes);
 		else

Re: linux-next: manual merge of the rcu tree with the ftrace tree

Posted by Yonghong Song 2 months, 3 weeks ago


On 11/14/25 9:05 AM, Paul E. McKenney wrote:
> On Fri, Nov 14, 2025 at 07:42:55AM -0500, Steven Rostedt wrote:
>> On Fri, 14 Nov 2025 13:52:26 +1100
>> Stephen Rothwell <sfr@canb.auug.org.au> wrote:
>>
>>> Hi all,
>>>
>>> Today's linux-next merge of the rcu tree got a conflict in:
>>>
>>>    kernel/trace/trace_syscalls.c
>>>
>>> between commit:
>>>
>>>    a544d9a66bdf ("tracing: Have syscall trace events read user space string")
>>>
>>> from the ftrace tree and commit:
>>>
>>>    35587dbc58dd ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast")
>>>
>>> from the rcu tree.
>>>
>>> I fixed it up (Maybe - see below) and can carry the fix as necessary. This
>>> is now fixed as far as linux-next is concerned, but any non trivial
>>> conflicts should be mentioned to your upstream maintainer when your tree
>>> is submitted for merging.  You may also want to consider cooperating
>>> with the maintainer of the conflicting tree to minimise any particularly
>>> complex conflicts.
>> Thanks for the update.
>>
>>> diff --cc kernel/trace/trace_syscalls.c
>>> index e96d0063cbcf,3f699b198c56..000000000000
>>> --- a/kernel/trace/trace_syscalls.c
>>> +++ b/kernel/trace/trace_syscalls.c
>>> @@@ -878,6 -322,8 +890,7 @@@ static void ftrace_syscall_enter(void *
>>>    	 * buffer and per-cpu data require preemption to be disabled.
>>>    	 */
>>>    	might_fault();
>>> + 	preempt_rt_guard();
>>>   -	guard(preempt_notrace)();
>> My code made it so that preemption is not needed here but is moved later
>> down for the logic that does the reading of user space data.
>>
>> Note, it must have preemption disabled for all configs (including RT).
>> Otherwise, the data it has can get corrupted.
>>
>> Paul, can you change it so that you *do not* touch this file?
> I could, but I believe that this would re-introduce the migration failure.
>
> Maybe we should just defer this until both your patch and the RCU
> stack hit mainline, and port on top of those?  Perhaps later in the
> merge window?
>
> I believe that migration needs to be disabled at this point, but I am
> again adding Yonghong on CC for his perspective.

Yes, migration needs to be disabled for rt kernel in order to let
bpf program running properly.

Regarding to non-rt kernel, currently preempt disable is used.
Is preempt disalbe just for bpf program or for something else
as well? Certainly perempt disable can help improve bpf prog
performance. From bpf prog itself, typically we can do with
migration disable, but in some places we may have to add
preempt disable, e.g.,
   https://lore.kernel.org/bpf/20251114064922.11650-1-chandna.sahil@gmail.com/T/#u
   https://lore.kernel.org/bpf/20251112163148.100949-1-chen.dylane@linux.dev/T/#m556837a5987bc048b8b9bbcc6b50728c441c139f

>
> 							Thanx, Paul
>
>> Thanks,
>>
>> -- Steve
>>
>>
>>>    
>>>    	syscall_nr = trace_get_syscall_nr(current, regs);
>>>    	if (syscall_nr < 0 || syscall_nr >= NR_syscalls)

Re: linux-next: manual merge of the rcu tree with the ftrace tree

Posted by Sebastian Andrzej Siewior 2 months, 3 weeks ago

On 2025-11-14 10:31:45 [-0800], Yonghong Song wrote:
> > I believe that migration needs to be disabled at this point, but I am
> > again adding Yonghong on CC for his perspective.
> 
> Yes, migration needs to be disabled for rt kernel in order to let
> bpf program running properly.

Why is disabling migration special in regard to RT kernels vs !RT?
Why do we need to disable migration given that bpf_prog_run_array()
already does that? Is there a different entry point? 
My point why is it required to disable migration on trace-point entry
for BPF given that the BPF-entry already does so.

Sebastian

Re: linux-next: manual merge of the rcu tree with the ftrace tree

Posted by Yonghong Song 2 months, 1 week ago


On 11/17/25 11:35 PM, Sebastian Andrzej Siewior wrote:
> On 2025-11-14 10:31:45 [-0800], Yonghong Song wrote:
>>> I believe that migration needs to be disabled at this point, but I am
>>> again adding Yonghong on CC for his perspective.
>> Yes, migration needs to be disabled for rt kernel in order to let
>> bpf program running properly.
> Why is disabling migration special in regard to RT kernels vs !RT?
> Why do we need to disable migration given that bpf_prog_run_array()
> already does that? Is there a different entry point?

bpf_prog_run_array() has two callers. One is trace_call_bpf() in
kernel/trace/bpf_trace.c, and the other is lirc_bpf_run() in
drivers/media/rc/bpf-lirc.c. The migration disable/enabled is
needed for lirc_bpf_run().

> My point why is it required to disable migration on trace-point entry
> for BPF given that the BPF-entry already does so.

In trace_call_bpf(), we have
    if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) { ... }
So migriate_disable() is necessary.

>
> Sebastian

Re: linux-next: manual merge of the rcu tree with the ftrace tree

Posted by Paul E. McKenney 2 months, 3 weeks ago

On Tue, Nov 18, 2025 at 08:35:08AM +0100, Sebastian Andrzej Siewior wrote:
> On 2025-11-14 10:31:45 [-0800], Yonghong Song wrote:
> > > I believe that migration needs to be disabled at this point, but I am
> > > again adding Yonghong on CC for his perspective.
> > 
> > Yes, migration needs to be disabled for rt kernel in order to let
> > bpf program running properly.
> 
> Why is disabling migration special in regard to RT kernels vs !RT?
> Why do we need to disable migration given that bpf_prog_run_array()
> already does that? Is there a different entry point? 
> My point why is it required to disable migration on trace-point entry
> for BPF given that the BPF-entry already does so.

When I tried doing without that disabling some weeks back, it broke.

Maybe things have changed since, but I must defer to Yonghong &c.

							Thanx, Paul

Re: linux-next: manual merge of the rcu tree with the ftrace tree

Posted by Frederic Weisbecker 2 months, 3 weeks ago

Le Fri, Nov 14, 2025 at 07:42:55AM -0500, Steven Rostedt a écrit :
> On Fri, 14 Nov 2025 13:52:26 +1100
> Stephen Rothwell <sfr@canb.auug.org.au> wrote:
> 
> > Hi all,
> > 
> > Today's linux-next merge of the rcu tree got a conflict in:
> > 
> >   kernel/trace/trace_syscalls.c
> > 
> > between commit:
> > 
> >   a544d9a66bdf ("tracing: Have syscall trace events read user space string")
> > 
> > from the ftrace tree and commit:
> > 
> >   35587dbc58dd ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast")
> > 
> > from the rcu tree.
> > 
> > I fixed it up (Maybe - see below) and can carry the fix as necessary. This
> > is now fixed as far as linux-next is concerned, but any non trivial
> > conflicts should be mentioned to your upstream maintainer when your tree
> > is submitted for merging.  You may also want to consider cooperating
> > with the maintainer of the conflicting tree to minimise any particularly
> > complex conflicts.
> 
> Thanks for the update.
> 
> > 
> 
> > diff --cc kernel/trace/trace_syscalls.c
> > index e96d0063cbcf,3f699b198c56..000000000000
> > --- a/kernel/trace/trace_syscalls.c
> > +++ b/kernel/trace/trace_syscalls.c
> > @@@ -878,6 -322,8 +890,7 @@@ static void ftrace_syscall_enter(void *
> >   	 * buffer and per-cpu data require preemption to be disabled.
> >   	 */
> >   	might_fault();
> > + 	preempt_rt_guard();
> >  -	guard(preempt_notrace)();
> 
> My code made it so that preemption is not needed here but is moved later
> down for the logic that does the reading of user space data.
> 
> Note, it must have preemption disabled for all configs (including RT).
> Otherwise, the data it has can get corrupted.
> 
> Paul, can you change it so that you *do not* touch this file?

Ok, I've zapped the commit for now until we sort this out.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

Re: linux-next: manual merge of the rcu tree with the ftrace tree

Posted by Paul E. McKenney 2 months, 3 weeks ago

On Fri, Nov 14, 2025 at 03:48:52PM +0100, Frederic Weisbecker wrote:
> Le Fri, Nov 14, 2025 at 07:42:55AM -0500, Steven Rostedt a écrit :
> > On Fri, 14 Nov 2025 13:52:26 +1100
> > Stephen Rothwell <sfr@canb.auug.org.au> wrote:
> > 
> > > Hi all,
> > > 
> > > Today's linux-next merge of the rcu tree got a conflict in:
> > > 
> > >   kernel/trace/trace_syscalls.c
> > > 
> > > between commit:
> > > 
> > >   a544d9a66bdf ("tracing: Have syscall trace events read user space string")
> > > 
> > > from the ftrace tree and commit:
> > > 
> > >   35587dbc58dd ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast")
> > > 
> > > from the rcu tree.
> > > 
> > > I fixed it up (Maybe - see below) and can carry the fix as necessary. This
> > > is now fixed as far as linux-next is concerned, but any non trivial
> > > conflicts should be mentioned to your upstream maintainer when your tree
> > > is submitted for merging.  You may also want to consider cooperating
> > > with the maintainer of the conflicting tree to minimise any particularly
> > > complex conflicts.
> > 
> > Thanks for the update.
> > 
> > > 
> > 
> > > diff --cc kernel/trace/trace_syscalls.c
> > > index e96d0063cbcf,3f699b198c56..000000000000
> > > --- a/kernel/trace/trace_syscalls.c
> > > +++ b/kernel/trace/trace_syscalls.c
> > > @@@ -878,6 -322,8 +890,7 @@@ static void ftrace_syscall_enter(void *
> > >   	 * buffer and per-cpu data require preemption to be disabled.
> > >   	 */
> > >   	might_fault();
> > > + 	preempt_rt_guard();
> > >  -	guard(preempt_notrace)();
> > 
> > My code made it so that preemption is not needed here but is moved later
> > down for the logic that does the reading of user space data.
> > 
> > Note, it must have preemption disabled for all configs (including RT).
> > Otherwise, the data it has can get corrupted.
> > 
> > Paul, can you change it so that you *do not* touch this file?
> 
> Ok, I've zapped the commit for now until we sort this out.

Thank you, Frederic, and I guess putting this in -next did indeed find
some problems, so that is good?  ;-)

							Thanx, Paul

Re: linux-next: manual merge of the rcu tree with the ftrace tree

Posted by Frederic Weisbecker 2 months, 3 weeks ago

Le Fri, Nov 14, 2025 at 09:06:29AM -0800, Paul E. McKenney a écrit :
> On Fri, Nov 14, 2025 at 03:48:52PM +0100, Frederic Weisbecker wrote:
> > Le Fri, Nov 14, 2025 at 07:42:55AM -0500, Steven Rostedt a écrit :
> > > On Fri, 14 Nov 2025 13:52:26 +1100
> > > Stephen Rothwell <sfr@canb.auug.org.au> wrote:
> > > 
> > > > Hi all,
> > > > 
> > > > Today's linux-next merge of the rcu tree got a conflict in:
> > > > 
> > > >   kernel/trace/trace_syscalls.c
> > > > 
> > > > between commit:
> > > > 
> > > >   a544d9a66bdf ("tracing: Have syscall trace events read user space string")
> > > > 
> > > > from the ftrace tree and commit:
> > > > 
> > > >   35587dbc58dd ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast")
> > > > 
> > > > from the rcu tree.
> > > > 
> > > > I fixed it up (Maybe - see below) and can carry the fix as necessary. This
> > > > is now fixed as far as linux-next is concerned, but any non trivial
> > > > conflicts should be mentioned to your upstream maintainer when your tree
> > > > is submitted for merging.  You may also want to consider cooperating
> > > > with the maintainer of the conflicting tree to minimise any particularly
> > > > complex conflicts.
> > > 
> > > Thanks for the update.
> > > 
> > > > 
> > > 
> > > > diff --cc kernel/trace/trace_syscalls.c
> > > > index e96d0063cbcf,3f699b198c56..000000000000
> > > > --- a/kernel/trace/trace_syscalls.c
> > > > +++ b/kernel/trace/trace_syscalls.c
> > > > @@@ -878,6 -322,8 +890,7 @@@ static void ftrace_syscall_enter(void *
> > > >   	 * buffer and per-cpu data require preemption to be disabled.
> > > >   	 */
> > > >   	might_fault();
> > > > + 	preempt_rt_guard();
> > > >  -	guard(preempt_notrace)();
> > > 
> > > My code made it so that preemption is not needed here but is moved later
> > > down for the logic that does the reading of user space data.
> > > 
> > > Note, it must have preemption disabled for all configs (including RT).
> > > Otherwise, the data it has can get corrupted.
> > > 
> > > Paul, can you change it so that you *do not* touch this file?
> > 
> > Ok, I've zapped the commit for now until we sort this out.
> 
> Thank you, Frederic, and I guess putting this in -next did indeed find
> some problems, so that is good?  ;-)

Indeed, mission accomplished ;-)

Steve proposed here to actually restore the patch:

    https://lore.kernel.org/lkml/20251114110136.3d36deca@gandalf.local.home/

But later said the reverse:

    https://lore.kernel.org/lkml/20251114121141.5e40428d@gandalf.local.home/

So for now I'm still keeping it outside -next. I hope it is not a necessary
change in your srcu series?

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

Re: linux-next: manual merge of the rcu tree with the ftrace tree

Posted by Paul E. McKenney 2 months, 3 weeks ago

On Tue, Nov 18, 2025 at 02:05:03PM +0100, Frederic Weisbecker wrote:
> Le Fri, Nov 14, 2025 at 09:06:29AM -0800, Paul E. McKenney a écrit :
> > On Fri, Nov 14, 2025 at 03:48:52PM +0100, Frederic Weisbecker wrote:
> > > Le Fri, Nov 14, 2025 at 07:42:55AM -0500, Steven Rostedt a écrit :
> > > > On Fri, 14 Nov 2025 13:52:26 +1100
> > > > Stephen Rothwell <sfr@canb.auug.org.au> wrote:
> > > > 
> > > > > Hi all,
> > > > > 
> > > > > Today's linux-next merge of the rcu tree got a conflict in:
> > > > > 
> > > > >   kernel/trace/trace_syscalls.c
> > > > > 
> > > > > between commit:
> > > > > 
> > > > >   a544d9a66bdf ("tracing: Have syscall trace events read user space string")
> > > > > 
> > > > > from the ftrace tree and commit:
> > > > > 
> > > > >   35587dbc58dd ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast")
> > > > > 
> > > > > from the rcu tree.
> > > > > 
> > > > > I fixed it up (Maybe - see below) and can carry the fix as necessary. This
> > > > > is now fixed as far as linux-next is concerned, but any non trivial
> > > > > conflicts should be mentioned to your upstream maintainer when your tree
> > > > > is submitted for merging.  You may also want to consider cooperating
> > > > > with the maintainer of the conflicting tree to minimise any particularly
> > > > > complex conflicts.
> > > > 
> > > > Thanks for the update.
> > > > 
> > > > > 
> > > > 
> > > > > diff --cc kernel/trace/trace_syscalls.c
> > > > > index e96d0063cbcf,3f699b198c56..000000000000
> > > > > --- a/kernel/trace/trace_syscalls.c
> > > > > +++ b/kernel/trace/trace_syscalls.c
> > > > > @@@ -878,6 -322,8 +890,7 @@@ static void ftrace_syscall_enter(void *
> > > > >   	 * buffer and per-cpu data require preemption to be disabled.
> > > > >   	 */
> > > > >   	might_fault();
> > > > > + 	preempt_rt_guard();
> > > > >  -	guard(preempt_notrace)();
> > > > 
> > > > My code made it so that preemption is not needed here but is moved later
> > > > down for the logic that does the reading of user space data.
> > > > 
> > > > Note, it must have preemption disabled for all configs (including RT).
> > > > Otherwise, the data it has can get corrupted.
> > > > 
> > > > Paul, can you change it so that you *do not* touch this file?
> > > 
> > > Ok, I've zapped the commit for now until we sort this out.
> > 
> > Thank you, Frederic, and I guess putting this in -next did indeed find
> > some problems, so that is good?  ;-)
> 
> Indeed, mission accomplished ;-)
> 
> Steve proposed here to actually restore the patch:
> 
>     https://lore.kernel.org/lkml/20251114110136.3d36deca@gandalf.local.home/
> 
> But later said the reverse:
> 
>     https://lore.kernel.org/lkml/20251114121141.5e40428d@gandalf.local.home/
> 
> So for now I'm still keeping it outside -next. I hope it is not a necessary
> change in your srcu series?

My thought is to put the patch with Steven's suggested removal on my
-rcu stack and see what kernel test robot thinks of it.  ;-)

							Thanx, Paul

Re: linux-next: manual merge of the rcu tree with the ftrace tree

Posted by Paul E. McKenney 2 months, 1 week ago

On Tue, Nov 18, 2025 at 07:04:07AM -0800, Paul E. McKenney wrote:
> On Tue, Nov 18, 2025 at 02:05:03PM +0100, Frederic Weisbecker wrote:
> > Le Fri, Nov 14, 2025 at 09:06:29AM -0800, Paul E. McKenney a écrit :

[ . . . ]

> > > Thank you, Frederic, and I guess putting this in -next did indeed find
> > > some problems, so that is good?  ;-)
> > 
> > Indeed, mission accomplished ;-)
> > 
> > Steve proposed here to actually restore the patch:
> > 
> >     https://lore.kernel.org/lkml/20251114110136.3d36deca@gandalf.local.home/
> > 
> > But later said the reverse:
> > 
> >     https://lore.kernel.org/lkml/20251114121141.5e40428d@gandalf.local.home/
> > 
> > So for now I'm still keeping it outside -next. I hope it is not a necessary
> > change in your srcu series?
> 
> My thought is to put the patch with Steven's suggested removal on my
> -rcu stack and see what kernel test robot thinks of it.  ;-)

Unless I hear otherwise, I will push this into -next after the RCU
patches land.  If all goes well, I will send the pull request to Linus.
So please let me know if you would prefer some other course of action.

							Thanx, Paul

Re: linux-next: manual merge of the rcu tree with the ftrace tree

Posted by Paul E. McKenney 2 months ago

On Mon, Dec 01, 2025 at 04:57:54PM -0800, Paul E. McKenney wrote:
> On Tue, Nov 18, 2025 at 07:04:07AM -0800, Paul E. McKenney wrote:
> > On Tue, Nov 18, 2025 at 02:05:03PM +0100, Frederic Weisbecker wrote:

[ . . . ]

> > > So for now I'm still keeping it outside -next. I hope it is not a necessary
> > > change in your srcu series?
> > 
> > My thought is to put the patch with Steven's suggested removal on my
> > -rcu stack and see what kernel test robot thinks of it.  ;-)
> 
> Unless I hear otherwise, I will push this into -next after the RCU
> patches land.  If all goes well, I will send the pull request to Linus.
> So please let me know if you would prefer some other course of action.

If I continue to hear no objections in the next 20 hours or so, I will
push this into -next:

fca6fa23c5a5 ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast")

							Thanx, Paul

Re: linux-next: manual merge of the rcu tree with the ftrace tree

Posted by Steven Rostedt 2 months ago

On Sun, 7 Dec 2025 12:43:32 -0800
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> If I continue to hear no objections in the next 20 hours or so, I will
> push this into -next:
> 
> fca6fa23c5a5 ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast")

Hi Paul,

Can you repost the patch as a normal patch (start its own thread) so
that I can look at it separately. I'm currently in Tokyo so I'll likely
get distracted a lot this week.

-- Steve

Re: linux-next: manual merge of the rcu tree with the ftrace tree

Posted by Paul E. McKenney 2 months ago

On Sun, Dec 07, 2025 at 07:17:56PM -0500, Steven Rostedt wrote:
> On Sun, 7 Dec 2025 12:43:32 -0800
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > If I continue to hear no objections in the next 20 hours or so, I will
> > push this into -next:
> > 
> > fca6fa23c5a5 ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast")
> 
> Hi Paul,
> 
> Can you repost the patch as a normal patch (start its own thread) so
> that I can look at it separately. I'm currently in Tokyo so I'll likely
> get distracted a lot this week.

Understood.  On its way to an inbox near you:

Message-ID: <e2fe3162-4b7b-44d6-91ff-f439b3dce706@paulmck-laptop>

							Thanx, Paul

Re: linux-next: manual merge of the rcu tree with the ftrace tree

Posted by Paul E. McKenney 2 months, 3 weeks ago

On Fri, Nov 14, 2025 at 09:06:29AM -0800, Paul E. McKenney wrote:
> On Fri, Nov 14, 2025 at 03:48:52PM +0100, Frederic Weisbecker wrote:
> > Le Fri, Nov 14, 2025 at 07:42:55AM -0500, Steven Rostedt a écrit :
> > > On Fri, 14 Nov 2025 13:52:26 +1100
> > > Stephen Rothwell <sfr@canb.auug.org.au> wrote:
> > > 
> > > > Hi all,
> > > > 
> > > > Today's linux-next merge of the rcu tree got a conflict in:
> > > > 
> > > >   kernel/trace/trace_syscalls.c
> > > > 
> > > > between commit:
> > > > 
> > > >   a544d9a66bdf ("tracing: Have syscall trace events read user space string")
> > > > 
> > > > from the ftrace tree and commit:
> > > > 
> > > >   35587dbc58dd ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast")
> > > > 
> > > > from the rcu tree.
> > > > 
> > > > I fixed it up (Maybe - see below) and can carry the fix as necessary. This
> > > > is now fixed as far as linux-next is concerned, but any non trivial
> > > > conflicts should be mentioned to your upstream maintainer when your tree
> > > > is submitted for merging.  You may also want to consider cooperating
> > > > with the maintainer of the conflicting tree to minimise any particularly
> > > > complex conflicts.
> > > 
> > > Thanks for the update.
> > > 
> > > > 
> > > 
> > > > diff --cc kernel/trace/trace_syscalls.c
> > > > index e96d0063cbcf,3f699b198c56..000000000000
> > > > --- a/kernel/trace/trace_syscalls.c
> > > > +++ b/kernel/trace/trace_syscalls.c
> > > > @@@ -878,6 -322,8 +890,7 @@@ static void ftrace_syscall_enter(void *
> > > >   	 * buffer and per-cpu data require preemption to be disabled.
> > > >   	 */
> > > >   	might_fault();
> > > > + 	preempt_rt_guard();
> > > >  -	guard(preempt_notrace)();
> > > 
> > > My code made it so that preemption is not needed here but is moved later
> > > down for the logic that does the reading of user space data.
> > > 
> > > Note, it must have preemption disabled for all configs (including RT).
> > > Otherwise, the data it has can get corrupted.
> > > 
> > > Paul, can you change it so that you *do not* touch this file?
> > 
> > Ok, I've zapped the commit for now until we sort this out.
> 
> Thank you, Frederic, and I guess putting this in -next did indeed find
> some problems, so that is good?  ;-)

And in other more hopeful news, your -next stack (including this patch)
passed torture.sh testing on both ARM and x86 and 12 hours of TREE10
15*CFLIST testing on x86.

							Thanx, Paul

Re: linux-next: manual merge of the rcu tree with the ftrace tree

Posted by Steven Rostedt 2 months, 3 weeks ago

On Fri, 14 Nov 2025 15:48:52 +0100
Frederic Weisbecker <frederic@kernel.org> wrote:

> > 
> > Paul, can you change it so that you *do not* touch this file?  
> 
> Ok, I've zapped the commit for now until we sort this out.

You can put it back. It's actually code I wrote to make sure this doesn't
conflict. The "preempt_rt_guard()" was supposed to be about "preempt_rt"
and I took it now as being "preempt" :-p

I need a vacation.

-- Steve

Re: linux-next: manual merge of the rcu tree with the ftrace tree

Posted by Sebastian Andrzej Siewior 2 months, 3 weeks ago

On 2025-11-14 07:42:55 [-0500], Steven Rostedt wrote:
> > diff --cc kernel/trace/trace_syscalls.c
> > index e96d0063cbcf,3f699b198c56..000000000000
> > --- a/kernel/trace/trace_syscalls.c
> > +++ b/kernel/trace/trace_syscalls.c
> > @@@ -878,6 -322,8 +890,7 @@@ static void ftrace_syscall_enter(void *
> >   	 * buffer and per-cpu data require preemption to be disabled.
> >   	 */
> >   	might_fault();
> > + 	preempt_rt_guard();
> >  -	guard(preempt_notrace)();
> 
> My code made it so that preemption is not needed here but is moved later
> down for the logic that does the reading of user space data.
> 
> Note, it must have preemption disabled for all configs (including RT).
> Otherwise, the data it has can get corrupted.
> 
> Paul, can you change it so that you *do not* touch this file?

Where is preempt_rt_guard() from?

> Thanks,
> 
> -- Steve

Sebastian

Re: linux-next: manual merge of the rcu tree with the ftrace tree

Posted by Steven Rostedt 2 months, 3 weeks ago

On Fri, 14 Nov 2025 14:35:32 +0100
Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:

> On 2025-11-14 07:42:55 [-0500], Steven Rostedt wrote:
> > > diff --cc kernel/trace/trace_syscalls.c
> > > index e96d0063cbcf,3f699b198c56..000000000000
> > > --- a/kernel/trace/trace_syscalls.c
> > > +++ b/kernel/trace/trace_syscalls.c
> > > @@@ -878,6 -322,8 +890,7 @@@ static void ftrace_syscall_enter(void *
> > >   	 * buffer and per-cpu data require preemption to be disabled.
> > >   	 */
> > >   	might_fault();
> > > + 	preempt_rt_guard();
> > >  -	guard(preempt_notrace)();  
> > 
> > My code made it so that preemption is not needed here but is moved later
> > down for the logic that does the reading of user space data.
> > 
> > Note, it must have preemption disabled for all configs (including RT).
> > Otherwise, the data it has can get corrupted.
> > 
> > Paul, can you change it so that you *do not* touch this file?  
> 
> Where is preempt_rt_guard() from?

Ah, it's from the patch I submitted that has this:

+/*
+ * When PREEMPT_RT is enabled, it disables migration instead
+ * of preemption. The pseudo syscall trace events need to match
+ * so that the counter logic recorded into he ring buffer by
+ * trace_event_buffer_reserve() still matches what it expects.
+ */
+#ifdef CONFIG_PREEMPT_RT
+# define preempt_rt_guard()  guard(migrate)()
+#else
+# define preempt_rt_guard()
+#endif
+

I must be getting old, as I forgot I wrote this :-p

I only saw the update from Stephen and thought it was disabling preemption.

It doesn't disable preemption, but is here to keep the latency
preempt_count counting the same in both PREEMPT_RT and non PREEMPT_RT. You
know, the stuff that shows up in the trace:

  "d..4."

Paul, never mind, this code will not affect the code I added.

-- Steve

Re: linux-next: manual merge of the rcu tree with the ftrace tree

Posted by Sebastian Andrzej Siewior 2 months, 3 weeks ago

On 2025-11-14 10:46:33 [-0500], Steven Rostedt wrote:
> On Fri, 14 Nov 2025 14:35:32 +0100
> Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> 
> > On 2025-11-14 07:42:55 [-0500], Steven Rostedt wrote:
> > > > diff --cc kernel/trace/trace_syscalls.c
> > > > index e96d0063cbcf,3f699b198c56..000000000000
> > > > --- a/kernel/trace/trace_syscalls.c
> > > > +++ b/kernel/trace/trace_syscalls.c
> > > > @@@ -878,6 -322,8 +890,7 @@@ static void ftrace_syscall_enter(void *
> > > >   	 * buffer and per-cpu data require preemption to be disabled.
> > > >   	 */
> > > >   	might_fault();
> > > > + 	preempt_rt_guard();
> > > >  -	guard(preempt_notrace)();  
> > > 
> > > My code made it so that preemption is not needed here but is moved later
> > > down for the logic that does the reading of user space data.
> > > 
> > > Note, it must have preemption disabled for all configs (including RT).
> > > Otherwise, the data it has can get corrupted.
> > > 
> > > Paul, can you change it so that you *do not* touch this file?  
> > 
> > Where is preempt_rt_guard() from?
> 
> Ah, it's from the patch I submitted that has this:
> 
> +/*
> + * When PREEMPT_RT is enabled, it disables migration instead
> + * of preemption. The pseudo syscall trace events need to match
> + * so that the counter logic recorded into he ring buffer by
> + * trace_event_buffer_reserve() still matches what it expects.
> + */
> +#ifdef CONFIG_PREEMPT_RT
> +# define preempt_rt_guard()  guard(migrate)()
> +#else
> +# define preempt_rt_guard()
> +#endif
> +
> 
> I must be getting old, as I forgot I wrote this :-p
> 
> I only saw the update from Stephen and thought it was disabling preemption.

but having both is kind of gross. Also the mapping from
preempt_rt_guard() to guard(migrate)() only on RT is kind of far.

> It doesn't disable preemption, but is here to keep the latency
> preempt_count counting the same in both PREEMPT_RT and non PREEMPT_RT. You
> know, the stuff that shows up in the trace:
> 
>   "d..4."

urgh.

We did that to match the reality with the tracer. Since the tracer
disabled preemption we decremented the counter from preempt_count to
record what was there before the trace point started changing it.
That was tracing_gen_ctx_dec(). Now I see we have something similar in
tracing_gen_ctx_dec_cond().
But why do we need to disable migration here? Why isn't !RT affected by
this. I remember someone had a trace where the NMI was set and migrate
disable was at max which sounds like someone decremented the
migrate_disable counter while migration wasn't disabled…

> Paul, never mind, this code will not affect the code I added.
> 
> -- Steve

Sebastian

Re: linux-next: manual merge of the rcu tree with the ftrace tree

Posted by Steven Rostedt 2 months, 3 weeks ago

On Fri, 14 Nov 2025 17:00:17 +0100
Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:

> > It doesn't disable preemption, but is here to keep the latency
> > preempt_count counting the same in both PREEMPT_RT and non PREEMPT_RT. You
> > know, the stuff that shows up in the trace:
> > 
> >   "d..4."  
> 
> urgh.
> 
> We did that to match the reality with the tracer. Since the tracer
> disabled preemption we decremented the counter from preempt_count to
> record what was there before the trace point started changing it.
> That was tracing_gen_ctx_dec(). Now I see we have something similar in
> tracing_gen_ctx_dec_cond().
> But why do we need to disable migration here? Why isn't !RT affected by
> this. I remember someone had a trace where the NMI was set and migrate
> disable was at max which sounds like someone decremented the
> migrate_disable counter while migration wasn't disabled…

It's to match this code:

--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -100,6 +100,25 @@ void for_each_tracepoint_in_module(struct module *mod,
 }
 #endif /* CONFIG_MODULES */
 
+/*
+ * BPF programs can attach to the tracepoint callbacks. But if the
+ * callbacks are called with preemption disabled, the BPF programs
+ * can cause quite a bit of latency. When PREEMPT_RT is enabled,
+ * instead of disabling preemption, use srcu_fast_notrace() for
+ * synchronization. As BPF programs that are attached to tracepoints
+ * expect to stay on the same CPU, also disable migration.
+ */
+#ifdef CONFIG_PREEMPT_RT
+extern struct srcu_struct tracepoint_srcu;
+# define tracepoint_sync() synchronize_srcu(&tracepoint_srcu);
+# define tracepoint_guard()                            \
+       guard(srcu_fast_notrace)(&tracepoint_srcu);     \
+       guard(migrate)()
+#else
+# define tracepoint_sync() synchronize_rcu();
+# define tracepoint_guard() guard(preempt_notrace)()
+#endif
+

Where in PREEMPT_RT we do not disable preemption around the tracepoint
callback, but in non RT we do. Instead it uses a srcu and migrate disable.

The migrate_disable in the syscall tracepoint (which gets called by the
system call version that doesn't disable migration, even in RT), needs to
disable migration so that the accounting that happens in:

  trace_event_buffer_reserve()

matches what happens when that function gets called by a normal tracepoint
callback.

-- Steve

Re: linux-next: manual merge of the rcu tree with the ftrace tree

Posted by Sebastian Andrzej Siewior 2 months, 3 weeks ago

On 2025-11-14 11:22:02 [-0500], Steven Rostedt wrote:
> It's to match this code:
> 
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -100,6 +100,25 @@ void for_each_tracepoint_in_module(struct module *mod,
>  }
>  #endif /* CONFIG_MODULES */
>  
> +/*
> + * BPF programs can attach to the tracepoint callbacks. But if the
> + * callbacks are called with preemption disabled, the BPF programs
> + * can cause quite a bit of latency. When PREEMPT_RT is enabled,
> + * instead of disabling preemption, use srcu_fast_notrace() for
> + * synchronization. As BPF programs that are attached to tracepoints
> + * expect to stay on the same CPU, also disable migration.
> + */
> +#ifdef CONFIG_PREEMPT_RT
> +extern struct srcu_struct tracepoint_srcu;
> +# define tracepoint_sync() synchronize_srcu(&tracepoint_srcu);
> +# define tracepoint_guard()                            \
> +       guard(srcu_fast_notrace)(&tracepoint_srcu);     \
> +       guard(migrate)()
> +#else
> +# define tracepoint_sync() synchronize_rcu();
> +# define tracepoint_guard() guard(preempt_notrace)()
> +#endif
> +
> 
> Where in PREEMPT_RT we do not disable preemption around the tracepoint
> callback, but in non RT we do. Instead it uses a srcu and migrate disable.

I appreciate the effort. I really do. But why can't we have SRCU on both
configs?

Also why does tracepoint_guard() need to disable migration? The BPF
program already disables migrations (see for instance
bpf_prog_run_array()).
This is true for RT and !RT. So there is no need to do it here.

> The migrate_disable in the syscall tracepoint (which gets called by the
> system call version that doesn't disable migration, even in RT), needs to
> disable migration so that the accounting that happens in:
> 
>   trace_event_buffer_reserve()
> 
> matches what happens when that function gets called by a normal tracepoint
> callback.

buh. But this is something. If we know that the call chain does not
disable migration, couldn't we just use a different function? I mean we
have tracing_gen_ctx_dec() and tracing_gen_ctx)(). Wouldn't this work
for migrate_disable(), too? 
Just in case we need it and can not avoid it, see above.

> -- Steve

Sebastian

Re: linux-next: manual merge of the rcu tree with the ftrace tree

Posted by Paul E. McKenney 2 months, 3 weeks ago

On Fri, Nov 14, 2025 at 05:33:30PM +0100, Sebastian Andrzej Siewior wrote:
> On 2025-11-14 11:22:02 [-0500], Steven Rostedt wrote:
> > It's to match this code:
> > 
> > --- a/include/linux/tracepoint.h
> > +++ b/include/linux/tracepoint.h
> > @@ -100,6 +100,25 @@ void for_each_tracepoint_in_module(struct module *mod,
> >  }
> >  #endif /* CONFIG_MODULES */
> >  
> > +/*
> > + * BPF programs can attach to the tracepoint callbacks. But if the
> > + * callbacks are called with preemption disabled, the BPF programs
> > + * can cause quite a bit of latency. When PREEMPT_RT is enabled,
> > + * instead of disabling preemption, use srcu_fast_notrace() for
> > + * synchronization. As BPF programs that are attached to tracepoints
> > + * expect to stay on the same CPU, also disable migration.
> > + */
> > +#ifdef CONFIG_PREEMPT_RT
> > +extern struct srcu_struct tracepoint_srcu;
> > +# define tracepoint_sync() synchronize_srcu(&tracepoint_srcu);
> > +# define tracepoint_guard()                            \
> > +       guard(srcu_fast_notrace)(&tracepoint_srcu);     \
> > +       guard(migrate)()
> > +#else
> > +# define tracepoint_sync() synchronize_rcu();
> > +# define tracepoint_guard() guard(preempt_notrace)()
> > +#endif
> > +
> > 
> > Where in PREEMPT_RT we do not disable preemption around the tracepoint
> > callback, but in non RT we do. Instead it uses a srcu and migrate disable.
> 
> I appreciate the effort. I really do. But why can't we have SRCU on both
> configs?

Due to performance concerns for non-RT kernels and workloads, where we
really need preemption disabled.

> Also why does tracepoint_guard() need to disable migration? The BPF
> program already disables migrations (see for instance
> bpf_prog_run_array()).
> This is true for RT and !RT. So there is no need to do it here.

The addition of migration disabling was in response to failures, which
this fixed.  Or at least greatly reduced the probability of!  Let's see...
That migrate_disable() has been there since 2022, so the failures were
happening despite it.  Adding Yonghong on CC for his perspective.

> > The migrate_disable in the syscall tracepoint (which gets called by the
> > system call version that doesn't disable migration, even in RT), needs to
> > disable migration so that the accounting that happens in:
> > 
> >   trace_event_buffer_reserve()
> > 
> > matches what happens when that function gets called by a normal tracepoint
> > callback.
> 
> buh. But this is something. If we know that the call chain does not
> disable migration, couldn't we just use a different function? I mean we
> have tracing_gen_ctx_dec() and tracing_gen_ctx)(). Wouldn't this work
> for migrate_disable(), too? 
> Just in case we need it and can not avoid it, see above.

On this, I must defer to the tracing experts.  ;-)

							Thanx, Paul

Re: linux-next: manual merge of the rcu tree with the ftrace tree

Posted by Sebastian Andrzej Siewior 2 months, 3 weeks ago

On 2025-11-14 09:00:21 [-0800], Paul E. McKenney wrote:
> > > Where in PREEMPT_RT we do not disable preemption around the tracepoint
> > > callback, but in non RT we do. Instead it uses a srcu and migrate disable.
> > 
> > I appreciate the effort. I really do. But why can't we have SRCU on both
> > configs?
> 
> Due to performance concerns for non-RT kernels and workloads, where we
> really need preemption disabled.

This means srcu_read_lock_notrace() is much more overhead compared to
rcu_read_lock_sched_notrace()?
I am a bit afraid of different bugs here and there.

> > Also why does tracepoint_guard() need to disable migration? The BPF
> > program already disables migrations (see for instance
> > bpf_prog_run_array()).
> > This is true for RT and !RT. So there is no need to do it here.
> 
> The addition of migration disabling was in response to failures, which
> this fixed.  Or at least greatly reduced the probability of!  Let's see...
> That migrate_disable() has been there since 2022, so the failures were
> happening despite it.  Adding Yonghong on CC for his perspective.

Okay. In general I would prefer that we know why we do it. BPF had
preempt_disable() which was turned into migrate_disable() for RT reasons
since remaining on the same CPU was enough and preempt_disable() was the
only way to enforce it at the time.
And I think Linus requested migrate_disable() to work regardless of RT
which PeterZ made happen (for different reasons, not BPF related).

> 							Thanx, Paul

Sebastian

Re: linux-next: manual merge of the rcu tree with the ftrace tree

Posted by Paul E. McKenney 2 months, 3 weeks ago

On Fri, Nov 14, 2025 at 06:10:52PM +0100, Sebastian Andrzej Siewior wrote:
> On 2025-11-14 09:00:21 [-0800], Paul E. McKenney wrote:
> > > > Where in PREEMPT_RT we do not disable preemption around the tracepoint
> > > > callback, but in non RT we do. Instead it uses a srcu and migrate disable.
> > > 
> > > I appreciate the effort. I really do. But why can't we have SRCU on both
> > > configs?
> > 
> > Due to performance concerns for non-RT kernels and workloads, where we
> > really need preemption disabled.
> 
> This means srcu_read_lock_notrace() is much more overhead compared to
> rcu_read_lock_sched_notrace()?
> I am a bit afraid of different bugs here and there.

No, the concern is instead overhead due to any actual preemption.  So the
goal is to actually disable preemption across the BPF program *except*
in PREEMPT_RT kernels.

> > > Also why does tracepoint_guard() need to disable migration? The BPF
> > > program already disables migrations (see for instance
> > > bpf_prog_run_array()).
> > > This is true for RT and !RT. So there is no need to do it here.
> > 
> > The addition of migration disabling was in response to failures, which
> > this fixed.  Or at least greatly reduced the probability of!  Let's see...
> > That migrate_disable() has been there since 2022, so the failures were
> > happening despite it.  Adding Yonghong on CC for his perspective.
> 
> Okay. In general I would prefer that we know why we do it. BPF had
> preempt_disable() which was turned into migrate_disable() for RT reasons
> since remaining on the same CPU was enough and preempt_disable() was the
> only way to enforce it at the time.
> And I think Linus requested migrate_disable() to work regardless of RT
> which PeterZ made happen (for different reasons, not BPF related).

Yes, migrate_disable() prevents migration either way, but it does not
prevent preemption, which is what was needed in non-PREEMPT_RT kernels
last I checked.

							Thanx, Paul

Re: linux-next: manual merge of the rcu tree with the ftrace tree

Posted by Sebastian Andrzej Siewior 2 months, 3 weeks ago

On 2025-11-14 09:25:06 [-0800], Paul E. McKenney wrote:
> On Fri, Nov 14, 2025 at 06:10:52PM +0100, Sebastian Andrzej Siewior wrote:
> > On 2025-11-14 09:00:21 [-0800], Paul E. McKenney wrote:
> > > > > Where in PREEMPT_RT we do not disable preemption around the tracepoint
> > > > > callback, but in non RT we do. Instead it uses a srcu and migrate disable.
> > > > 
> > > > I appreciate the effort. I really do. But why can't we have SRCU on both
> > > > configs?
> > > 
> > > Due to performance concerns for non-RT kernels and workloads, where we
> > > really need preemption disabled.
> > 
> > This means srcu_read_lock_notrace() is much more overhead compared to
> > rcu_read_lock_sched_notrace()?
> > I am a bit afraid of different bugs here and there.
> 
> No, the concern is instead overhead due to any actual preemption.  So the
> goal is to actually disable preemption across the BPF program *except*
> in PREEMPT_RT kernels.

Overhead of actual preemption while the BPF callback of the trace-event
is invoked?
So we get rid of the preempt_disable() in the trace-point which we had
due rcu_read_lock_sched_notrace() and we need to preserve it because
preemption while the BPF program is invoked?
This is also something we want for CONFIG_PREEMPT (LAZY)?

Sorry to be verbose but I try to catch up.
The BPF invocation does not disable preemption for a long time. It
disables migration since some code uses per-CPU variables here.

For XDP kind of BPF invocations, preemption is disabled (except for RT)
because those run in NAPI/ softirq context.

> > > > Also why does tracepoint_guard() need to disable migration? The BPF
> > > > program already disables migrations (see for instance
> > > > bpf_prog_run_array()).
> > > > This is true for RT and !RT. So there is no need to do it here.
> > > 
> > > The addition of migration disabling was in response to failures, which
> > > this fixed.  Or at least greatly reduced the probability of!  Let's see...
> > > That migrate_disable() has been there since 2022, so the failures were
> > > happening despite it.  Adding Yonghong on CC for his perspective.
> > 
> > Okay. In general I would prefer that we know why we do it. BPF had
> > preempt_disable() which was turned into migrate_disable() for RT reasons
> > since remaining on the same CPU was enough and preempt_disable() was the
> > only way to enforce it at the time.
> > And I think Linus requested migrate_disable() to work regardless of RT
> > which PeterZ made happen (for different reasons, not BPF related).
> 
> Yes, migrate_disable() prevents migration either way, but it does not
> prevent preemption, which is what was needed in non-PREEMPT_RT kernels
> last I checked.

BPF in general sometimes relies on per-CPU variables. Sometimes it is
needed to avoid reentrancy which is what preempt_disable() provides for
the same context. This is usually handled where it is required and when
is removed, it is added back shortly. See for instance
	https://lore.kernel.org/all/20251114064922.11650-1-chandna.sahil@gmail.com/

:)

> 
> 							Thanx, Paul

Sebastian

Re: linux-next: manual merge of the rcu tree with the ftrace tree

Posted by Paul E. McKenney 2 months, 3 weeks ago

On Fri, Nov 14, 2025 at 06:41:59PM +0100, Sebastian Andrzej Siewior wrote:
> On 2025-11-14 09:25:06 [-0800], Paul E. McKenney wrote:
> > On Fri, Nov 14, 2025 at 06:10:52PM +0100, Sebastian Andrzej Siewior wrote:
> > > On 2025-11-14 09:00:21 [-0800], Paul E. McKenney wrote:
> > > > > > Where in PREEMPT_RT we do not disable preemption around the tracepoint
> > > > > > callback, but in non RT we do. Instead it uses a srcu and migrate disable.
> > > > > 
> > > > > I appreciate the effort. I really do. But why can't we have SRCU on both
> > > > > configs?
> > > > 
> > > > Due to performance concerns for non-RT kernels and workloads, where we
> > > > really need preemption disabled.
> > > 
> > > This means srcu_read_lock_notrace() is much more overhead compared to
> > > rcu_read_lock_sched_notrace()?
> > > I am a bit afraid of different bugs here and there.
> > 
> > No, the concern is instead overhead due to any actual preemption.  So the
> > goal is to actually disable preemption across the BPF program *except*
> > in PREEMPT_RT kernels.
> 
> Overhead of actual preemption while the BPF callback of the trace-event
> is invoked?
> So we get rid of the preempt_disable() in the trace-point which we had
> due rcu_read_lock_sched_notrace() and we need to preserve it because
> preemption while the BPF program is invoked?
> This is also something we want for CONFIG_PREEMPT (LAZY)?
> 
> Sorry to be verbose but I try to catch up.

No need to apologize, given my tendency to be verbose.  ;-)

> The BPF invocation does not disable preemption for a long time. It
> disables migration since some code uses per-CPU variables here.
> 
> For XDP kind of BPF invocations, preemption is disabled (except for RT)
> because those run in NAPI/ softirq context.

Before Steven's pair of patches (one of which Frederic and I are handling
due to it depending on not-yet-mainline SRCU-fast commits), BPF programs
attached to tracepoints ran with preemption disabled.  This behavior is
still in mainline.  As you reported some time back, this caused problems
for PREEMPT_RT, hence Steven's pair of patches.  But although we do want
to fix PREEMPT_RT, we don't want to break other kernel configuration,
hence keeping preemption disabled in non-PREEMPT_RT kernels.

Now perhaps Yonghong will tell us that this has since been shown to not
be a problem for BPF programs attached to tracepoints in non-PREEMPT_RT
kernels.  But he has not yet done so, which strongly suggests we keep
the known-to-work preemption-disabled status of BPF programs attached
to tracepoints.

> > > > > Also why does tracepoint_guard() need to disable migration? The BPF
> > > > > program already disables migrations (see for instance
> > > > > bpf_prog_run_array()).
> > > > > This is true for RT and !RT. So there is no need to do it here.
> > > > 
> > > > The addition of migration disabling was in response to failures, which
> > > > this fixed.  Or at least greatly reduced the probability of!  Let's see...
> > > > That migrate_disable() has been there since 2022, so the failures were
> > > > happening despite it.  Adding Yonghong on CC for his perspective.
> > > 
> > > Okay. In general I would prefer that we know why we do it. BPF had
> > > preempt_disable() which was turned into migrate_disable() for RT reasons
> > > since remaining on the same CPU was enough and preempt_disable() was the
> > > only way to enforce it at the time.
> > > And I think Linus requested migrate_disable() to work regardless of RT
> > > which PeterZ made happen (for different reasons, not BPF related).
> > 
> > Yes, migrate_disable() prevents migration either way, but it does not
> > prevent preemption, which is what was needed in non-PREEMPT_RT kernels
> > last I checked.
> 
> BPF in general sometimes relies on per-CPU variables. Sometimes it is
> needed to avoid reentrancy which is what preempt_disable() provides for
> the same context. This is usually handled where it is required and when
> is removed, it is added back shortly. See for instance
> 	https://lore.kernel.org/all/20251114064922.11650-1-chandna.sahil@gmail.com/
> 
> :)

Agreed, and that was why I added the migrate_disable() calls earlier,
calls that in Steven's more recent version of this patch just now
conflicted with Steven's other patch in -next.  ;-)

							Thanx, Paul

Re: linux-next: manual merge of the rcu tree with the ftrace tree

Posted by Steven Rostedt 2 months, 3 weeks ago

On Fri, 14 Nov 2025 09:25:06 -0800
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> > This means srcu_read_lock_notrace() is much more overhead compared to
> > rcu_read_lock_sched_notrace()?
> > I am a bit afraid of different bugs here and there.  
> 
> No, the concern is instead overhead due to any actual preemption.  So the
> goal is to actually disable preemption across the BPF program *except*
> in PREEMPT_RT kernels.

If this is a BPF issue only, can we move this logic into the tracepoint
callbacks that BPF uses?

Because, as we can see in this patch. This logic has a ripple effect
throughout the tracing code where it may not be needed.

I see that the callbacks seem to call the bpf_func directly. Could there be
some kind of wrapper around these?

-- Steve

Re: linux-next: manual merge of the rcu tree with the ftrace tree

Posted by Steven Rostedt 2 months, 3 weeks ago

On Fri, 14 Nov 2025 17:33:30 +0100
Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:

> > Where in PREEMPT_RT we do not disable preemption around the tracepoint
> > callback, but in non RT we do. Instead it uses a srcu and migrate disable.  
> 
> I appreciate the effort. I really do. But why can't we have SRCU on both
> configs?

I don't know. Is there more overhead with disabling migration than
disabling preemption?

> 
> Also why does tracepoint_guard() need to disable migration? The BPF
> program already disables migrations (see for instance
> bpf_prog_run_array()).

We also would need to audit all tracepoint callbacks, as there may be some
assumptions about staying on the same CPU.

> This is true for RT and !RT. So there is no need to do it here.
> 
> > The migrate_disable in the syscall tracepoint (which gets called by the
> > system call version that doesn't disable migration, even in RT), needs to
> > disable migration so that the accounting that happens in:
> > 
> >   trace_event_buffer_reserve()
> > 
> > matches what happens when that function gets called by a normal tracepoint
> > callback.  
> 
> buh. But this is something. If we know that the call chain does not
> disable migration, couldn't we just use a different function? I mean we
> have tracing_gen_ctx_dec() and tracing_gen_ctx)(). Wouldn't this work
> for migrate_disable(), too? 
> Just in case we need it and can not avoid it, see above.

I thought about that too. It would then create two different
trace_event_buffer_reserve():

static __always_inline void *event_buffer_reserve(struct trace_event_buffer *fbuffer,
						  struct trace_event_file *trace_file,
						  unsigned long len, bool dec)
{
	struct trace_event_call *event_call = trace_file->event_call;

	if ((trace_file->flags & EVENT_FILE_FL_PID_FILTER) &&
	    trace_event_ignore_this_pid(trace_file))
		return NULL;

	/*
	 * If CONFIG_PREEMPTION is enabled, then the tracepoint itself disables
	 * preemption (adding one to the preempt_count). Since we are
	 * interested in the preempt_count at the time the tracepoint was
	 * hit, we need to subtract one to offset the increment.
	 */
	fbuffer->trace_ctx = dec ? tracing_gen_ctx_dec() : tracing_gen_ctx();
	fbuffer->trace_file = trace_file;

	fbuffer->event =
		trace_event_buffer_lock_reserve(&fbuffer->buffer, trace_file,
						event_call->event.type, len,
						fbuffer->trace_ctx);
	if (!fbuffer->event)
		return NULL;

	fbuffer->regs = NULL;
	fbuffer->entry = ring_buffer_event_data(fbuffer->event);
	return fbuffer->entry;
}

void *trace_event_buffer_reserve(struct trace_event_buffer *fbuffer,
				 struct trace_event_file *trace_file,
				 unsigned long len)
{
	return event_buffer_reserve(fbuffer, trace_file, len, true);
}

void *trace_syscall_event_buffer_reserve(struct trace_event_buffer *fbuffer,
					 struct trace_event_file *trace_file,
					 unsigned long len)
{
	return event_buffer_reserve(fbuffer, trace_file, len, false);
}

Hmm

-- Steve

Re: linux-next: manual merge of the rcu tree with the ftrace tree

Posted by Sebastian Andrzej Siewior 2 months, 3 weeks ago

On 2025-11-14 11:48:28 [-0500], Steven Rostedt wrote:
> On Fri, 14 Nov 2025 17:33:30 +0100
> Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> 
> > > Where in PREEMPT_RT we do not disable preemption around the tracepoint
> > > callback, but in non RT we do. Instead it uses a srcu and migrate disable.  
> > 
> > I appreciate the effort. I really do. But why can't we have SRCU on both
> > configs?
> 
> I don't know. Is there more overhead with disabling migration than
> disabling preemption?

On the first and last invocation, yes. But we if disabling migration is
not required for SRCU then why doing it?

We had the disabled preemption due to rcu_read_lock_sched() due to
tracepoint requirement which was not spelled out. This appears to be
replaced with srcu_fast(). I just don't see why we need two flavours
here (RT vs !RT) and where migrate_disable() requirement is from.

> > 
> > Also why does tracepoint_guard() need to disable migration? The BPF
> > program already disables migrations (see for instance
> > bpf_prog_run_array()).
> 
> We also would need to audit all tracepoint callbacks, as there may be some
> assumptions about staying on the same CPU.

Sure. Okay. What would I need to grep for in order to audit it?

> > This is true for RT and !RT. So there is no need to do it here.
> > 
> > > The migrate_disable in the syscall tracepoint (which gets called by the
> > > system call version that doesn't disable migration, even in RT), needs to
> > > disable migration so that the accounting that happens in:
> > > 
> > >   trace_event_buffer_reserve()
> > > 
> > > matches what happens when that function gets called by a normal tracepoint
> > > callback.  
> > 
> > buh. But this is something. If we know that the call chain does not
> > disable migration, couldn't we just use a different function? I mean we
> > have tracing_gen_ctx_dec() and tracing_gen_ctx)(). Wouldn't this work
> > for migrate_disable(), too? 
> > Just in case we need it and can not avoid it, see above.
> 
> I thought about that too. It would then create two different
> trace_event_buffer_reserve():
> 
> static __always_inline void *event_buffer_reserve(struct trace_event_buffer *fbuffer,
> 						  struct trace_event_file *trace_file,
> 						  unsigned long len, bool dec)
> {
> 	struct trace_event_call *event_call = trace_file->event_call;
> 
> 	if ((trace_file->flags & EVENT_FILE_FL_PID_FILTER) &&
> 	    trace_event_ignore_this_pid(trace_file))
> 		return NULL;
> 
> 	/*
> 	 * If CONFIG_PREEMPTION is enabled, then the tracepoint itself disables
> 	 * preemption (adding one to the preempt_count). Since we are
> 	 * interested in the preempt_count at the time the tracepoint was
> 	 * hit, we need to subtract one to offset the increment.
> 	 */
> 	fbuffer->trace_ctx = dec ? tracing_gen_ctx_dec() : tracing_gen_ctx();
> 	fbuffer->trace_file = trace_file;
> 
> 	fbuffer->event =
> 		trace_event_buffer_lock_reserve(&fbuffer->buffer, trace_file,
> 						event_call->event.type, len,
> 						fbuffer->trace_ctx);
> 	if (!fbuffer->event)
> 		return NULL;
> 
> 	fbuffer->regs = NULL;
> 	fbuffer->entry = ring_buffer_event_data(fbuffer->event);
> 	return fbuffer->entry;
> }
> 
> void *trace_event_buffer_reserve(struct trace_event_buffer *fbuffer,
> 				 struct trace_event_file *trace_file,
> 				 unsigned long len)
> {
> 	return event_buffer_reserve(fbuffer, trace_file, len, true);
> }
> 
> void *trace_syscall_event_buffer_reserve(struct trace_event_buffer *fbuffer,
> 					 struct trace_event_file *trace_file,
> 					 unsigned long len)
> {
> 	return event_buffer_reserve(fbuffer, trace_file, len, false);
> }
> 
> Hmm

Yeah. I *think* in the preempt case we always use the one or the other.

So I would prefer this instead of explicitly disable migration so the a
function down in the stack can decrement the counter again.
Ideally, we don't disable migration to begin with.

_If_ the BPF program disables migrations before invocation of its
program then any trace recording that happens within this program
_should_ record the migration counter at that time. Which would be 1 at
the minimum.

> -- Steve

Sebastian

Re: linux-next: manual merge of the rcu tree with the ftrace tree

Posted by Steven Rostedt 2 months, 3 weeks ago

On Fri, 14 Nov 2025 18:02:32 +0100
Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:

> > I don't know. Is there more overhead with disabling migration than
> > disabling preemption?  
> 
> On the first and last invocation, yes. But we if disabling migration is
> not required for SRCU then why doing it?

I'll yield to the BPF experts here.

> > 
> > We also would need to audit all tracepoint callbacks, as there may be some
> > assumptions about staying on the same CPU.  
> 
> Sure. Okay. What would I need to grep for in order to audit it?

Probably anything that uses per-cpu or smp_processor_id().


> > void *trace_event_buffer_reserve(struct trace_event_buffer *fbuffer,
> > 				 struct trace_event_file *trace_file,
> > 				 unsigned long len)
> > {
> > 	return event_buffer_reserve(fbuffer, trace_file, len, true);
> > }
> > 
> > void *trace_syscall_event_buffer_reserve(struct trace_event_buffer *fbuffer,
> > 					 struct trace_event_file *trace_file,
> > 					 unsigned long len)
> > {
> > 	return event_buffer_reserve(fbuffer, trace_file, len, false);
> > }
> > 
> > Hmm  
> 
> Yeah. I *think* in the preempt case we always use the one or the other.

OK, we can do this instead. Probably cleaner anyway.

> 
> So I would prefer this instead of explicitly disable migration so the a
> function down in the stack can decrement the counter again.
> Ideally, we don't disable migration to begin with.
> 
> _If_ the BPF program disables migrations before invocation of its
> program then any trace recording that happens within this program
> _should_ record the migration counter at that time. Which would be 1 at
> the minimum.

Again, I yield to the BPF folks.

Frederic, it may be good to zap this patch from your repo. It looks like it
still needs more work.

Thanks,

-- Steve