[PATCH v5] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast

Steven Rostedt posted 1 patch 4 weeks, 1 day ago
There is a newer version of this series
include/linux/trace_events.h  | 24 ++++++++++++++
include/linux/tracepoint.h    | 25 ++++++++++++--
include/trace/perf.h          |  4 +--
include/trace/trace_events.h  | 21 ++++++++++--
kernel/trace/trace_events.c   | 61 +++++++++++++++++++++++++++++------
kernel/trace/trace_syscalls.c |  4 +--
kernel/tracepoint.c           | 44 +++++++++++++++++++++++++
7 files changed, 164 insertions(+), 19 deletions(-)
[PATCH v5] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast
Posted by Steven Rostedt 4 weeks, 1 day ago
From: "Paul E. McKenney" <paulmck@kernel.org>

The current use of guard(preempt_notrace)() within __DECLARE_TRACE()
to protect invocation of __DO_TRACE_CALL() means that BPF programs
attached to tracepoints are non-preemptible.  This is unhelpful in
real-time systems, whose users apparently wish to use BPF while also
achieving low latencies.  (Who knew?)

One option would be to use preemptible RCU, but this introduces
many opportunities for infinite recursion, which many consider to
be counterproductive, especially given the relatively small stacks
provided by the Linux kernel.  These opportunities could be shut down
by sufficiently energetic duplication of code, but this sort of thing
is considered impolite in some circles.

Therefore, use the shiny new SRCU-fast API, which provides somewhat faster
readers than those of preemptible RCU, at least on Paul E. McKenney's
laptop, where task_struct access is more expensive than access to per-CPU
variables.  And SRCU-fast provides way faster readers than does SRCU,
courtesy of being able to avoid the read-side use of smp_mb().  Also,
it is quite straightforward to create srcu_read_{,un}lock_fast_notrace()
functions.

While in the area, SRCU now supports early boot call_srcu().  Therefore,
remove the checks that used to avoid such use from rcu_free_old_probes()
before this commit was applied:

e53244e2c893 ("tracepoint: Remove SRCU protection")

The current commit can be thought of as an approximate revert of that
commit, with some compensating additions of preemption disabling.
This preemption disabling uses guard(preempt_notrace)().

However, Yonghong Song points out that BPF assumes that non-sleepable
BPF programs will remain on the same CPU, which means that migration
must be disabled whenever preemption remains enabled.  In addition,
non-RT kernels have performance expectations that would be violated by
allowing the BPF programs to be preempted.

Therefore, continue to disable preemption in non-RT kernels, and protect
the BPF program with both SRCU and migration disabling for RT kernels,
and even then only if preemption is not already disabled.

Link: https://lore.kernel.org/all/20250613152218.1924093-1-bigeasy@linutronix.de/

Co-developed-by: Steve Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v4: https://lore.kernel.org/all/20251216120819.3499e00e@gandalf.local.home/

- Added kerneldoc to trace_event_buffer_reserve{_syscall}

- Restructured the tracepoint.c code to not have #ifdef within functions

 include/linux/trace_events.h  | 24 ++++++++++++++
 include/linux/tracepoint.h    | 25 ++++++++++++--
 include/trace/perf.h          |  4 +--
 include/trace/trace_events.h  | 21 ++++++++++--
 kernel/trace/trace_events.c   | 61 +++++++++++++++++++++++++++++------
 kernel/trace/trace_syscalls.c |  4 +--
 kernel/tracepoint.c           | 44 +++++++++++++++++++++++++
 7 files changed, 164 insertions(+), 19 deletions(-)

diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 3690221ba3d8..a2704c35eda8 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -222,6 +222,26 @@ static inline unsigned int tracing_gen_ctx_dec(void)
 	return trace_ctx;
 }
 
+/*
+ * When PREEMPT_RT is enabled, trace events are called with disabled
+ * migration. The trace events need to know if the tracepoint disabled
+ * migration or not so that what is recorded to the ring buffer shows
+ * the state of when the trace event triggered, and not the state caused
+ * by the trace event.
+ */
+#ifdef CONFIG_PREEMPT_RT
+static inline unsigned int tracing_gen_ctx_dec_cond(void)
+{
+	unsigned int trace_ctx;
+
+	trace_ctx = tracing_gen_ctx_dec();
+	/* The migration counter starts at bit 4 */
+	return trace_ctx - (1 << 4);
+}
+#else
+# define tracing_gen_ctx_dec_cond() tracing_gen_ctx_dec()
+#endif
+
 struct trace_event_file;
 
 struct ring_buffer_event *
@@ -313,6 +333,10 @@ void *trace_event_buffer_reserve(struct trace_event_buffer *fbuffer,
 				  struct trace_event_file *trace_file,
 				  unsigned long len);
 
+void *trace_event_buffer_reserve_syscall(struct trace_event_buffer *fbuffer,
+					 struct trace_event_file *trace_file,
+					 unsigned long len);
+
 void trace_event_buffer_commit(struct trace_event_buffer *fbuffer);
 
 enum {
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 8a56f3278b1b..0563c7d9fcb2 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -100,6 +100,25 @@ void for_each_tracepoint_in_module(struct module *mod,
 }
 #endif /* CONFIG_MODULES */
 
+/*
+ * BPF programs can attach to the tracepoint callbacks. But if the
+ * callbacks are called with preemption disabled, the BPF programs
+ * can cause quite a bit of latency. When PREEMPT_RT is enabled,
+ * instead of disabling preemption, use srcu_fast_notrace() for
+ * synchronization. As BPF programs that are attached to tracepoints
+ * expect to stay on the same CPU, also disable migration.
+ */
+#ifdef CONFIG_PREEMPT_RT
+extern struct srcu_struct tracepoint_srcu;
+# define tracepoint_sync() synchronize_srcu(&tracepoint_srcu);
+# define tracepoint_guard()				\
+	guard(srcu_fast_notrace)(&tracepoint_srcu);	\
+	guard(migrate)()
+#else
+# define tracepoint_sync() synchronize_rcu();
+# define tracepoint_guard() guard(preempt_notrace)()
+#endif
+
 /*
  * tracepoint_synchronize_unregister must be called between the last tracepoint
  * probe unregistration and the end of module exit to make sure there is no
@@ -115,7 +134,7 @@ void for_each_tracepoint_in_module(struct module *mod,
 static inline void tracepoint_synchronize_unregister(void)
 {
 	synchronize_rcu_tasks_trace();
-	synchronize_rcu();
+	tracepoint_sync();
 }
 static inline bool tracepoint_is_faultable(struct tracepoint *tp)
 {
@@ -275,13 +294,13 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 		return static_branch_unlikely(&__tracepoint_##name.key);\
 	}
 
-#define __DECLARE_TRACE(name, proto, args, cond, data_proto)		\
+#define __DECLARE_TRACE(name, proto, args, cond, data_proto)			\
 	__DECLARE_TRACE_COMMON(name, PARAMS(proto), PARAMS(args), PARAMS(data_proto)) \
 	static inline void __do_trace_##name(proto)			\
 	{								\
 		TRACEPOINT_CHECK(name)					\
 		if (cond) {						\
-			guard(preempt_notrace)();			\
+			tracepoint_guard();				\
 			__DO_TRACE_CALL(name, TP_ARGS(args));		\
 		}							\
 	}								\
diff --git a/include/trace/perf.h b/include/trace/perf.h
index a1754b73a8f5..348ad1d9b556 100644
--- a/include/trace/perf.h
+++ b/include/trace/perf.h
@@ -71,6 +71,7 @@ perf_trace_##call(void *__data, proto)					\
 	u64 __count __attribute__((unused));				\
 	struct task_struct *__task __attribute__((unused));		\
 									\
+	guard(preempt_notrace)();					\
 	do_perf_trace_##call(__data, args);				\
 }
 
@@ -85,9 +86,8 @@ perf_trace_##call(void *__data, proto)					\
 	struct task_struct *__task __attribute__((unused));		\
 									\
 	might_fault();							\
-	preempt_disable_notrace();					\
+	guard(preempt_notrace)();					\
 	do_perf_trace_##call(__data, args);				\
-	preempt_enable_notrace();					\
 }
 
 /*
diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
index 4f22136fd465..6fb58387e9f1 100644
--- a/include/trace/trace_events.h
+++ b/include/trace/trace_events.h
@@ -429,6 +429,22 @@ do_trace_event_raw_event_##call(void *__data, proto)			\
 	trace_event_buffer_commit(&fbuffer);				\
 }
 
+/*
+ * When PREEMPT_RT is enabled, the tracepoint does not disable preemption
+ * but instead disables migration. The callbacks for the trace events
+ * need to have a consistent state so that it can reflect the proper
+ * preempt_disabled counter.
+ */
+#ifdef CONFIG_PREEMPT_RT
+/* disable preemption for RT so that the counters still match */
+# define trace_event_guard() guard(preempt_notrace)()
+/* Have syscalls up the migrate disable counter to emulate non-syscalls */
+# define trace_syscall_event_guard() guard(migrate)()
+#else
+# define trace_event_guard()
+# define trace_syscall_event_guard()
+#endif
+
 #undef DECLARE_EVENT_CLASS
 #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
 __DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), PARAMS(tstruct), \
@@ -436,6 +452,7 @@ __DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), PARAMS(tstruct), \
 static notrace void							\
 trace_event_raw_event_##call(void *__data, proto)			\
 {									\
+	trace_event_guard();						\
 	do_trace_event_raw_event_##call(__data, args);			\
 }
 
@@ -447,9 +464,9 @@ static notrace void							\
 trace_event_raw_event_##call(void *__data, proto)			\
 {									\
 	might_fault();							\
-	preempt_disable_notrace();					\
+	trace_syscall_event_guard();					\
+	guard(preempt_notrace)();					\
 	do_trace_event_raw_event_##call(__data, args);			\
-	preempt_enable_notrace();					\
 }
 
 /*
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 137b4d9bb116..758c8a4ec7c2 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -649,9 +649,9 @@ bool trace_event_ignore_this_pid(struct trace_event_file *trace_file)
 }
 EXPORT_SYMBOL_GPL(trace_event_ignore_this_pid);
 
-void *trace_event_buffer_reserve(struct trace_event_buffer *fbuffer,
-				 struct trace_event_file *trace_file,
-				 unsigned long len)
+static __always_inline void *buffer_reserve(struct trace_event_buffer *fbuffer,
+					    struct trace_event_file *trace_file,
+					    unsigned long len)
 {
 	struct trace_event_call *event_call = trace_file->event_call;
 
@@ -659,13 +659,6 @@ void *trace_event_buffer_reserve(struct trace_event_buffer *fbuffer,
 	    trace_event_ignore_this_pid(trace_file))
 		return NULL;
 
-	/*
-	 * If CONFIG_PREEMPTION is enabled, then the tracepoint itself disables
-	 * preemption (adding one to the preempt_count). Since we are
-	 * interested in the preempt_count at the time the tracepoint was
-	 * hit, we need to subtract one to offset the increment.
-	 */
-	fbuffer->trace_ctx = tracing_gen_ctx_dec();
 	fbuffer->trace_file = trace_file;
 
 	fbuffer->event =
@@ -679,8 +672,56 @@ void *trace_event_buffer_reserve(struct trace_event_buffer *fbuffer,
 	fbuffer->entry = ring_buffer_event_data(fbuffer->event);
 	return fbuffer->entry;
 }
+
+/**
+ * trace_event_buffer_reserve - reserve space on the ring buffer for an event
+ * @fbuffer: information about how to save the event
+ * @trace_file: the instance file descriptor for the event
+ * @len: The length of the event
+ *
+ * The @fbuffer has information about the ring buffer and data will
+ * be added to it to be used by the call to trace_event_buffer_commit().
+ * The @trace_file is the desrciptor with information about the status
+ * of the given event for a specific trace_array instance.
+ * The @len is the length of data to save for the event.
+ *
+ * Returns a pointer to the data on the ring buffer or NULL if the
+ *   event was not reserved (event was filtered, too big, or the buffer
+ *   simply was disabled for write).
+ */
+void *trace_event_buffer_reserve(struct trace_event_buffer *fbuffer,
+				 struct trace_event_file *trace_file,
+				 unsigned long len)
+{
+	fbuffer->trace_ctx = tracing_gen_ctx_dec_cond();
+	return buffer_reserve(fbuffer, trace_file, len);
+}
 EXPORT_SYMBOL_GPL(trace_event_buffer_reserve);
 
+/**
+ * trace_event_buffer_reserve_syscall - reserve space for a syscall event
+ * @fbuffer: information about how to save the event
+ * @trace_file: the instance file descriptor for the event
+ * @len: The length of the event
+ *
+ * This behaves exactly the same as trace_event_buffer_reserve(), but
+ * needs to act slightly different for system call events when PREEMPT_RT
+ * is enabled. The way the preempt count and migration disabled are
+ * set is different than trace_event_buffer_reserve(), as normal events
+ * have migration disabled instead of preemption in PREEMPT_RT, system
+ * call events do not disable migrating but still disable preemption.
+ *
+ * Returns the same as trace_event_buffer_reserve().
+ */
+void *trace_event_buffer_reserve_syscall(struct trace_event_buffer *fbuffer,
+					 struct trace_event_file *trace_file,
+					 unsigned long len)
+{
+	fbuffer->trace_ctx = tracing_gen_ctx_dec();
+	return buffer_reserve(fbuffer, trace_file, len);
+}
+
+
 int trace_event_reg(struct trace_event_call *call,
 		    enum trace_reg type, void *data)
 {
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index e96d0063cbcf..f330fd22ea78 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -909,7 +909,7 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
 
 	size += sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
 
-	entry = trace_event_buffer_reserve(&fbuffer, trace_file, size);
+	entry = trace_event_buffer_reserve_syscall(&fbuffer, trace_file, size);
 	if (!entry)
 		return;
 
@@ -955,7 +955,7 @@ static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
 	if (!sys_data)
 		return;
 
-	entry = trace_event_buffer_reserve(&fbuffer, trace_file, sizeof(*entry));
+	entry = trace_event_buffer_reserve_syscall(&fbuffer, trace_file, sizeof(*entry));
 	if (!entry)
 		return;
 
diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index 62719d2941c9..17e11d91d029 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -34,9 +34,32 @@ enum tp_transition_sync {
 
 struct tp_transition_snapshot {
 	unsigned long rcu;
+	unsigned long srcu_gp;
 	bool ongoing;
 };
 
+/* In PREEMPT_RT, SRCU is used to protect the tracepoint callbacks */
+#ifdef CONFIG_PREEMPT_RT
+DEFINE_SRCU_FAST(tracepoint_srcu);
+EXPORT_SYMBOL_GPL(tracepoint_srcu);
+static inline void start_rt_synchronize(struct tp_transition_snapshot *snapshot)
+{
+	snapshot->srcu_gp = start_poll_synchronize_srcu(&tracepoint_srcu);
+}
+static inline void poll_rt_synchronize(struct tp_transition_snapshot *snapshot)
+{
+	if (!poll_state_synchronize_srcu(&tracepoint_srcu, snapshot->srcu_gp))
+		synchronize_srcu(&tracepoint_srcu);
+}
+#else
+static inline void start_rt_synchronize(struct tp_transition_snapshot *snapshot)
+{
+}
+static inline void poll_rt_synchronize(struct tp_transition_snapshot *snapshot)
+{
+}
+#endif
+
 /* Protected by tracepoints_mutex */
 static struct tp_transition_snapshot tp_transition_snapshot[_NR_TP_TRANSITION_SYNC];
 
@@ -46,6 +69,7 @@ static void tp_rcu_get_state(enum tp_transition_sync sync)
 
 	/* Keep the latest get_state snapshot. */
 	snapshot->rcu = get_state_synchronize_rcu();
+	start_rt_synchronize(snapshot);
 	snapshot->ongoing = true;
 }
 
@@ -56,6 +80,7 @@ static void tp_rcu_cond_sync(enum tp_transition_sync sync)
 	if (!snapshot->ongoing)
 		return;
 	cond_synchronize_rcu(snapshot->rcu);
+	poll_rt_synchronize(snapshot);
 	snapshot->ongoing = false;
 }
 
@@ -101,10 +126,22 @@ static inline void *allocate_probes(int count)
 	return p == NULL ? NULL : p->probes;
 }
 
+#ifdef CONFIG_PREEMPT_RT
+static void srcu_free_old_probes(struct rcu_head *head)
+{
+	kfree(container_of(head, struct tp_probes, rcu));
+}
+
+static void rcu_free_old_probes(struct rcu_head *head)
+{
+	call_srcu(&tracepoint_srcu, head, srcu_free_old_probes);
+}
+#else
 static void rcu_free_old_probes(struct rcu_head *head)
 {
 	kfree(container_of(head, struct tp_probes, rcu));
 }
+#endif
 
 static inline void release_probes(struct tracepoint *tp, struct tracepoint_func *old)
 {
@@ -112,6 +149,13 @@ static inline void release_probes(struct tracepoint *tp, struct tracepoint_func
 		struct tp_probes *tp_probes = container_of(old,
 			struct tp_probes, probes[0]);
 
+		/*
+		 * Tracepoint probes are protected by either RCU or
+		 * Tasks Trace RCU and also by SRCU.  By calling the SRCU
+		 * callback in the [Tasks Trace] RCU callback we cover
+		 * both cases. So let us chain the SRCU and [Tasks Trace]
+		 * RCU callbacks to wait for both grace periods.
+		 */
 		if (tracepoint_is_faultable(tp))
 			call_rcu_tasks_trace(&tp_probes->rcu, rcu_free_old_probes);
 		else
-- 
2.51.0
Re: [PATCH v5] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast
Posted by Mathieu Desnoyers 4 weeks, 1 day ago
On 2026-01-08 22:05, Steven Rostedt wrote:
> From: "Paul E. McKenney" <paulmck@kernel.org>
[...]

I disagree with many elements of the proposed approach.

On one end we have BPF wanting to hook on arbitrary tracepoints without
adding significant latency to PREEMPT RT kernels.

One the other hand, we have high-speed tracers which execute very short
critical sections to serialize trace data into ring buffers.

All of those users register to the tracepoint API.

We also have to consider that migrate disable is *not* cheap at all
compared to preempt disable.

So I'm wondering why all tracepoint users need to pay the migrate
disable runtime overhead on preempt RT kernels for the sake of BPF ?

Using SRCU-fast to protect tracepoint callback iteration makes sense
for preempt-rt, but I'd recommend moving the migrate disable guard
within the bpf callback code rather than slowing down other tracers
which execute within a short amount of time. Other tracers can then
choose to disable preemption rather than migration if that's a better
fit for their needs.

> diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
> index 3690221ba3d8..a2704c35eda8 100644
> --- a/include/linux/trace_events.h
> +++ b/include/linux/trace_events.h
> @@ -222,6 +222,26 @@ static inline unsigned int tracing_gen_ctx_dec(void)
>   	return trace_ctx;
>   }
>   
> +/*
> + * When PREEMPT_RT is enabled, trace events are called with disabled
> + * migration. The trace events need to know if the tracepoint disabled
> + * migration or not so that what is recorded to the ring buffer shows
> + * the state of when the trace event triggered, and not the state caused
> + * by the trace event.
> + */
> +#ifdef CONFIG_PREEMPT_RT
> +static inline unsigned int tracing_gen_ctx_dec_cond(void)
> +{
> +	unsigned int trace_ctx;
> +
> +	trace_ctx = tracing_gen_ctx_dec();
> +	/* The migration counter starts at bit 4 */
> +	return trace_ctx - (1 << 4);

We should turn this hardcoded "4" value into an enum label or a
define. That define should be exposed by tracepoint.h. We should
not hardcode expectations about the implementation of distinct APIs
across the tracing subsystem.

[...]

> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -100,6 +100,25 @@ void for_each_tracepoint_in_module(struct module *mod,
>   }
>   #endif /* CONFIG_MODULES */
>   
> +/*
> + * BPF programs can attach to the tracepoint callbacks. But if the
> + * callbacks are called with preemption disabled, the BPF programs
> + * can cause quite a bit of latency. When PREEMPT_RT is enabled,
> + * instead of disabling preemption, use srcu_fast_notrace() for
> + * synchronization. As BPF programs that are attached to tracepoints
> + * expect to stay on the same CPU, also disable migration.
> + */
> +#ifdef CONFIG_PREEMPT_RT
> +extern struct srcu_struct tracepoint_srcu;
> +# define tracepoint_sync() synchronize_srcu(&tracepoint_srcu);
> +# define tracepoint_guard()				\
> +	guard(srcu_fast_notrace)(&tracepoint_srcu);	\
> +	guard(migrate)()
> +#else
> +# define tracepoint_sync() synchronize_rcu();
> +# define tracepoint_guard() guard(preempt_notrace)()
> +#endif

Doing migrate disable on PREEMPT RT for BPF vs preempt disable in other
tracers should come in a separate preparation patch. It belongs to the
tracers, not to tracepoints.

[...]
				\
> diff --git a/include/trace/perf.h b/include/trace/perf.h
> index a1754b73a8f5..348ad1d9b556 100644
> --- a/include/trace/perf.h
> +++ b/include/trace/perf.h
> @@ -71,6 +71,7 @@ perf_trace_##call(void *__data, proto)					\
>   	u64 __count __attribute__((unused));				\
>   	struct task_struct *__task __attribute__((unused));		\
>   									\
> +	guard(preempt_notrace)();					\
>   	do_perf_trace_##call(__data, args);				\
>   }
>   
> @@ -85,9 +86,8 @@ perf_trace_##call(void *__data, proto)					\
>   	struct task_struct *__task __attribute__((unused));		\
>   									\
>   	might_fault();							\
> -	preempt_disable_notrace();					\
> +	guard(preempt_notrace)();					\
>   	do_perf_trace_##call(__data, args);				\
> -	preempt_enable_notrace();	

Move this to a perf-specific preparation patch.				\
>   }
>   
>   /*
> diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
> index 4f22136fd465..6fb58387e9f1 100644
> --- a/include/trace/trace_events.h
> +++ b/include/trace/trace_events.h
> @@ -429,6 +429,22 @@ do_trace_event_raw_event_##call(void *__data, proto)			\
>   	trace_event_buffer_commit(&fbuffer);				\
>   }
>   
> +/*
> + * When PREEMPT_RT is enabled, the tracepoint does not disable preemption
> + * but instead disables migration. The callbacks for the trace events
> + * need to have a consistent state so that it can reflect the proper
> + * preempt_disabled counter.

Having those defines within trace_events.h is poking holes within any
hope of abstraction we can have from the tracepoint.h API. This adds
strong coupling between tracepoint and trace_event.h.

Rather than hardcoding preempt counter expectations across tracepoint
and trace-events, we should expose a #define in tracepoint.h which
will make the preempt counter nesting level available to other
parts of the kernel such as trace_events.h. This way we keep everything
in one place and we don't add cross-references about subtle preempt
counter nesting level details.

> + */
> +#ifdef CONFIG_PREEMPT_RT
> +/* disable preemption for RT so that the counters still match */
> +# define trace_event_guard() guard(preempt_notrace)()
> +/* Have syscalls up the migrate disable counter to emulate non-syscalls */
> +# define trace_syscall_event_guard() guard(migrate)()
> +#else
> +# define trace_event_guard()
> +# define trace_syscall_event_guard()
> +#endif
This should be moved to separate tracer-specific prep patches.

[...]

> + * The @trace_file is the desrciptor with information about the status

descriptor

[...]

> + *
> + * Returns a pointer to the data on the ring buffer or NULL if the
> + *   event was not reserved (event was filtered, too big, or the buffer
> + *   simply was disabled for write).

odd spaces here.

[...]

>   
> +#ifdef CONFIG_PREEMPT_RT
> +static void srcu_free_old_probes(struct rcu_head *head)
> +{
> +	kfree(container_of(head, struct tp_probes, rcu));
> +}
> +
> +static void rcu_free_old_probes(struct rcu_head *head)
> +{
> +	call_srcu(&tracepoint_srcu, head, srcu_free_old_probes);
> +}
> +#else
>   static void rcu_free_old_probes(struct rcu_head *head)
>   {
>   	kfree(container_of(head, struct tp_probes, rcu));
>   }
> +#endif
>   
>   static inline void release_probes(struct tracepoint *tp, struct tracepoint_func *old)
>   {
> @@ -112,6 +149,13 @@ static inline void release_probes(struct tracepoint *tp, struct tracepoint_func
>   		struct tp_probes *tp_probes = container_of(old,
>   			struct tp_probes, probes[0]);
>   
> +		/*
> +		 * Tracepoint probes are protected by either RCU or
> +		 * Tasks Trace RCU and also by SRCU.  By calling the SRCU

I'm confused.

My understanding is that in !RT we have:

- RCU (!tracepoint_is_faultable(tp))
- RCU tasks trace (tracepoint_is_faultable(tp))

And for RT:

- SRCU-fast (!tracepoint_is_faultable(tp))
- RCU tasks trace (tracepoint_is_faultable(tp))

So I don't understand this comment, and also I don't understand why we
need to chain the callbacks rather than just call the appropriate
call_rcu based on the tracepoint "is_faultable" state.

What am I missing ?

> +		 * callback in the [Tasks Trace] RCU callback we cover
> +		 * both cases. So let us chain the SRCU and [Tasks Trace]
> +		 * RCU callbacks to wait for both grace periods.
> +		 */
>   		if (tracepoint_is_faultable(tp))
>   			call_rcu_tasks_trace(&tp_probes->rcu, rcu_free_old_probes);
>   		else

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Re: [PATCH v5] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast
Posted by Alexei Starovoitov 4 weeks, 1 day ago
On Fri, Jan 9, 2026 at 6:45 AM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> On 2026-01-08 22:05, Steven Rostedt wrote:
> > From: "Paul E. McKenney" <paulmck@kernel.org>
> [...]
>
> I disagree with many elements of the proposed approach.
>
> On one end we have BPF wanting to hook on arbitrary tracepoints without
> adding significant latency to PREEMPT RT kernels.
>
> One the other hand, we have high-speed tracers which execute very short
> critical sections to serialize trace data into ring buffers.
>
> All of those users register to the tracepoint API.
>
> We also have to consider that migrate disable is *not* cheap at all
> compared to preempt disable.

Looks like your complaint comes from lack of engagement in kernel
development.
migrate_disable _was_ not cheap.
Try to benchmark it now.
It's inlined. It's a fraction of extra overhead on top of preempt_disable.
Re: [PATCH v5] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast
Posted by Yonghong Song 4 weeks, 1 day ago

On 1/9/26 11:10 AM, Alexei Starovoitov wrote:
> On Fri, Jan 9, 2026 at 6:45 AM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>> On 2026-01-08 22:05, Steven Rostedt wrote:
>>> From: "Paul E. McKenney" <paulmck@kernel.org>
>> [...]
>>
>> I disagree with many elements of the proposed approach.
>>
>> On one end we have BPF wanting to hook on arbitrary tracepoints without
>> adding significant latency to PREEMPT RT kernels.
>>
>> One the other hand, we have high-speed tracers which execute very short
>> critical sections to serialize trace data into ring buffers.
>>
>> All of those users register to the tracepoint API.
>>
>> We also have to consider that migrate disable is *not* cheap at all
>> compared to preempt disable.
> Looks like your complaint comes from lack of engagement in kernel
> development.
> migrate_disable _was_ not cheap.
> Try to benchmark it now.
> It's inlined. It's a fraction of extra overhead on top of preempt_disable.
>
The following are related patches to inline migrate_disable():

35561bab768977c9e05f1f1a9bc00134c85f3e28 arch: Add the macro COMPILE_OFFSETS to all the asm-offsets.c
88a90315a99a9120cd471bf681515cc77cd7cdb8 rcu: Replace preempt.h with sched.h in include/linux/rcupdate.h
378b7708194fff77c9020392067329931c3fcc04 sched: Make migrate_{en,dis}able() inline

Re: [PATCH v5] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast
Posted by Steven Rostedt 4 weeks, 1 day ago
On Fri, 9 Jan 2026 11:10:16 -0800
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

\> >
> > We also have to consider that migrate disable is *not* cheap at all
> > compared to preempt disable.  
> 
> Looks like your complaint comes from lack of engagement in kernel
> development.

No need to make comments like that. The Linux kernel is an ocean of code.
It's very hard to keep up on everything that is happening. I knew of work
being done on migrate_disable but I didn't know what the impacts of that
work was. Mathieu is still very much involved and engaged in kernel
development.

> migrate_disable _was_ not cheap.
> Try to benchmark it now.
> It's inlined. It's a fraction of extra overhead on top of preempt_disable.

It would be good to have a benchmark of the two. What about fast_srcu? Is
that fast enough to replace the preempt_disable()? If so, then could we
just make this the same for both RT and !RT?

-- Steve
Re: [PATCH v5] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast
Posted by Mathieu Desnoyers 4 weeks, 1 day ago
On 2026-01-09 14:19, Steven Rostedt wrote:
> On Fri, 9 Jan 2026 11:10:16 -0800
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> \> >
>>> We also have to consider that migrate disable is *not* cheap at all
>>> compared to preempt disable.
>>
>> Looks like your complaint comes from lack of engagement in kernel
>> development.
> 
> No need to make comments like that. The Linux kernel is an ocean of code.
> It's very hard to keep up on everything that is happening. I knew of work
> being done on migrate_disable but I didn't know what the impacts of that
> work was. Mathieu is still very much involved and engaged in kernel
> development.

Thanks Steven. I guess Alexei missed my recent involvement in other
areas of the kernel.

As Steven pointed out, the kernel is vast, so I cannot keep up with
the progress on every single topic. That being said, I very recently
(about 1 month ago) tried using migrate disable for the RSS tracking
improvements (hierarchical percpu counters), and found that the overhead
of migrate disable was large compared to preempt disable. The generated
assembler is also orders of magnitude larger (on x86-64).

Creating small placeholder functions which just call preempt/migrate
disable and enable for a preempt RT build:

0000000000002a20 <test_preempt_disable>:
     2a20:       f3 0f 1e fa             endbr64
     2a24:       65 ff 05 00 00 00 00    incl   %gs:0x0(%rip)        # 2a2b <test_preempt_disable+0xb>
     2a2b:       e9 00 00 00 00          jmp    2a30 <test_preempt_disable+0x10>

0000000000002a40 <test_preempt_enable>:
     2a40:       f3 0f 1e fa             endbr64
     2a44:       65 ff 0d 00 00 00 00    decl   %gs:0x0(%rip)        # 2a4b <test_preempt_enable+0xb>
     2a4b:       74 05                   je     2a52 <test_preempt_enable+0x12>
     2a4d:       e9 00 00 00 00          jmp    2a52 <test_preempt_enable+0x12>
     2a52:       e8 00 00 00 00          call   2a57 <test_preempt_enable+0x17>
     2a57:       e9 00 00 00 00          jmp    2a5c <test_preempt_enable+0x1c>

0000000000002920 <test_migrate_disable>:
     2920:       f3 0f 1e fa             endbr64
     2924:       65 48 8b 15 00 00 00    mov    %gs:0x0(%rip),%rdx        # 292c <test_migrate_disable+0xc>
     292b:       00
     292c:       0f b7 82 38 07 00 00    movzwl 0x738(%rdx),%eax
     2933:       66 85 c0                test   %ax,%ax
     2936:       74 0f                   je     2947 <test_migrate_disable+0x27>
     2938:       83 c0 01                add    $0x1,%eax
     293b:       66 89 82 38 07 00 00    mov    %ax,0x738(%rdx)
     2942:       e9 00 00 00 00          jmp    2947 <test_migrate_disable+0x27>
     2947:       65 ff 05 00 00 00 00    incl   %gs:0x0(%rip)        # 294e <test_migrate_disable+0x2e>
     294e:       65 48 8b 05 00 00 00    mov    %gs:0x0(%rip),%rax        # 2956 <test_migrate_disable+0x36>
     2955:       00
     2956:       83 80 00 00 00 00 01    addl   $0x1,0x0(%rax)
     295d:       b8 01 00 00 00          mov    $0x1,%eax
     2962:       66 89 82 38 07 00 00    mov    %ax,0x738(%rdx)
     2969:       65 ff 0d 00 00 00 00    decl   %gs:0x0(%rip)        # 2970 <test_migrate_disable+0x50>
     2970:       74 05                   je     2977 <test_migrate_disable+0x57>
     2972:       e9 00 00 00 00          jmp    2977 <test_migrate_disable+0x57>
     2977:       e8 00 00 00 00          call   297c <test_migrate_disable+0x5c>
     297c:       e9 00 00 00 00          jmp    2981 <test_migrate_disable+0x61>

00000000000029a0 <test_migrate_enable>:
     29a0:       f3 0f 1e fa             endbr64
     29a4:       65 48 8b 15 00 00 00    mov    %gs:0x0(%rip),%rdx        # 29ac <test_migrate_enable+0xc>
     29ab:       00
     29ac:       0f b7 82 38 07 00 00    movzwl 0x738(%rdx),%eax
     29b3:       66 85 c0                test   %ax,%ax
     29b6:       74 0f                   je     29c7 <test_migrate_enable+0x27>
     29b8:       83 c0 01                add    $0x1,%eax
     29bb:       66 89 82 38 07 00 00    mov    %ax,0x738(%rdx)
     29c2:       e9 00 00 00 00          jmp    29c7 <test_migrate_enable+0x27>
     29c7:       65 ff 05 00 00 00 00    incl   %gs:0x0(%rip)        # 29ce <test_migrate_enable+0x2e>
     29ce:       65 48 8b 05 00 00 00    mov    %gs:0x0(%rip),%rax        # 29d6 <test_migrate_enable+0x36>
     29d5:       00
     29d6:       83 80 00 00 00 00 01    addl   $0x1,0x0(%rax)
     29dd:       b8 01 00 00 00          mov    $0x1,%eax
     29e2:       66 89 82 38 07 00 00    mov    %ax,0x738(%rdx)
     29e9:       65 ff 0d 00 00 00 00    decl   %gs:0x0(%rip)        # 29f0 <test_migrate_enable+0x50>
     29f0:       74 05                   je     29f7 <test_migrate_enable+0x57>
     29f2:       e9 00 00 00 00          jmp    29f7 <test_migrate_enable+0x57>
     29f7:       e8 00 00 00 00          call   29fc <test_migrate_enable+0x5c>
     29fc:       e9 00 00 00 00          jmp    2a01 <test_migrate_enable+0x61>

> 
>> migrate_disable _was_ not cheap.
>> Try to benchmark it now.
>> It's inlined. It's a fraction of extra overhead on top of preempt_disable.
> 
> It would be good to have a benchmark of the two. What about fast_srcu? Is
> that fast enough to replace the preempt_disable()? If so, then could we
> just make this the same for both RT and !RT?

I've modified kernel/rcu/refscale.c to compare those:

AMD EPYC 9654 96-Core Processor, kernel baseline: v6.18.1
CONFIG_PREEMPT=y
# CONFIG_PREEMPT_LAZY is not set
# CONFIG_PREEMPT_RT is not set

* preempt disable/enable pair:                                     1.1 ns
* srcu-fast lock/unlock:                                           1.5 ns

CONFIG_RCU_REF_SCALE_TEST=y
* migrate disable/enable pair:                                     3.0 ns
* calls to migrate disable/enable pair within noinline functions: 17.0 ns

CONFIG_RCU_REF_SCALE_TEST=m
* migrate disable/enable pair:                                    22.0 ns

When I attempted using migrate disable, I configured refscale as
a module, which gave me the appalling 22 ns overhead. It looks like
the implementation of migrate disable/enable now differs depending on
whether it's used from the core kernel or from a module. That's rather
unexpected.

It seems to be done on purpose though (INSTANTIATE_EXPORTED_MIGRATE_DISABLE)
to work around the fact that it is not possible to export the runqueues
variable.

That's the kind of compilation context dependent overhead variability I'd
rather avoid in the implementation of the tracepoint instrumentation API.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Re: [PATCH v5] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast
Posted by Alexei Starovoitov 4 weeks, 1 day ago
On Fri, Jan 9, 2026 at 12:21 PM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
>
> * preempt disable/enable pair:                                     1.1 ns
> * srcu-fast lock/unlock:                                           1.5 ns
>
> CONFIG_RCU_REF_SCALE_TEST=y
> * migrate disable/enable pair:                                     3.0 ns

.. and you're arguing that 3ns vs 1ns difference is so important
for your out-of-tree tracer that in-tree tracers need to do
some workarounds?! wtf
Re: [PATCH v5] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast
Posted by Steven Rostedt 4 weeks, 1 day ago
On Fri, 9 Jan 2026 13:54:34 -0800
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Fri, Jan 9, 2026 at 12:21 PM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
> >
> >
> > * preempt disable/enable pair:                                     1.1 ns
> > * srcu-fast lock/unlock:                                           1.5 ns
> >
> > CONFIG_RCU_REF_SCALE_TEST=y
> > * migrate disable/enable pair:                                     3.0 ns  
> 
> .. and you're arguing that 3ns vs 1ns difference is so important
> for your out-of-tree tracer that in-tree tracers need to do
> some workarounds?! wtf

This has nothing to do with out of tree tracers. The overhead of the
22ns is for any tracepoint in an in-tree module. That's because the
rq->nr_pinned isn't exported for modules to use.

-- Steve
Re: [PATCH v5] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast
Posted by Alexei Starovoitov 4 weeks, 1 day ago
On Fri, Jan 9, 2026 at 2:00 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Fri, 9 Jan 2026 13:54:34 -0800
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>
> > On Fri, Jan 9, 2026 at 12:21 PM Mathieu Desnoyers
> > <mathieu.desnoyers@efficios.com> wrote:
> > >
> > >
> > > * preempt disable/enable pair:                                     1.1 ns
> > > * srcu-fast lock/unlock:                                           1.5 ns
> > >
> > > CONFIG_RCU_REF_SCALE_TEST=y
> > > * migrate disable/enable pair:                                     3.0 ns
> >
> > .. and you're arguing that 3ns vs 1ns difference is so important
> > for your out-of-tree tracer that in-tree tracers need to do
> > some workarounds?! wtf
>
> This has nothing to do with out of tree tracers. The overhead of the
> 22ns is for any tracepoint in an in-tree module. That's because the
> rq->nr_pinned isn't exported for modules to use.

None of the driver's tracepoints are in the critical path.
You perfectly know that Mathieu argued about not slowing down lttng.
Re: [PATCH v5] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast
Posted by Steven Rostedt 4 weeks, 1 day ago
On Fri, 9 Jan 2026 14:18:36 -0800
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Fri, Jan 9, 2026 at 2:00 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > On Fri, 9 Jan 2026 13:54:34 -0800
> > Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >  
> > > On Fri, Jan 9, 2026 at 12:21 PM Mathieu Desnoyers
> > > <mathieu.desnoyers@efficios.com> wrote:  
> > > >
> > > >
> > > > * preempt disable/enable pair:                                     1.1 ns
> > > > * srcu-fast lock/unlock:                                           1.5 ns
> > > >
> > > > CONFIG_RCU_REF_SCALE_TEST=y
> > > > * migrate disable/enable pair:                                     3.0 ns  
> > >
> > > .. and you're arguing that 3ns vs 1ns difference is so important
> > > for your out-of-tree tracer that in-tree tracers need to do
> > > some workarounds?! wtf  
> >
> > This has nothing to do with out of tree tracers. The overhead of the
> > 22ns is for any tracepoint in an in-tree module. That's because the
> > rq->nr_pinned isn't exported for modules to use.  
> 
> None of the driver's tracepoints are in the critical path.
> You perfectly know that Mathieu argued about not slowing down lttng.

How is this about lttng? Sure he cares about that, but even tracepoints
that lttng uses doesn't get affected any more than ftrace or bpf.
Because lttng is one of the callbacks. The migrate disable happens in
the in-tree portion of the code.

So you are saying that all the tracepoints for xfs are not in a fastpath?

-- Steve
Re: [PATCH v5] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast
Posted by Steven Rostedt 4 weeks, 1 day ago
On Fri, 9 Jan 2026 17:33:26 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> How is this about lttng? Sure he cares about that, but even tracepoints
> that lttng uses doesn't get affected any more than ftrace or bpf.
> Because lttng is one of the callbacks. The migrate disable happens in
> the in-tree portion of the code.
> 
> So you are saying that all the tracepoints for xfs are not in a fastpath?

Regardless of tracing. I now have my RT hat on. The spin_locks that are
converted to mutex use migrate disable. The fact that migrate_disable
in modules are close to 10x slower than the same code in-kernel is
troubling to say the least. It means that modules in RT take a hit
every time they take a spin_lock().

The migrate disable being slow for modules is no longer just a tracing
issue. It's a PREEMPT_RT issue.

-- Steve
Re: [PATCH v5] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast
Posted by Alexei Starovoitov 4 weeks ago
On Fri, Jan 9, 2026 at 2:39 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Fri, 9 Jan 2026 17:33:26 -0500
> Steven Rostedt <rostedt@goodmis.org> wrote:
>
> > How is this about lttng? Sure he cares about that, but even tracepoints
> > that lttng uses doesn't get affected any more than ftrace or bpf.
> > Because lttng is one of the callbacks. The migrate disable happens in
> > the in-tree portion of the code.
> >
> > So you are saying that all the tracepoints for xfs are not in a fastpath?
>
> Regardless of tracing. I now have my RT hat on. The spin_locks that are
> converted to mutex use migrate disable. The fact that migrate_disable
> in modules are close to 10x slower than the same code in-kernel is
> troubling to say the least. It means that modules in RT take a hit
> every time they take a spin_lock().
>
> The migrate disable being slow for modules is no longer just a tracing
> issue. It's a PREEMPT_RT issue.

migrate_enable/disable() wasn't inlined for a long time.
It bothered us enough, since sleepable bpf is the main user
of it besides RT, so we made an effort to inline it.

RT, at the same time, doesn't inline rt_spin_lock() itself
so inlining migrate_disable() or not is not 10x at all.
Benchmark spin_lock on RT in-tree and in-module and I bet
there won't be a big difference.
Re: [PATCH v5] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast
Posted by Steven Rostedt 4 weeks ago
On Fri, 9 Jan 2026 16:35:10 -0800
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> migrate_enable/disable() wasn't inlined for a long time.
> It bothered us enough, since sleepable bpf is the main user
> of it besides RT, so we made an effort to inline it.

It did bother us too. it went through lots of iterations to become more
efficient over the years (it was really bad in the beginning while
still in the rt-patch), and hopefully that will continue.

> 
> RT, at the same time, doesn't inline rt_spin_lock() itself
> so inlining migrate_disable() or not is not 10x at all.
> Benchmark spin_lock on RT in-tree and in-module and I bet
> there won't be a big difference.

I'll put that on my todo list. But still, having migrate_disable a
function for modules and 100% inlined for in-kernel code just because
it needs access to a field in the run queue that doesn't need to be in
the run queue seems like it should be fixed.

As for tracepoints, BPF is the only one that needs migrate disable.
It's not needed for ftrace or perf (although perf uses preempt
disable). It should be moved into the BPF callback code as perf has its
preempt disable in its callback code.

If BPF doesn't care about the extra overhead of migrate_disable() for
modules, then why should XFS suffer from that too?

-- Steve
Re: [PATCH v5] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast
Posted by Alexei Starovoitov 3 weeks, 6 days ago
On Sat, Jan 10, 2026 at 8:14 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Fri, 9 Jan 2026 16:35:10 -0800
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>
> > migrate_enable/disable() wasn't inlined for a long time.
> > It bothered us enough, since sleepable bpf is the main user
> > of it besides RT, so we made an effort to inline it.
>
> It did bother us too. it went through lots of iterations to become more
> efficient over the years (it was really bad in the beginning while
> still in the rt-patch), and hopefully that will continue.
>
> >
> > RT, at the same time, doesn't inline rt_spin_lock() itself
> > so inlining migrate_disable() or not is not 10x at all.
> > Benchmark spin_lock on RT in-tree and in-module and I bet
> > there won't be a big difference.
>
> I'll put that on my todo list. But still, having migrate_disable a
> function for modules and 100% inlined for in-kernel code just because
> it needs access to a field in the run queue that doesn't need to be in
> the run queue seems like it should be fixed.

There was plenty of discussion with Peter regarding different
ways to inline migrate_disable. What was landed was the best
option at that point, but feel free to restart the discussion.

>
> As for tracepoints, BPF is the only one that needs migrate disable.
> It's not needed for ftrace or perf (although perf uses preempt
> disable). It should be moved into the BPF callback code as perf has its
> preempt disable in its callback code.
>
> If BPF doesn't care about the extra overhead of migrate_disable() for
> modules, then why should XFS suffer from that too?

The diff has nothing to do with bpf needs and/or bpf internals.
It's really about being a good citizen of PREEMP_RT.
bpf side already does migrate_disable,
rcu_read_lock, srcu_fast/task_trace when necessary.
Most of the time we don't rely on any external preempt state or rcu/srcu.
Removing guard(preempt_notrace)(); from tracepoint invocation
would be just fine for bpf. Simple remove will trigger bug
on cant_sleep(), but that's a trivial fix.
Re: [PATCH v5] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast
Posted by Steven Rostedt 3 weeks, 6 days ago
On Sun, 11 Jan 2026 12:04:51 -0800
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> The diff has nothing to do with bpf needs and/or bpf internals.
> It's really about being a good citizen of PREEMP_RT.
> bpf side already does migrate_disable,
> rcu_read_lock, srcu_fast/task_trace when necessary.
> Most of the time we don't rely on any external preempt state or rcu/srcu.
> Removing guard(preempt_notrace)(); from tracepoint invocation
> would be just fine for bpf. Simple remove will trigger bug
> on cant_sleep(), but that's a trivial fix.

Oh, so you are OK replacing the preempt_disable in the tracepoint
callbacks with fast SRCU? 

Then I guess we can simply do that. Would it be fine to do that for
both RT and non-RT? That will simplify the code quite a bit.

-- Steve
Re: [PATCH v5] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast
Posted by Alexei Starovoitov 3 weeks, 5 days ago
On Sun, Jan 11, 2026 at 2:09 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Sun, 11 Jan 2026 12:04:51 -0800
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>
> > The diff has nothing to do with bpf needs and/or bpf internals.
> > It's really about being a good citizen of PREEMP_RT.
> > bpf side already does migrate_disable,
> > rcu_read_lock, srcu_fast/task_trace when necessary.
> > Most of the time we don't rely on any external preempt state or rcu/srcu.
> > Removing guard(preempt_notrace)(); from tracepoint invocation
> > would be just fine for bpf. Simple remove will trigger bug
> > on cant_sleep(), but that's a trivial fix.
>
> Oh, so you are OK replacing the preempt_disable in the tracepoint
> callbacks with fast SRCU?

yes, but..

> Then I guess we can simply do that. Would it be fine to do that for
> both RT and non-RT? That will simplify the code quite a bit.

Agree. perf needs preempt_disable in their callbacks (as this patch does)
and bpf side needs to add migrate_disable in __bpf_trace_run for now.
Though syscall tracepoints are sleepable we don't take advantage of
that on the bpf side. Eventually we will, and then rcu_lock
inside __bpf_trace_run will become srcu_fast_lock.

The way to think about generic infrastructure like tracepoints is
to minimize their overhead no matter what out-of-tree and in-tree
users' assumptions are today, so why do we need preempt_disable
or srcu_fast there?
I think today it's there because all callbacks (perf, ftrace, bpf)
expect preemption to be disabled, but can we just remove it from tp side?
and move preempt_disable to callbacks that actually need it?

I'm looking at release_probes(). It's fine as-is, no?
Re: [PATCH v5] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast
Posted by Steven Rostedt 3 weeks, 5 days ago
On Sun, 11 Jan 2026 15:38:38 -0800
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> > Oh, so you are OK replacing the preempt_disable in the tracepoint
> > callbacks with fast SRCU?  
> 
> yes, but..
> 
> > Then I guess we can simply do that. Would it be fine to do that for
> > both RT and non-RT? That will simplify the code quite a bit.  
> 
> Agree. perf needs preempt_disable in their callbacks (as this patch does)
> and bpf side needs to add migrate_disable in __bpf_trace_run for now.
> Though syscall tracepoints are sleepable we don't take advantage of
> that on the bpf side. Eventually we will, and then rcu_lock
> inside __bpf_trace_run will become srcu_fast_lock.
> 
> The way to think about generic infrastructure like tracepoints is
> to minimize their overhead no matter what out-of-tree and in-tree
> users' assumptions are today, so why do we need preempt_disable
> or srcu_fast there?

Either preempt disable or srcu_fast is still needed.

> I think today it's there because all callbacks (perf, ftrace, bpf)
> expect preemption to be disabled, but can we just remove it from tp side?
> and move preempt_disable to callbacks that actually need it?

Yes if we are talking about switching from preempt_disable to src_fast.
No if you mean to remove both as it still needs RCU protection. It's
used for synchronizing changes in the tracepoint infrastructure itself.
That __DO_TRACE_CALL() macro is the guts of the tracepoint callback
code. It needs to handle races with additions and removals of callbacks
there, as the callbacks also get data passed to them. If it gets out of
sync, a callback could be called with another callback's data.

That's why it has:

                it_func_ptr =                                           \
                        rcu_dereference_raw((&__tracepoint_##name)->funcs);

> 
> I'm looking at release_probes(). It's fine as-is, no?

That's just freeing, and as you see, there's RCU synchronization
required.

Updates to tracepoints require RCU protection. It started out with
preempt_disable() for all tracepoints (which was synchronized with
synchronized_sched() which later became just synchronize_rcu()).

The issue that came about was that both ftrace and perf had an
assumption that its tracepoint callbacks always have preempt disabled
when being called. To move to srcu_fast() that is no longer the case.
And we need to do that for PREEMPT_RT if BPF is running very long
callbacks to the tracepoints.

Ftrace has been fixed to not require it, but still needs to take into
account if tracepoints disable preemption or not so that it can display
the preempt_count() of when the tracepoint was called correctly.

Perf is trivial to fix as it can simply add a preempt_disable() in its
handler.

But we were originally told that BPF had an issue because it had the
assumption that tracepoint callbacks wouldn't migrate. That's where
this patch came about.

Now if you are saying that BPF will handle migrate_disable() on its own
and not require the tracepoint infrastructure to do it for it, then
this is perfect. And I can then simplify this code, and just use
srcu_fast for both RT and !RT.

-- Steve
Re: [PATCH v5] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast
Posted by Alexei Starovoitov 3 weeks, 5 days ago
On Mon, Jan 12, 2026 at 5:53 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Sun, 11 Jan 2026 15:38:38 -0800
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>
> > > Oh, so you are OK replacing the preempt_disable in the tracepoint
> > > callbacks with fast SRCU?
> >
> > yes, but..
> >
> > > Then I guess we can simply do that. Would it be fine to do that for
> > > both RT and non-RT? That will simplify the code quite a bit.
> >
> > Agree. perf needs preempt_disable in their callbacks (as this patch does)
> > and bpf side needs to add migrate_disable in __bpf_trace_run for now.
> > Though syscall tracepoints are sleepable we don't take advantage of
> > that on the bpf side. Eventually we will, and then rcu_lock
> > inside __bpf_trace_run will become srcu_fast_lock.
> >
> > The way to think about generic infrastructure like tracepoints is
> > to minimize their overhead no matter what out-of-tree and in-tree
> > users' assumptions are today, so why do we need preempt_disable
> > or srcu_fast there?
>
> Either preempt disable or srcu_fast is still needed.
>
> > I think today it's there because all callbacks (perf, ftrace, bpf)
> > expect preemption to be disabled, but can we just remove it from tp side?
> > and move preempt_disable to callbacks that actually need it?
>
> Yes if we are talking about switching from preempt_disable to src_fast.
> No if you mean to remove both as it still needs RCU protection. It's
> used for synchronizing changes in the tracepoint infrastructure itself.
> That __DO_TRACE_CALL() macro is the guts of the tracepoint callback
> code. It needs to handle races with additions and removals of callbacks
> there, as the callbacks also get data passed to them. If it gets out of
> sync, a callback could be called with another callback's data.
>
> That's why it has:
>
>                 it_func_ptr =                                           \
>                         rcu_dereference_raw((&__tracepoint_##name)->funcs);
>
> >
> > I'm looking at release_probes(). It's fine as-is, no?
>
> That's just freeing, and as you see, there's RCU synchronization
> required.
>
> Updates to tracepoints require RCU protection. It started out with
> preempt_disable() for all tracepoints (which was synchronized with
> synchronized_sched() which later became just synchronize_rcu()).

I see.

> The issue that came about was that both ftrace and perf had an
> assumption that its tracepoint callbacks always have preempt disabled
> when being called. To move to srcu_fast() that is no longer the case.
> And we need to do that for PREEMPT_RT if BPF is running very long
> callbacks to the tracepoints.
>
> Ftrace has been fixed to not require it, but still needs to take into
> account if tracepoints disable preemption or not so that it can display
> the preempt_count() of when the tracepoint was called correctly.
>
> Perf is trivial to fix as it can simply add a preempt_disable() in its
> handler.
>
> But we were originally told that BPF had an issue because it had the
> assumption that tracepoint callbacks wouldn't migrate. That's where
> this patch came about.
>
> Now if you are saying that BPF will handle migrate_disable() on its own
> and not require the tracepoint infrastructure to do it for it, then
> this is perfect. And I can then simplify this code, and just use
> srcu_fast for both RT and !RT.

Agree. Just add migrate_disable to __bpf_trace_run,
or, better yet, use rcu_read_lock_dont_migrate() in there.

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index fe28d86f7c35..abbf0177ad20 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -2062,7 +2062,7 @@ void __bpf_trace_run(struct bpf_raw_tp_link
*link, u64 *args)
        struct bpf_run_ctx *old_run_ctx;
        struct bpf_trace_run_ctx run_ctx;

-       cant_sleep();
+       rcu_read_lock_dont_migrate();
        if (unlikely(this_cpu_inc_return(*(prog->active)) != 1)) {
                bpf_prog_inc_misses_counter(prog);
                goto out;
@@ -2071,13 +2071,12 @@ void __bpf_trace_run(struct bpf_raw_tp_link
*link, u64 *args)
        run_ctx.bpf_cookie = link->cookie;
        old_run_ctx = bpf_set_run_ctx(&run_ctx.run_ctx);

-       rcu_read_lock();
        (void) bpf_prog_run(prog, args);
-       rcu_read_unlock();

        bpf_reset_run_ctx(old_run_ctx);
 out:
        this_cpu_dec(*(prog->active));
+       rcu_read_unlock_migrate();
 }
Re: [PATCH v5] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast
Posted by Sebastian Andrzej Siewior 3 weeks, 4 days ago
On 2026-01-12 09:19:58 [-0800], Alexei Starovoitov wrote:
> > Now if you are saying that BPF will handle migrate_disable() on its own
> > and not require the tracepoint infrastructure to do it for it, then
> > this is perfect. And I can then simplify this code, and just use
> > srcu_fast for both RT and !RT.
> 
> Agree. Just add migrate_disable to __bpf_trace_run,
> or, better yet, use rcu_read_lock_dont_migrate() in there.

Wonderful, thank you.

Is this "must remain on the same CPU and can be re-entrant" because BPF
core code such memory allocator/ data structures use per-CPU data
structures and must use the same through the whole invocation?

I did audit network related BPF code and their per-CPU usage usually had
a local_bh_disable() in the relevant spots.

Sebastian
Re: [PATCH v5] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast
Posted by Alexei Starovoitov 3 weeks, 3 days ago
On Tue, Jan 13, 2026 at 6:23 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> On 2026-01-12 09:19:58 [-0800], Alexei Starovoitov wrote:
> > > Now if you are saying that BPF will handle migrate_disable() on its own
> > > and not require the tracepoint infrastructure to do it for it, then
> > > this is perfect. And I can then simplify this code, and just use
> > > srcu_fast for both RT and !RT.
> >
> > Agree. Just add migrate_disable to __bpf_trace_run,
> > or, better yet, use rcu_read_lock_dont_migrate() in there.
>
> Wonderful, thank you.
>
> Is this "must remain on the same CPU and can be re-entrant" because BPF
> core code such memory allocator/ data structures use per-CPU data
> structures and must use the same through the whole invocation?

It's per-cpu maps.
htab_percpu_map_lookup_elem() returns a pointer to per-cpu value
which needs to be valid for the duration of the program.
Not much else.
Re: [PATCH v5] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast
Posted by Mathieu Desnoyers 4 weeks, 1 day ago
On 2026-01-09 17:18, Alexei Starovoitov wrote:
> On Fri, Jan 9, 2026 at 2:00 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>>
>> On Fri, 9 Jan 2026 13:54:34 -0800
>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>
>>> On Fri, Jan 9, 2026 at 12:21 PM Mathieu Desnoyers
>>> <mathieu.desnoyers@efficios.com> wrote:
>>>>
>>>>
>>>> * preempt disable/enable pair:                                     1.1 ns
>>>> * srcu-fast lock/unlock:                                           1.5 ns
>>>>
>>>> CONFIG_RCU_REF_SCALE_TEST=y
>>>> * migrate disable/enable pair:                                     3.0 ns
>>>
>>> .. and you're arguing that 3ns vs 1ns difference is so important
>>> for your out-of-tree tracer that in-tree tracers need to do
>>> some workarounds?! wtf
>>
>> This has nothing to do with out of tree tracers. The overhead of the
>> 22ns is for any tracepoint in an in-tree module. That's because the
>> rq->nr_pinned isn't exported for modules to use.
> 
> None of the driver's tracepoints are in the critical path.
> You perfectly know that Mathieu argued about not slowing down lttng.

My argument is about not slowing down high-throughput tracers
on preempt-rt kernels. This affects ftrace as well as lttng,
so both in-tree and OOT tracers just alike.

Are you so sure that no tracepoint instrumentation whatsoever can
be compiled into a module which is a critical path for some workload ?
That's a bold statement.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Re: [PATCH v5] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast
Posted by Steven Rostedt 4 weeks, 1 day ago
On Fri, 9 Jan 2026 15:21:19 -0500
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> * preempt disable/enable pair:                                     1.1 ns
> * srcu-fast lock/unlock:                                           1.5 ns
> 
> CONFIG_RCU_REF_SCALE_TEST=y
> * migrate disable/enable pair:                                     3.0 ns
> * calls to migrate disable/enable pair within noinline functions: 17.0 ns
> 
> CONFIG_RCU_REF_SCALE_TEST=m
> * migrate disable/enable pair:                                    22.0 ns

OUCH! So migrate disable/enable has a much larger overhead when executed in
a module than in the kernel? This means all spin_locks() in modules
converted to mutexes in PREEMPT_RT are taking this hit!

It looks like it has to allow access to the rq->nr_pinned. There's a hack to
expose this part of the rq struct for in-kernel by the following:

kernel/sched/rq-offsets.c:      DEFINE(RQ_nr_pinned, offsetof(struct rq, nr_pinned));

Then for the in-kernel code we have:

#define this_rq_raw() arch_raw_cpu_ptr(&runqueues)
#else
#define this_rq_raw() PERCPU_PTR(&runqueues)
#endif
#define this_rq_pinned() (*(unsigned int *)((void *)this_rq_raw() + RQ_nr_pinned))

Looking at the scheduler code, the rq->nr_pinned is referenced by a static
function with:

static inline bool rq_has_pinned_tasks(struct rq *rq)
{
	return rq->nr_pinned;
}

Which is only referenced in hotplug code and a balance_push() path in load
balancing. Does this variable really need to be in the runqueue struct?

Why not just make it a per-cpu variable. Maybe call it cpu_nr_pinned_tasks,
and export that for all to use?

It will not only fix the discrepancy between the overhead of
migrate_disable/enable in modules vs in-kernel. But it also removes the
hack to expose a portion of the runqueue.

-- Steve
Re: [PATCH v5] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast
Posted by Peter Zijlstra 3 weeks, 5 days ago
On Fri, Jan 09, 2026 at 04:02:02PM -0500, Steven Rostedt wrote:
> On Fri, 9 Jan 2026 15:21:19 -0500
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> > * preempt disable/enable pair:                                     1.1 ns
> > * srcu-fast lock/unlock:                                           1.5 ns
> > 
> > CONFIG_RCU_REF_SCALE_TEST=y
> > * migrate disable/enable pair:                                     3.0 ns
> > * calls to migrate disable/enable pair within noinline functions: 17.0 ns
> > 
> > CONFIG_RCU_REF_SCALE_TEST=m
> > * migrate disable/enable pair:                                    22.0 ns
> 
> OUCH! So migrate disable/enable has a much larger overhead when executed in
> a module than in the kernel? This means all spin_locks() in modules
> converted to mutexes in PREEMPT_RT are taking this hit!

Not so, the migrate_disable() for PREEMPT_RT is still in core code --
kernel/locking/spinlock_rt.c is very much not build as a module.
Re: [PATCH v5] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast
Posted by Steven Rostedt 3 weeks, 5 days ago
On Mon, 12 Jan 2026 16:31:28 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> > OUCH! So migrate disable/enable has a much larger overhead when executed in
> > a module than in the kernel? This means all spin_locks() in modules
> > converted to mutexes in PREEMPT_RT are taking this hit!  
> 
> Not so, the migrate_disable() for PREEMPT_RT is still in core code --
> kernel/locking/spinlock_rt.c is very much not build as a module.

True. But still, wouldn't it be cleaner to have that variable separate from
the run queue and make the code a bit simpler?

As now it doesn't look like it will even bother tracing, as it appears that
only BPF would need it. So this would just be a clean up.

-- Steve
Re: [PATCH v5] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast
Posted by Peter Zijlstra 3 weeks, 5 days ago
On Mon, Jan 12, 2026 at 10:36:12AM -0500, Steven Rostedt wrote:
> On Mon, 12 Jan 2026 16:31:28 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > > OUCH! So migrate disable/enable has a much larger overhead when executed in
> > > a module than in the kernel? This means all spin_locks() in modules
> > > converted to mutexes in PREEMPT_RT are taking this hit!  
> > 
> > Not so, the migrate_disable() for PREEMPT_RT is still in core code --
> > kernel/locking/spinlock_rt.c is very much not build as a module.
> 
> True. But still, wouldn't it be cleaner to have that variable separate from
> the run queue and make the code a bit simpler?

I still don't like exporting variables either way around. Exporting
functions is much saner.
Re: [PATCH v5] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast
Posted by Menglong Dong 3 weeks, 5 days ago
On 2026/1/10 05:02 Steven Rostedt <rostedt@goodmis.org> write:
> On Fri, 9 Jan 2026 15:21:19 -0500
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> > * preempt disable/enable pair:                                     1.1 ns
> > * srcu-fast lock/unlock:                                           1.5 ns
> > 
> > CONFIG_RCU_REF_SCALE_TEST=y
> > * migrate disable/enable pair:                                     3.0 ns
> > * calls to migrate disable/enable pair within noinline functions: 17.0 ns
> > 
> > CONFIG_RCU_REF_SCALE_TEST=m
> > * migrate disable/enable pair:                                    22.0 ns
> 
> OUCH! So migrate disable/enable has a much larger overhead when executed in
> a module than in the kernel? This means all spin_locks() in modules
> converted to mutexes in PREEMPT_RT are taking this hit!
> 
> It looks like it has to allow access to the rq->nr_pinned. There's a hack to
> expose this part of the rq struct for in-kernel by the following:
> 
> kernel/sched/rq-offsets.c:      DEFINE(RQ_nr_pinned, offsetof(struct rq, nr_pinned));
> 
> Then for the in-kernel code we have:
> 
> #define this_rq_raw() arch_raw_cpu_ptr(&runqueues)
> #else
> #define this_rq_raw() PERCPU_PTR(&runqueues)
> #endif
> #define this_rq_pinned() (*(unsigned int *)((void *)this_rq_raw() + RQ_nr_pinned))
> 
> Looking at the scheduler code, the rq->nr_pinned is referenced by a static
> function with:
> 
> static inline bool rq_has_pinned_tasks(struct rq *rq)
> {
> 	return rq->nr_pinned;
> }
> 
> Which is only referenced in hotplug code and a balance_push() path in load
> balancing. Does this variable really need to be in the runqueue struct?
> 
> Why not just make it a per-cpu variable. Maybe call it cpu_nr_pinned_tasks,
> and export that for all to use?
> 
> It will not only fix the discrepancy between the overhead of
> migrate_disable/enable in modules vs in-kernel. But it also removes the
> hack to expose a portion of the runqueue.

I think it's a good idea to factor out the "nr_pinned" from struct rq.
The current approach that we inline the migrate_disable is a little
obscure. The initial propose of inline migrate_disable is to optimize the
performance of bpf trampoline, so the modules are not considered.

As you said, rq_has_pinned_tasks() is the only place that use the
nr_pinned, except the migrate_disable/migrate_enable. After more
analysis, I think maybe we can do it this way:

DEFINE_PER_CPU_SHARED_ALIGNED(int, cpu_nr_pinned_tasks);

And change rq_has_pinned_tasks() to:
static inline bool rq_has_pinned_tasks(struct rq *rq)
{
	return *per_cpu_ptr(&cpu_nr_pinned_tasks, rq->cpu);
}

The "rq" in rq_has_pinned_tasks() may come from other CPU, so we
can't use "return this_cpu_read(cpu_nr_pinned_tasks)" directly.

Thanks!
Menglong Dong

> 
> -- Steve
> 
>
Re: [PATCH v5] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast
Posted by Steven Rostedt 4 weeks, 1 day ago
On Fri, 9 Jan 2026 09:40:17 -0500
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> On 2026-01-08 22:05, Steven Rostedt wrote:
> > From: "Paul E. McKenney" <paulmck@kernel.org>  
> [...]
> 
> I disagree with many elements of the proposed approach.
> 
> On one end we have BPF wanting to hook on arbitrary tracepoints without
> adding significant latency to PREEMPT RT kernels.
> 
> One the other hand, we have high-speed tracers which execute very short
> critical sections to serialize trace data into ring buffers.
> 
> All of those users register to the tracepoint API.
> 
> We also have to consider that migrate disable is *not* cheap at all
> compared to preempt disable.

To be fair, every spin_lock() converted into a mutex in PREEMPT_RT now
calls migrate_disable() instead of preempt_disable(). I'm just saying the
overhead of migrate_disable() in PREEMPT_RT is not limited to tracepoints.

> 
> So I'm wondering why all tracepoint users need to pay the migrate
> disable runtime overhead on preempt RT kernels for the sake of BPF ?

We could argue that it is to keep the same paradigm as non RT. Where
the code expects to stay on the same CPU. This is why we needed to add it
to spin_lock() code. Only a few places in the kernel expect spin_lock() to
pin the current task on the CPU, but because of those few cases, we needed
to make all callers of spin_lock() call migrate disable :-/

> 
> Using SRCU-fast to protect tracepoint callback iteration makes sense
> for preempt-rt, but I'd recommend moving the migrate disable guard
> within the bpf callback code rather than slowing down other tracers
> which execute within a short amount of time. Other tracers can then
> choose to disable preemption rather than migration if that's a better
> fit for their needs.

This is a discussion with the BPF folks.

> 
> > diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
> > index 3690221ba3d8..a2704c35eda8 100644
> > --- a/include/linux/trace_events.h
> > +++ b/include/linux/trace_events.h
> > @@ -222,6 +222,26 @@ static inline unsigned int tracing_gen_ctx_dec(void)
> >   	return trace_ctx;
> >   }
> >   
> > +/*
> > + * When PREEMPT_RT is enabled, trace events are called with disabled
> > + * migration. The trace events need to know if the tracepoint disabled
> > + * migration or not so that what is recorded to the ring buffer shows
> > + * the state of when the trace event triggered, and not the state caused
> > + * by the trace event.
> > + */
> > +#ifdef CONFIG_PREEMPT_RT
> > +static inline unsigned int tracing_gen_ctx_dec_cond(void)
> > +{
> > +	unsigned int trace_ctx;
> > +
> > +	trace_ctx = tracing_gen_ctx_dec();
> > +	/* The migration counter starts at bit 4 */
> > +	return trace_ctx - (1 << 4);  
> 
> We should turn this hardcoded "4" value into an enum label or a
> define. That define should be exposed by tracepoint.h. We should
> not hardcode expectations about the implementation of distinct APIs
> across the tracing subsystem.

This is exposed to user space already, so that 4 will never change. And
this is specifically for "trace events" which are what are attached to
tracepoints. No other tracepoint caller needs to know about this "4". This
value goes into the common_preempt_count of the event. libtraceevent
already parses this.

I have no problem making this an enum (or define) and using it here and
where it is set in trace.c:tracing_gen_ctx_irq_test(). But it belongs in
trace_event.h not tracepoint.h.

I can call it TRACE_MIGRATION_SHIFT

#define TRACE_MIGRATION_SHIFT	4

> 
> [...]
> 
> > --- a/include/linux/tracepoint.h
> > +++ b/include/linux/tracepoint.h
> > @@ -100,6 +100,25 @@ void for_each_tracepoint_in_module(struct module *mod,
> >   }
> >   #endif /* CONFIG_MODULES */
> >   
> > +/*
> > + * BPF programs can attach to the tracepoint callbacks. But if the
> > + * callbacks are called with preemption disabled, the BPF programs
> > + * can cause quite a bit of latency. When PREEMPT_RT is enabled,
> > + * instead of disabling preemption, use srcu_fast_notrace() for
> > + * synchronization. As BPF programs that are attached to tracepoints
> > + * expect to stay on the same CPU, also disable migration.
> > + */
> > +#ifdef CONFIG_PREEMPT_RT
> > +extern struct srcu_struct tracepoint_srcu;
> > +# define tracepoint_sync() synchronize_srcu(&tracepoint_srcu);
> > +# define tracepoint_guard()				\
> > +	guard(srcu_fast_notrace)(&tracepoint_srcu);	\
> > +	guard(migrate)()
> > +#else
> > +# define tracepoint_sync() synchronize_rcu();
> > +# define tracepoint_guard() guard(preempt_notrace)()
> > +#endif  
> 
> Doing migrate disable on PREEMPT RT for BPF vs preempt disable in other
> tracers should come in a separate preparation patch. It belongs to the
> tracers, not to tracepoints.

That's fair.

> 
> [...]
> 				\
> > diff --git a/include/trace/perf.h b/include/trace/perf.h
> > index a1754b73a8f5..348ad1d9b556 100644
> > --- a/include/trace/perf.h
> > +++ b/include/trace/perf.h
> > @@ -71,6 +71,7 @@ perf_trace_##call(void *__data, proto)					\
> >   	u64 __count __attribute__((unused));				\
> >   	struct task_struct *__task __attribute__((unused));		\
> >   									\
> > +	guard(preempt_notrace)();					\
> >   	do_perf_trace_##call(__data, args);				\
> >   }
> >   
> > @@ -85,9 +86,8 @@ perf_trace_##call(void *__data, proto)					\
> >   	struct task_struct *__task __attribute__((unused));		\
> >   									\
> >   	might_fault();							\
> > -	preempt_disable_notrace();					\
> > +	guard(preempt_notrace)();					\
> >   	do_perf_trace_##call(__data, args);				\
> > -	preempt_enable_notrace();	  
> 
> Move this to a perf-specific preparation patch.				\

Sure.

> >   }
> >   
> >   /*
> > diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
> > index 4f22136fd465..6fb58387e9f1 100644
> > --- a/include/trace/trace_events.h
> > +++ b/include/trace/trace_events.h
> > @@ -429,6 +429,22 @@ do_trace_event_raw_event_##call(void *__data, proto)			\
> >   	trace_event_buffer_commit(&fbuffer);				\
> >   }
> >   
> > +/*
> > + * When PREEMPT_RT is enabled, the tracepoint does not disable preemption
> > + * but instead disables migration. The callbacks for the trace events
> > + * need to have a consistent state so that it can reflect the proper
> > + * preempt_disabled counter.  
> 
> Having those defines within trace_events.h is poking holes within any
> hope of abstraction we can have from the tracepoint.h API. This adds
> strong coupling between tracepoint and trace_event.h.

OK, I see you are worried about the coupling between the behavior of
tracepoint.h and trace_event.h.

> 
> Rather than hardcoding preempt counter expectations across tracepoint
> and trace-events, we should expose a #define in tracepoint.h which
> will make the preempt counter nesting level available to other
> parts of the kernel such as trace_events.h. This way we keep everything
> in one place and we don't add cross-references about subtle preempt
> counter nesting level details.

OK, so how do we pass this information from tracepoint.h to the users? I
hate to add another field to task_struct for this.

> 
> > + */
> > +#ifdef CONFIG_PREEMPT_RT
> > +/* disable preemption for RT so that the counters still match */
> > +# define trace_event_guard() guard(preempt_notrace)()
> > +/* Have syscalls up the migrate disable counter to emulate non-syscalls */
> > +# define trace_syscall_event_guard() guard(migrate)()
> > +#else
> > +# define trace_event_guard()
> > +# define trace_syscall_event_guard()
> > +#endif  
> This should be moved to separate tracer-specific prep patches.
> 
> [...]
> 
> > + * The @trace_file is the desrciptor with information about the status  
> 
> descriptor

Oops.

> 
> [...]
> 
> > + *
> > + * Returns a pointer to the data on the ring buffer or NULL if the
> > + *   event was not reserved (event was filtered, too big, or the buffer
> > + *   simply was disabled for write).  
> 
> odd spaces here.

You mean the indentation?  I could add it more and also a colon:

 * Returns: A pointer to the data on the ring buffer or NULL if the
 *          event was not reserved (event was filtered, too big, or the
 *          buffer simply was disabled for write).  

Would that work better?

> 
> [...]
> 
> >   
> > +#ifdef CONFIG_PREEMPT_RT
> > +static void srcu_free_old_probes(struct rcu_head *head)
> > +{
> > +	kfree(container_of(head, struct tp_probes, rcu));
> > +}
> > +
> > +static void rcu_free_old_probes(struct rcu_head *head)
> > +{
> > +	call_srcu(&tracepoint_srcu, head, srcu_free_old_probes);
> > +}
> > +#else
> >   static void rcu_free_old_probes(struct rcu_head *head)
> >   {
> >   	kfree(container_of(head, struct tp_probes, rcu));
> >   }
> > +#endif
> >   
> >   static inline void release_probes(struct tracepoint *tp, struct tracepoint_func *old)
> >   {
> > @@ -112,6 +149,13 @@ static inline void release_probes(struct tracepoint *tp, struct tracepoint_func
> >   		struct tp_probes *tp_probes = container_of(old,
> >   			struct tp_probes, probes[0]);
> >   
> > +		/*
> > +		 * Tracepoint probes are protected by either RCU or
> > +		 * Tasks Trace RCU and also by SRCU.  By calling the SRCU  
> 
> I'm confused.
> 
> My understanding is that in !RT we have:
> 
> - RCU (!tracepoint_is_faultable(tp))
> - RCU tasks trace (tracepoint_is_faultable(tp))
> 
> And for RT:
> 
> - SRCU-fast (!tracepoint_is_faultable(tp))
> - RCU tasks trace (tracepoint_is_faultable(tp))
> 
> So I don't understand this comment, and also I don't understand why we
> need to chain the callbacks rather than just call the appropriate
> call_rcu based on the tracepoint "is_faultable" state.
> 
> What am I missing ?

Ah, you are right. I think this was the result of trying different ways of
synchronization. The non-faultable version should be wrapped to either call
normal RCU synchronization or SRCU synchronization.

I can send a new version, or do we want to wait if the BPF folks have a
better idea about the "migrate disable" issue?

Thanks for the review.

-- Steve


> 
> > +		 * callback in the [Tasks Trace] RCU callback we cover
> > +		 * both cases. So let us chain the SRCU and [Tasks Trace]
> > +		 * RCU callbacks to wait for both grace periods.
> > +		 */
> >   		if (tracepoint_is_faultable(tp))
> >   			call_rcu_tasks_trace(&tp_probes->rcu, rcu_free_old_probes);
> >   		else  
> 
> Thanks,
> 
> Mathieu
>
Re: [PATCH v5] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast
Posted by Mathieu Desnoyers 4 weeks, 1 day ago
On 2026-01-09 12:21, Steven Rostedt wrote:
> On Fri, 9 Jan 2026 09:40:17 -0500
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
>> On 2026-01-08 22:05, Steven Rostedt wrote:
>>> From: "Paul E. McKenney" <paulmck@kernel.org>
>> [...]
>>
>> I disagree with many elements of the proposed approach.
>>
>> On one end we have BPF wanting to hook on arbitrary tracepoints without
>> adding significant latency to PREEMPT RT kernels.
>>
>> One the other hand, we have high-speed tracers which execute very short
>> critical sections to serialize trace data into ring buffers.
>>
>> All of those users register to the tracepoint API.
>>
>> We also have to consider that migrate disable is *not* cheap at all
>> compared to preempt disable.
> 
> To be fair, every spin_lock() converted into a mutex in PREEMPT_RT now
> calls migrate_disable() instead of preempt_disable(). I'm just saying the
> overhead of migrate_disable() in PREEMPT_RT is not limited to tracepoints.

That's a fair point. That being said, the kernel code which relies on
spin locks for fast-paths tends to use raw spinlocks (e.g. scheduler),
which AFAIU does not need migrate disable because those still disable
preemption even on preempt RT.

> 
>>
>> So I'm wondering why all tracepoint users need to pay the migrate
>> disable runtime overhead on preempt RT kernels for the sake of BPF ?
> 
> We could argue that it is to keep the same paradigm as non RT. Where
> the code expects to stay on the same CPU. This is why we needed to add it
> to spin_lock() code. Only a few places in the kernel expect spin_lock() to
> pin the current task on the CPU, but because of those few cases, we needed
> to make all callers of spin_lock() call migrate disable :-/

For tracepoints we have a handful of tracer users, so I don't consider
the scope of the audit to be at the same scale as spinlocks. Changing
each tracer user to either disable preemption or migration if they need
it on preempt-RT is fairly straightforward. And if tracers don't need
to stay on a single CPU, that's even better.

I'm also inclined to consider moving !RT kernels to srcu-fast rather
than RCU to make RT and !RT more alike. The performance of
srcu-fast read-side is excellent, at least on my EPYC test machine.

This would open the door to relax the tracer callback requirements on
preempt-off/migrate-off if their callback implementation can deal with
it.

> 
>>
>> Using SRCU-fast to protect tracepoint callback iteration makes sense
>> for preempt-rt, but I'd recommend moving the migrate disable guard
>> within the bpf callback code rather than slowing down other tracers
>> which execute within a short amount of time. Other tracers can then
>> choose to disable preemption rather than migration if that's a better
>> fit for their needs.
> 
> This is a discussion with the BPF folks.

FWIW, the approach I'm proposing would be similar to what I've done for
faultable syscall tracepoints.

[...]
>>> +
>>> +	trace_ctx = tracing_gen_ctx_dec();
>>> +	/* The migration counter starts at bit 4 */
>>> +	return trace_ctx - (1 << 4);
>>
>> We should turn this hardcoded "4" value into an enum label or a
>> define. That define should be exposed by tracepoint.h. We should
>> not hardcode expectations about the implementation of distinct APIs
>> across the tracing subsystem.
> 
> This is exposed to user space already, so that 4 will never change. And
> this is specifically for "trace events" which are what are attached to
> tracepoints. No other tracepoint caller needs to know about this "4". This
> value goes into the common_preempt_count of the event. libtraceevent
> already parses this.
> 
> I have no problem making this an enum (or define) and using it here and
> where it is set in trace.c:tracing_gen_ctx_irq_test(). But it belongs in
> trace_event.h not tracepoint.h.
> 
> I can call it TRACE_MIGRATION_SHIFT
> 
> #define TRACE_MIGRATION_SHIFT	4

I'm OK with that.

[...]

>>>    }
>>>    
>>>    /*
>>> diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
>>> index 4f22136fd465..6fb58387e9f1 100644
>>> --- a/include/trace/trace_events.h
>>> +++ b/include/trace/trace_events.h
>>> @@ -429,6 +429,22 @@ do_trace_event_raw_event_##call(void *__data, proto)			\
>>>    	trace_event_buffer_commit(&fbuffer);				\
>>>    }
>>>    
>>> +/*
>>> + * When PREEMPT_RT is enabled, the tracepoint does not disable preemption
>>> + * but instead disables migration. The callbacks for the trace events
>>> + * need to have a consistent state so that it can reflect the proper
>>> + * preempt_disabled counter.
>>
>> Having those defines within trace_events.h is poking holes within any
>> hope of abstraction we can have from the tracepoint.h API. This adds
>> strong coupling between tracepoint and trace_event.h.
> 
> OK, I see you are worried about the coupling between the behavior of
> tracepoint.h and trace_event.h.
> 
>>
>> Rather than hardcoding preempt counter expectations across tracepoint
>> and trace-events, we should expose a #define in tracepoint.h which
>> will make the preempt counter nesting level available to other
>> parts of the kernel such as trace_events.h. This way we keep everything
>> in one place and we don't add cross-references about subtle preempt
>> counter nesting level details.
> 
> OK, so how do we pass this information from tracepoint.h to the users? I
> hate to add another field to task_struct for this.

This could be as simple as adding something akin to

/*
  * Level of preempt disable nesting between tracepoint caller code and
  * tracer callback.
  */
#ifdef CONFIG_PREEMPT_RT
# define TRACEPOINT_PREEMPT_DISABLE_NESTING	0
#else
# define TRACEPOINT_PREEMPT_DISABLE_NESTING	1
#endif

to tracepoint.h and then use that from trace_event.h.

> 
>>
>> [...]
>>
>>> + *
>>> + * Returns a pointer to the data on the ring buffer or NULL if the
>>> + *   event was not reserved (event was filtered, too big, or the buffer
>>> + *   simply was disabled for write).
>>
>> odd spaces here.
> 
> You mean the indentation?  I could add it more and also a colon:
> 
>   * Returns: A pointer to the data on the ring buffer or NULL if the
>   *          event was not reserved (event was filtered, too big, or the
>   *          buffer simply was disabled for write).
> 
> Would that work better?

Yes.

[...]

>>
>> So I don't understand this comment, and also I don't understand why we
>> need to chain the callbacks rather than just call the appropriate
>> call_rcu based on the tracepoint "is_faultable" state.
>>
>> What am I missing ?
> 
> Ah, you are right. I think this was the result of trying different ways of
> synchronization. The non-faultable version should be wrapped to either call
> normal RCU synchronization or SRCU synchronization.

Yes. Or we switch even !RT kernels non-faultable tracepoints to
srcu-fast, considering its low overhead. This could come as a future
step though.

> I can send a new version, or do we want to wait if the BPF folks have a
> better idea about the "migrate disable" issue?

I'd recommend going ahead and move the migrate disable into the BPF
callback similarly to what did for the faultable syscall tracepoint, and
let the bpf folks comment on the resulting patch. It's not that much
code, and showing the change will ground the discussion into something
more tangible.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Re: [PATCH v5] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast
Posted by Sebastian Andrzej Siewior 3 weeks, 4 days ago
On 2026-01-09 13:58:21 [-0500], Mathieu Desnoyers wrote:
> On 2026-01-09 12:21, Steven Rostedt wrote:
> > > Using SRCU-fast to protect tracepoint callback iteration makes sense
> > > for preempt-rt, but I'd recommend moving the migrate disable guard
> > > within the bpf callback code rather than slowing down other tracers
> > > which execute within a short amount of time. Other tracers can then
> > > choose to disable preemption rather than migration if that's a better
> > > fit for their needs.
> > 
> > This is a discussion with the BPF folks.
> 
> FWIW, the approach I'm proposing would be similar to what I've done for
> faultable syscall tracepoints.

I just started reading this thread but we could limit the
migrate_disable() to only the BPF callback part so it does not effect
the whole tracepoint if there not a BFP program attached to it.

Sebastian