When expressing RCU Tasks Trace in terms of SRCU-fast, it was
necessary to keep a nesting count and per-CPU srcu_ctr structure
pointer in the task_struct structure, which is slow to access.
But an alternative is to instead make rcu_read_lock_tasks_trace() and
rcu_read_unlock_tasks_trace(), which match the underlying SRCU-fast
semantics, avoiding the task_struct accesses.
When all callers have switched to the new API, the previous
rcu_read_lock_trace() and rcu_read_unlock_trace() APIs will be removed.
The rcu_read_{,un}lock_{,tasks_}trace() functions need to use smp_mb()
only if invoked where RCU is not watching, that is, from locations where
a call to rcu_is_watching() would return false. In architectures that
define the ARCH_WANTS_NO_INSTR Kconfig option, use of noinstr and friends
ensures that tracing happens only where RCU is watching, so those
architectures can dispense entirely with the read-side calls to smp_mb().
Other architectures include these read-side calls by default, but in many
installations there might be either larger than average tolerance for
risk, prohibition of removing tracing on a running system, or careful
review and approval of removing of tracing. Such installations can
build their kernels with CONFIG_TASKS_TRACE_RCU_NO_MB=y to avoid those
read-side calls to smp_mb(), thus accepting responsibility for run-time
removal of tracing from code regions that RCU is not watching.
Those wishing to disable read-side memory barriers for an entire
architecture can select this TASKS_TRACE_RCU_NO_MB Kconfig option,
hence the polarity.
[ paulmck: Apply Peter Zijlstra feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <bpf@vger.kernel.org>
---
include/linux/rcupdate_trace.h | 65 +++++++++++++++++++++++++++++-----
kernel/rcu/Kconfig | 23 ++++++++++++
kernel/rcu/tasks.h | 7 +++-
3 files changed, 86 insertions(+), 9 deletions(-)
diff --git a/include/linux/rcupdate_trace.h b/include/linux/rcupdate_trace.h
index 0bd47f12ecd17b..f47ba9c074601c 100644
--- a/include/linux/rcupdate_trace.h
+++ b/include/linux/rcupdate_trace.h
@@ -34,6 +34,53 @@ static inline int rcu_read_lock_trace_held(void)
#ifdef CONFIG_TASKS_TRACE_RCU
+/**
+ * rcu_read_lock_tasks_trace - mark beginning of RCU-trace read-side critical section
+ *
+ * When synchronize_rcu_tasks_trace() is invoked by one task, then that
+ * task is guaranteed to block until all other tasks exit their read-side
+ * critical sections. Similarly, if call_rcu_trace() is invoked on one
+ * task while other tasks are within RCU read-side critical sections,
+ * invocation of the corresponding RCU callback is deferred until after
+ * the all the other tasks exit their critical sections.
+ *
+ * For more details, please see the documentation for
+ * srcu_read_lock_fast(). For a description of how implicit RCU
+ * readers provide the needed ordering for architectures defining the
+ * ARCH_WANTS_NO_INSTR Kconfig option (and thus promising never to trace
+ * code where RCU is not watching), please see the __srcu_read_lock_fast()
+ * (non-kerneldoc) header comment. Otherwise, the smp_mb() below provided
+ * the needed ordering.
+ */
+static inline struct srcu_ctr __percpu *rcu_read_lock_tasks_trace(void)
+{
+ struct srcu_ctr __percpu *ret = __srcu_read_lock_fast(&rcu_tasks_trace_srcu_struct);
+
+ rcu_try_lock_acquire(&rcu_tasks_trace_srcu_struct.dep_map);
+ if (!IS_ENABLED(CONFIG_TASKS_TRACE_RCU_NO_MB))
+ smp_mb(); // Provide ordering on noinstr-incomplete architectures.
+ return ret;
+}
+
+/**
+ * rcu_read_unlock_tasks_trace - mark end of RCU-trace read-side critical section
+ * @scp: return value from corresponding rcu_read_lock_tasks_trace().
+ *
+ * Pairs with the preceding call to rcu_read_lock_tasks_trace() that
+ * returned the value passed in via scp.
+ *
+ * For more details, please see the documentation for rcu_read_unlock().
+ * For memory-ordering information, please see the header comment for the
+ * rcu_read_lock_tasks_trace() function.
+ */
+static inline void rcu_read_unlock_tasks_trace(struct srcu_ctr __percpu *scp)
+{
+ if (!IS_ENABLED(CONFIG_TASKS_TRACE_RCU_NO_MB))
+ smp_mb(); // Provide ordering on noinstr-incomplete architectures.
+ __srcu_read_unlock_fast(&rcu_tasks_trace_srcu_struct, scp);
+ srcu_lock_release(&rcu_tasks_trace_srcu_struct.dep_map);
+}
+
/**
* rcu_read_lock_trace - mark beginning of RCU-trace read-side critical section
*
@@ -50,14 +97,15 @@ static inline void rcu_read_lock_trace(void)
{
struct task_struct *t = current;
+ rcu_try_lock_acquire(&rcu_tasks_trace_srcu_struct.dep_map);
if (t->trc_reader_nesting++) {
// In case we interrupted a Tasks Trace RCU reader.
- rcu_try_lock_acquire(&rcu_tasks_trace_srcu_struct.dep_map);
return;
}
barrier(); // nesting before scp to protect against interrupt handler.
- t->trc_reader_scp = srcu_read_lock_fast(&rcu_tasks_trace_srcu_struct);
- smp_mb(); // Placeholder for more selective ordering
+ t->trc_reader_scp = __srcu_read_lock_fast(&rcu_tasks_trace_srcu_struct);
+ if (!IS_ENABLED(CONFIG_TASKS_TRACE_RCU_NO_MB))
+ smp_mb(); // Placeholder for more selective ordering
}
/**
@@ -74,13 +122,14 @@ static inline void rcu_read_unlock_trace(void)
struct srcu_ctr __percpu *scp;
struct task_struct *t = current;
- smp_mb(); // Placeholder for more selective ordering
scp = t->trc_reader_scp;
barrier(); // scp before nesting to protect against interrupt handler.
- if (!--t->trc_reader_nesting)
- srcu_read_unlock_fast(&rcu_tasks_trace_srcu_struct, scp);
- else
- srcu_lock_release(&rcu_tasks_trace_srcu_struct.dep_map);
+ if (!--t->trc_reader_nesting) {
+ if (!IS_ENABLED(CONFIG_TASKS_TRACE_RCU_NO_MB))
+ smp_mb(); // Placeholder for more selective ordering
+ __srcu_read_unlock_fast(&rcu_tasks_trace_srcu_struct, scp);
+ }
+ srcu_lock_release(&rcu_tasks_trace_srcu_struct.dep_map);
}
/**
diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index 73a6cc364628b5..6a319e2926589f 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -142,6 +142,29 @@ config TASKS_TRACE_RCU
default n
select IRQ_WORK
+config TASKS_TRACE_RCU_NO_MB
+ bool "Override RCU Tasks Trace inclusion of read-side memory barriers"
+ depends on RCU_EXPERT && TASKS_TRACE_RCU
+ default ARCH_WANTS_NO_INSTR
+ help
+ This option prevents the use of read-side memory barriers in
+ rcu_read_lock_tasks_trace() and rcu_read_unlock_tasks_trace()
+ even in kernels built with CONFIG_ARCH_WANTS_NO_INSTR=n, that is,
+ in kernels that do not have noinstr set up in entry/exit code.
+ By setting this option, you are promising to carefully review
+ use of ftrace, BPF, and friends to ensure that no tracing
+ operation is attached to a function that runs in that portion
+ of the entry/exit code that RCU does not watch, that is,
+ where rcu_is_watching() returns false. Alternatively, you
+ might choose to never remove traces except by rebooting.
+
+ Those wishing to disable read-side memory barriers for an entire
+ architecture can select this Kconfig option, hence the polarity.
+
+ Say Y here if you need speed and will review use of tracing.
+ Say N here for certain esoteric testing of RCU itself.
+ Take the default if you are unsure.
+
config RCU_STALL_COMMON
def_bool TREE_RCU
help
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 833e180db744f2..bf1226834c9423 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -1600,8 +1600,13 @@ static inline void rcu_tasks_bootup_oddness(void) {}
// Tracing variant of Tasks RCU. This variant is designed to be used
// to protect tracing hooks, including those of BPF. This variant
// is implemented via a straightforward mapping onto SRCU-fast.
+// DEFINE_SRCU_FAST() is required because rcu_read_lock_trace() must
+// use __srcu_read_lock_fast() in order to bypass the rcu_is_watching()
+// checks in kernels built with CONFIG_TASKS_TRACE_RCU_NO_MB=n, which also
+// bypasses the srcu_check_read_flavor_force() that would otherwise mark
+// rcu_tasks_trace_srcu_struct as needing SRCU-fast readers.
-DEFINE_SRCU(rcu_tasks_trace_srcu_struct);
+DEFINE_SRCU_FAST(rcu_tasks_trace_srcu_struct);
EXPORT_SYMBOL_GPL(rcu_tasks_trace_srcu_struct);
#endif /* #else #ifdef CONFIG_TASKS_TRACE_RCU */
--
2.40.1
On Wed, Oct 1, 2025 at 7:48 AM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> +static inline struct srcu_ctr __percpu *rcu_read_lock_tasks_trace(void)
> +{
> + struct srcu_ctr __percpu *ret = __srcu_read_lock_fast(&rcu_tasks_trace_srcu_struct);
> +
> + rcu_try_lock_acquire(&rcu_tasks_trace_srcu_struct.dep_map);
> + if (!IS_ENABLED(CONFIG_TASKS_TRACE_RCU_NO_MB))
> + smp_mb(); // Provide ordering on noinstr-incomplete architectures.
> + return ret;
> +}
...
> @@ -50,14 +97,15 @@ static inline void rcu_read_lock_trace(void)
> {
> struct task_struct *t = current;
>
> + rcu_try_lock_acquire(&rcu_tasks_trace_srcu_struct.dep_map);
> if (t->trc_reader_nesting++) {
> // In case we interrupted a Tasks Trace RCU reader.
> - rcu_try_lock_acquire(&rcu_tasks_trace_srcu_struct.dep_map);
> return;
> }
> barrier(); // nesting before scp to protect against interrupt handler.
> - t->trc_reader_scp = srcu_read_lock_fast(&rcu_tasks_trace_srcu_struct);
> - smp_mb(); // Placeholder for more selective ordering
> + t->trc_reader_scp = __srcu_read_lock_fast(&rcu_tasks_trace_srcu_struct);
> + if (!IS_ENABLED(CONFIG_TASKS_TRACE_RCU_NO_MB))
> + smp_mb(); // Placeholder for more selective ordering
> }
Since srcu_fast() __percpu pointers must be incremented/decremented
within the same task, should we expose "raw" rcu_read_lock_tasks_trace()
at all?
rcu_read_lock_trace() stashes that pointer within a task,
so implementation guarantees that unlock will happen within the same task,
while _tasks_trace() requires the user not to do stupid things.
I guess it's fine to have both versions and the amount of copy paste
seems justified, but I keep wondering.
Especially since _tasks_trace() needs more work on bpf trampoline
side to pass this pointer around from lock to unlock.
We can add extra 8 bytes to struct bpf_tramp_run_ctx and save it there,
but set/reset run_ctx operates on current anyway, so it's not clear
which version will be faster. I suspect _trace() will be good enough.
Especially since trc_reader_nesting is kinda an optimization.
On Wed, Oct 01, 2025 at 06:37:33PM -0700, Alexei Starovoitov wrote:
> On Wed, Oct 1, 2025 at 7:48 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > +static inline struct srcu_ctr __percpu *rcu_read_lock_tasks_trace(void)
> > +{
> > + struct srcu_ctr __percpu *ret = __srcu_read_lock_fast(&rcu_tasks_trace_srcu_struct);
> > +
> > + rcu_try_lock_acquire(&rcu_tasks_trace_srcu_struct.dep_map);
> > + if (!IS_ENABLED(CONFIG_TASKS_TRACE_RCU_NO_MB))
> > + smp_mb(); // Provide ordering on noinstr-incomplete architectures.
> > + return ret;
> > +}
>
> ...
>
> > @@ -50,14 +97,15 @@ static inline void rcu_read_lock_trace(void)
> > {
> > struct task_struct *t = current;
> >
> > + rcu_try_lock_acquire(&rcu_tasks_trace_srcu_struct.dep_map);
> > if (t->trc_reader_nesting++) {
> > // In case we interrupted a Tasks Trace RCU reader.
> > - rcu_try_lock_acquire(&rcu_tasks_trace_srcu_struct.dep_map);
> > return;
> > }
> > barrier(); // nesting before scp to protect against interrupt handler.
> > - t->trc_reader_scp = srcu_read_lock_fast(&rcu_tasks_trace_srcu_struct);
> > - smp_mb(); // Placeholder for more selective ordering
> > + t->trc_reader_scp = __srcu_read_lock_fast(&rcu_tasks_trace_srcu_struct);
> > + if (!IS_ENABLED(CONFIG_TASKS_TRACE_RCU_NO_MB))
> > + smp_mb(); // Placeholder for more selective ordering
> > }
>
> Since srcu_fast() __percpu pointers must be incremented/decremented
> within the same task, should we expose "raw" rcu_read_lock_tasks_trace()
> at all?
> rcu_read_lock_trace() stashes that pointer within a task,
> so implementation guarantees that unlock will happen within the same task,
> while _tasks_trace() requires the user not to do stupid things.
>
> I guess it's fine to have both versions and the amount of copy paste
> seems justified, but I keep wondering.
> Especially since _tasks_trace() needs more work on bpf trampoline
> side to pass this pointer around from lock to unlock.
> We can add extra 8 bytes to struct bpf_tramp_run_ctx and save it there,
> but set/reset run_ctx operates on current anyway, so it's not clear
> which version will be faster. I suspect _trace() will be good enough.
> Especially since trc_reader_nesting is kinda an optimization.
The idea is to convert callers and get rid of rcu_read_lock_trace()
in favor of rcu_read_lock_tasks_trace(), the reason being the slow
task_struct access on x86. But if the extra storage is an issue for
some use cases, we can keep both. In that case, I would of course reduce
the copy-pasta in a future patch.
Thanx, Paul
On Thu, Oct 2, 2025 at 6:38 AM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Wed, Oct 01, 2025 at 06:37:33PM -0700, Alexei Starovoitov wrote:
> > On Wed, Oct 1, 2025 at 7:48 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > +static inline struct srcu_ctr __percpu *rcu_read_lock_tasks_trace(void)
> > > +{
> > > + struct srcu_ctr __percpu *ret = __srcu_read_lock_fast(&rcu_tasks_trace_srcu_struct);
> > > +
> > > + rcu_try_lock_acquire(&rcu_tasks_trace_srcu_struct.dep_map);
> > > + if (!IS_ENABLED(CONFIG_TASKS_TRACE_RCU_NO_MB))
> > > + smp_mb(); // Provide ordering on noinstr-incomplete architectures.
> > > + return ret;
> > > +}
> >
> > ...
> >
> > > @@ -50,14 +97,15 @@ static inline void rcu_read_lock_trace(void)
> > > {
> > > struct task_struct *t = current;
> > >
> > > + rcu_try_lock_acquire(&rcu_tasks_trace_srcu_struct.dep_map);
> > > if (t->trc_reader_nesting++) {
> > > // In case we interrupted a Tasks Trace RCU reader.
> > > - rcu_try_lock_acquire(&rcu_tasks_trace_srcu_struct.dep_map);
> > > return;
> > > }
> > > barrier(); // nesting before scp to protect against interrupt handler.
> > > - t->trc_reader_scp = srcu_read_lock_fast(&rcu_tasks_trace_srcu_struct);
> > > - smp_mb(); // Placeholder for more selective ordering
> > > + t->trc_reader_scp = __srcu_read_lock_fast(&rcu_tasks_trace_srcu_struct);
> > > + if (!IS_ENABLED(CONFIG_TASKS_TRACE_RCU_NO_MB))
> > > + smp_mb(); // Placeholder for more selective ordering
> > > }
> >
> > Since srcu_fast() __percpu pointers must be incremented/decremented
> > within the same task, should we expose "raw" rcu_read_lock_tasks_trace()
> > at all?
> > rcu_read_lock_trace() stashes that pointer within a task,
> > so implementation guarantees that unlock will happen within the same task,
> > while _tasks_trace() requires the user not to do stupid things.
> >
> > I guess it's fine to have both versions and the amount of copy paste
> > seems justified, but I keep wondering.
> > Especially since _tasks_trace() needs more work on bpf trampoline
> > side to pass this pointer around from lock to unlock.
> > We can add extra 8 bytes to struct bpf_tramp_run_ctx and save it there,
> > but set/reset run_ctx operates on current anyway, so it's not clear
> > which version will be faster. I suspect _trace() will be good enough.
> > Especially since trc_reader_nesting is kinda an optimization.
>
> The idea is to convert callers and get rid of rcu_read_lock_trace()
> in favor of rcu_read_lock_tasks_trace(), the reason being the slow
> task_struct access on x86. But if the extra storage is an issue for
> some use cases, we can keep both. In that case, I would of course reduce
> the copy-pasta in a future patch.
slow task_struct access on x86? That's news to me.
Why is it slow?
static __always_inline struct task_struct *get_current(void)
{
if (IS_ENABLED(CONFIG_USE_X86_SEG_SUPPORT))
return this_cpu_read_const(const_current_task);
return this_cpu_read_stable(current_task);
}
The former is used with gcc 14+ while later is with clang.
I don't understand the difference between the two.
I'm guessing gcc14+ can be optimized better within the function,
but both look plenty fast.
We need current access anyway for run_ctx.
On Thu, Oct 02, 2025 at 08:56:01AM -0700, Alexei Starovoitov wrote:
> On Thu, Oct 2, 2025 at 6:38 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Wed, Oct 01, 2025 at 06:37:33PM -0700, Alexei Starovoitov wrote:
> > > On Wed, Oct 1, 2025 at 7:48 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > >
> > > > +static inline struct srcu_ctr __percpu *rcu_read_lock_tasks_trace(void)
> > > > +{
> > > > + struct srcu_ctr __percpu *ret = __srcu_read_lock_fast(&rcu_tasks_trace_srcu_struct);
> > > > +
> > > > + rcu_try_lock_acquire(&rcu_tasks_trace_srcu_struct.dep_map);
> > > > + if (!IS_ENABLED(CONFIG_TASKS_TRACE_RCU_NO_MB))
> > > > + smp_mb(); // Provide ordering on noinstr-incomplete architectures.
> > > > + return ret;
> > > > +}
> > >
> > > ...
> > >
> > > > @@ -50,14 +97,15 @@ static inline void rcu_read_lock_trace(void)
> > > > {
> > > > struct task_struct *t = current;
> > > >
> > > > + rcu_try_lock_acquire(&rcu_tasks_trace_srcu_struct.dep_map);
> > > > if (t->trc_reader_nesting++) {
> > > > // In case we interrupted a Tasks Trace RCU reader.
> > > > - rcu_try_lock_acquire(&rcu_tasks_trace_srcu_struct.dep_map);
> > > > return;
> > > > }
> > > > barrier(); // nesting before scp to protect against interrupt handler.
> > > > - t->trc_reader_scp = srcu_read_lock_fast(&rcu_tasks_trace_srcu_struct);
> > > > - smp_mb(); // Placeholder for more selective ordering
> > > > + t->trc_reader_scp = __srcu_read_lock_fast(&rcu_tasks_trace_srcu_struct);
> > > > + if (!IS_ENABLED(CONFIG_TASKS_TRACE_RCU_NO_MB))
> > > > + smp_mb(); // Placeholder for more selective ordering
> > > > }
> > >
> > > Since srcu_fast() __percpu pointers must be incremented/decremented
> > > within the same task, should we expose "raw" rcu_read_lock_tasks_trace()
> > > at all?
> > > rcu_read_lock_trace() stashes that pointer within a task,
> > > so implementation guarantees that unlock will happen within the same task,
> > > while _tasks_trace() requires the user not to do stupid things.
> > >
> > > I guess it's fine to have both versions and the amount of copy paste
> > > seems justified, but I keep wondering.
> > > Especially since _tasks_trace() needs more work on bpf trampoline
> > > side to pass this pointer around from lock to unlock.
> > > We can add extra 8 bytes to struct bpf_tramp_run_ctx and save it there,
> > > but set/reset run_ctx operates on current anyway, so it's not clear
> > > which version will be faster. I suspect _trace() will be good enough.
> > > Especially since trc_reader_nesting is kinda an optimization.
> >
> > The idea is to convert callers and get rid of rcu_read_lock_trace()
> > in favor of rcu_read_lock_tasks_trace(), the reason being the slow
> > task_struct access on x86. But if the extra storage is an issue for
> > some use cases, we can keep both. In that case, I would of course reduce
> > the copy-pasta in a future patch.
>
> slow task_struct access on x86? That's news to me.
> Why is it slow?
> static __always_inline struct task_struct *get_current(void)
> {
> if (IS_ENABLED(CONFIG_USE_X86_SEG_SUPPORT))
> return this_cpu_read_const(const_current_task);
>
> return this_cpu_read_stable(current_task);
> }
>
>
> The former is used with gcc 14+ while later is with clang.
> I don't understand the difference between the two.
> I'm guessing gcc14+ can be optimized better within the function,
> but both look plenty fast.
>
> We need current access anyway for run_ctx.
Last I measured it, task_struct access was quite a bit slower than was
access to per-CPU variables. The assembly language was such that this
was unsurprising.
But maybe things have changed, and it certainly would be a good thing
if task_struct access had improved. Once I get done hammering it with
functional tests, I will of course do benchmarking and adjust as needed.
Thanx, Paul
© 2016 - 2026 Red Hat, Inc.