in-kernel rseq | Patchew

[RFC] in-kernel rseq

Posted by Peter Zijlstra 1 month, 1 week ago

Hi,

It has come to my attention that various people are struggling with
preempt_disable()+preempt_enable() costs for various architectures.

Mostly in relation to things like this_cpu_ and or local_. 

The below is a very crude (and broken, more on that below) POC.

So the 'main' advantage of this over preempt_disable()/preempt_enable(),
it on the preempt_enable() side, this elides the whole conditional and
call schedule() nonsense.

Now, on to the broken part, the below 'commit' address should be the
address of the 'STORE' instruction. In case of LL/SC, it should be the
SC, in case of LSE, it should be the LSE instruction.

This means, it needs to be woven into the asm... and I'm not that handy
with arm64 asm.

The pseudo code would be something like:

	current->sched_seq = &_R;
	...

_start:  compute per cpu-addr
	 load addr
	 $OP
_commit: store addr

	...
	current->sched_rseq = NULL;


Then when preemption happens (from interrupt), the instruction pointer
is 'simply' reset to _start and it tries again.

Anyway, this was aimed at arm64, which chose to use atomics for
this_cpu. But if we move sched_rseq() from schedule-tail into interrupt
entry, then this would also work for things like Power.

Anyway, just throwing ideas out there.


diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
index b57b2bb00967..080a868391b7 100644
--- a/arch/arm64/include/asm/percpu.h
+++ b/arch/arm64/include/asm/percpu.h
@@ -11,6 +11,7 @@
 #include <asm/cmpxchg.h>
 #include <asm/stack_pointer.h>
 #include <asm/sysreg.h>
+#include <generated/rq-offsets.h>
 
 static inline void set_my_cpu_offset(unsigned long off)
 {
@@ -155,9 +156,23 @@ PERCPU_RET_OP(add, add, ldadd)
 
 #define _pcp_protect(op, pcp, ...)					\
 ({									\
-	preempt_disable_notrace();					\
+	__label__ __rseq_begin;						\
+	__label__ __rseq_end;						\
+	static struct sched_rseq _R = {					\
+		.begin = (unsigned long)&&__rseq_begin,			\
+		.commit = (unsigned long)&&__rseq_end,			\
+		.restart = (unsigned long)&&__rseq_begin,		\
+	};								\
+	struct sched_rseq **this_rseq;					\
+	asm ("mrs %0, sp_el0; add %0, %0, %1;" : "=r" (this_rseq) : "i" (TSK_rseq));\
+	*this_rseq = &_R;						\
+__rseq_begin:								\
+	barrier();							\
 	op(raw_cpu_ptr(&(pcp)), __VA_ARGS__);				\
-	preempt_enable_notrace();					\
+	/* XXX broken */						\
+	barrier();							\
+__rseq_end:								\
+	*this_rseq = NULL;						\
 })
 
 #define _pcp_protect_return(op, pcp, args...)				\
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a7b4a980eb2f..7960f3e21104 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -817,6 +817,12 @@ struct kmap_ctrl {
 #endif
 };
 
+struct sched_rseq {
+	unsigned long begin;
+	unsigned long commit;
+	unsigned long restart;
+};
+
 struct task_struct {
 #ifdef CONFIG_THREAD_INFO_IN_TASK
 	/*
@@ -827,6 +833,8 @@ struct task_struct {
 #endif
 	unsigned int			__state;
 
+	struct sched_rseq		*sched_rseq;
+
 	/* saved state for "spinlock sleepers" */
 	unsigned int			saved_state;
 
diff --git a/include/linux/sched/rseq.h b/include/linux/sched/rseq.h
deleted file mode 100644
index e69de29bb2d1..000000000000
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bfd280ec0f97..d4702f8590f2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5087,6 +5087,23 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
 	prepare_arch_switch(next);
 }
 
+static inline void sched_rseq(struct task_struct *prev)
+{
+	struct sched_rseq *rseq = prev->sched_rseq;
+	struct pt_regs *regs;
+	unsigned long ip;
+
+	if (likely(!rseq))
+		return;
+
+	regs = task_pt_regs(prev);
+	ip = instruction_pointer(regs);
+	if ((ip - rseq->begin) >= (rseq->commit - rseq->begin))
+		return;
+
+	instruction_pointer_set(regs, rseq->restart);
+}
+
 /**
  * finish_task_switch - clean up after a task-switch
  * @prev: the thread we just switched away from.
@@ -5145,6 +5162,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	prev_state = READ_ONCE(prev->__state);
 	vtime_task_switch(prev);
 	perf_event_task_sched_in(prev, current);
+	sched_rseq(prev);
 	finish_task(prev);
 	tick_nohz_task_switch();
 	finish_lock_switch(rq);
diff --git a/kernel/sched/rq-offsets.c b/kernel/sched/rq-offsets.c
index a23747bbe25b..629989a89395 100644
--- a/kernel/sched/rq-offsets.c
+++ b/kernel/sched/rq-offsets.c
@@ -6,7 +6,8 @@
 
 int main(void)
 {
-	DEFINE(RQ_nr_pinned, offsetof(struct rq, nr_pinned));
+	DEFINE(RQ_nr_pinned,	offsetof(struct rq, nr_pinned));
+	DEFINE(TSK_rseq,	offsetof(struct task_struct, sched_rseq));
 
 	return 0;
 }

Re: [RFC] in-kernel rseq

Posted by Heiko Carstens 1 month, 1 week ago

On Mon, Feb 23, 2026 at 05:38:43PM +0100, Peter Zijlstra wrote:
> This means, it needs to be woven into the asm... and I'm not that handy
> with arm64 asm.
> 
> The pseudo code would be something like:
> 
> 	current->sched_seq = &_R;
> 	...
> 
> _start:  compute per cpu-addr
> 	 load addr
> 	 $OP
> _commit: store addr
> 
> 	...
> 	current->sched_rseq = NULL;
> 
> 
> Then when preemption happens (from interrupt), the instruction pointer
> is 'simply' reset to _start and it tries again.

I guess also on every interrupt, exception, and nmi current->sched_rseq needs
to be saved on entry, and restored on exit, since other contexts can make use
of this_cpu ops as well.

> Anyway, this was aimed at arm64, which chose to use atomics for
> this_cpu. But if we move sched_rseq() from schedule-tail into interrupt
> entry, then this would also work for things like Power.

Let's assume s390 would be target, which also uses atomics for
this_cpu ops. A very simple function like:

static DEFINE_PER_CPU(long, bar);

long foo(long val)
{
	return this_cpu_add_return(bar, val); 
}

would turn into the below with PREEMPT_NONE:

0000000000000000 <foo>:
   0:   c0 04 00 00 00 00       jgnop   0 <foo>
   6:   c0 10 00 00 00 00       larl    %r1,6 <foo+0x6> <- r1 contains address of "bar"
                        8: R_390_PC32DBL        .data..percpu+0x2
   c:   a7 39 00 00             lghi    %r3,0
  10:   e3 10 33 b8 00 08       ag      %r1,952(%r3)    <- add per-cpu offset
  16:   eb 02 10 00 00 e8       laag    %r0,%r2,0(%r1)  <- atomic op
  1c:   b9 08 00 20             agr     %r2,%r0
  20:   07 fe                   br      %r14

With PREEMPT_LAZY this turns into:

0000000000000000 <foo>:
   0:   c0 04 00 00 00 00       jgnop   0 <foo>
   6:   eb af f0 68 00 24       stmg    %r10,%r15,104(%r15)
   c:   b9 04 00 ef             lgr     %r14,%r15
  10:   b9 04 00 b2             lgr     %r11,%r2
  14:   e3 f0 ff c8 ff 71       lay     %r15,-56(%r15)
  1a:   e3 e0 f0 98 00 24       stg     %r14,152(%r15) <- up to here: create stack frame
  20:   eb 01 03 a8 00 6a       asi     936,1          <- preempt_inc()
  26:   c0 10 00 00 00 00       larl    %r1,26 <foo+0x26>
                        28: R_390_PC32DBL       .data..percpu+0x2
  2c:   a7 29 00 00             lghi    %r2,0
  30:   e3 10 23 b8 00 08       ag      %r1,952(%r2)
  36:   eb ab 10 00 00 e8       laag    %r10,%r11,0(%r1)
  3c:   eb ff 03 a8 00 6e       alsi    936,-1         <- preempt_dec_and_test()
  42:   a7 54 00 05             jnhe    4c <foo+0x4c>
  46:   c0 e5 00 00 00 00       brasl   %r14,46 <foo+0x46>
                        48: R_390_PLT32DBL      preempt_schedule_notrace+0x2
  4c:   b9 e8 b0 2a             agrk    %r2,%r10,%r11
  50:   eb af f0 a0 00 04       lmg     %r10,%r15,160(%r15)
  56:   07 fe                   br      %r14

With your proposal I guess this would turn into something like below.  Note,
the below is hand-edited, therefore offsets etc, do not make any sense, it is
just the instruction sequence I guess we _could_ end up with:

0000000000000000 <foo>:
   0:   c0 04 00 00 00 00       jgnop   0 <foo>
                                larl    %r1,#this_seq <- &_RR 
                                stg     %r1,944       <- lowcore->sched_seq = &_R;
   c:   c0 10 00 00 00 00       larl    %r1,c <foo+0xc>
                        e: R_390_PC32DBL        .data..percpu+0x2
  16:   e3 10 33 b8 00 08       ag      %r1,952
  1c:   eb 02 10 00 00 e8       laag    %r0,%r2,0(%r1)
                                mvghi   944,0         <- lowcore->sched_seq = NULL;
  2c:   b9 08 00 20             agr     %r2,%r0
  30:   07 fe                   br      %r14

This uses the s390 specific "lowcore" instead of current for sched_seq, since
it is an architecture per-cpu area mapped at address zero.

Let me give it a try to verify if the generated code would really look
like the above, but might a few days.

Re: [RFC] in-kernel rseq

Posted by Peter Zijlstra 1 month, 1 week ago

On Tue, Feb 24, 2026 at 12:16:46PM +0100, Heiko Carstens wrote:

> Let's assume s390 would be target, which also uses atomics for
> this_cpu ops. A very simple function like:
> 
> static DEFINE_PER_CPU(long, bar);
> 
> long foo(long val)
> {
> 	return this_cpu_add_return(bar, val); 
> }
> 
> would turn into the below with PREEMPT_NONE:
> 
> 0000000000000000 <foo>:
>    0:   c0 04 00 00 00 00       jgnop   0 <foo>
>    6:   c0 10 00 00 00 00       larl    %r1,6 <foo+0x6> <- r1 contains address of "bar"
>                         8: R_390_PC32DBL        .data..percpu+0x2
>    c:   a7 39 00 00             lghi    %r3,0
>   10:   e3 10 33 b8 00 08       ag      %r1,952(%r3)    <- add per-cpu offset
>   16:   eb 02 10 00 00 e8       laag    %r0,%r2,0(%r1)  <- atomic op
>   1c:   b9 08 00 20             agr     %r2,%r0
>   20:   07 fe                   br      %r14
> 
> With PREEMPT_LAZY this turns into:
> 
> 0000000000000000 <foo>:
>    0:   c0 04 00 00 00 00       jgnop   0 <foo>
>    6:   eb af f0 68 00 24       stmg    %r10,%r15,104(%r15)
>    c:   b9 04 00 ef             lgr     %r14,%r15
>   10:   b9 04 00 b2             lgr     %r11,%r2
>   14:   e3 f0 ff c8 ff 71       lay     %r15,-56(%r15)
>   1a:   e3 e0 f0 98 00 24       stg     %r14,152(%r15) <- up to here: create stack frame

So some of that could be elided with that asm call thunk thing we talked
about yesterday, right?

>   20:   eb 01 03 a8 00 6a       asi     936,1          <- preempt_inc()
>   26:   c0 10 00 00 00 00       larl    %r1,26 <foo+0x26>
>                         28: R_390_PC32DBL       .data..percpu+0x2
>   2c:   a7 29 00 00             lghi    %r2,0
>   30:   e3 10 23 b8 00 08       ag      %r1,952(%r2)
>   36:   eb ab 10 00 00 e8       laag    %r10,%r11,0(%r1)
>   3c:   eb ff 03 a8 00 6e       alsi    936,-1         <- preempt_dec_and_test()
>   42:   a7 54 00 05             jnhe    4c <foo+0x4c>
>   46:   c0 e5 00 00 00 00       brasl   %r14,46 <foo+0x46>
>                         48: R_390_PLT32DBL      preempt_schedule_notrace+0x2
>   4c:   b9 e8 b0 2a             agrk    %r2,%r10,%r11
>   50:   eb af f0 a0 00 04       lmg     %r10,%r15,160(%r15)
>   56:   07 fe                   br      %r14
> 
> With your proposal I guess this would turn into something like below.  Note,
> the below is hand-edited, therefore offsets etc, do not make any sense, it is
> just the instruction sequence I guess we _could_ end up with:
> 
> 0000000000000000 <foo>:
>    0:   c0 04 00 00 00 00       jgnop   0 <foo>
>                                 larl    %r1,#this_seq <- &_RR 
>                                 stg     %r1,944       <- lowcore->sched_seq = &_R;
>    c:   c0 10 00 00 00 00       larl    %r1,c <foo+0xc>
>                         e: R_390_PC32DBL        .data..percpu+0x2
>   16:   e3 10 33 b8 00 08       ag      %r1,952
>   1c:   eb 02 10 00 00 e8       laag    %r0,%r2,0(%r1)
>                                 mvghi   944,0         <- lowcore->sched_seq = NULL;
>   2c:   b9 08 00 20             agr     %r2,%r0
>   30:   07 fe                   br      %r14
> 
> This uses the s390 specific "lowcore" instead of current for sched_seq, since
> it is an architecture per-cpu area mapped at address zero.

Right, something like that. This is hopefully 'better' :-)

Re: [RFC] in-kernel rseq

Posted by Heiko Carstens 1 month, 1 week ago

On Tue, Feb 24, 2026 at 04:20:32PM +0100, Peter Zijlstra wrote:
> > With PREEMPT_LAZY this turns into:
> > 
> > 0000000000000000 <foo>:
> >    0:   c0 04 00 00 00 00       jgnop   0 <foo>
> >    6:   eb af f0 68 00 24       stmg    %r10,%r15,104(%r15)
> >    c:   b9 04 00 ef             lgr     %r14,%r15
> >   10:   b9 04 00 b2             lgr     %r11,%r2
> >   14:   e3 f0 ff c8 ff 71       lay     %r15,-56(%r15)
> >   1a:   e3 e0 f0 98 00 24       stg     %r14,152(%r15) <- up to here: create stack frame
> 
> So some of that could be elided with that asm call thunk thing we talked
> about yesterday, right?

Yes, with
#define __preempt_schedule_notrace() \
 asm volatile("brasl %%r14,preempt_schedule_notrace_thunk" : : : "cc", "memory", "r14")

we would end up with:

0000000000000000 <foo>:
   0:   c0 04 00 00 00 00       jgnop   0 <foo>
   6:   eb 01 03 a8 00 6a       asi     936,1
   c:   c0 10 00 00 00 00       larl    %r1,c <foo+0xc>
                        e: R_390_PC32DBL        .data..percpu+0x2
  12:   a7 39 00 00             lghi    %r3,0
  16:   e3 10 33 b8 00 08       ag      %r1,952(%r3)
  1c:   eb 22 10 00 00 f8       laa     %r2,%r2,0(%r1)
  22:   eb ff 03 a8 00 6e       alsi    936,-1
  28:   a7 a4 00 03             jhe     2e <foo+0x2e>
  2c:   07 fe                   br      %r14
  2e:   e3 e0 f0 88 00 24       stg     %r14,136(%r15)
  34:   c0 e5 00 00 00 00       brasl   %r14,34 <foo+0x34>
                        36: R_390_PC32DBL       preempt_schedule_notrace_thunk+0x2
  3a:   e3 e0 f0 88 00 04       lg      %r14,136(%r15)
  40:   07 fe                   br      %r14

The stack setup is gone, like wanted :)

Re: [RFC] in-kernel rseq

Posted by Heiko Carstens 1 month, 1 week ago

On Tue, Feb 24, 2026 at 05:02:10PM +0100, Heiko Carstens wrote:
> On Tue, Feb 24, 2026 at 04:20:32PM +0100, Peter Zijlstra wrote:
> > > With PREEMPT_LAZY this turns into:
> > > 
> > > 0000000000000000 <foo>:
> > >    0:   c0 04 00 00 00 00       jgnop   0 <foo>
> > >    6:   eb af f0 68 00 24       stmg    %r10,%r15,104(%r15)
> > >    c:   b9 04 00 ef             lgr     %r14,%r15
> > >   10:   b9 04 00 b2             lgr     %r11,%r2
> > >   14:   e3 f0 ff c8 ff 71       lay     %r15,-56(%r15)
> > >   1a:   e3 e0 f0 98 00 24       stg     %r14,152(%r15) <- up to here: create stack frame
> > 
> > So some of that could be elided with that asm call thunk thing we talked
> > about yesterday, right?
> 
> Yes, with
> #define __preempt_schedule_notrace() \
>  asm volatile("brasl %%r14,preempt_schedule_notrace_thunk" : : : "cc", "memory", "r14")
> 
> we would end up with:

...[random junk]...

Sorry, that was an incorrect version, only handling this_cpu_add().

So with

static DEFINE_PER_CPU(long, bar);

long foo(long val)
{
	return this_cpu_add_return(bar, val);
}

and the above define the result would be the below (no stack frame -
up to the thunk to handle that, including register save/restore).

0000000000000000 <foo>:
   0:   c0 04 00 00 00 00       jgnop   0 <foo>
   6:   eb 01 03 a8 00 6a       asi     936,1
   c:   c0 10 00 00 00 00       larl    %r1,c <foo+0xc>
                        e: R_390_PC32DBL        .data..percpu+0x2
  12:   a7 39 00 00             lghi    %r3,0
  16:   e3 10 33 b8 00 08       ag      %r1,952(%r3)
  1c:   eb 02 10 00 00 e8       laag    %r0,%r2,0(%r1)
  22:   eb ff 03 a8 00 6e       alsi    936,-1
  28:   a7 a4 00 05             jhe     32 <foo+0x32>
  2c:   b9 08 00 20             agr     %r2,%r0
  30:   07 fe                   br      %r14
  32:   e3 e0 f0 88 00 24       stg     %r14,136(%r15)
  38:   c0 e5 00 00 00 00       brasl   %r14,38 <foo+0x38>
                        3a: R_390_PC32DBL       preempt_schedule_notrace_thunk+0x2
  3e:   e3 e0 f0 88 00 04       lg      %r14,136(%r15)
  44:   b9 08 00 20             agr     %r2,%r0
  48:   07 fe                   br      %r14

Re: [RFC] in-kernel rseq

Posted by Peter Zijlstra 1 month, 1 week ago

On Tue, Feb 24, 2026 at 12:16:46PM +0100, Heiko Carstens wrote:
> On Mon, Feb 23, 2026 at 05:38:43PM +0100, Peter Zijlstra wrote:
> > This means, it needs to be woven into the asm... and I'm not that handy
> > with arm64 asm.
> > 
> > The pseudo code would be something like:
> > 
> > 	current->sched_seq = &_R;
> > 	...
> > 
> > _start:  compute per cpu-addr
> > 	 load addr
> > 	 $OP
> > _commit: store addr
> > 
> > 	...
> > 	current->sched_rseq = NULL;
> > 
> > 
> > Then when preemption happens (from interrupt), the instruction pointer
> > is 'simply' reset to _start and it tries again.
> 
> I guess also on every interrupt, exception, and nmi current->sched_rseq needs
> to be saved on entry, and restored on exit, since other contexts can make use
> of this_cpu ops as well.

Right -- so I can't seem to make my mind up on this. I *think* I like
the save/restore version of the sched version better.

Having it restart for every interrupt, even though its guaranteed to not
change the process seems unfortunate. Interrupts can be fairly high
rate without the task changing.

Anyway, I've cobbled together something a little more elaborate, but
equally untested.

I've renamed it kseq, to be distinct from the existing rseq, and there's
two versions, one sched and one irq based. The sched one is
saved/restored, while the irq one is not.

For both, the architecture is 'required' to provide a function/macro
that gives the address of the pointer, the sched one takes a task as
argument, but that could be completely ignored.

This allows you to use whatever storage you think best, lowcore on s390,
paca on Power, whatever.

Anyway, tglx will probably hate on all this for adding more crap :-)

---
diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h
index d26d1b1bcbfb..3f6d4ceaf3a1 100644
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -362,9 +362,26 @@ typedef struct irqentry_state {
 		bool	exit_rcu;
 		bool	lockdep;
 	};
+#ifdef CONFIG_KSEQ_SCHED
+	void *kseq_sched;
+#endif
 } irqentry_state_t;
 #endif
 
+static __always_inline void irqentry_kseq_push(struct irqentry_state *state)
+{
+#ifdef CONFIG_KSEQ_SCHED
+	state->kseq_sched = *kseq_sched_ptr(current);
+#endif
+}
+
+static __always_inline void irqentry_kseq_pop(struct irqentry_state *state)
+{
+#ifdef CONFIG_KSEQ_SCHED
+	*kseq_sched_ptr(current) = state->kseq_sched;
+#endif
+}
+
 /**
  * irqentry_enter - Handle state tracking on ordinary interrupt entries
  * @regs:	Pointer to pt_regs of interrupted context
diff --git a/include/linux/kseq.h b/include/linux/kseq.h
new file mode 100644
index 000000000000..a8bfdbdedb6f
--- /dev/null
+++ b/include/linux/kseq.h
@@ -0,0 +1,75 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_KSEQ_H
+#define _LINUX_KSEQ_H
+
+#include <linux/ptrace.h>
+
+/*
+ * Kernel restartable SEQuence.
+ *
+ * Pseudo code; this is expected to be used in assembler:
+ *
+ * static const struct kseq _R = {
+ *	.begin   = &&__kseq_begin,
+ *	.commit  = &&__kseq_commit,
+ *	.restart = &&__kseq_begin,	// simply retry
+ * };
+ *
+ * __rseq_begin:
+ *	WRITE_ONCE(->kseq, &_R);	// section active
+ *	addr = raw_cpu_ptr(pcp);
+ *	v = READ_ONCE(*addr);
+ *	v $OP i;
+ * __rseq_commit:
+ *	WRITE_ONCE(*addr, v);
+ *	WRITE_ONCE(->kseq, NULL);	// section inactive
+ *
+ * NOTE: when .restart == begin, it must be before writing the relevant kseq
+ *       pointer, since hitting the restart will clear the pointer.
+ *
+ * NOTE: commit must be the STORE that closes the sequence; being restarted
+ *       after this could result in the operation being performed twice, which
+ *       is ofcourse totally BAD(tm).
+ */
+struct kseq {
+	unsigned long begin;
+	unsigned long commit;
+	unsigned long restart;
+};
+
+static __always_inline void __restart_kernel_seq(struct kseq **kseq_ptr, struct pt_regs *regs)
+{
+	struct kseq *kseq = *kseq_ptr;
+	unsigned long ip;
+
+	if (!kseq)
+		return;
+
+	*kseq_ptr = NULL;
+
+	ip = instruction_pointer(regs);
+	if ((ip - kseq->begin) > (kseq->commit - kseq->begin))
+		return;
+
+	/*
+	 * begin <= ip <= commit
+	 */
+	instruction_pointer_set(regs, kseq->restart);
+}
+
+/*
+ * CONFIG_KSEQ_SCHED when set, shall provide:
+ *   struct kseq **kseq_sched_ptr(struct task_struct *);
+ *
+ * CONFIG_KSEQ_IRQ when set, shall provide:
+ *   struct kseq **kseq_irq_ptr(void);
+ *
+ * Both these functions shall provide an arch specific address for the
+ * respective kseq pointer.
+ */
+#if defined(CONFIG_KSEQ_SCHED) || defined(CONFIG_KSEQ_IRQ)
+#include <asm/kseq.h>
+#endif
+
+#endif /* _LINUX_KSEQ_H */
+
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 9ef63e414791..376a7039152e 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -7,6 +7,7 @@
 #include <linux/kmsan.h>
 #include <linux/livepatch.h>
 #include <linux/tick.h>
+#include <linux/kseq.h>
 
 /* Workaround to allow gradual conversion of architecture code */
 void __weak arch_do_signal_or_restart(struct pt_regs *regs) { }
@@ -103,6 +104,13 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 	}
 }
 
+static __always_inline void irqentry_kseq(struct pt_regs *regs)
+{
+#ifdef CONFIG_KSEQ_IRQ
+	__restart_kernel_seq(kseq_irq_ptr(), regs);
+#endif
+}
+
 noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
 {
 	irqentry_state_t ret = {
@@ -149,6 +157,8 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
 		instrumentation_begin();
 		kmsan_unpoison_entry_regs(regs);
 		trace_hardirqs_off_finish();
+		irqentry_kseq_push(&ret);
+		irqentry_kseq(regs);
 		instrumentation_end();
 
 		ret.exit_rcu = true;
@@ -166,6 +176,8 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
 	kmsan_unpoison_entry_regs(regs);
 	rcu_irq_enter_check_tick();
 	trace_hardirqs_off_finish();
+	irqentry_kseq_push(&ret);
+	irqentry_kseq(regs);
 	instrumentation_end();
 
 	return ret;
@@ -218,6 +230,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
 	if (user_mode(regs)) {
 		irqentry_exit_to_user_mode(regs);
 	} else if (!regs_irqs_disabled(regs)) {
+		irqentry_kseq_pop(&state);
 		/*
 		 * If RCU was not watching on entry this needs to be done
 		 * carefully and needs the same ordering of lockdep/tracing
@@ -242,6 +255,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
 		trace_hardirqs_on();
 		instrumentation_end();
 	} else {
+		irqentry_kseq_pop(&state);
 		/*
 		 * IRQ flags state is correct already. Just tell RCU if it
 		 * was not watching on entry.
@@ -266,6 +280,10 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
 	kmsan_unpoison_entry_regs(regs);
 	trace_hardirqs_off_finish();
 	ftrace_nmi_enter();
+	if (!user_mode(regs)) {
+		irqentry_kseq_push(&irq_state);
+		irqentry_kseq(regs);
+	}
 	instrumentation_end();
 
 	return irq_state;
@@ -274,6 +292,8 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
 void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state)
 {
 	instrumentation_begin();
+	if (!user_mode(regs))
+		irqentry_kseq_pop(&irq_state);
 	ftrace_nmi_exit();
 	if (irq_state.lockdep) {
 		trace_hardirqs_on_prepare();
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 759777694c78..b51f41797fe0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -69,6 +69,7 @@
 #include <linux/wait_api.h>
 #include <linux/workqueue_api.h>
 #include <linux/livepatch_sched.h>
+#include <linux/kseq.h>
 
 #ifdef CONFIG_PREEMPT_DYNAMIC
 # ifdef CONFIG_GENERIC_IRQ_ENTRY
@@ -5087,6 +5088,19 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
 	prepare_arch_switch(next);
 }
 
+/*
+ * Must be called after context switch, but before finish_task(), which will
+ * allow wakeup and scheduling on another CPU.
+ *
+ * This ensures task_pt_regs() is filled out and stable.
+ */
+static inline void kseq_sched(struct task_struct *prev)
+{
+#ifdef CONFIG_KSEQ_SCHED
+	__restart_kernel_seq(kseq_sched_ptr(prev), task_pt_regs(prev));
+#endif
+}
+
 /**
  * finish_task_switch - clean up after a task-switch
  * @prev: the thread we just switched away from.
@@ -5145,6 +5159,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	prev_state = READ_ONCE(prev->__state);
 	vtime_task_switch(prev);
 	perf_event_task_sched_in(prev, current);
+	kseq_sched(prev);
 	finish_task(prev);
 	tick_nohz_task_switch();
 	finish_lock_switch(rq);

Re: [RFC] in-kernel rseq

Posted by Mathieu Desnoyers 1 month, 1 week ago

On 2026-02-24 06:16, Heiko Carstens wrote:
> On Mon, Feb 23, 2026 at 05:38:43PM +0100, Peter Zijlstra wrote:
>> This means, it needs to be woven into the asm... and I'm not that handy
>> with arm64 asm.
>>
>> The pseudo code would be something like:
>>
>> 	current->sched_seq = &_R;
>> 	...
>>
>> _start:  compute per cpu-addr
>> 	 load addr
>> 	 $OP
>> _commit: store addr
>>
>> 	...
>> 	current->sched_rseq = NULL;
>>
>>
>> Then when preemption happens (from interrupt), the instruction pointer
>> is 'simply' reset to _start and it tries again.
> 
> I guess also on every interrupt, exception, and nmi current->sched_rseq needs
> to be saved on entry, and restored on exit, since other contexts can make use
> of this_cpu ops as well.

If we do a design similar to userspace rseq, we'd abort the rseq
critical section on interrupt, exception, nmi (by changing the pt_regs
instruction pointer) rather than save/restore it. This is what
userspace rseq does for signal handlers nesting on top of rseq critical
sections.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [RFC] in-kernel rseq

Posted by David Laight 1 month, 1 week ago

On Tue, 24 Feb 2026 08:48:03 -0500
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> On 2026-02-24 06:16, Heiko Carstens wrote:
> > On Mon, Feb 23, 2026 at 05:38:43PM +0100, Peter Zijlstra wrote:  
> >> This means, it needs to be woven into the asm... and I'm not that handy
> >> with arm64 asm.
> >>
> >> The pseudo code would be something like:
> >>
> >> 	current->sched_seq = &_R;
> >> 	...
> >>
> >> _start:  compute per cpu-addr
> >> 	 load addr
> >> 	 $OP
> >> _commit: store addr
> >>
> >> 	...
> >> 	current->sched_rseq = NULL;
> >>
> >>
> >> Then when preemption happens (from interrupt), the instruction pointer
> >> is 'simply' reset to _start and it tries again.  
> > 
> > I guess also on every interrupt, exception, and nmi current->sched_rseq needs
> > to be saved on entry, and restored on exit, since other contexts can make use
> > of this_cpu ops as well.  
> 
> If we do a design similar to userspace rseq, we'd abort the rseq
> critical section on interrupt, exception, nmi (by changing the pt_regs
> instruction pointer) rather than save/restore it. This is what
> userspace rseq does for signal handlers nesting on top of rseq critical
> sections.

Does that mean that 'start' would have to include the code to setup
the rseq? (rather than being after it as above).

	David

> 
> Thanks,
> 
> Mathieu
>

Re: [RFC] in-kernel rseq

Posted by Mathieu Desnoyers 1 month, 1 week ago

On 2026-02-24 09:59, David Laight wrote:
> On Tue, 24 Feb 2026 08:48:03 -0500
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
>> On 2026-02-24 06:16, Heiko Carstens wrote:
>>> On Mon, Feb 23, 2026 at 05:38:43PM +0100, Peter Zijlstra wrote:
>>>> This means, it needs to be woven into the asm... and I'm not that handy
>>>> with arm64 asm.
>>>>
>>>> The pseudo code would be something like:
>>>>
>>>> 	current->sched_seq = &_R;
>>>> 	...
>>>>
>>>> _start:  compute per cpu-addr
>>>> 	 load addr
>>>> 	 $OP
>>>> _commit: store addr
>>>>
>>>> 	...
>>>> 	current->sched_rseq = NULL;
>>>>
>>>>
>>>> Then when preemption happens (from interrupt), the instruction pointer
>>>> is 'simply' reset to _start and it tries again.
>>>
>>> I guess also on every interrupt, exception, and nmi current->sched_rseq needs
>>> to be saved on entry, and restored on exit, since other contexts can make use
>>> of this_cpu ops as well.
>>
>> If we do a design similar to userspace rseq, we'd abort the rseq
>> critical section on interrupt, exception, nmi (by changing the pt_regs
>> instruction pointer) rather than save/restore it. This is what
>> userspace rseq does for signal handlers nesting on top of rseq critical
>> sections.
> 
> Does that mean that 'start' would have to include the code to setup
> the rseq? (rather than being after it as above).

No. The "code to setup rseq" is possibly a sequence of instructions
(or just one instruction) which ends with just a single store which
populates the rseq_cs pointer. It's OK to have the start address
immediately after that store.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [RFC] in-kernel rseq

Posted by David Laight 1 month, 1 week ago

On Mon, 23 Feb 2026 17:38:43 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> Hi,
> 
> It has come to my attention that various people are struggling with
> preempt_disable()+preempt_enable() costs for various architectures.
> 
> Mostly in relation to things like this_cpu_ and or local_. 
> 
> The below is a very crude (and broken, more on that below) POC.
> 
> So the 'main' advantage of this over preempt_disable()/preempt_enable(),
> it on the preempt_enable() side, this elides the whole conditional and
> call schedule() nonsense.
> 
> Now, on to the broken part, the below 'commit' address should be the
> address of the 'STORE' instruction. In case of LL/SC, it should be the
> SC, in case of LSE, it should be the LSE instruction.

I think it would be better as the address of the instruction after
the 'store'.
You probably don't need separate 'begin' and 'restart' addresses.
It might be enough to save the 'restart' address and a byte length
directly in 'current', much simpler code.

How much it helps is another matter.
I'm sure I remember something about per-cpu data being used for something
because it was faster then using 'current' - not sure of the context.

The real problem with rseq is they don't scale.
At least this against the context switch code - which a slow path.

I wonder if anyone (not reading the code) would notice if a 'short
term preempt-disable' were implemented that missed out the reschedule
test were implemented?
I think that is just unlocked RMW of a per-cpu/thread variable.

	David

> 
> This means, it needs to be woven into the asm... and I'm not that handy
> with arm64 asm.
> 
> The pseudo code would be something like:
> 
> 	current->sched_seq = &_R;
> 	...
> 
> _start:  compute per cpu-addr
> 	 load addr
> 	 $OP
> _commit: store addr
> 
> 	...
> 	current->sched_rseq = NULL;
> 
> 
> Then when preemption happens (from interrupt), the instruction pointer
> is 'simply' reset to _start and it tries again.
> 
> Anyway, this was aimed at arm64, which chose to use atomics for
> this_cpu. But if we move sched_rseq() from schedule-tail into interrupt
> entry, then this would also work for things like Power.
> 
> Anyway, just throwing ideas out there.
> 
> 
> diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
> index b57b2bb00967..080a868391b7 100644
> --- a/arch/arm64/include/asm/percpu.h
> +++ b/arch/arm64/include/asm/percpu.h
> @@ -11,6 +11,7 @@
>  #include <asm/cmpxchg.h>
>  #include <asm/stack_pointer.h>
>  #include <asm/sysreg.h>
> +#include <generated/rq-offsets.h>
>  
>  static inline void set_my_cpu_offset(unsigned long off)
>  {
> @@ -155,9 +156,23 @@ PERCPU_RET_OP(add, add, ldadd)
>  
>  #define _pcp_protect(op, pcp, ...)					\
>  ({									\
> -	preempt_disable_notrace();					\
> +	__label__ __rseq_begin;						\
> +	__label__ __rseq_end;						\
> +	static struct sched_rseq _R = {					\
> +		.begin = (unsigned long)&&__rseq_begin,			\
> +		.commit = (unsigned long)&&__rseq_end,			\
> +		.restart = (unsigned long)&&__rseq_begin,		\
> +	};								\
> +	struct sched_rseq **this_rseq;					\
> +	asm ("mrs %0, sp_el0; add %0, %0, %1;" : "=r" (this_rseq) : "i" (TSK_rseq));\
> +	*this_rseq = &_R;						\
> +__rseq_begin:								\
> +	barrier();							\
>  	op(raw_cpu_ptr(&(pcp)), __VA_ARGS__);				\
> -	preempt_enable_notrace();					\
> +	/* XXX broken */						\
> +	barrier();							\
> +__rseq_end:								\
> +	*this_rseq = NULL;						\
>  })
>  
>  #define _pcp_protect_return(op, pcp, args...)				\
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index a7b4a980eb2f..7960f3e21104 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -817,6 +817,12 @@ struct kmap_ctrl {
>  #endif
>  };
>  
> +struct sched_rseq {
> +	unsigned long begin;
> +	unsigned long commit;
> +	unsigned long restart;
> +};
> +
>  struct task_struct {
>  #ifdef CONFIG_THREAD_INFO_IN_TASK
>  	/*
> @@ -827,6 +833,8 @@ struct task_struct {
>  #endif
>  	unsigned int			__state;
>  
> +	struct sched_rseq		*sched_rseq;
> +
>  	/* saved state for "spinlock sleepers" */
>  	unsigned int			saved_state;
>  
> diff --git a/include/linux/sched/rseq.h b/include/linux/sched/rseq.h
> deleted file mode 100644
> index e69de29bb2d1..000000000000
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index bfd280ec0f97..d4702f8590f2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5087,6 +5087,23 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
>  	prepare_arch_switch(next);
>  }
>  
> +static inline void sched_rseq(struct task_struct *prev)
> +{
> +	struct sched_rseq *rseq = prev->sched_rseq;
> +	struct pt_regs *regs;
> +	unsigned long ip;
> +
> +	if (likely(!rseq))
> +		return;
> +
> +	regs = task_pt_regs(prev);
> +	ip = instruction_pointer(regs);
> +	if ((ip - rseq->begin) >= (rseq->commit - rseq->begin))
> +		return;
> +
> +	instruction_pointer_set(regs, rseq->restart);
> +}
> +
>  /**
>   * finish_task_switch - clean up after a task-switch
>   * @prev: the thread we just switched away from.
> @@ -5145,6 +5162,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
>  	prev_state = READ_ONCE(prev->__state);
>  	vtime_task_switch(prev);
>  	perf_event_task_sched_in(prev, current);
> +	sched_rseq(prev);
>  	finish_task(prev);
>  	tick_nohz_task_switch();
>  	finish_lock_switch(rq);
> diff --git a/kernel/sched/rq-offsets.c b/kernel/sched/rq-offsets.c
> index a23747bbe25b..629989a89395 100644
> --- a/kernel/sched/rq-offsets.c
> +++ b/kernel/sched/rq-offsets.c
> @@ -6,7 +6,8 @@
>  
>  int main(void)
>  {
> -	DEFINE(RQ_nr_pinned, offsetof(struct rq, nr_pinned));
> +	DEFINE(RQ_nr_pinned,	offsetof(struct rq, nr_pinned));
> +	DEFINE(TSK_rseq,	offsetof(struct task_struct, sched_rseq));
>  
>  	return 0;
>  }
>

Re: [RFC] in-kernel rseq

Posted by Mathieu Desnoyers 1 month, 1 week ago

On 2026-02-23 12:53, David Laight wrote:
> On Mon, 23 Feb 2026 17:38:43 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
>> Hi,
>>
>> It has come to my attention that various people are struggling with
>> preempt_disable()+preempt_enable() costs for various architectures.
>>
>> Mostly in relation to things like this_cpu_ and or local_.
>>
>> The below is a very crude (and broken, more on that below) POC.
>>
>> So the 'main' advantage of this over preempt_disable()/preempt_enable(),
>> it on the preempt_enable() side, this elides the whole conditional and
>> call schedule() nonsense.
>>
>> Now, on to the broken part, the below 'commit' address should be the
>> address of the 'STORE' instruction. In case of LL/SC, it should be the
>> SC, in case of LSE, it should be the LSE instruction.
> 
> I think it would be better as the address of the instruction after
> the 'store'.

That's indeed what we do for userspace rseq.

> You probably don't need separate 'begin' and 'restart' addresses.

It's not needed as long as the abort behavior is only restart. It
becomes useful if another behavior is wanted on abort. But since
this is kernel code and not ABI, it can change if the need arise.

> It might be enough to save the 'restart' address and a byte length
> directly in 'current', much simpler code.

That would make it two stores to the task struct. Those would not be
single-instruction, so we'd have to deal with preemption coming between
those two stores. Also this would be more code: two stores compared
to a single pointer store to the task struct to begin the critical
section. AFAIU Peter's proposed approach is more efficient.

We could turn the end address into a length if we want, this would
make it more alike the userspace rseq ABI counterpart.

> 
> How much it helps is another matter.
> I'm sure I remember something about per-cpu data being used for something
> because it was faster then using 'current' - not sure of the context.

The problem with per-cpu data for this is how to handle migration ?
The whole point of this is to replace preempt disable.

> 
> The real problem with rseq is they don't scale.

Not sure what you mean. They don't scale with respect to what ?

> At least this against the context switch code - which a slow path.

This adds a task struct field load + NULL check on the scheduler
fast path. Is it what you are concerned about ?

[...]
> I think that is just unlocked RMW of a per-cpu/thread variable.
It is quite similar to LL/SC, but mitigated by the scheduler rather than
hardware, so it can use a sequence of cheaper load/store instructions on
the fast path. Also, based on prior benchmarks, a short sequence of
loads/stores was faster than a unlocked RMW instruction (at least on
x86-64).

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [RFC] in-kernel rseq

Posted by Peter Zijlstra 1 month, 1 week ago

On Mon, Feb 23, 2026 at 01:22:18PM -0500, Mathieu Desnoyers wrote:

> > I think it would be better as the address of the instruction after
> > the 'store'.
> 
> That's indeed what we do for userspace rseq.

Either works I suppose. The only think to be careful about is that you
must not restart once the store has happened.

> > You probably don't need separate 'begin' and 'restart' addresses.
> 
> It's not needed as long as the abort behavior is only restart. It
> becomes useful if another behavior is wanted on abort. But since
> this is kernel code and not ABI, it can change if the need arise.

Right, didn't want to limit to restart, although that is what is used
here.

> > It might be enough to save the 'restart' address and a byte length
> > directly in 'current', much simpler code.
> 
> That would make it two stores to the task struct. Those would not be
> single-instruction, so we'd have to deal with preemption coming between
> those two stores. Also this would be more code: two stores compared
> to a single pointer store to the task struct to begin the critical
> section. AFAIU Peter's proposed approach is more efficient.

Must indeed be a single store. Either we have it set in full, or we
don't.

> We could turn the end address into a length if we want, this would
> make it more alike the userspace rseq ABI counterpart.

I find 3 distinct addresses easier to fill out, but again it doesn't
matter.

> > How much it helps is another matter.
> > I'm sure I remember something about per-cpu data being used for something
> > because it was faster then using 'current' - not sure of the context.
> 
> The problem with per-cpu data for this is how to handle migration ?
> The whole point of this is to replace preempt disable.

This; it cannot be a per-cpu address, if you need it to implement
per-cpu ops.

> > The real problem with rseq is they don't scale.
> 
> Not sure what you mean. They don't scale with respect to what ?

He might be talking about forward progress instead of scaling. There are
indeed foward progress concerns with rseq -- as there are with trivial
LL/SC. But given the length of a slice vs the length of a rseq section,
this should be a non-issue.

Doing the restart on interrupt would be a bigger issue. Although even
there I think that since the operations we're talking about are but a
few instructions, it should all just work well enough.

And if not, you can always craft a restart path that does the actual
local_irq_disable().

Eg.

this_cpu_add(pcp, i)
{
	static const struct sched_seq _R = {
		.begin = &&__rseq_begin,
		.commit = &&__rseq_commit,
		.restart = &&__rseq_restart,
	};

	WRITE_ONCE(current->sched_rseq, &_R);
__rseq_begin:
	barrier();
	addr = raw_cpu_ptr(pcp);
	v = READ_ONCE(*addr)
	v += i;
	WRITE_ONCE(*addr, v);
	barrier();
__rseq_commit:
	WRITE_ONCE(current->sched_rseq, NULL);
	return;

__rseq_restart:
	guard(irqsave)();
	addr = raw_cpu_ptr(pcp);
	v = READ_ONCE(*addr)
	v += i;
	WRITE_ONCE(*addr, v);
	return;
}

That way you get fast most of the time, except when you did do get an
interrupt in between.

> > I think that is just unlocked RMW of a per-cpu/thread variable.

That's missing the point entirely. He might be stuck in x86_64 or
something.

Re: [RFC] in-kernel rseq

Posted by David Laight 1 month, 1 week ago

On Mon, 23 Feb 2026 22:54:36 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Feb 23, 2026 at 01:22:18PM -0500, Mathieu Desnoyers wrote:
> 
> > > I think it would be better as the address of the instruction after
> > > the 'store'.  
> > 
> > That's indeed what we do for userspace rseq.  
> 
> Either works I suppose. The only think to be careful about is that you
> must not restart once the store has happened.
> 
> > > You probably don't need separate 'begin' and 'restart' addresses.  
> > 
> > It's not needed as long as the abort behavior is only restart. It
> > becomes useful if another behavior is wanted on abort. But since
> > this is kernel code and not ABI, it can change if the need arise.  
> 
> Right, didn't want to limit to restart, although that is what is used
> here.
> 
> > > It might be enough to save the 'restart' address and a byte length
> > > directly in 'current', much simpler code.  
> > 
> > That would make it two stores to the task struct. Those would not be
> > single-instruction, so we'd have to deal with preemption coming between
> > those two stores. Also this would be more code: two stores compared
> > to a single pointer store to the task struct to begin the critical
> > section. AFAIU Peter's proposed approach is more efficient.  
> 
> Must indeed be a single store. Either we have it set in full, or we
> don't.

Not really, you can do two stores (to the task struct) provided you
check the second one - remember the data is being looked at by the
cpu that did the writes.

> > We could turn the end address into a length if we want, this would
> > make it more alike the userspace rseq ABI counterpart.  
> 
> I find 3 distinct addresses easier to fill out, but again it doesn't
> matter.

Actually if you save the end address you only need to check if the
current %pc is less that that address, if it is you back it up to
the start of the sequence.

> 
> > > How much it helps is another matter.
> > > I'm sure I remember something about per-cpu data being used for something
> > > because it was faster then using 'current' - not sure of the context.  
> > 
> > The problem with per-cpu data for this is how to handle migration ?
> > The whole point of this is to replace preempt disable.  
> 
> This; it cannot be a per-cpu address, if you need it to implement
> per-cpu ops.

Sorry yes, you are replacing a per-cpu data operation with a per-task one.
But I'm sure I remember something where the opposite was done because it
was unexpectedly faster to use per-cpu data.
I'm not sure where arm gets 'current' from, x86 'has it easy' because
of %fg and %gs.
(If current is loaded from per-cpu data that might explain why using
per-cpu data is faster.)

That makes me think (a bad sign)...
Are the compilers 'clever' enough to use %fs for current->member while
current()->member uses a #define to get the actual address?

preempt_disable() itself can be implemented using per-cpu or per-task
data. I think it varies between architectures, not sure which asm uses.

> > > The real problem with rseq is they don't scale.  
> > 
> > Not sure what you mean. They don't scale with respect to what ?  
> 
> He might be talking about forward progress instead of scaling. There are
> indeed foward progress concerns with rseq -- as there are with trivial
> LL/SC. But given the length of a slice vs the length of a rseq section,
> this should be a non-issue.

No scaling, in this case it is fine to add the rseq just before needing it.
But if they have to be set in advance then you start getting a long list
to check - I'm sure that must happen with userspace rseq?

> 
> Doing the restart on interrupt would be a bigger issue. Although even
> there I think that since the operations we're talking about are but a
> few instructions, it should all just work well enough.
> 
...

> > > I think that is just unlocked RMW of a per-cpu/thread variable.  
> 
> That's missing the point entirely. He might be stuck in x86_64 or
> something.

Not entirely, it doesn't matter if code is preempted between the read
and write in preempt_disable() because that can only happen when the
count is changing from 0 to 1.
What does matter is that the 1 is written to the correct place.

	David

Re: [RFC] in-kernel rseq

Posted by Mathieu Desnoyers 1 month, 1 week ago

On 2026-02-24 05:27, David Laight wrote:
[...]
> 
> No scaling, in this case it is fine to add the rseq just before needing it.

In all cases it is fine to set the per-task rseq pointer just before
needing it. That's how the userspace rseq was implemented.

> But if they have to be set in advance then you start getting a long list
> to check - I'm sure that must happen with userspace rseq?

No, userspace declares rseq_cs descriptors in its data, and populates
the rseq_abi->rseq_cs field (thread-local) with a pointer to that
descriptor at the very beginning of the critical section.

So return to userspace after context switch either finds a NULL pointer
or only needs to load from a single rseq_cs descriptor from userspace.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [RFC] in-kernel rseq

Posted by David Laight 1 month, 1 week ago

On Tue, 24 Feb 2026 08:33:23 -0500
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> On 2026-02-24 05:27, David Laight wrote:
> [...]
> > 
> > No scaling, in this case it is fine to add the rseq just before needing it.  
> 
> In all cases it is fine to set the per-task rseq pointer just before
> needing it. That's how the userspace rseq was implemented.
> 
> > But if they have to be set in advance then you start getting a long list
> > to check - I'm sure that must happen with userspace rseq?  
> 
> No, userspace declares rseq_cs descriptors in its data, and populates
> the rseq_abi->rseq_cs field (thread-local) with a pointer to that
> descriptor at the very beginning of the critical section.
> 
> So return to userspace after context switch either finds a NULL pointer
> or only needs to load from a single rseq_cs descriptor from userspace.

So all of the program has to use a single (per thread) 'rseq' structure?
And you better not try to use it in a signal handler.

I'm sure I should know the main use-case for user-space rseq.
I do remember a big problem with short-duration 'hot mutex' used for
(eg) removing work items from a linked list.
Although the 'hold time' is usually only a few instructions (so contention
is unlikely) if you get hit by an ethernet interrupt while the mutex is
held it can take milliseconds for all the softint code to complete before
the mutex is released - not good for system throughput.

	David

> 
> Thanks,
> 
> Mathieu
>

Re: [RFC] in-kernel rseq

Posted by Mathieu Desnoyers 1 month, 1 week ago

On 2026-02-24 09:49, David Laight wrote:
> So all of the program has to use a single (per thread) 'rseq' structure?

Correct. One rseq TLS which contains a pointer to the current rseq_cs
descriptor (or NULL), and possibly many rseq_cs critical section
descriptors.

> And you better not try to use it in a signal handler.

The signal handler delivery handles the rseq abort on return to
userspace. The rseq critical section can be used within a signal
handler.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com