Aside of a Kconfig knob add the following items:
- Two flag bits for the rseq user space ABI, which allow user space to
query the availability and enablement without a syscall.
- A new member to the user space ABI struct rseq, which is going to be
used to communicate request and grant between kernel and user space.
- A rseq state struct to hold the kernel state of this
- Documentation of the new mechanism
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
V6: Fix typos - Bigeasy
V5: Document behaviour of arbitrary syscalls
V4: Make the example correct - Prakash
V3: Fix more typos and expressions - Randy
V2: Fix Kconfig indentation, fix typos and expressions - Randy
Make the control fields a struct and remove the atomicity requirement - Mathieu
---
Documentation/userspace-api/index.rst | 1
Documentation/userspace-api/rseq.rst | 135 ++++++++++++++++++++++++++++++++++
include/linux/rseq_types.h | 28 ++++++-
include/uapi/linux/rseq.h | 38 +++++++++
init/Kconfig | 12 +++
kernel/rseq.c | 7 +
6 files changed, 220 insertions(+), 1 deletion(-)
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -21,6 +21,7 @@ System calls
ebpf/index
ioctl/index
mseal
+ rseq
Security-related interfaces
===========================
--- /dev/null
+++ b/Documentation/userspace-api/rseq.rst
@@ -0,0 +1,135 @@
+=====================
+Restartable Sequences
+=====================
+
+Restartable Sequences allow to register a per thread userspace memory area
+to be used as an ABI between kernel and userspace for three purposes:
+
+ * userspace restartable sequences
+
+ * quick access to read the current CPU number, node ID from userspace
+
+ * scheduler time slice extensions
+
+Restartable sequences (per-cpu atomics)
+---------------------------------------
+
+Restartable sequences allow userspace to perform update operations on
+per-cpu data without requiring heavyweight atomic operations. The actual
+ABI is unfortunately only available in the code and selftests.
+
+Quick access to CPU number, node ID
+-----------------------------------
+
+Allows to implement per CPU data efficiently. Documentation is in code and
+selftests. :(
+
+Scheduler time slice extensions
+-------------------------------
+
+This allows a thread to request a time slice extension when it enters a
+critical section to avoid contention on a resource when the thread is
+scheduled out inside of the critical section.
+
+The prerequisites for this functionality are:
+
+ * Enabled in Kconfig
+
+ * Enabled at boot time (default is enabled)
+
+ * A rseq userspace pointer has been registered for the thread
+
+The thread has to enable the functionality via prctl(2)::
+
+ prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
+ PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
+
+prctl() returns 0 on success or otherwise with the following error codes:
+
+========= ==============================================================
+Errorcode Meaning
+========= ==============================================================
+EINVAL Functionality not available or invalid function arguments.
+ Note: arg4 and arg5 must be zero
+ENOTSUPP Functionality was disabled on the kernel command line
+ENXIO Available, but no rseq user struct registered
+========= ==============================================================
+
+The state can be also queried via prctl(2)::
+
+ prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0);
+
+prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if
+disabled. Otherwise it returns with the following error codes:
+
+========= ==============================================================
+Errorcode Meaning
+========= ==============================================================
+EINVAL Functionality not available or invalid function arguments.
+ Note: arg3 and arg4 and arg5 must be zero
+========= ==============================================================
+
+The availability and status is also exposed via the rseq ABI struct flags
+field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the
+``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read-only for user
+space and only for informational purposes.
+
+If the mechanism was enabled via prctl(), the thread can request a time
+slice extension by setting rseq::slice_ctrl::request to 1. If the thread is
+interrupted and the interrupt results in a reschedule request in the
+kernel, then the kernel can grant a time slice extension and return to
+userspace instead of scheduling out. The length of the extension is
+determined by the ``rseq_slice_extension_nsec`` sysctl.
+
+The kernel indicates the grant by clearing rseq::slice_ctrl::request and
+setting rseq::slice_ctrl::granted to 1. If there is a reschedule of the
+thread after granting the extension, the kernel clears the granted bit to
+indicate that to userspace.
+
+If the request bit is still set when the leaving the critical section,
+userspace can clear it and continue.
+
+If the granted bit is set, then userspace invokes rseq_slice_yield(2) when
+leaving the critical section to relinquish the CPU. The kernel enforces
+this by arming a timer to prevent misbehaving userspace from abusing this
+mechanism.
+
+If both the request bit and the granted bit are false when leaving the
+critical section, then this indicates that a grant was revoked and no
+further action is required by userspace.
+
+The required code flow is as follows::
+
+ rseq->slice_ctrl.request = 1;
+ barrier(); // Prevent compiler reordering
+ critical_section();
+ barrier(); // Prevent compiler reordering
+ rseq->slice_ctrl.request = 0;
+ if (rseq->slice_ctrl.granted)
+ rseq_slice_yield();
+
+As all of this is strictly CPU local, there are no atomicity requirements.
+Checking the granted state is racy, but that cannot be avoided at all::
+
+ if (rseq->slice_ctrl.granted)
+ -> Interrupt results in schedule and grant revocation
+ rseq_slice_yield();
+
+So there is no point in pretending that this might be solved by an atomic
+operation.
+
+If the thread issues a syscall other than rseq_slice_yield(2) within the
+granted timeslice extension, the grant is also revoked and the CPU is
+relinquished immediately when entering the kernel. This is required as
+syscalls might consume arbitrary CPU time until they reach a scheduling
+point when the preemption model is either NONE or VOLUNTARY and therefore
+might exceed the grant by far.
+
+The preferred solution for user space is to use rseq_slice_yield(2) which
+is side effect free. The support for arbitrary syscalls is required to
+support onion layer architectured applications, where the code handling the
+critical section and requesting the time slice extension has no control
+over the code within the critical section.
+
+The kernel enforces flag consistency and terminates the thread with SIGSEGV
+if it detects a violation.
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -73,12 +73,35 @@ struct rseq_ids {
};
/**
+ * union rseq_slice_state - Status information for rseq time slice extension
+ * @state: Compound to access the overall state
+ * @enabled: Time slice extension is enabled for the task
+ * @granted: Time slice extension was granted to the task
+ */
+union rseq_slice_state {
+ u16 state;
+ struct {
+ u8 enabled;
+ u8 granted;
+ };
+};
+
+/**
+ * struct rseq_slice - Status information for rseq time slice extension
+ * @state: Time slice extension state
+ */
+struct rseq_slice {
+ union rseq_slice_state state;
+};
+
+/**
* struct rseq_data - Storage for all rseq related data
* @usrptr: Pointer to the registered user space RSEQ memory
* @len: Length of the RSEQ region
- * @sig: Signature of critial section abort IPs
+ * @sig: Signature of critical section abort IPs
* @event: Storage for event management
* @ids: Storage for cached CPU ID and MM CID
+ * @slice: Storage for time slice extension data
*/
struct rseq_data {
struct rseq __user *usrptr;
@@ -86,6 +109,9 @@ struct rseq_data {
u32 sig;
struct rseq_event event;
struct rseq_ids ids;
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+ struct rseq_slice slice;
+#endif
};
#else /* CONFIG_RSEQ */
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -23,9 +23,15 @@ enum rseq_flags {
};
enum rseq_cs_flags_bit {
+ /* Historical and unsupported bits */
RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT = 0,
RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT = 1,
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT = 2,
+ /* (3) Intentional gap to put new bits into a separate byte */
+
+ /* User read only feature flags */
+ RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT = 4,
+ RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT = 5,
};
enum rseq_cs_flags {
@@ -35,6 +41,11 @@ enum rseq_cs_flags {
(1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT),
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE =
(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
+
+ RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE =
+ (1U << RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT),
+ RSEQ_CS_FLAG_SLICE_EXT_ENABLED =
+ (1U << RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT),
};
/*
@@ -53,6 +64,27 @@ struct rseq_cs {
__u64 abort_ip;
} __attribute__((aligned(4 * sizeof(__u64))));
+/**
+ * rseq_slice_ctrl - Time slice extension control structure
+ * @all: Compound value
+ * @request: Request for a time slice extension
+ * @granted: Granted time slice extension
+ *
+ * @request is set by user space and can be cleared by user space or kernel
+ * space. @granted is set and cleared by the kernel and must only be read
+ * by user space.
+ */
+struct rseq_slice_ctrl {
+ union {
+ __u32 all;
+ struct {
+ __u8 request;
+ __u8 granted;
+ __u16 __reserved;
+ };
+ };
+};
+
/*
* struct rseq is aligned on 4 * 8 bytes to ensure it is always
* contained within a single cache-line.
@@ -142,6 +174,12 @@ struct rseq {
__u32 mm_cid;
/*
+ * Time slice extension control structure. CPU local updates from
+ * kernel and user space.
+ */
+ struct rseq_slice_ctrl slice_ctrl;
+
+ /*
* Flexible array member at end of structure, after last feature field.
*/
char end[];
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1938,6 +1938,18 @@ config RSEQ
If unsure, say Y.
+config RSEQ_SLICE_EXTENSION
+ bool "Enable rseq-based time slice extension mechanism"
+ depends on RSEQ && HIGH_RES_TIMERS && GENERIC_ENTRY && HAVE_GENERIC_TIF_BITS
+ help
+ Allows userspace to request a limited time slice extension when
+ returning from an interrupt to user space via the RSEQ shared
+ data ABI. If granted, that allows to complete a critical section,
+ so that other threads are not stuck on a conflicted resource,
+ while the task is scheduled out.
+
+ If unsure, say N.
+
config RSEQ_STATS
default n
bool "Enable lightweight statistics of restartable sequences" if EXPERT
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -389,6 +389,8 @@ static bool rseq_reset_ids(void)
*/
SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
{
+ u32 rseqfl = 0;
+
if (flags & RSEQ_FLAG_UNREGISTER) {
if (flags & ~RSEQ_FLAG_UNREGISTER)
return -EINVAL;
@@ -440,6 +442,9 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
if (!access_ok(rseq, rseq_len))
return -EFAULT;
+ if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION))
+ rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
+
scoped_user_write_access(rseq, efault) {
/*
* If the rseq_cs pointer is non-NULL on registration, clear it to
@@ -449,11 +454,13 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
* clearing the fields. Don't bother reading it, just reset it.
*/
unsafe_put_user(0UL, &rseq->rseq_cs, efault);
+ unsafe_put_user(rseqfl, &rseq->flags, efault);
/* Initialize IDs in user space */
unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id_start, efault);
unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id, efault);
unsafe_put_user(0U, &rseq->node_id, efault);
unsafe_put_user(0U, &rseq->mm_cid, efault);
+ unsafe_put_user(0U, &rseq->slice_ctrl.all, efault);
}
/*
On Mon, Dec 15, 2025 at 05:52:04PM +0100, Thomas Gleixner wrote:
> --- a/include/uapi/linux/rseq.h
> +++ b/include/uapi/linux/rseq.h
> @@ -23,9 +23,15 @@ enum rseq_flags {
> };
>
> enum rseq_cs_flags_bit {
> + /* Historical and unsupported bits */
> RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT = 0,
> RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT = 1,
> RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT = 2,
> + /* (3) Intentional gap to put new bits into a separate byte */
> +
> + /* User read only feature flags */
> + RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT = 4,
> + RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT = 5,
> };
Either 's/byte/nibble/' or 's/= [45]/= [78]/' I suppose.
On 2025-12-15 13:24, Thomas Gleixner wrote:
[...]
> +The thread has to enable the functionality via prctl(2)::
> +
> + prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
> + PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
Although it is not documented, it appears that a thread can
also use this prctl to disable slice extension.
How is it meant to compose once we have libc trying to use slice
extension internally and the application also using it or wishing to
disable it, unaware that libc is also trying to use it ?
Applications are composed of various libraries, each of which may want
to use the feature. It's unclear to me how the per-thread slice
extension enable/disable state fits in this context. Unless we address
this, it will become either:
- Owned and used by a single library, or
- Owned and used by the application, unavailable to libraries.
This goes against the design goals of RSEQ features.
[...]
> --- a/include/uapi/linux/rseq.h
> +++ b/include/uapi/linux/rseq.h
> @@ -23,9 +23,15 @@ enum rseq_flags {
> };
>
> enum rseq_cs_flags_bit {
> + /* Historical and unsupported bits */
> RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT = 0,
> RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT = 1,
> RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT = 2,
> + /* (3) Intentional gap to put new bits into a separate byte */
Aren't there 8 bits in a byte ? What am I missing ?
> +
> + /* User read only feature flags */
> + RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT = 4,
> + RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT = 5,
> };
>
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
On Tue, Dec 16 2025 at 09:36, Mathieu Desnoyers wrote:
> On 2025-12-15 13:24, Thomas Gleixner wrote:
> [...]
>> +The thread has to enable the functionality via prctl(2)::
>> +
>> + prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
>> + PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
>
> Although it is not documented, it appears that a thread can
> also use this prctl to disable slice extension.
Obviously. Controls are supposed to be symmetrical.
> How is it meant to compose once we have libc trying to use slice
> extension internally and the application also using it or wishing to
> disable it, unaware that libc is also trying to use it ?
Tons of prctls have the same "issue". What's so special about this?
> Applications are composed of various libraries, each of which may want
I'm well aware of that fact.
> to use the feature. It's unclear to me how the per-thread slice
> extension enable/disable state fits in this context. Unless we address
> this, it will become either:
>
> - Owned and used by a single library, or
>
> - Owned and used by the application, unavailable to libraries.
The prctl allows you to query the state, so all parties can make
informed decisions. It's not any different from other mechanisms, which
require coordination between different parts.
> This goes against the design goals of RSEQ features.
These goals are documented where?
What I've seen so far at least from the implementation is that it aims
to enable the maximum amount of features, aka. overhead, unconditionally
even if nothing uses them, e.g. CID.
Your vision/goal of RSEQ being useful everywhere simply does not match
the reality.
As I pointed out in the previous submission, the benefits of time slice
extensions are limited. In low contention scenarios they result in
measurable regressions, so it's not the magic panacea which solves all
locking/critical section problems at once.
The idea that cobbling random libraries together in the hope that
everything goes well has never worked. That's simply a wet dream and
Java has proven that to the maximum extent decades ago. Nevertheless all
other programming models went down the same yawning abyss and everyone
expects that the kernel is magically solving their problems by adding
more abusable [mis]features.
Systems have to be designed carefully as a whole if you want to achieve
the maximum performance. That's not any different from other targets
like real-time. A real-time enabled kernel does not magically create a
real-time system.
TBH, the prctl should be the least of your worries. There are worse
problems with uncoordinated usage:
set(REQUEST)
....
-> Interrupt
clr(REQUEST)
set(GRANTED)
lib1fn()
set(REQUEST) <- Inconsistent state
if (a) {
lib2fn()
syscall() <- RSEQ debug will kill the task....
} else {
...
-> Interrupt
<- RSEQ debug will kill the task....
And no, we are not going to lift this restriction because it allows
abuse of the mechanism unless we track more state and inflict more
overhead on the kernel for no good reason.
Thanks,
tglx
On Fri, Dec 19, 2025 at 12:21:30AM +0100, Thomas Gleixner wrote: > On Tue, Dec 16 2025 at 09:36, Mathieu Desnoyers wrote: > > On 2025-12-15 13:24, Thomas Gleixner wrote: > > [...] > >> +The thread has to enable the functionality via prctl(2):: > >> + > >> + prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET, > >> + PR_RSEQ_SLICE_EXT_ENABLE, 0, 0); > > > > Although it is not documented, it appears that a thread can > > also use this prctl to disable slice extension. > > Obviously. Controls are supposed to be symmetrical. > > > How is it meant to compose once we have libc trying to use slice > > extension internally and the application also using it or wishing to > > disable it, unaware that libc is also trying to use it ? > > Tons of prctls have the same "issue". What's so special about this? So I've read this whole thread, and I'm with Thomas on this. Yes this interface has sharp edges, but I don't think anything here makes a case for adding more complexity. As Thomas already stated; the very worst possible outcome is that slice extensions are always denied -- this is a performance issue, not a correctness issue. To me it really reads like: Doctor, it hurts when I hit my hand with a hammer.
On 2025-12-18 18:21, Thomas Gleixner wrote: > On Tue, Dec 16 2025 at 09:36, Mathieu Desnoyers wrote: >> On 2025-12-15 13:24, Thomas Gleixner wrote: >> [...] >>> +The thread has to enable the functionality via prctl(2):: >>> + >>> + prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET, >>> + PR_RSEQ_SLICE_EXT_ENABLE, 0, 0); >> >> Although it is not documented, it appears that a thread can >> also use this prctl to disable slice extension. > > Obviously. Controls are supposed to be symmetrical. I agree that the vast majority of prctl are symmetrical, but there are exceptions, e.g. PR_SET_NO_NEW_PRIVS, PR_SET_SECCOMP. >> How is it meant to compose once we have libc trying to use slice >> extension internally and the application also using it or wishing to >> disable it, unaware that libc is also trying to use it ? > > Tons of prctls have the same "issue". What's so special about this? What is special about this is the fact that we want to allow userspace to specialize its fast-path code at runtime based on availability of an rseq feature. If we allow slice extension to be disabled by the program or any library within the process, this means that either the program or any other library cannot assume slice extension availability to stay invariant after it has been setup. This therefore requires adding additional feature availability tests on the fast-path. And if this state is per-thread, this means testing flags within the rseq area on every use. Even if this is a simple load from an address at thread pointer + offset, test, and branch, the overhead adds up quickly for fast-paths. Moreover, if the prctl enables the feature independently for each thread (rather than for the whole process), this requires a conditional state check on every use because it can be enabled or disabled depending on the thread. This prevents code specialization that would select the appropriate code at process startup through either ifunc resolver, code patching or other mean. [...] > The prctl allows you to query the state, so all parties can make > informed decisions. It's not any different from other mechanisms, which > require coordination between different parts. I'm fine with having prctl enable the feature (for the whole process) and query its state. The part I'm concerned with is the prctl disabling the feature, as we're losing the availability invariant after setup. > >> This goes against the design goals of RSEQ features. > > These goals are documented where? We should clarify those design goals somewhere. So far those have been enforced by me when vetting new features, but that approach is not good in the long term. Is Documentation/userspace-api/rseq.rst a good location for this ? > > What I've seen so far at least from the implementation is that it aims > to enable the maximum amount of features, aka. overhead, unconditionally > even if nothing uses them, e.g. CID. I don't mind having things disabled on process startup and then opt-in. What I care about though is that the enabled state stays invariant across the entire process after setting this up at program startup. I agree with you in retrospect that this opt-in approach should have been taken for CID. > Your vision/goal of RSEQ being useful everywhere simply does not match > the reality. Again, I don't mind the opt-in approach, only that the state stays invariant after program startup. > As I pointed out in the previous submission, the benefits of time slice > extensions are limited. In low contention scenarios they result in > measurable regressions, so it's not the magic panacea which solves all > locking/critical section problems at once. I agree that whatever code we add to an uncontended spinlock fast path will show up in microbenchmark measurements. > > The idea that cobbling random libraries together in the hope that > everything goes well has never worked. That's simply a wet dream and > Java has proven that to the maximum extent decades ago. Nevertheless all > other programming models went down the same yawning abyss and everyone > expects that the kernel is magically solving their problems by adding > more abusable [mis]features. > > Systems have to be designed carefully as a whole if you want to achieve > the maximum performance. That's not any different from other targets > like real-time. A real-time enabled kernel does not magically create a > real-time system. [...] I think we are talking about two different program/libraries composition use-cases here. AFAIU, the aspect you are focused on is whether we should allow users of slice extension to nest. I agree with you that we should document this as unsupported, since the goal of slice extension is really for short spinlock critical sections, and nesting of those goes against that basic definition. The concern I am raising here is different. It's about just _using_ slice extension from various entities (program, libraries) within a process, without any nesting of slice extension requests. If libc successfully enables slice extension in its startup, the kernel should guarantee that it stays invariant for the lifetime of the program so libc can optimize its code accordingly, or use a fallback, without requiring additional per-thread variable checks in its fast paths. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
On Wed, Jan 07 2026 at 16:11, Mathieu Desnoyers wrote:
> On 2025-12-18 18:21, Thomas Gleixner wrote:
>> On Tue, Dec 16 2025 at 09:36, Mathieu Desnoyers wrote:
>>> On 2025-12-15 13:24, Thomas Gleixner wrote:
>>> [...]
>>>> +The thread has to enable the functionality via prctl(2)::
>>>> +
>>>> + prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
>>>> + PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
>>>
>>> Although it is not documented, it appears that a thread can
>>> also use this prctl to disable slice extension.
>>
>> Obviously. Controls are supposed to be symmetrical.
>
> I agree that the vast majority of prctl are symmetrical, but
> there are exceptions, e.g. PR_SET_NO_NEW_PRIVS, PR_SET_SECCOMP.
Which have security requirements and are therefore different.
>>> How is it meant to compose once we have libc trying to use slice
>>> extension internally and the application also using it or wishing to
>>> disable it, unaware that libc is also trying to use it ?
>>
>> Tons of prctls have the same "issue". What's so special about this?
>
> What is special about this is the fact that we want to allow userspace
> to specialize its fast-path code at runtime based on availability of an
> rseq feature.
>
> If we allow slice extension to be disabled by the program or any
> library within the process, this means that either the program or any
> other library cannot assume slice extension availability to stay
> invariant after it has been setup.
That's really a non-problem. This is not any different from other
tunables and there is really no reason to make up theoretical cases
where a library enables and another one disables. If user space can't
get it's act together then so be it. It's not the kernels problem and as
this is not a security feature with strict semantics, there is no reason
to let the kernel implement policy.
> Moreover, if the prctl enables the feature independently for each
> thread (rather than for the whole process), this requires a conditional
> state check on every use because it can be enabled or disabled
> depending on the thread. This prevents code specialization that would
> select the appropriate code at process startup through either ifunc
> resolver, code patching or other mean.
I'm not completely opposed to make it process wide. For threads created
after enablement, that's trivial because that can be done when the per
thread RSEQ is registered. But when it gets enabled _after_ threads have
been created already then we need code to chase the threads and enable
it after the fact because we are not going to query the enablement in
curr->mm::whatever just to have another conditional and another
cacheline to access.
The only option is to reject enablement when there is already more than
one thread in the process, but there is a reasonable argument that a
process might only enable it for a subset of threads, which have actual
lock interaction and not bother with it for other things. I'm not seeing
a reason to restrict the flexibility of configuration just because you
envision magic use cases all over the place.
On the other hand there is no guarantee that libc registers RSEQ when a
thread is started as it can be disabled or not supported, so you have
exactly the same problem there that the code which wants to use it needs
to ensure that a RSEQ area is registered, no?
> [...]
>
>> The prctl allows you to query the state, so all parties can make
>> informed decisions. It's not any different from other mechanisms, which
>> require coordination between different parts.
>
> I'm fine with having prctl enable the feature (for the whole process)
> and query its state.
>
> The part I'm concerned with is the prctl disabling the feature, as
> we're losing the availability invariant after setup.
close(0);
has the same problem. How many instances of bugs in that area have you
seen so far?
>> What I've seen so far at least from the implementation is that it aims
>> to enable the maximum amount of features, aka. overhead, unconditionally
>> even if nothing uses them, e.g. CID.
>
> I don't mind having things disabled on process startup and then opt-in.
> What I care about though is that the enabled state stays invariant across
> the entire process after setting this up at program startup.
Userspace is perfectly equipped to do so and the kernel is not there to
prevent user space from shooting itself into the foot.
>> As I pointed out in the previous submission, the benefits of time slice
>> extensions are limited. In low contention scenarios they result in
>> measurable regressions, so it's not the magic panacea which solves all
>> locking/critical section problems at once.
>
> I agree that whatever code we add to an uncontended spinlock fast path
> will show up in microbenchmark measurements.
It not only shows up in microbenchmarks. It shows up in real world
scenarios too. So enabling and using it in random places just because
you can will not necessarily result in any performance gain, it might
actually get worse.
>> The idea that cobbling random libraries together in the hope that
>> everything goes well has never worked. That's simply a wet dream and
>> Java has proven that to the maximum extent decades ago. Nevertheless all
>> other programming models went down the same yawning abyss and everyone
>> expects that the kernel is magically solving their problems by adding
>> more abusable [mis]features.
>>
>> Systems have to be designed carefully as a whole if you want to achieve
>> the maximum performance. That's not any different from other targets
>> like real-time. A real-time enabled kernel does not magically create a
>> real-time system.
> [...]
>
> I think we are talking about two different program/libraries composition
> use-cases here.
>
> AFAIU, the aspect you are focused on is whether we should allow users of
> slice extension to nest. I agree with you that we should document this
> as unsupported, since the goal of slice extension is really for short
> spinlock critical sections, and nesting of those goes against that
> basic definition.
This is not about nesting. This is about the completely unrealistic idea
that combining random libraries will result in a functional optimized
system. If you want to ensure that nothing can disable it then implement
a syscall filter which rejects the disable command. That's user space
policy, not kernel side hardcoded policy.
> The concern I am raising here is different. It's about just _using_
> slice extension from various entities (program, libraries) within a
> process, without any nesting of slice extension requests.
>
> If libc successfully enables slice extension in its startup, the
> kernel should guarantee that it stays invariant for the lifetime
> of the program so libc can optimize its code accordingly, or use
> a fallback, without requiring additional per-thread variable checks
> in its fast paths.
Even if libc enables it and something else disables it, then the only
downside is that user space pointlessly does the request dance:
set_request()
critical_section()
clear_request()
if (granted()) // Guaranteed to be false
sys_rseq_slice_yield()
The resulting harm is that requests are ignored by the kernel, so the
"optimized" code is not getting what it expects and executes 3
instructions for nothing. That's all. So where is your problem?
Thanks,
tglx
* Thomas Gleixner: > I'm not completely opposed to make it process wide. For threads created > after enablement, that's trivial because that can be done when the per > thread RSEQ is registered. But when it gets enabled _after_ threads have > been created already then we need code to chase the threads and enable > it after the fact because we are not going to query the enablement in > curr->mm::whatever just to have another conditional and another > cacheline to access. In glibc, we make sure that the registration for restartable sequences happens before any user code (with the exception of IFUNC resolvers) can run. This includes code from signal handlers. We started masking signals on newly created threads for this reason, to make these partially initialized states unobservable. It's not clear to me what the expected outcome is. If we ever want to offer deadline extension as a mutex attribute (for example), then we have to switch this on at process start unconditionally because we don't know if this new API will be used by the new process (potentially after dlopen, so we can't even use things likely analyzing the symbol footprint ahead of time). > The only option is to reject enablement when there is already more than > one thread in the process, but there is a reasonable argument that a > process might only enable it for a subset of threads, which have actual > lock interaction and not bother with it for other things. I'm not seeing > a reason to restrict the flexibility of configuration just because you > envision magic use cases all over the place. Sure, but it looks like this needs a custom/minimal libc. It's like repurposing set_robust_list for something else. It can be done, but it has a significant cost in terms of compatibility because some functionality (that other libraries in the process depend on) will stop working. > On the other hand there is no guarantee that libc registers RSEQ when a > thread is started as it can be disabled or not supported, so you have > exactly the same problem there that the code which wants to use it needs > to ensure that a RSEQ area is registered, no? With glibc, if RSEQ is registered on the main thread, it will be registered on all other threads, too. Technically, it's possible to unregister RSEQ with the kernel, of course, but that's totally undefined, like unmapping memory originally returned from malloc. >>> The prctl allows you to query the state, so all parties can make >>> informed decisions. It's not any different from other mechanisms, which >>> require coordination between different parts. >> >> I'm fine with having prctl enable the feature (for the whole process) >> and query its state. >> >> The part I'm concerned with is the prctl disabling the feature, as >> we're losing the availability invariant after setup. > > close(0); > > has the same problem. How many instances of bugs in that area have you > seen so far? We've had significant issues due to incorrect close calls (maybe not close(0) in particular, but definitely with double-closes removing descriptors created by other threads. We need the prctl to unregister for CRIU, though, otherwise CRIU won't be able to use glibc directly (or would have to re-exec itself in a new configuration). Thanks, Florian
On 2026-01-13 18:45, Florian Weimer wrote: > * Thomas Gleixner: > >> I'm not completely opposed to make it process wide. For threads created >> after enablement, that's trivial because that can be done when the per >> thread RSEQ is registered. But when it gets enabled _after_ threads have >> been created already then we need code to chase the threads and enable >> it after the fact because we are not going to query the enablement in >> curr->mm::whatever just to have another conditional and another >> cacheline to access. > > In glibc, we make sure that the registration for restartable sequences > happens before any user code (with the exception of IFUNC resolvers) can > run. This includes code from signal handlers. We started masking > signals on newly created threads for this reason, to make these > partially initialized states unobservable. > > It's not clear to me what the expected outcome is. If we ever want to > offer deadline extension as a mutex attribute (for example), then we > have to switch this on at process start unconditionally because we don't > know if this new API will be used by the new process (potentially after > dlopen, so we can't even use things likely analyzing the symbol > footprint ahead of time). > >> The only option is to reject enablement when there is already more than >> one thread in the process, but there is a reasonable argument that a >> process might only enable it for a subset of threads, which have actual >> lock interaction and not bother with it for other things. I'm not seeing >> a reason to restrict the flexibility of configuration just because you >> envision magic use cases all over the place. > > Sure, but it looks like this needs a custom/minimal libc. It's like > repurposing set_robust_list for something else. It can be done, but it > has a significant cost in terms of compatibility because some > functionality (that other libraries in the process depend on) will stop > working. My main concern is about the overhead of added system calls at thread creation. I recall that doing an additional rseq system call at thread creation was analyzed thoroughly for performance regressions at the libc level. I would not want to start requiring libc to issue a handful of additional prctl system calls per thread creation for no good reason. I don't mind that much whether we enable slice extension per process or per thread, but what I do mind in the case of per-thread enabling is whether the enabling scheme can be batched, so a user enables a set of rseq features in one go, ideally at rseq registration. This is missing with the prctl approach proposed by Thomas. If the enabling is per-process, it's not so bad because there is already a lot happening on exec, so I would not mind a prctl that much, but for per-thread enabling I see the many individual system calls as an issue we need to address. > We need the prctl to unregister for CRIU, though, otherwise CRIU won't > be able to use glibc directly (or would have to re-exec itself in a new > configuration). Good point that the unregister is needed. Which means the prctl is probably needed then. But it does not solve the "handful of prctl per thread creation" issue, which probably calls for something more at the rseq system call level. So I wonder if we could extend the rseq thread registration to also specify a set of "features to enable" somehow ? This would still be per-thread, but would not require additional prctl on thread creation. Thanks Florian and Thomas for your input, this helps me corner the issue that's nagging at me. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
On Sat, Jan 17, 2026 at 05:16:16PM +0100, Mathieu Desnoyers wrote:
> My main concern is about the overhead of added system calls at thread
> creation. I recall that doing an additional rseq system call at thread
> creation was analyzed thoroughly for performance regressions at the
> libc level. I would not want to start requiring libc to issue a
> handful of additional prctl system calls per thread creation for no good
> reason.
A wee something like so?
That would allow registering rseq with RSEQ_FLAG_SLICE_EXT_DEFAULT_ON
set and if all the stars align, it will then have it on at the end.
---
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -424,7 +424,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
return 0;
}
- if (unlikely(flags))
+ if (unlikely(flags & ~(RSEQ_FLAG_SLICE_EXT_DEFAULT_ON)))
return -EINVAL;
if (current->rseq.usrptr) {
@@ -459,8 +459,12 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
if (!access_ok(rseq, rseq_len))
return -EFAULT;
- if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION))
+ if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) {
rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
+ if (rseq_slice_extension_enabled() &&
+ flags & RSEQ_FLAG_SLICE_EXT_DEFAULT_ON)
+ rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
+ }
scoped_user_write_access(rseq, efault) {
/*
@@ -488,6 +492,10 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
current->rseq.len = rseq_len;
current->rseq.sig = sig;
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+ current->rseq.slice.state.enabled = !!(rseqfl & RSEQ_CS_FLAG_SLICE_EXT_ENABLED);
+#endif
+
/*
* If rseq was previously inactive, and has just been
* registered, ensure the cpu_id_start and cpu_id fields
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -19,7 +19,8 @@ enum rseq_cpu_id_state {
};
enum rseq_flags {
- RSEQ_FLAG_UNREGISTER = (1 << 0),
+ RSEQ_FLAG_UNREGISTER = (1 << 0),
+ RSEQ_FLAG_SLICE_EXT_DEFAULT_ON = (1 << 1),
};
enum rseq_cs_flags_bit {
* Peter Zijlstra: > On Sat, Jan 17, 2026 at 05:16:16PM +0100, Mathieu Desnoyers wrote: > >> My main concern is about the overhead of added system calls at thread >> creation. I recall that doing an additional rseq system call at thread >> creation was analyzed thoroughly for performance regressions at the >> libc level. I would not want to start requiring libc to issue a >> handful of additional prctl system calls per thread creation for no good >> reason. > > A wee something like so? > > That would allow registering rseq with RSEQ_FLAG_SLICE_EXT_DEFAULT_ON > set and if all the stars align, it will then have it on at the end. I think this would work for glibc because it will only show up in __rseq_flags if we set the flag on process startup, and then all threads would get it. It doesn't matter that __rseq_flags is not per-thread. Thanks, Florian
On 2026-01-19 05:21, Peter Zijlstra wrote:
> On Sat, Jan 17, 2026 at 05:16:16PM +0100, Mathieu Desnoyers wrote:
>
>> My main concern is about the overhead of added system calls at thread
>> creation. I recall that doing an additional rseq system call at thread
>> creation was analyzed thoroughly for performance regressions at the
>> libc level. I would not want to start requiring libc to issue a
>> handful of additional prctl system calls per thread creation for no good
>> reason.
>
> A wee something like so?
>
> That would allow registering rseq with RSEQ_FLAG_SLICE_EXT_DEFAULT_ON
> set and if all the stars align, it will then have it on at the end.
That's a very good step in the right direction. I just wonder how
userspace is expected to learn that it runs on a kernel which
accepts the RSEQ_FLAG_SLICE_EXT_DEFAULT_ON flag ?
I think it could expect it when getauxval for AT_RSEQ_FEATURE_SIZE
includes the slice ext field. This gives us a cheap way to know
from userspace whether this new flag is supported or not.
One nit below:
[...]
> - if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION))
> + if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) {
> rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
> + if (rseq_slice_extension_enabled() &&
> + flags & RSEQ_FLAG_SLICE_EXT_DEFAULT_ON)
I think you want to surround flags & RSEQ_FLAG_SLICE_EXT_DEFAULT_ON with
parentheses () to have the expected operator priority.
Thanks!
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
On Mon, Jan 19, 2026 at 11:30:53AM +0100, Mathieu Desnoyers wrote:
> On 2026-01-19 05:21, Peter Zijlstra wrote:
> > On Sat, Jan 17, 2026 at 05:16:16PM +0100, Mathieu Desnoyers wrote:
> >
> > > My main concern is about the overhead of added system calls at thread
> > > creation. I recall that doing an additional rseq system call at thread
> > > creation was analyzed thoroughly for performance regressions at the
> > > libc level. I would not want to start requiring libc to issue a
> > > handful of additional prctl system calls per thread creation for no good
> > > reason.
> >
> > A wee something like so?
> >
> > That would allow registering rseq with RSEQ_FLAG_SLICE_EXT_DEFAULT_ON
> > set and if all the stars align, it will then have it on at the end.
>
> That's a very good step in the right direction. I just wonder how
> userspace is expected to learn that it runs on a kernel which
> accepts the RSEQ_FLAG_SLICE_EXT_DEFAULT_ON flag ?
>
> I think it could expect it when getauxval for AT_RSEQ_FEATURE_SIZE
> includes the slice ext field. This gives us a cheap way to know
> from userspace whether this new flag is supported or not.
struct rseq vs struct rseq_data. I don't think that slice field is
exposed on the user side of things.
I was thinking it could just try with the flag the firs time, and then
record if that worked or not and use the 'correct' value for all future
rseq calls.
> One nit below:
>
> [...]
> > - if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION))
> > + if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) {
> > rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
> > + if (rseq_slice_extension_enabled() &&
> > + flags & RSEQ_FLAG_SLICE_EXT_DEFAULT_ON)
>
> I think you want to surround flags & RSEQ_FLAG_SLICE_EXT_DEFAULT_ON with
> parentheses () to have the expected operator priority.
Moo, done (I added that rseq_slice_extension_enabled() test later).
On 2026-01-19 06:03, Peter Zijlstra wrote:
> On Mon, Jan 19, 2026 at 11:30:53AM +0100, Mathieu Desnoyers wrote:
>> On 2026-01-19 05:21, Peter Zijlstra wrote:
>>> On Sat, Jan 17, 2026 at 05:16:16PM +0100, Mathieu Desnoyers wrote:
>>>
>>>> My main concern is about the overhead of added system calls at thread
>>>> creation. I recall that doing an additional rseq system call at thread
>>>> creation was analyzed thoroughly for performance regressions at the
>>>> libc level. I would not want to start requiring libc to issue a
>>>> handful of additional prctl system calls per thread creation for no good
>>>> reason.
>>>
>>> A wee something like so?
>>>
>>> That would allow registering rseq with RSEQ_FLAG_SLICE_EXT_DEFAULT_ON
>>> set and if all the stars align, it will then have it on at the end.
>>
>> That's a very good step in the right direction. I just wonder how
>> userspace is expected to learn that it runs on a kernel which
>> accepts the RSEQ_FLAG_SLICE_EXT_DEFAULT_ON flag ?
>>
>> I think it could expect it when getauxval for AT_RSEQ_FEATURE_SIZE
>> includes the slice ext field. This gives us a cheap way to know
>> from userspace whether this new flag is supported or not.
>
> struct rseq vs struct rseq_data. I don't think that slice field is
> exposed on the user side of things.
Yes it is. (unless I'm missing something ?)
See the original patch of this thread at https://lore.kernel.org/lkml/20251215155708.669472597@linutronix.de/
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
[...]
@@ -142,6 +174,12 @@ struct rseq {
__u32 mm_cid;
/*
+ * Time slice extension control structure. CPU local updates from
+ * kernel and user space.
+ */
+ struct rseq_slice_ctrl slice_ctrl;
+
+ /*
* Flexible array member at end of structure, after last feature field.
*/
char end[];
>
> I was thinking it could just try with the flag the firs time, and then
> record if that worked or not and use the 'correct' value for all future
> rseq calls.
That would work too, but would waste a system call on process startup in case
of failure. Not a big deal, but checking with getauxval would be better because
this information is already exported to userspace at program execution and
available without doing any system call.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
On Mon, Jan 19, 2026 at 12:10:27PM +0100, Mathieu Desnoyers wrote: > > struct rseq vs struct rseq_data. I don't think that slice field is > > exposed on the user side of things. > > Yes it is. (unless I'm missing something ?) *sigh*, I was looking on the wrong machine, that didn't have the patches applied :-( I need to go wake up or something...
On Wed, Jan 14 2026 at 00:45, Florian Weimer wrote:
> * Thomas Gleixner:
>> I'm not completely opposed to make it process wide. For threads created
>> after enablement, that's trivial because that can be done when the per
>> thread RSEQ is registered. But when it gets enabled _after_ threads have
>> been created already then we need code to chase the threads and enable
>> it after the fact because we are not going to query the enablement in
>> curr->mm::whatever just to have another conditional and another
>> cacheline to access.
>
> In glibc, we make sure that the registration for restartable sequences
> happens before any user code (with the exception of IFUNC resolvers) can
> run. This includes code from signal handlers. We started masking
> signals on newly created threads for this reason, to make these
> partially initialized states unobservable.
>
> It's not clear to me what the expected outcome is. If we ever want to
> offer deadline extension as a mutex attribute (for example), then we
> have to switch this on at process start unconditionally because we don't
> know if this new API will be used by the new process (potentially after
> dlopen, so we can't even use things likely analyzing the symbol
> footprint ahead of time).
Sure, but then you can enable it at each thread start, no?
>> The only option is to reject enablement when there is already more than
>> one thread in the process, but there is a reasonable argument that a
>> process might only enable it for a subset of threads, which have actual
>> lock interaction and not bother with it for other things. I'm not seeing
>> a reason to restrict the flexibility of configuration just because you
>> envision magic use cases all over the place.
>
> Sure, but it looks like this needs a custom/minimal libc. It's like
> repurposing set_robust_list for something else. It can be done, but it
> has a significant cost in terms of compatibility because some
> functionality (that other libraries in the process depend on) will stop
> working.
The kernel is not there to cater magic user space expectations. It
provides interfaces and the minimal amount of policy.
If glibc wants to use it for mutexes (for all the wrong reasons) then
glibc needs to take care of enabling it like it does for registering
RSEQ for each newly created thread.
If glibc does not and the application does care for their particular
concurrency control, then it is the application's problem to ensure that
it is enabled for the threads it cares about, right?
>> On the other hand there is no guarantee that libc registers RSEQ when a
>> thread is started as it can be disabled or not supported, so you have
>> exactly the same problem there that the code which wants to use it needs
>> to ensure that a RSEQ area is registered, no?
>
> With glibc, if RSEQ is registered on the main thread, it will be
> registered on all other threads, too. Technically, it's possible to
> unregister RSEQ with the kernel, of course, but that's totally
> undefined, like unmapping memory originally returned from malloc.
This is again user land policy. glibc decides to register RSEQ for each
new thread, but the kernel does not care whether it does or not.
>>>> The prctl allows you to query the state, so all parties can make
>>>> informed decisions. It's not any different from other mechanisms, which
>>>> require coordination between different parts.
>>>
>>> I'm fine with having prctl enable the feature (for the whole process)
>>> and query its state.
>>>
>>> The part I'm concerned with is the prctl disabling the feature, as
>>> we're losing the availability invariant after setup.
>>
>> close(0);
>>
>> has the same problem. How many instances of bugs in that area have you
>> seen so far?
>
> We've had significant issues due to incorrect close calls (maybe not
> close(0) in particular, but definitely with double-closes removing
> descriptors created by other threads.
That's again not a kernel problem. The primary UNIX design principle is
to allow user space to shoot itself into the foot. There is zero reason
to change that unless it's a justified security issue.
Time slice extension best effort magic does definitely qualify for
that. It's harmless as the only side effect is that user space wastes
cycles...
Thanks,
tglx
The following commit has been merged into the sched/core branch of tip:
Commit-ID: d7a5da7a0f7fa7ff081140c4f6f971db98882703
Gitweb: https://git.kernel.org/tip/d7a5da7a0f7fa7ff081140c4f6f971db98882703
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 15 Dec 2025 17:52:04 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 22 Jan 2026 11:11:16 +01:00
rseq: Add fields and constants for time slice extension
Aside of a Kconfig knob add the following items:
- Two flag bits for the rseq user space ABI, which allow user space to
query the availability and enablement without a syscall.
- A new member to the user space ABI struct rseq, which is going to be
used to communicate request and grant between kernel and user space.
- A rseq state struct to hold the kernel state of this
- Documentation of the new mechanism
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155708.669472597@linutronix.de
---
Documentation/userspace-api/index.rst | 1 +-
Documentation/userspace-api/rseq.rst | 135 +++++++++++++++++++++++++-
include/linux/rseq_types.h | 28 ++++-
include/uapi/linux/rseq.h | 38 +++++++-
init/Kconfig | 12 ++-
kernel/rseq.c | 7 +-
6 files changed, 220 insertions(+), 1 deletion(-)
create mode 100644 Documentation/userspace-api/rseq.rst
diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
index 8a61ac4..fa0fe8a 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -21,6 +21,7 @@ System calls
ebpf/index
ioctl/index
mseal
+ rseq
Security-related interfaces
===========================
diff --git a/Documentation/userspace-api/rseq.rst b/Documentation/userspace-api/rseq.rst
new file mode 100644
index 0000000..e1fdb0d
--- /dev/null
+++ b/Documentation/userspace-api/rseq.rst
@@ -0,0 +1,135 @@
+=====================
+Restartable Sequences
+=====================
+
+Restartable Sequences allow to register a per thread userspace memory area
+to be used as an ABI between kernel and userspace for three purposes:
+
+ * userspace restartable sequences
+
+ * quick access to read the current CPU number, node ID from userspace
+
+ * scheduler time slice extensions
+
+Restartable sequences (per-cpu atomics)
+---------------------------------------
+
+Restartable sequences allow userspace to perform update operations on
+per-cpu data without requiring heavyweight atomic operations. The actual
+ABI is unfortunately only available in the code and selftests.
+
+Quick access to CPU number, node ID
+-----------------------------------
+
+Allows to implement per CPU data efficiently. Documentation is in code and
+selftests. :(
+
+Scheduler time slice extensions
+-------------------------------
+
+This allows a thread to request a time slice extension when it enters a
+critical section to avoid contention on a resource when the thread is
+scheduled out inside of the critical section.
+
+The prerequisites for this functionality are:
+
+ * Enabled in Kconfig
+
+ * Enabled at boot time (default is enabled)
+
+ * A rseq userspace pointer has been registered for the thread
+
+The thread has to enable the functionality via prctl(2)::
+
+ prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
+ PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
+
+prctl() returns 0 on success or otherwise with the following error codes:
+
+========= ==============================================================
+Errorcode Meaning
+========= ==============================================================
+EINVAL Functionality not available or invalid function arguments.
+ Note: arg4 and arg5 must be zero
+ENOTSUPP Functionality was disabled on the kernel command line
+ENXIO Available, but no rseq user struct registered
+========= ==============================================================
+
+The state can be also queried via prctl(2)::
+
+ prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0);
+
+prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if
+disabled. Otherwise it returns with the following error codes:
+
+========= ==============================================================
+Errorcode Meaning
+========= ==============================================================
+EINVAL Functionality not available or invalid function arguments.
+ Note: arg3 and arg4 and arg5 must be zero
+========= ==============================================================
+
+The availability and status is also exposed via the rseq ABI struct flags
+field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the
+``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read-only for user
+space and only for informational purposes.
+
+If the mechanism was enabled via prctl(), the thread can request a time
+slice extension by setting rseq::slice_ctrl::request to 1. If the thread is
+interrupted and the interrupt results in a reschedule request in the
+kernel, then the kernel can grant a time slice extension and return to
+userspace instead of scheduling out. The length of the extension is
+determined by the ``rseq_slice_extension_nsec`` sysctl.
+
+The kernel indicates the grant by clearing rseq::slice_ctrl::request and
+setting rseq::slice_ctrl::granted to 1. If there is a reschedule of the
+thread after granting the extension, the kernel clears the granted bit to
+indicate that to userspace.
+
+If the request bit is still set when the leaving the critical section,
+userspace can clear it and continue.
+
+If the granted bit is set, then userspace invokes rseq_slice_yield(2) when
+leaving the critical section to relinquish the CPU. The kernel enforces
+this by arming a timer to prevent misbehaving userspace from abusing this
+mechanism.
+
+If both the request bit and the granted bit are false when leaving the
+critical section, then this indicates that a grant was revoked and no
+further action is required by userspace.
+
+The required code flow is as follows::
+
+ rseq->slice_ctrl.request = 1;
+ barrier(); // Prevent compiler reordering
+ critical_section();
+ barrier(); // Prevent compiler reordering
+ rseq->slice_ctrl.request = 0;
+ if (rseq->slice_ctrl.granted)
+ rseq_slice_yield();
+
+As all of this is strictly CPU local, there are no atomicity requirements.
+Checking the granted state is racy, but that cannot be avoided at all::
+
+ if (rseq->slice_ctrl.granted)
+ -> Interrupt results in schedule and grant revocation
+ rseq_slice_yield();
+
+So there is no point in pretending that this might be solved by an atomic
+operation.
+
+If the thread issues a syscall other than rseq_slice_yield(2) within the
+granted timeslice extension, the grant is also revoked and the CPU is
+relinquished immediately when entering the kernel. This is required as
+syscalls might consume arbitrary CPU time until they reach a scheduling
+point when the preemption model is either NONE or VOLUNTARY and therefore
+might exceed the grant by far.
+
+The preferred solution for user space is to use rseq_slice_yield(2) which
+is side effect free. The support for arbitrary syscalls is required to
+support onion layer architectured applications, where the code handling the
+critical section and requesting the time slice extension has no control
+over the code within the critical section.
+
+The kernel enforces flag consistency and terminates the thread with SIGSEGV
+if it detects a violation.
diff --git a/include/linux/rseq_types.h b/include/linux/rseq_types.h
index 332dc14..67e40c0 100644
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -73,12 +73,35 @@ struct rseq_ids {
};
/**
+ * union rseq_slice_state - Status information for rseq time slice extension
+ * @state: Compound to access the overall state
+ * @enabled: Time slice extension is enabled for the task
+ * @granted: Time slice extension was granted to the task
+ */
+union rseq_slice_state {
+ u16 state;
+ struct {
+ u8 enabled;
+ u8 granted;
+ };
+};
+
+/**
+ * struct rseq_slice - Status information for rseq time slice extension
+ * @state: Time slice extension state
+ */
+struct rseq_slice {
+ union rseq_slice_state state;
+};
+
+/**
* struct rseq_data - Storage for all rseq related data
* @usrptr: Pointer to the registered user space RSEQ memory
* @len: Length of the RSEQ region
- * @sig: Signature of critial section abort IPs
+ * @sig: Signature of critical section abort IPs
* @event: Storage for event management
* @ids: Storage for cached CPU ID and MM CID
+ * @slice: Storage for time slice extension data
*/
struct rseq_data {
struct rseq __user *usrptr;
@@ -86,6 +109,9 @@ struct rseq_data {
u32 sig;
struct rseq_event event;
struct rseq_ids ids;
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+ struct rseq_slice slice;
+#endif
};
#else /* CONFIG_RSEQ */
diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index 1b76d50..6afc219 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -23,9 +23,15 @@ enum rseq_flags {
};
enum rseq_cs_flags_bit {
+ /* Historical and unsupported bits */
RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT = 0,
RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT = 1,
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT = 2,
+ /* (3) Intentional gap to put new bits into a separate byte */
+
+ /* User read only feature flags */
+ RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT = 4,
+ RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT = 5,
};
enum rseq_cs_flags {
@@ -35,6 +41,11 @@ enum rseq_cs_flags {
(1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT),
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE =
(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
+
+ RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE =
+ (1U << RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT),
+ RSEQ_CS_FLAG_SLICE_EXT_ENABLED =
+ (1U << RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT),
};
/*
@@ -53,6 +64,27 @@ struct rseq_cs {
__u64 abort_ip;
} __attribute__((aligned(4 * sizeof(__u64))));
+/**
+ * rseq_slice_ctrl - Time slice extension control structure
+ * @all: Compound value
+ * @request: Request for a time slice extension
+ * @granted: Granted time slice extension
+ *
+ * @request is set by user space and can be cleared by user space or kernel
+ * space. @granted is set and cleared by the kernel and must only be read
+ * by user space.
+ */
+struct rseq_slice_ctrl {
+ union {
+ __u32 all;
+ struct {
+ __u8 request;
+ __u8 granted;
+ __u16 __reserved;
+ };
+ };
+};
+
/*
* struct rseq is aligned on 4 * 8 bytes to ensure it is always
* contained within a single cache-line.
@@ -142,6 +174,12 @@ struct rseq {
__u32 mm_cid;
/*
+ * Time slice extension control structure. CPU local updates from
+ * kernel and user space.
+ */
+ struct rseq_slice_ctrl slice_ctrl;
+
+ /*
* Flexible array member at end of structure, after last feature field.
*/
char end[];
diff --git a/init/Kconfig b/init/Kconfig
index fa79feb..00c6fbb 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1938,6 +1938,18 @@ config RSEQ
If unsure, say Y.
+config RSEQ_SLICE_EXTENSION
+ bool "Enable rseq-based time slice extension mechanism"
+ depends on RSEQ && HIGH_RES_TIMERS && GENERIC_ENTRY && HAVE_GENERIC_TIF_BITS
+ help
+ Allows userspace to request a limited time slice extension when
+ returning from an interrupt to user space via the RSEQ shared
+ data ABI. If granted, that allows to complete a critical section,
+ so that other threads are not stuck on a conflicted resource,
+ while the task is scheduled out.
+
+ If unsure, say N.
+
config RSEQ_STATS
default n
bool "Enable lightweight statistics of restartable sequences" if EXPERT
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 395d8b0..07c324d 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -389,6 +389,8 @@ static bool rseq_reset_ids(void)
*/
SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
{
+ u32 rseqfl = 0;
+
if (flags & RSEQ_FLAG_UNREGISTER) {
if (flags & ~RSEQ_FLAG_UNREGISTER)
return -EINVAL;
@@ -440,6 +442,9 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32
if (!access_ok(rseq, rseq_len))
return -EFAULT;
+ if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION))
+ rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
+
scoped_user_write_access(rseq, efault) {
/*
* If the rseq_cs pointer is non-NULL on registration, clear it to
@@ -449,11 +454,13 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32
* clearing the fields. Don't bother reading it, just reset it.
*/
unsafe_put_user(0UL, &rseq->rseq_cs, efault);
+ unsafe_put_user(rseqfl, &rseq->flags, efault);
/* Initialize IDs in user space */
unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id_start, efault);
unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id, efault);
unsafe_put_user(0U, &rseq->node_id, efault);
unsafe_put_user(0U, &rseq->mm_cid, efault);
+ unsafe_put_user(0U, &rseq->slice_ctrl.all, efault);
}
/*
© 2016 - 2026 Red Hat, Inc.