Aside of a Kconfig knob add the following items:
- Two flag bits for the rseq user space ABI, which allow user space to
query the availability and enablement without a syscall.
- A new member to the user space ABI struct rseq, which is going to be
used to communicate request and grant between kernel and user space.
- A rseq state struct to hold the kernel state of this
- Documentation of the new mechanism
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
Documentation/userspace-api/index.rst | 1
Documentation/userspace-api/rseq.rst | 129 ++++++++++++++++++++++++++++++++++
include/linux/rseq_types.h | 26 ++++++
include/uapi/linux/rseq.h | 28 +++++++
init/Kconfig | 12 +++
kernel/rseq.c | 8 ++
6 files changed, 204 insertions(+)
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -21,6 +21,7 @@ System calls
ebpf/index
ioctl/index
mseal
+ rseq
Security-related interfaces
===========================
--- /dev/null
+++ b/Documentation/userspace-api/rseq.rst
@@ -0,0 +1,129 @@
+=====================
+Restartable Sequences
+=====================
+
+Restartable Sequences allow to register a per thread userspace memory area
+to be used as an ABI between kernel and user-space for three purposes:
+
+ * user-space restartable sequences
+
+ * quick access to read the current CPU number, node ID from user-space
+
+ * scheduler time slice extensions
+
+Restartable sequences (per-cpu atomics)
+---------------------------------------
+
+Restartables sequences allow user-space to perform update operations on
+per-cpu data without requiring heavy-weight atomic operations. The actual
+ABI is unfortunately only available in the code and selftests.
+
+Quick access to CPU number, node ID
+-----------------------------------
+
+Allows to implement per CPU data efficiently. Documentation is in code and
+selftests. :(
+
+Scheduler time slice extensions
+-------------------------------
+
+This allows a thread to request a time slice extension when it enters a
+critical section to avoid contention on a resource when the thread is
+scheduled out inside of the critical section.
+
+The prerequisites for this functionality are:
+
+ * Enabled in Kconfig
+
+ * Enabled at boot time (default is enabled)
+
+ * A rseq user space pointer has been registered for the thread
+
+The thread has to enable the functionality via prctl(2)::
+
+ prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
+ PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
+
+prctl() returns 0 on success and otherwise with the following error codes:
+
+========= ==============================================================
+Errorcode Meaning
+========= ==============================================================
+EINVAL Functionality not available or invalid function arguments.
+ Note: arg4 and arg5 must be zero
+ENOTSUPP Functionality was disabled on the kernel command line
+ENXIO Available, but no rseq user struct registered
+========= ==============================================================
+
+The state can be also queried via prctl(2)::
+
+ prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0);
+
+prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if
+disabled. Otherwise it returns with the following error codes:
+
+========= ==============================================================
+Errorcode Meaning
+========= ==============================================================
+EINVAL Functionality not available or invalid function arguments.
+ Note: arg3 and arg4 and arg5 must be zero
+========= ==============================================================
+
+The availability and status is also exposed via the rseq ABI struct flags
+field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the
+``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read only for user
+space and only for informational purposes.
+
+If the mechanism was enabled via prctl(), the thread can request a time
+slice extension by setting the ``RSEQ_SLICE_EXT_REQUEST_BIT`` in the struct
+rseq slice_ctrl field. If the thread is interrupted and the interrupt
+results in a reschedule request in the kernel, then the kernel can grant a
+time slice extension and return to user space instead of scheduling
+out.
+
+The kernel indicates the grant by clearing ``RSEQ_SLICE_EXT_REQUEST_BIT``
+and setting ``RSEQ_SLICE_EXT_GRANTED_BIT`` in the rseq::slice_ctrl
+field. If there is a reschedule of the thread after granting the extension,
+the kernel clears the granted bit to indicate that to user space.
+
+If the request bit is still set when the leaving the critical section, user
+space can clear it and continue.
+
+If the granted bit is set, then user space has to invoke rseq_slice_yield()
+when leaving the critical section to relinquish the CPU. The kernel
+enforces this by arming a timer to prevent misbehaving user space from
+abusing this mechanism.
+
+If both the request bit and the granted bit are false when leaving the
+critical section, then this indicates that a grant was revoked and no
+further action is required by user space.
+
+The required code flow is as follows::
+
+ rseq->slice_ctrl = REQUEST;
+ critical_section();
+ if (!local_test_and_clear_bit(REQUEST, &rseq->slice_ctrl)) {
+ if (rseq->slice_ctrl & GRANTED)
+ rseq_slice_yield();
+ }
+
+local_test_and_clear_bit() has to be local CPU atomic to prevent the
+obvious RMW race versus an interrupt. On X86 this can be achieved with BTRL
+without LOCK prefix. On architectures, which do not provide lightweight CPU
+local atomics this needs to be implemented with regular atomic operations.
+
+Setting REQUEST has no atomicity requirements as there is no concurrency
+vs. the GRANTED bit.
+
+Checking the GRANTED has no atomicity requirements as there is obviously a
+race which cannot be avoided at all::
+
+ if (rseq->slice_ctrl & GRANTED)
+ -> Interrupt results in schedule and grant revocation
+ rseq_slice_yield();
+
+So there is no point in pretending that this might be solved by an atomic
+operation.
+
+The kernel enforces flag consistency and terminates the thread with SIGSEGV
+if it detects a violation.
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -71,12 +71,35 @@ struct rseq_ids {
};
/**
+ * union rseq_slice_state - Status information for rseq time slice extension
+ * @state: Compound to access the overall state
+ * @enabled: Time slice extension is enabled for the task
+ * @granted: Time slice extension was granted to the task
+ */
+union rseq_slice_state {
+ u16 state;
+ struct {
+ u8 enabled;
+ u8 granted;
+ };
+};
+
+/**
+ * struct rseq_slice - Status information for rseq time slice extension
+ * @state: Time slice extension state
+ */
+struct rseq_slice {
+ union rseq_slice_state state;
+};
+
+/**
* struct rseq_data - Storage for all rseq related data
* @usrptr: Pointer to the registered user space RSEQ memory
* @len: Length of the RSEQ region
* @sig: Signature of critial section abort IPs
* @event: Storage for event management
* @ids: Storage for cached CPU ID and MM CID
+ * @slice: Storage for time slice extension data
*/
struct rseq_data {
struct rseq __user *usrptr;
@@ -84,6 +107,9 @@ struct rseq_data {
u32 sig;
struct rseq_event event;
struct rseq_ids ids;
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+ struct rseq_slice slice;
+#endif
};
#else /* CONFIG_RSEQ */
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -23,9 +23,15 @@ enum rseq_flags {
};
enum rseq_cs_flags_bit {
+ /* Historical and unsupported bits */
RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT = 0,
RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT = 1,
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT = 2,
+ /* (3) Intentional gap to put new bits into a seperate byte */
+
+ /* User read only feature flags */
+ RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT = 4,
+ RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT = 5,
};
enum rseq_cs_flags {
@@ -35,6 +41,22 @@ enum rseq_cs_flags {
(1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT),
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE =
(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
+
+ RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE =
+ (1U << RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT),
+ RSEQ_CS_FLAG_SLICE_EXT_ENABLED =
+ (1U << RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT),
+};
+
+enum rseq_slice_bits {
+ /* Time slice extension ABI bits */
+ RSEQ_SLICE_EXT_REQUEST_BIT = 0,
+ RSEQ_SLICE_EXT_GRANTED_BIT = 1,
+};
+
+enum rseq_slice_masks {
+ RSEQ_SLICE_EXT_REQUEST = (1U << RSEQ_SLICE_EXT_REQUEST_BIT),
+ RSEQ_SLICE_EXT_GRANTED = (1U << RSEQ_SLICE_EXT_GRANTED_BIT),
};
/*
@@ -142,6 +164,12 @@ struct rseq {
__u32 mm_cid;
/*
+ * Time slice extension control word. CPU local atomic updates from
+ * kernel and user space.
+ */
+ __u32 slice_ctrl;
+
+ /*
* Flexible array member at end of structure, after last feature field.
*/
char end[];
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1908,6 +1908,18 @@ config RSEQ_DEBUG_DEFAULT_ENABLE
If unsure, say N.
+config RSEQ_SLICE_EXTENSION
+ bool "Enable rseq based time slice extension mechanism"
+ depends on RSEQ && HIGH_RES_TIMERS && GENERIC_ENTRY && HAVE_GENERIC_TIF_BITS
+ help
+ Allows userspace to request a limited time slice extension when
+ returning from an interrupt to user space via the RSEQ shared
+ data ABI. If granted, that allows to complete a critical section,
+ so that other threads are not stuck on a conflicted resource,
+ while the task is scheduled out.
+
+ If unsure, say N.
+
config DEBUG_RSEQ
default n
bool "Enable debugging of rseq() system call" if EXPERT
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -387,6 +387,8 @@ static bool rseq_reset_ids(void)
*/
SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
{
+ u32 rseqfl = 0;
+
if (flags & RSEQ_FLAG_UNREGISTER) {
if (flags & ~RSEQ_FLAG_UNREGISTER)
return -EINVAL;
@@ -448,6 +450,12 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
if (put_user_masked_u64(0UL, &rseq->rseq_cs))
return -EFAULT;
+ if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION))
+ rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
+
+ if (put_user_masked_u32(rseqfl, &rseq->flags))
+ return -EFAULT;
+
/*
* Activate the registration by setting the rseq area address, length
* and signature in the task struct.
> On Sep 8, 2025, at 3:59 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
..
> +enum rseq_slice_masks {
> + RSEQ_SLICE_EXT_REQUEST = (1U << RSEQ_SLICE_EXT_REQUEST_BIT),
> + RSEQ_SLICE_EXT_GRANTED = (1U << RSEQ_SLICE_EXT_GRANTED_BIT),
> };
>
> /*
> @@ -142,6 +164,12 @@ struct rseq {
> __u32 mm_cid;
>
> /*
> + * Time slice extension control word. CPU local atomic updates from
> + * kernel and user space.
> + */
> + __u32 slice_ctrl;
We intend to backport the slice extension feature to older kernel versions.
With use of a new structure member for slice control, could there be discrepancy
with rseq structure size(older version) registered by libc? In that case the application
may not be able to use slice extension feature unless Libc’s use of rseq is disabled.
Application would have to verify structure size, so should it be mentioned in the
documentation. Also, perhaps make the prctl() enable call return error, if structure size
does not match?
With regards to application determining the address and size of rseq structure
registered by libc, what are you thoughts on getting that thru the rseq(2)
system call or a prctl() call instead of dealing with the __week symbols as was discussed here.
https://lore.kernel.org/all/F9DBABAD-ABF0-49AA-9A38-BD4D2BE78B94@oracle.com/
Thanks,
-Prakash
> +
> + /*
> * Flexible array member at end of structure, after last feature field.
> */
> char end[];
On 2025-09-22 01:28, Prakash Sangappa wrote:
>
>
>> On Sep 8, 2025, at 3:59 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>
> ..
>> +enum rseq_slice_masks {
>> + RSEQ_SLICE_EXT_REQUEST = (1U << RSEQ_SLICE_EXT_REQUEST_BIT),
>> + RSEQ_SLICE_EXT_GRANTED = (1U << RSEQ_SLICE_EXT_GRANTED_BIT),
>> };
>>
>> /*
>> @@ -142,6 +164,12 @@ struct rseq {
>> __u32 mm_cid;
>>
>> /*
>> + * Time slice extension control word. CPU local atomic updates from
>> + * kernel and user space.
>> + */
>> + __u32 slice_ctrl;
>
> We intend to backport the slice extension feature to older kernel versions.
>
> With use of a new structure member for slice control, could there be discrepancy
> with rseq structure size(older version) registered by libc? In that case the application
> may not be able to use slice extension feature unless Libc’s use of rseq is disabled.
The rseq extension scheme allows this to seamlessly work.
You will need a glibc 2.41+, which uses the getauxval(3)
AT_RSEQ_FEATURE_SIZE and AT_RSEQ_ALIGN to query the feature size
supported by the Linux kernel. It allocates a per-thread memory
area which is large enough to support that feature set, and
registers it to the kernel through rseq(2) on thread creation.
Note that before we had the extensible rseq scheme, glibc registered
a 32-byte structure (including padding at the end), which is considered
as the rseq "original" registration size.
The "mm_cid" field ends at 28 bytes, which leaves 4 bytes of padding at
the end of the original rseq structure. Considering that the time slice
extension fields will likely fit within those 4 bytes, I expect that
applications linked against glibc [2.35, 2.40] will also be able to use
those fields. Those applications should use getauxval(3)
AT_RSEQ_FEATURE_SIZE to validate whether the kernel populates this field
or if it's just padding.
Note that this all works even if you backport the feature to an older kernel:
the rseq extension scheme does not depend on querying the kernel version at
all. You will however be required to backport the support for additional
rseq fields that come before the time slice, such as node_id and mm_cid,
if they are not implemented in your older kernel.
>
> Application would have to verify structure size, so should it be mentioned in the
> documentation.
Yes, applications should check that the glibc's __rseq_size is large enough to fit
the new slice field(s), *and* for the original rseq size special case
(32 bytes including padding), those would need to query getauxval(3)
AT_RSEQ_FEATURE_SIZE to make sure the field is indeed supported.
Also, perhaps make the prctl() enable call return error, if structure size
> does not match?
That's not how the extensible scheme works.
Either glibc registers a 32-byte area (in which the time slice feature would
fit), or it registers an area large enough to fit all kernel supported features,
or it fails registration. And prctl() is per-process, whereas the rseq registration
is per-thread, so it's kind of weird to make prctl() fail if the current
thread's rseq is not registered.
>
> With regards to application determining the address and size of rseq structure
> registered by libc, what are you thoughts on getting that thru the rseq(2)
> system call or a prctl() call instead of dealing with the __week symbols as was discussed here.
>
> https://lore.kernel.org/all/F9DBABAD-ABF0-49AA-9A38-BD4D2BE78B94@oracle.com/
I think that the other leg of that email thread got to a resolution of both static and
dynamic use-cases through use of an extern __weak symbol, no [1] ? Not that I am against
adding a rseq(2) query for rseq address, size, and signature, but I just want to double
check that it would be there for convenience and is not actually needed in the typical
use-cases.
Thanks,
Mathieu
[1] https://lore.kernel.org/all/aKPFIQwg5zxSS5oS@google.com/
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
> On Sep 22, 2025, at 6:55 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>
> On 2025-09-22 01:28, Prakash Sangappa wrote:
>>> On Sep 8, 2025, at 3:59 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>>
>> ..
>>> +enum rseq_slice_masks {
>>> + RSEQ_SLICE_EXT_REQUEST = (1U << RSEQ_SLICE_EXT_REQUEST_BIT),
>>> + RSEQ_SLICE_EXT_GRANTED = (1U << RSEQ_SLICE_EXT_GRANTED_BIT),
>>> };
>>>
>>> /*
>>> @@ -142,6 +164,12 @@ struct rseq {
>>> __u32 mm_cid;
>>>
>>> /*
>>> + * Time slice extension control word. CPU local atomic updates from
>>> + * kernel and user space.
>>> + */
>>> + __u32 slice_ctrl;
>> We intend to backport the slice extension feature to older kernel versions.
>> With use of a new structure member for slice control, could there be discrepancy
>> with rseq structure size(older version) registered by libc? In that case the application
>> may not be able to use slice extension feature unless Libc’s use of rseq is disabled.
>
> The rseq extension scheme allows this to seamlessly work.
>
> You will need a glibc 2.41+, which uses the getauxval(3)
> AT_RSEQ_FEATURE_SIZE and AT_RSEQ_ALIGN to query the feature size
> supported by the Linux kernel. It allocates a per-thread memory
> area which is large enough to support that feature set, and
> registers it to the kernel through rseq(2) on thread creation.
Ok,
>
> Note that before we had the extensible rseq scheme, glibc registered
> a 32-byte structure (including padding at the end), which is considered
> as the rseq "original" registration size.
>
> The "mm_cid" field ends at 28 bytes, which leaves 4 bytes of padding at
> the end of the original rseq structure. Considering that the time slice
> extension fields will likely fit within those 4 bytes, I expect that
> applications linked against glibc [2.35, 2.40] will also be able to use
> those fields. Those applications should use getauxval(3)
> AT_RSEQ_FEATURE_SIZE to validate whether the kernel populates this field
> or if it's just padding.
The question was about the size of rseq structure registered by glibc. If it is using
AT_RSEQ_FEATURE_SIZE to allocate the per-thread area for rseq, I suppose that
should be fine. However application would have to verify that __rseq_size size is large
enough.
As for the Kernel supporting slice extension, I expect the prctl(..,PR_RSEQ_SLICE_EXT_ENABLE)
would return an error if it is not supported, won’t that be sufficient or should it check
AT_RSEQ_FEATURE_SIZE?
>
> Note that this all works even if you backport the feature to an older kernel:
> the rseq extension scheme does not depend on querying the kernel version at
> all. You will however be required to backport the support for additional
> rseq fields that come before the time slice, such as node_id and mm_cid,
> if they are not implemented in your older kernel.
Yes, need to look at those changes that needs to be backported. Also, the dependent
'rseq: Optimize exit to user space’ changes from other patch series.
>
>> Application would have to verify structure size, so should it be mentioned in the
>> documentation.
>
> Yes, applications should check that the glibc's __rseq_size is large enough to fit
> the new slice field(s), *and* for the original rseq size special case
> (32 bytes including padding), those would need to query getauxval(3)
> AT_RSEQ_FEATURE_SIZE to make sure the field is indeed supported.
>
> Also, perhaps make the prctl() enable call return error, if structure size
>> does not match?
>
> That's not how the extensible scheme works.
>
> Either glibc registers a 32-byte area (in which the time slice feature would
> fit), or it registers an area large enough to fit all kernel supported features,
> or it fails registration. And prctl() is per-process, whereas the rseq registration
> is per-thread, so it's kind of weird to make prctl() fail if the current
> thread's rseq is not registered.
I meant the prctl(.., PR_RSEQ_SLICE_EXT_ENABLE) call is per thread and
sets the enabled bit in per thread rseq. This could fail if rseq struct size is not large enough?
>
>> With regards to application determining the address and size of rseq structure
>> registered by libc, what are you thoughts on getting that thru the rseq(2)
>> system call or a prctl() call instead of dealing with the __week symbols as was discussed here.
>> https://lore.kernel.org/all/F9DBABAD-ABF0-49AA-9A38-BD4D2BE78B94@oracle.com/
>
> I think that the other leg of that email thread got to a resolution of both static and
> dynamic use-cases through use of an extern __weak symbol, no [1] ? Not that I am against
> adding a rseq(2) query for rseq address, size, and signature, but I just want to double
> check that it would be there for convenience and is not actually needed in the typical
> use-cases.
Yes, mainly for convenience.
Thanks,
-Prakash
>
> Thanks,
>
> Mathieu
>
> [1] https://lore.kernel.org/all/aKPFIQwg5zxSS5oS@google.com/
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
Hello Prakash, On 9/22/2025 10:58 AM, Prakash Sangappa wrote: > With use of a new structure member for slice control, could there be discrepancy > with rseq structure size(older version) registered by libc? In that case the application > may not be able to use slice extension feature unless Libc’s use of rseq is disabled. In this case, wouldn't GLIBC's rseq registration fail if presumed __rseq_size is smaller than the "struct rseq" size? And if it has allocated a large enough area, then the prctl() should help to query the slice extension feature's availability. -- Thanks and Regards, Prateek
On 2025-09-22 01:57, K Prateek Nayak wrote: > Hello Prakash, > > On 9/22/2025 10:58 AM, Prakash Sangappa wrote: >> With use of a new structure member for slice control, could there be discrepancy >> with rseq structure size(older version) registered by libc? In that case the application >> may not be able to use slice extension feature unless Libc’s use of rseq is disabled. > > In this case, wouldn't GLIBC's rseq registration fail if presumed > __rseq_size is smaller than the "struct rseq" size? The registered rseq size cannot be smaller than 32 bytes, else registration is refused by the system call (-EINVAL). The new slice extension fields would fit within those 32 bytes, so it should always work. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
On 2025-09-08 18:59, Thomas Gleixner wrote:
> Aside of a Kconfig knob add the following items:
>
> - Two flag bits for the rseq user space ABI, which allow user space to
> query the availability and enablement without a syscall.
>
> - A new member to the user space ABI struct rseq, which is going to be
> used to communicate request and grant between kernel and user space.
>
> - A rseq state struct to hold the kernel state of this
>
> - Documentation of the new mechanism
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
> Cc: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
> Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
> Documentation/userspace-api/index.rst | 1
> Documentation/userspace-api/rseq.rst | 129 ++++++++++++++++++++++++++++++++++
> include/linux/rseq_types.h | 26 ++++++
> include/uapi/linux/rseq.h | 28 +++++++
> init/Kconfig | 12 +++
> kernel/rseq.c | 8 ++
> 6 files changed, 204 insertions(+)
>
> --- a/Documentation/userspace-api/index.rst
> +++ b/Documentation/userspace-api/index.rst
> @@ -21,6 +21,7 @@ System calls
> ebpf/index
> ioctl/index
> mseal
> + rseq
>
> Security-related interfaces
> ===========================
> --- /dev/null
> +++ b/Documentation/userspace-api/rseq.rst
> @@ -0,0 +1,129 @@
> +=====================
> +Restartable Sequences
> +=====================
> +
> +Restartable Sequences allow to register a per thread userspace memory area
> +to be used as an ABI between kernel and user-space for three purposes:
> +
> + * user-space restartable sequences
> +
> + * quick access to read the current CPU number, node ID from user-space
Also reading the "concurrency ID" (mm_cid).
> +
> + * scheduler time slice extensions
> +
> +Restartable sequences (per-cpu atomics)
> +---------------------------------------
> +
> +Restartables sequences allow user-space to perform update operations on
> +per-cpu data without requiring heavy-weight atomic operations. The actual
> +ABI is unfortunately only available in the code and selftests.
Note that I've done a man page available here:
https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/doc/man/rseq.2
which describes the ABI.
> +
> +Quick access to CPU number, node ID
> +-----------------------------------
> +
> +Allows to implement per CPU data efficiently. Documentation is in code and
> +selftests. :(
At what level should we document this here ? Would it be OK to show examples
that rely on librseq helpers ?
> +
> +Scheduler time slice extensions
> +-------------------------------
> +
Note: I suspect we'll also want to add this section to the rseq(2) man page.
> +This allows a thread to request a time slice extension when it enters a
> +critical section to avoid contention on a resource when the thread is
> +scheduled out inside of the critical section.
> +
> +The prerequisites for this functionality are:
> +
> + * Enabled in Kconfig
> +
> + * Enabled at boot time (default is enabled)
> +
> + * A rseq user space pointer has been registered for the thread
> +
> +The thread has to enable the functionality via prctl(2)::
> +
> + prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
> + PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
> +
> +prctl() returns 0 on success and otherwise with the following error codes:
> +
> +========= ==============================================================
> +Errorcode Meaning
> +========= ==============================================================
> +EINVAL Functionality not available or invalid function arguments.
> + Note: arg4 and arg5 must be zero
> +ENOTSUPP Functionality was disabled on the kernel command line
> +ENXIO Available, but no rseq user struct registered
> +========= ==============================================================
> +
> +The state can be also queried via prctl(2)::
> +
> + prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0);
> +
> +prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if
> +disabled. Otherwise it returns with the following error codes:
> +
> +========= ==============================================================
> +Errorcode Meaning
> +========= ==============================================================
> +EINVAL Functionality not available or invalid function arguments.
> + Note: arg3 and arg4 and arg5 must be zero
> +========= ==============================================================
> +
> +The availability and status is also exposed via the rseq ABI struct flags
> +field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the
> +``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read only for user
> +space and only for informational purposes.
Do those flags have a meaning within the struct rseq_cs @flags field as
well, or just within the struct rseq flags field ?
> +
> +If the mechanism was enabled via prctl(), the thread can request a time
> +slice extension by setting the ``RSEQ_SLICE_EXT_REQUEST_BIT`` in the struct
> +rseq slice_ctrl field. If the thread is interrupted and the interrupt
> +results in a reschedule request in the kernel, then the kernel can grant a
> +time slice extension and return to user space instead of scheduling
> +out.
> +
> +The kernel indicates the grant by clearing ``RSEQ_SLICE_EXT_REQUEST_BIT``
> +and setting ``RSEQ_SLICE_EXT_GRANTED_BIT`` in the rseq::slice_ctrl
> +field. If there is a reschedule of the thread after granting the extension,
> +the kernel clears the granted bit to indicate that to user space.
> +
> +If the request bit is still set when the leaving the critical section, user
> +space can clear it and continue.
> +
> +If the granted bit is set, then user space has to invoke rseq_slice_yield()
> +when leaving the critical section to relinquish the CPU. The kernel
> +enforces this by arming a timer to prevent misbehaving user space from
> +abusing this mechanism.
> +
> +If both the request bit and the granted bit are false when leaving the
> +critical section, then this indicates that a grant was revoked and no
> +further action is required by user space.
> +
> +The required code flow is as follows::
> +
> + rseq->slice_ctrl = REQUEST;
> + critical_section();
> + if (!local_test_and_clear_bit(REQUEST, &rseq->slice_ctrl)) {
> + if (rseq->slice_ctrl & GRANTED)
> + rseq_slice_yield();
> + }
> +
> +local_test_and_clear_bit() has to be local CPU atomic to prevent the
> +obvious RMW race versus an interrupt. On X86 this can be achieved with BTRL
> +without LOCK prefix. On architectures, which do not provide lightweight CPU
> +local atomics this needs to be implemented with regular atomic operations.
> +
> +Setting REQUEST has no atomicity requirements as there is no concurrency
> +vs. the GRANTED bit.
> +
> +Checking the GRANTED has no atomicity requirements as there is obviously a
> +race which cannot be avoided at all::
> +
> + if (rseq->slice_ctrl & GRANTED)
> + -> Interrupt results in schedule and grant revocation
> + rseq_slice_yield();
> +
> +So there is no point in pretending that this might be solved by an atomic
> +operation.
See my cover letter comments about the algorithm above.
Thanks,
Mathieu
> +
> +The kernel enforces flag consistency and terminates the thread with SIGSEGV
> +if it detects a violation.
> --- a/include/linux/rseq_types.h
> +++ b/include/linux/rseq_types.h
> @@ -71,12 +71,35 @@ struct rseq_ids {
> };
>
> /**
> + * union rseq_slice_state - Status information for rseq time slice extension
> + * @state: Compound to access the overall state
> + * @enabled: Time slice extension is enabled for the task
> + * @granted: Time slice extension was granted to the task
> + */
> +union rseq_slice_state {
> + u16 state;
> + struct {
> + u8 enabled;
> + u8 granted;
> + };
> +};
> +
> +/**
> + * struct rseq_slice - Status information for rseq time slice extension
> + * @state: Time slice extension state
> + */
> +struct rseq_slice {
> + union rseq_slice_state state;
> +};
> +
> +/**
> * struct rseq_data - Storage for all rseq related data
> * @usrptr: Pointer to the registered user space RSEQ memory
> * @len: Length of the RSEQ region
> * @sig: Signature of critial section abort IPs
> * @event: Storage for event management
> * @ids: Storage for cached CPU ID and MM CID
> + * @slice: Storage for time slice extension data
> */
> struct rseq_data {
> struct rseq __user *usrptr;
> @@ -84,6 +107,9 @@ struct rseq_data {
> u32 sig;
> struct rseq_event event;
> struct rseq_ids ids;
> +#ifdef CONFIG_RSEQ_SLICE_EXTENSION
> + struct rseq_slice slice;
> +#endif
> };
>
> #else /* CONFIG_RSEQ */
> --- a/include/uapi/linux/rseq.h
> +++ b/include/uapi/linux/rseq.h
> @@ -23,9 +23,15 @@ enum rseq_flags {
> };
>
> enum rseq_cs_flags_bit {
> + /* Historical and unsupported bits */
> RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT = 0,
> RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT = 1,
> RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT = 2,
> + /* (3) Intentional gap to put new bits into a seperate byte */
> +
> + /* User read only feature flags */
> + RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT = 4,
> + RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT = 5,
> };
>
> enum rseq_cs_flags {
> @@ -35,6 +41,22 @@ enum rseq_cs_flags {
> (1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT),
> RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE =
> (1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
> +
> + RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE =
> + (1U << RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT),
> + RSEQ_CS_FLAG_SLICE_EXT_ENABLED =
> + (1U << RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT),
> +};
> +
> +enum rseq_slice_bits {
> + /* Time slice extension ABI bits */
> + RSEQ_SLICE_EXT_REQUEST_BIT = 0,
> + RSEQ_SLICE_EXT_GRANTED_BIT = 1,
> +};
> +
> +enum rseq_slice_masks {
> + RSEQ_SLICE_EXT_REQUEST = (1U << RSEQ_SLICE_EXT_REQUEST_BIT),
> + RSEQ_SLICE_EXT_GRANTED = (1U << RSEQ_SLICE_EXT_GRANTED_BIT),
> };
>
> /*
> @@ -142,6 +164,12 @@ struct rseq {
> __u32 mm_cid;
>
> /*
> + * Time slice extension control word. CPU local atomic updates from
> + * kernel and user space.
> + */
> + __u32 slice_ctrl;
> +
> + /*
> * Flexible array member at end of structure, after last feature field.
> */
> char end[];
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1908,6 +1908,18 @@ config RSEQ_DEBUG_DEFAULT_ENABLE
>
> If unsure, say N.
>
> +config RSEQ_SLICE_EXTENSION
> + bool "Enable rseq based time slice extension mechanism"
> + depends on RSEQ && HIGH_RES_TIMERS && GENERIC_ENTRY && HAVE_GENERIC_TIF_BITS
> + help
> + Allows userspace to request a limited time slice extension when
> + returning from an interrupt to user space via the RSEQ shared
> + data ABI. If granted, that allows to complete a critical section,
> + so that other threads are not stuck on a conflicted resource,
> + while the task is scheduled out.
> +
> + If unsure, say N.
> +
> config DEBUG_RSEQ
> default n
> bool "Enable debugging of rseq() system call" if EXPERT
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -387,6 +387,8 @@ static bool rseq_reset_ids(void)
> */
> SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
> {
> + u32 rseqfl = 0;
> +
> if (flags & RSEQ_FLAG_UNREGISTER) {
> if (flags & ~RSEQ_FLAG_UNREGISTER)
> return -EINVAL;
> @@ -448,6 +450,12 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
> if (put_user_masked_u64(0UL, &rseq->rseq_cs))
> return -EFAULT;
>
> + if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION))
> + rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
> +
> + if (put_user_masked_u32(rseqfl, &rseq->flags))
> + return -EFAULT;
> +
> /*
> * Activate the registration by setting the rseq area address, length
> * and signature in the task struct.
>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
On 2025-09-11 11:41, Mathieu Desnoyers wrote:
> On 2025-09-08 18:59, Thomas Gleixner wrote:
[...]
>
>> +
>> +The kernel enforces flag consistency and terminates the thread with
>> SIGSEGV
>> +if it detects a violation.
>> --- a/include/linux/rseq_types.h
>> +++ b/include/linux/rseq_types.h
>> @@ -71,12 +71,35 @@ struct rseq_ids {
>> };
>> /**
>> + * union rseq_slice_state - Status information for rseq time slice
>> extension
>> + * @state: Compound to access the overall state
>> + * @enabled: Time slice extension is enabled for the task
>> + * @granted: Time slice extension was granted to the task
>> + */
>> +union rseq_slice_state {
>> + u16 state;
>> + struct {
>> + u8 enabled;
>> + u8 granted;
>> + };
>> +};
>> +
>> +/**
>> + * struct rseq_slice - Status information for rseq time slice extension
>> + * @state: Time slice extension state
>> + */
>> +struct rseq_slice {
>> + union rseq_slice_state state;
>> +};
>> +
>> +/**
>> * struct rseq_data - Storage for all rseq related data
>> * @usrptr: Pointer to the registered user space RSEQ memory
>> * @len: Length of the RSEQ region
>> * @sig: Signature of critial section abort IPs
>> * @event: Storage for event management
>> * @ids: Storage for cached CPU ID and MM CID
>> + * @slice: Storage for time slice extension data
>> */
>> struct rseq_data {
>> struct rseq __user *usrptr;
>> @@ -84,6 +107,9 @@ struct rseq_data {
>> u32 sig;
>> struct rseq_event event;
>> struct rseq_ids ids;
>> +#ifdef CONFIG_RSEQ_SLICE_EXTENSION
>> + struct rseq_slice slice;
>> +#endif
Note: we could move this #ifdef to surround the definition
of both union rseq_slice_state and struct rseq_slice,
and emit an empty structure in the #else case rather than
do the ifdef here.
Thanks,
Mathieu
>> };
>> #else /* CONFIG_RSEQ */
>> --- a/include/uapi/linux/rseq.h
>> +++ b/include/uapi/linux/rseq.h
>> @@ -23,9 +23,15 @@ enum rseq_flags {
>> };
>> enum rseq_cs_flags_bit {
>> + /* Historical and unsupported bits */
>> RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT = 0,
>> RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT = 1,
>> RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT = 2,
>> + /* (3) Intentional gap to put new bits into a seperate byte */
>> +
>> + /* User read only feature flags */
>> + RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT = 4,
>> + RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT = 5,
>> };
>> enum rseq_cs_flags {
>> @@ -35,6 +41,22 @@ enum rseq_cs_flags {
>> (1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT),
>> RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE =
>> (1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
>> +
>> + RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE =
>> + (1U << RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT),
>> + RSEQ_CS_FLAG_SLICE_EXT_ENABLED =
>> + (1U << RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT),
>> +};
>> +
>> +enum rseq_slice_bits {
>> + /* Time slice extension ABI bits */
>> + RSEQ_SLICE_EXT_REQUEST_BIT = 0,
>> + RSEQ_SLICE_EXT_GRANTED_BIT = 1,
>> +};
>> +
>> +enum rseq_slice_masks {
>> + RSEQ_SLICE_EXT_REQUEST = (1U << RSEQ_SLICE_EXT_REQUEST_BIT),
>> + RSEQ_SLICE_EXT_GRANTED = (1U << RSEQ_SLICE_EXT_GRANTED_BIT),
>> };
>> /*
>> @@ -142,6 +164,12 @@ struct rseq {
>> __u32 mm_cid;
>> /*
>> + * Time slice extension control word. CPU local atomic updates from
>> + * kernel and user space.
>> + */
>> + __u32 slice_ctrl;
>> +
>> + /*
>> * Flexible array member at end of structure, after last feature
>> field.
>> */
>> char end[];
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -1908,6 +1908,18 @@ config RSEQ_DEBUG_DEFAULT_ENABLE
>> If unsure, say N.
>> +config RSEQ_SLICE_EXTENSION
>> + bool "Enable rseq based time slice extension mechanism"
>> + depends on RSEQ && HIGH_RES_TIMERS && GENERIC_ENTRY &&
>> HAVE_GENERIC_TIF_BITS
>> + help
>> + Allows userspace to request a limited time slice extension
>> when
>> + returning from an interrupt to user space via the RSEQ shared
>> + data ABI. If granted, that allows to complete a critical section,
>> + so that other threads are not stuck on a conflicted resource,
>> + while the task is scheduled out.
>> +
>> + If unsure, say N.
>> +
>> config DEBUG_RSEQ
>> default n
>> bool "Enable debugging of rseq() system call" if EXPERT
>> --- a/kernel/rseq.c
>> +++ b/kernel/rseq.c
>> @@ -387,6 +387,8 @@ static bool rseq_reset_ids(void)
>> */
>> SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
>> int, flags, u32, sig)
>> {
>> + u32 rseqfl = 0;
>> +
>> if (flags & RSEQ_FLAG_UNREGISTER) {
>> if (flags & ~RSEQ_FLAG_UNREGISTER)
>> return -EINVAL;
>> @@ -448,6 +450,12 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
>> if (put_user_masked_u64(0UL, &rseq->rseq_cs))
>> return -EFAULT;
>> + if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION))
>> + rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
>> +
>> + if (put_user_masked_u32(rseqfl, &rseq->flags))
>> + return -EFAULT;
>> +
>> /*
>> * Activate the registration by setting the rseq area address,
>> length
>> * and signature in the task struct.
>>
>
>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Hi Thomas,
On 9/8/25 3:59 PM, Thomas Gleixner wrote:
> Aside of a Kconfig knob add the following items:
>
> ---
> Documentation/userspace-api/index.rst | 1
> Documentation/userspace-api/rseq.rst | 129 ++++++++++++++++++++++++++++++++++
> include/linux/rseq_types.h | 26 ++++++
> include/uapi/linux/rseq.h | 28 +++++++
> init/Kconfig | 12 +++
> kernel/rseq.c | 8 ++
> 6 files changed, 204 insertions(+)
>
> --- /dev/null
> +++ b/Documentation/userspace-api/rseq.rst
> @@ -0,0 +1,129 @@
> +=====================
> +Restartable Sequences
> +=====================
> +
> +Restartable Sequences allow to register a per thread userspace memory area
> +to be used as an ABI between kernel and user-space for three purposes:
userspace or user-space or user space -- be consistent, please.
(above 2 times, and more below)
FWIW, "userspace" overwhelmingly wins in the kernel source tree.
On the $internet it looks like "user space" wins (quick look).
> +
> + * user-space restartable sequences
> +
> + * quick access to read the current CPU number, node ID from user-space
> +
> + * scheduler time slice extensions
> +
> +Restartable sequences (per-cpu atomics)
> +---------------------------------------
> +
> +Restartables sequences allow user-space to perform update operations on
> +per-cpu data without requiring heavy-weight atomic operations. The actual
just heavyweight
> +ABI is unfortunately only available in the code and selftests.
> +
> +Quick access to CPU number, node ID
> +-----------------------------------
> +
> +Allows to implement per CPU data efficiently. Documentation is in code and
> +selftests. :(
> +
> +Scheduler time slice extensions
> +-------------------------------
> +
> +This allows a thread to request a time slice extension when it enters a
> +critical section to avoid contention on a resource when the thread is
> +scheduled out inside of the critical section.
> +
> +The prerequisites for this functionality are:
> +
> + * Enabled in Kconfig
> +
> + * Enabled at boot time (default is enabled)
> +
> + * A rseq user space pointer has been registered for the thread
^^^^^^^^^^
> +
> +The thread has to enable the functionality via prctl(2)::
> +
> + prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
> + PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
> +
> +prctl() returns 0 on success and otherwise with the following error codes:
> +
> +========= ==============================================================
> +Errorcode Meaning
> +========= ==============================================================
> +EINVAL Functionality not available or invalid function arguments.
> + Note: arg4 and arg5 must be zero
> +ENOTSUPP Functionality was disabled on the kernel command line
> +ENXIO Available, but no rseq user struct registered
> +========= ==============================================================
> +
> +The state can be also queried via prctl(2)::
> +
> + prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0);
> +
> +prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if
> +disabled. Otherwise it returns with the following error codes:
> +
> +========= ==============================================================
> +Errorcode Meaning
> +========= ==============================================================
> +EINVAL Functionality not available or invalid function arguments.
> + Note: arg3 and arg4 and arg5 must be zero
> +========= ==============================================================
> +
> +The availability and status is also exposed via the rseq ABI struct flags
> +field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the
> +``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read only for user
read-only for
> +space and only for informational purposes.
userspace ?
> +
> +If the mechanism was enabled via prctl(), the thread can request a time
> +slice extension by setting the ``RSEQ_SLICE_EXT_REQUEST_BIT`` in the struct
> +rseq slice_ctrl field. If the thread is interrupted and the interrupt
> +results in a reschedule request in the kernel, then the kernel can grant a
> +time slice extension and return to user space instead of scheduling
^^^^^^^^^^
> +out.
> +
> +The kernel indicates the grant by clearing ``RSEQ_SLICE_EXT_REQUEST_BIT``
> +and setting ``RSEQ_SLICE_EXT_GRANTED_BIT`` in the rseq::slice_ctrl
> +field. If there is a reschedule of the thread after granting the extension,
> +the kernel clears the granted bit to indicate that to user space.
?
> +
> +If the request bit is still set when the leaving the critical section, user
> +space can clear it and continue.
?
> +
> +If the granted bit is set, then user space has to invoke rseq_slice_yield()
?
> +when leaving the critical section to relinquish the CPU. The kernel
> +enforces this by arming a timer to prevent misbehaving user space from
OK, I think that you like "user space". :)
> +abusing this mechanism.
> +
> +If both the request bit and the granted bit are false when leaving the
> +critical section, then this indicates that a grant was revoked and no
> +further action is required by user space.
> +
> +The required code flow is as follows::
> +
> + rseq->slice_ctrl = REQUEST;
> + critical_section();
> + if (!local_test_and_clear_bit(REQUEST, &rseq->slice_ctrl)) {
> + if (rseq->slice_ctrl & GRANTED)
> + rseq_slice_yield();
> + }
> +
> +local_test_and_clear_bit() has to be local CPU atomic to prevent the
> +obvious RMW race versus an interrupt. On X86 this can be achieved with BTRL
> +without LOCK prefix. On architectures, which do not provide lightweight CPU
no comma ^
> +local atomics this needs to be implemented with regular atomic operations.
> +
> +Setting REQUEST has no atomicity requirements as there is no concurrency
> +vs. the GRANTED bit.
> +
> +Checking the GRANTED has no atomicity requirements as there is obviously a
> +race which cannot be avoided at all::
> +
> + if (rseq->slice_ctrl & GRANTED)
> + -> Interrupt results in schedule and grant revocation
> + rseq_slice_yield();
> +
> +So there is no point in pretending that this might be solved by an atomic
> +operation.
> +
> +The kernel enforces flag consistency and terminates the thread with SIGSEGV
> +if it detects a violation.
> --- a/include/uapi/linux/rseq.h
> +++ b/include/uapi/linux/rseq.h
> @@ -23,9 +23,15 @@ enum rseq_flags {
> };
>
> enum rseq_cs_flags_bit {
> + /* Historical and unsupported bits */
> RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT = 0,
> RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT = 1,
> RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT = 2,
> + /* (3) Intentional gap to put new bits into a seperate byte */
separate
("There is a rat in separate." -- old clue)
'arat'
> +
> + /* User read only feature flags */
> + RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT = 4,
> + RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT = 5,
> };
>
> enum rseq_cs_flags {
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1908,6 +1908,18 @@ config RSEQ_DEBUG_DEFAULT_ENABLE
>
> If unsure, say N.
>
> +config RSEQ_SLICE_EXTENSION
> + bool "Enable rseq based time slice extension mechanism"
rseq-based
> + depends on RSEQ && HIGH_RES_TIMERS && GENERIC_ENTRY && HAVE_GENERIC_TIF_BITS
> + help
> + Allows userspace to request a limited time slice extension when
Use tab + 2 spaces above instead of N spaces.
> + returning from an interrupt to user space via the RSEQ shared
> + data ABI. If granted, that allows to complete a critical section,
> + so that other threads are not stuck on a conflicted resource,
> + while the task is scheduled out.
--
~Randy
© 2016 - 2026 Red Hat, Inc.