Aside of a Kconfig knob add the following items:
- Two flag bits for the rseq user space ABI, which allow user space to
query the availability and enablement without a syscall.
- A new member to the user space ABI struct rseq, which is going to be
used to communicate request and grant between kernel and user space.
- A rseq state struct to hold the kernel state of this
- Documentation of the new mechanism
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
Documentation/userspace-api/index.rst | 1
Documentation/userspace-api/rseq.rst | 129 ++++++++++++++++++++++++++++++++++
include/linux/rseq_types.h | 26 ++++++
include/uapi/linux/rseq.h | 28 +++++++
init/Kconfig | 12 +++
kernel/rseq.c | 8 ++
6 files changed, 204 insertions(+)
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -21,6 +21,7 @@ System calls
ebpf/index
ioctl/index
mseal
+ rseq
Security-related interfaces
===========================
--- /dev/null
+++ b/Documentation/userspace-api/rseq.rst
@@ -0,0 +1,129 @@
+=====================
+Restartable Sequences
+=====================
+
+Restartable Sequences allow to register a per thread userspace memory area
+to be used as an ABI between kernel and user-space for three purposes:
+
+ * user-space restartable sequences
+
+ * quick access to read the current CPU number, node ID from user-space
+
+ * scheduler time slice extensions
+
+Restartable sequences (per-cpu atomics)
+---------------------------------------
+
+Restartables sequences allow user-space to perform update operations on
+per-cpu data without requiring heavy-weight atomic operations. The actual
+ABI is unfortunately only available in the code and selftests.
+
+Quick access to CPU number, node ID
+-----------------------------------
+
+Allows to implement per CPU data efficiently. Documentation is in code and
+selftests. :(
+
+Scheduler time slice extensions
+-------------------------------
+
+This allows a thread to request a time slice extension when it enters a
+critical section to avoid contention on a resource when the thread is
+scheduled out inside of the critical section.
+
+The prerequisites for this functionality are:
+
+ * Enabled in Kconfig
+
+ * Enabled at boot time (default is enabled)
+
+ * A rseq user space pointer has been registered for the thread
+
+The thread has to enable the functionality via prctl(2)::
+
+ prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
+ PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
+
+prctl() returns 0 on success and otherwise with the following error codes:
+
+========= ==============================================================
+Errorcode Meaning
+========= ==============================================================
+EINVAL Functionality not available or invalid function arguments.
+ Note: arg4 and arg5 must be zero
+ENOTSUPP Functionality was disabled on the kernel command line
+ENXIO Available, but no rseq user struct registered
+========= ==============================================================
+
+The state can be also queried via prctl(2)::
+
+ prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0);
+
+prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if
+disabled. Otherwise it returns with the following error codes:
+
+========= ==============================================================
+Errorcode Meaning
+========= ==============================================================
+EINVAL Functionality not available or invalid function arguments.
+ Note: arg3 and arg4 and arg5 must be zero
+========= ==============================================================
+
+The availability and status is also exposed via the rseq ABI struct flags
+field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the
+``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read only for user
+space and only for informational purposes.
+
+If the mechanism was enabled via prctl(), the thread can request a time
+slice extension by setting the ``RSEQ_SLICE_EXT_REQUEST_BIT`` in the struct
+rseq slice_ctrl field. If the thread is interrupted and the interrupt
+results in a reschedule request in the kernel, then the kernel can grant a
+time slice extension and return to user space instead of scheduling
+out.
+
+The kernel indicates the grant by clearing ``RSEQ_SLICE_EXT_REQUEST_BIT``
+and setting ``RSEQ_SLICE_EXT_GRANTED_BIT`` in the rseq::slice_ctrl
+field. If there is a reschedule of the thread after granting the extension,
+the kernel clears the granted bit to indicate that to user space.
+
+If the request bit is still set when the leaving the critical section, user
+space can clear it and continue.
+
+If the granted bit is set, then user space has to invoke rseq_slice_yield()
+when leaving the critical section to relinquish the CPU. The kernel
+enforces this by arming a timer to prevent misbehaving user space from
+abusing this mechanism.
+
+If both the request bit and the granted bit are false when leaving the
+critical section, then this indicates that a grant was revoked and no
+further action is required by user space.
+
+The required code flow is as follows::
+
+ rseq->slice_ctrl = REQUEST;
+ critical_section();
+ if (!local_test_and_clear_bit(REQUEST, &rseq->slice_ctrl)) {
+ if (rseq->slice_ctrl & GRANTED)
+ rseq_slice_yield();
+ }
+
+local_test_and_clear_bit() has to be local CPU atomic to prevent the
+obvious RMW race versus an interrupt. On X86 this can be achieved with BTRL
+without LOCK prefix. On architectures, which do not provide lightweight CPU
+local atomics this needs to be implemented with regular atomic operations.
+
+Setting REQUEST has no atomicity requirements as there is no concurrency
+vs. the GRANTED bit.
+
+Checking the GRANTED has no atomicity requirements as there is obviously a
+race which cannot be avoided at all::
+
+ if (rseq->slice_ctrl & GRANTED)
+ -> Interrupt results in schedule and grant revocation
+ rseq_slice_yield();
+
+So there is no point in pretending that this might be solved by an atomic
+operation.
+
+The kernel enforces flag consistency and terminates the thread with SIGSEGV
+if it detects a violation.
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -71,12 +71,35 @@ struct rseq_ids {
};
/**
+ * union rseq_slice_state - Status information for rseq time slice extension
+ * @state: Compound to access the overall state
+ * @enabled: Time slice extension is enabled for the task
+ * @granted: Time slice extension was granted to the task
+ */
+union rseq_slice_state {
+ u16 state;
+ struct {
+ u8 enabled;
+ u8 granted;
+ };
+};
+
+/**
+ * struct rseq_slice - Status information for rseq time slice extension
+ * @state: Time slice extension state
+ */
+struct rseq_slice {
+ union rseq_slice_state state;
+};
+
+/**
* struct rseq_data - Storage for all rseq related data
* @usrptr: Pointer to the registered user space RSEQ memory
* @len: Length of the RSEQ region
* @sig: Signature of critial section abort IPs
* @event: Storage for event management
* @ids: Storage for cached CPU ID and MM CID
+ * @slice: Storage for time slice extension data
*/
struct rseq_data {
struct rseq __user *usrptr;
@@ -84,6 +107,9 @@ struct rseq_data {
u32 sig;
struct rseq_event event;
struct rseq_ids ids;
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+ struct rseq_slice slice;
+#endif
};
#else /* CONFIG_RSEQ */
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -23,9 +23,15 @@ enum rseq_flags {
};
enum rseq_cs_flags_bit {
+ /* Historical and unsupported bits */
RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT = 0,
RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT = 1,
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT = 2,
+ /* (3) Intentional gap to put new bits into a seperate byte */
+
+ /* User read only feature flags */
+ RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT = 4,
+ RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT = 5,
};
enum rseq_cs_flags {
@@ -35,6 +41,22 @@ enum rseq_cs_flags {
(1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT),
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE =
(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
+
+ RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE =
+ (1U << RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT),
+ RSEQ_CS_FLAG_SLICE_EXT_ENABLED =
+ (1U << RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT),
+};
+
+enum rseq_slice_bits {
+ /* Time slice extension ABI bits */
+ RSEQ_SLICE_EXT_REQUEST_BIT = 0,
+ RSEQ_SLICE_EXT_GRANTED_BIT = 1,
+};
+
+enum rseq_slice_masks {
+ RSEQ_SLICE_EXT_REQUEST = (1U << RSEQ_SLICE_EXT_REQUEST_BIT),
+ RSEQ_SLICE_EXT_GRANTED = (1U << RSEQ_SLICE_EXT_GRANTED_BIT),
};
/*
@@ -142,6 +164,12 @@ struct rseq {
__u32 mm_cid;
/*
+ * Time slice extension control word. CPU local atomic updates from
+ * kernel and user space.
+ */
+ __u32 slice_ctrl;
+
+ /*
* Flexible array member at end of structure, after last feature field.
*/
char end[];
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1908,6 +1908,18 @@ config RSEQ_DEBUG_DEFAULT_ENABLE
If unsure, say N.
+config RSEQ_SLICE_EXTENSION
+ bool "Enable rseq based time slice extension mechanism"
+ depends on RSEQ && HIGH_RES_TIMERS && GENERIC_ENTRY && HAVE_GENERIC_TIF_BITS
+ help
+ Allows userspace to request a limited time slice extension when
+ returning from an interrupt to user space via the RSEQ shared
+ data ABI. If granted, that allows to complete a critical section,
+ so that other threads are not stuck on a conflicted resource,
+ while the task is scheduled out.
+
+ If unsure, say N.
+
config DEBUG_RSEQ
default n
bool "Enable debugging of rseq() system call" if EXPERT
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -387,6 +387,8 @@ static bool rseq_reset_ids(void)
*/
SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
{
+ u32 rseqfl = 0;
+
if (flags & RSEQ_FLAG_UNREGISTER) {
if (flags & ~RSEQ_FLAG_UNREGISTER)
return -EINVAL;
@@ -448,6 +450,12 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
if (put_user_masked_u64(0UL, &rseq->rseq_cs))
return -EFAULT;
+ if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION))
+ rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
+
+ if (put_user_masked_u32(rseqfl, &rseq->flags))
+ return -EFAULT;
+
/*
* Activate the registration by setting the rseq area address, length
* and signature in the task struct.
Hi Thomas, On 9/8/25 3:59 PM, Thomas Gleixner wrote: > Aside of a Kconfig knob add the following items: > > --- > Documentation/userspace-api/index.rst | 1 > Documentation/userspace-api/rseq.rst | 129 ++++++++++++++++++++++++++++++++++ > include/linux/rseq_types.h | 26 ++++++ > include/uapi/linux/rseq.h | 28 +++++++ > init/Kconfig | 12 +++ > kernel/rseq.c | 8 ++ > 6 files changed, 204 insertions(+) > > --- /dev/null > +++ b/Documentation/userspace-api/rseq.rst > @@ -0,0 +1,129 @@ > +===================== > +Restartable Sequences > +===================== > + > +Restartable Sequences allow to register a per thread userspace memory area > +to be used as an ABI between kernel and user-space for three purposes: userspace or user-space or user space -- be consistent, please. (above 2 times, and more below) FWIW, "userspace" overwhelmingly wins in the kernel source tree. On the $internet it looks like "user space" wins (quick look). > + > + * user-space restartable sequences > + > + * quick access to read the current CPU number, node ID from user-space > + > + * scheduler time slice extensions > + > +Restartable sequences (per-cpu atomics) > +--------------------------------------- > + > +Restartables sequences allow user-space to perform update operations on > +per-cpu data without requiring heavy-weight atomic operations. The actual just heavyweight > +ABI is unfortunately only available in the code and selftests. > + > +Quick access to CPU number, node ID > +----------------------------------- > + > +Allows to implement per CPU data efficiently. Documentation is in code and > +selftests. :( > + > +Scheduler time slice extensions > +------------------------------- > + > +This allows a thread to request a time slice extension when it enters a > +critical section to avoid contention on a resource when the thread is > +scheduled out inside of the critical section. > + > +The prerequisites for this functionality are: > + > + * Enabled in Kconfig > + > + * Enabled at boot time (default is enabled) > + > + * A rseq user space pointer has been registered for the thread ^^^^^^^^^^ > + > +The thread has to enable the functionality via prctl(2):: > + > + prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET, > + PR_RSEQ_SLICE_EXT_ENABLE, 0, 0); > + > +prctl() returns 0 on success and otherwise with the following error codes: > + > +========= ============================================================== > +Errorcode Meaning > +========= ============================================================== > +EINVAL Functionality not available or invalid function arguments. > + Note: arg4 and arg5 must be zero > +ENOTSUPP Functionality was disabled on the kernel command line > +ENXIO Available, but no rseq user struct registered > +========= ============================================================== > + > +The state can be also queried via prctl(2):: > + > + prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0); > + > +prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if > +disabled. Otherwise it returns with the following error codes: > + > +========= ============================================================== > +Errorcode Meaning > +========= ============================================================== > +EINVAL Functionality not available or invalid function arguments. > + Note: arg3 and arg4 and arg5 must be zero > +========= ============================================================== > + > +The availability and status is also exposed via the rseq ABI struct flags > +field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the > +``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read only for user read-only for > +space and only for informational purposes. userspace ? > + > +If the mechanism was enabled via prctl(), the thread can request a time > +slice extension by setting the ``RSEQ_SLICE_EXT_REQUEST_BIT`` in the struct > +rseq slice_ctrl field. If the thread is interrupted and the interrupt > +results in a reschedule request in the kernel, then the kernel can grant a > +time slice extension and return to user space instead of scheduling ^^^^^^^^^^ > +out. > + > +The kernel indicates the grant by clearing ``RSEQ_SLICE_EXT_REQUEST_BIT`` > +and setting ``RSEQ_SLICE_EXT_GRANTED_BIT`` in the rseq::slice_ctrl > +field. If there is a reschedule of the thread after granting the extension, > +the kernel clears the granted bit to indicate that to user space. ? > + > +If the request bit is still set when the leaving the critical section, user > +space can clear it and continue. ? > + > +If the granted bit is set, then user space has to invoke rseq_slice_yield() ? > +when leaving the critical section to relinquish the CPU. The kernel > +enforces this by arming a timer to prevent misbehaving user space from OK, I think that you like "user space". :) > +abusing this mechanism. > + > +If both the request bit and the granted bit are false when leaving the > +critical section, then this indicates that a grant was revoked and no > +further action is required by user space. > + > +The required code flow is as follows:: > + > + rseq->slice_ctrl = REQUEST; > + critical_section(); > + if (!local_test_and_clear_bit(REQUEST, &rseq->slice_ctrl)) { > + if (rseq->slice_ctrl & GRANTED) > + rseq_slice_yield(); > + } > + > +local_test_and_clear_bit() has to be local CPU atomic to prevent the > +obvious RMW race versus an interrupt. On X86 this can be achieved with BTRL > +without LOCK prefix. On architectures, which do not provide lightweight CPU no comma ^ > +local atomics this needs to be implemented with regular atomic operations. > + > +Setting REQUEST has no atomicity requirements as there is no concurrency > +vs. the GRANTED bit. > + > +Checking the GRANTED has no atomicity requirements as there is obviously a > +race which cannot be avoided at all:: > + > + if (rseq->slice_ctrl & GRANTED) > + -> Interrupt results in schedule and grant revocation > + rseq_slice_yield(); > + > +So there is no point in pretending that this might be solved by an atomic > +operation. > + > +The kernel enforces flag consistency and terminates the thread with SIGSEGV > +if it detects a violation. > --- a/include/uapi/linux/rseq.h > +++ b/include/uapi/linux/rseq.h > @@ -23,9 +23,15 @@ enum rseq_flags { > }; > > enum rseq_cs_flags_bit { > + /* Historical and unsupported bits */ > RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT = 0, > RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT = 1, > RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT = 2, > + /* (3) Intentional gap to put new bits into a seperate byte */ separate ("There is a rat in separate." -- old clue) 'arat' > + > + /* User read only feature flags */ > + RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT = 4, > + RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT = 5, > }; > > enum rseq_cs_flags { > --- a/init/Kconfig > +++ b/init/Kconfig > @@ -1908,6 +1908,18 @@ config RSEQ_DEBUG_DEFAULT_ENABLE > > If unsure, say N. > > +config RSEQ_SLICE_EXTENSION > + bool "Enable rseq based time slice extension mechanism" rseq-based > + depends on RSEQ && HIGH_RES_TIMERS && GENERIC_ENTRY && HAVE_GENERIC_TIF_BITS > + help > + Allows userspace to request a limited time slice extension when Use tab + 2 spaces above instead of N spaces. > + returning from an interrupt to user space via the RSEQ shared > + data ABI. If granted, that allows to complete a critical section, > + so that other threads are not stuck on a conflicted resource, > + while the task is scheduled out. -- ~Randy
© 2016 - 2025 Red Hat, Inc.