[v3] rseq: Implement time slice extension mechanism

[patch V3 02/12] rseq: Add fields and constants for time slice extension

Posted by Thomas Gleixner 3 months, 1 week ago

Aside of a Kconfig knob add the following items:

   - Two flag bits for the rseq user space ABI, which allow user space to
     query the availability and enablement without a syscall.

   - A new member to the user space ABI struct rseq, which is going to be
     used to communicate request and grant between kernel and user space.

   - A rseq state struct to hold the kernel state of this

   - Documentation of the new mechanism

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
V3: Fix more typos and expressions - Randy
V2: Fix Kconfig indentation, fix typos and expressions - Randy
    Make the control fields a struct and remove the atomicity requirement - Mathieu
---
 Documentation/userspace-api/index.rst |    1 
 Documentation/userspace-api/rseq.rst  |  118 ++++++++++++++++++++++++++++++++++
 include/linux/rseq_types.h            |   28 +++++++-
 include/uapi/linux/rseq.h             |   38 ++++++++++
 init/Kconfig                          |   12 +++
 kernel/rseq.c                         |    7 ++
 6 files changed, 203 insertions(+), 1 deletion(-)

--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -21,6 +21,7 @@ System calls
    ebpf/index
    ioctl/index
    mseal
+   rseq
 
 Security-related interfaces
 ===========================
--- /dev/null
+++ b/Documentation/userspace-api/rseq.rst
@@ -0,0 +1,118 @@
+=====================
+Restartable Sequences
+=====================
+
+Restartable Sequences allow to register a per thread userspace memory area
+to be used as an ABI between kernel and userspace for three purposes:
+
+ * userspace restartable sequences
+
+ * quick access to read the current CPU number, node ID from userspace
+
+ * scheduler time slice extensions
+
+Restartable sequences (per-cpu atomics)
+---------------------------------------
+
+Restartable sequences allow userspace to perform update operations on
+per-cpu data without requiring heavyweight atomic operations. The actual
+ABI is unfortunately only available in the code and selftests.
+
+Quick access to CPU number, node ID
+-----------------------------------
+
+Allows to implement per CPU data efficiently. Documentation is in code and
+selftests. :(
+
+Scheduler time slice extensions
+-------------------------------
+
+This allows a thread to request a time slice extension when it enters a
+critical section to avoid contention on a resource when the thread is
+scheduled out inside of the critical section.
+
+The prerequisites for this functionality are:
+
+    * Enabled in Kconfig
+
+    * Enabled at boot time (default is enabled)
+
+    * A rseq userspace pointer has been registered for the thread
+
+The thread has to enable the functionality via prctl(2)::
+
+    prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
+          PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
+
+prctl() returns 0 on success or otherwise with the following error codes:
+
+========= ==============================================================
+Errorcode Meaning
+========= ==============================================================
+EINVAL	  Functionality not available or invalid function arguments.
+          Note: arg4 and arg5 must be zero
+ENOTSUPP  Functionality was disabled on the kernel command line
+ENXIO	  Available, but no rseq user struct registered
+========= ==============================================================
+
+The state can be also queried via prctl(2)::
+
+  prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0);
+
+prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if
+disabled. Otherwise it returns with the following error codes:
+
+========= ==============================================================
+Errorcode Meaning
+========= ==============================================================
+EINVAL	  Functionality not available or invalid function arguments.
+          Note: arg3 and arg4 and arg5 must be zero
+========= ==============================================================
+
+The availability and status is also exposed via the rseq ABI struct flags
+field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the
+``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read-only for user
+space and only for informational purposes.
+
+If the mechanism was enabled via prctl(), the thread can request a time
+slice extension by setting rseq::slice_ctrl.request to 1. If the thread is
+interrupted and the interrupt results in a reschedule request in the
+kernel, then the kernel can grant a time slice extension and return to
+userspace instead of scheduling out.
+
+The kernel indicates the grant by clearing rseq::slice_ctrl::reqeust and
+setting rseq::slice_ctrl::granted to 1. If there is a reschedule of the
+thread after granting the extension, the kernel clears the granted bit to
+indicate that to userspace.
+
+If the request bit is still set when the leaving the critical section,
+userspace can clear it and continue.
+
+If the granted bit is set, then userspace has to invoke rseq_slice_yield()
+when leaving the critical section to relinquish the CPU. The kernel
+enforces this by arming a timer to prevent misbehaving userspace from
+abusing this mechanism.
+
+If both the request bit and the granted bit are false when leaving the
+critical section, then this indicates that a grant was revoked and no
+further action is required by userspace.
+
+The required code flow is as follows::
+
+    rseq->slice_ctrl.request = 1;
+    critical_section();
+    if (rseq->slice_ctrl.granted)
+         rseq_slice_yield();
+
+As all of this is strictly CPU local, there are no atomicity requirements.
+Checking the granted state is racy, but that cannot be avoided at all::
+
+    if (rseq->slice_ctrl & GRANTED)
+      -> Interrupt results in schedule and grant revocation
+        rseq_slice_yield();
+
+So there is no point in pretending that this might be solved by an atomic
+operation.
+
+The kernel enforces flag consistency and terminates the thread with SIGSEGV
+if it detects a violation.
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -73,12 +73,35 @@ struct rseq_ids {
 };
 
 /**
+ * union rseq_slice_state - Status information for rseq time slice extension
+ * @state:	Compound to access the overall state
+ * @enabled:	Time slice extension is enabled for the task
+ * @granted:	Time slice extension was granted to the task
+ */
+union rseq_slice_state {
+	u16			state;
+	struct {
+		u8		enabled;
+		u8		granted;
+	};
+};
+
+/**
+ * struct rseq_slice - Status information for rseq time slice extension
+ * @state:	Time slice extension state
+ */
+struct rseq_slice {
+	union rseq_slice_state	state;
+};
+
+/**
  * struct rseq_data - Storage for all rseq related data
  * @usrptr:	Pointer to the registered user space RSEQ memory
  * @len:	Length of the RSEQ region
- * @sig:	Signature of critial section abort IPs
+ * @sig:	Signature of critical section abort IPs
  * @event:	Storage for event management
  * @ids:	Storage for cached CPU ID and MM CID
+ * @slice:	Storage for time slice extension data
  */
 struct rseq_data {
 	struct rseq __user		*usrptr;
@@ -86,6 +109,9 @@ struct rseq_data {
 	u32				sig;
 	struct rseq_event		event;
 	struct rseq_ids			ids;
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+	struct rseq_slice		slice;
+#endif
 };
 
 #else /* CONFIG_RSEQ */
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -23,9 +23,15 @@ enum rseq_flags {
 };
 
 enum rseq_cs_flags_bit {
+	/* Historical and unsupported bits */
 	RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT	= 0,
 	RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT	= 1,
 	RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT	= 2,
+	/* (3) Intentional gap to put new bits into a separate byte */
+
+	/* User read only feature flags */
+	RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT	= 4,
+	RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT	= 5,
 };
 
 enum rseq_cs_flags {
@@ -35,6 +41,11 @@ enum rseq_cs_flags {
 		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT),
 	RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE	=
 		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
+
+	RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE	=
+		(1U << RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT),
+	RSEQ_CS_FLAG_SLICE_EXT_ENABLED		=
+		(1U << RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT),
 };
 
 /*
@@ -53,6 +64,27 @@ struct rseq_cs {
 	__u64 abort_ip;
 } __attribute__((aligned(4 * sizeof(__u64))));
 
+/**
+ * rseq_slice_ctrl - Time slice extension control structure
+ * @all:	Compound value
+ * @request:	Request for a time slice extension
+ * @granted:	Granted time slice extension
+ *
+ * @request is set by user space and can be cleared by user space or kernel
+ * space.  @granted is set and cleared by the kernel and must only be read
+ * by user space.
+ */
+struct rseq_slice_ctrl {
+	union {
+		__u32		all;
+		struct {
+			__u8	request;
+			__u8	granted;
+			__u16	__reserved;
+		};
+	};
+};
+
 /*
  * struct rseq is aligned on 4 * 8 bytes to ensure it is always
  * contained within a single cache-line.
@@ -142,6 +174,12 @@ struct rseq {
 	__u32 mm_cid;
 
 	/*
+	 * Time slice extension control structure. CPU local updates from
+	 * kernel and user space.
+	 */
+	struct rseq_slice_ctrl slice_ctrl;
+
+	/*
 	 * Flexible array member at end of structure, after last feature field.
 	 */
 	char end[];
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1913,6 +1913,18 @@ config RSEQ
 
 	  If unsure, say Y.
 
+config RSEQ_SLICE_EXTENSION
+	bool "Enable rseq-based time slice extension mechanism"
+	depends on RSEQ && HIGH_RES_TIMERS && GENERIC_ENTRY && HAVE_GENERIC_TIF_BITS
+	help
+	  Allows userspace to request a limited time slice extension when
+	  returning from an interrupt to user space via the RSEQ shared
+	  data ABI. If granted, that allows to complete a critical section,
+	  so that other threads are not stuck on a conflicted resource,
+	  while the task is scheduled out.
+
+	  If unsure, say N.
+
 config RSEQ_STATS
 	default n
 	bool "Enable lightweight statistics of restartable sequences" if EXPERT
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -389,6 +389,8 @@ static bool rseq_reset_ids(void)
  */
 SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
 {
+	u32 rseqfl = 0;
+
 	if (flags & RSEQ_FLAG_UNREGISTER) {
 		if (flags & ~RSEQ_FLAG_UNREGISTER)
 			return -EINVAL;
@@ -440,6 +442,9 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
 	if (!access_ok(rseq, rseq_len))
 		return -EFAULT;
 
+	if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION))
+		rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
+
 	scoped_user_write_access(rseq, efault) {
 		/*
 		 * If the rseq_cs pointer is non-NULL on registration, clear it to
@@ -449,11 +454,13 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
 		 * clearing the fields. Don't bother reading it, just reset it.
 		 */
 		unsafe_put_user(0UL, &rseq->rseq_cs, efault);
+		unsafe_put_user(rseqfl, &rseq->flags, efault);
 		/* Initialize IDs in user space */
 		unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id_start, efault);
 		unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id, efault);
 		unsafe_put_user(0U, &rseq->node_id, efault);
 		unsafe_put_user(0U, &rseq->mm_cid, efault);
+		unsafe_put_user(0U, &rseq->slice_ctrl.all, efault);
 	}
 
 	/*

Re: [patch V3 02/12] rseq: Add fields and constants for time slice extension

Posted by Steven Rostedt 3 months ago

On Wed, 29 Oct 2025 14:22:14 +0100 (CET)
Thomas Gleixner <tglx@linutronix.de> wrote:

> +/**
> + * rseq_slice_ctrl - Time slice extension control structure
> + * @all:	Compound value
> + * @request:	Request for a time slice extension
> + * @granted:	Granted time slice extension
> + *
> + * @request is set by user space and can be cleared by user space or kernel
> + * space.  @granted is set and cleared by the kernel and must only be read
> + * by user space.
> + */
> +struct rseq_slice_ctrl {
> +	union {
> +		__u32		all;
> +		struct {
> +			__u8	request;
> +			__u8	granted;
> +			__u16	__reserved;
> +		};
> +	};
> +};
> +
>  /*
>   * struct rseq is aligned on 4 * 8 bytes to ensure it is always
>   * contained within a single cache-line.
> @@ -142,6 +174,12 @@ struct rseq {
>  	__u32 mm_cid;
>  
>  	/*
> +	 * Time slice extension control structure. CPU local updates from
> +	 * kernel and user space.
> +	 */
> +	struct rseq_slice_ctrl slice_ctrl;
> +

BTW, Google is interested in expanding this feature for VMs. As a VM kernel
spinlock also happens to be a user space spinlock. The KVM folks would
rather have this implemented via the normal user spaces method than to do
anything specific to the KVM internal code. Or at least, keep it as
non-intrusive as possible.

I talked with Mathieu and the KVM folks on how it could use the rseq
method, and it was suggested that qemu would set up a shared memory region
between the qemu thread and the virtual CPU and possibly submit a driver
that would expose this memory region. This could hook to a paravirt
spinlock that would set the bit stating the system is in a critical section
and clear it when all spin locks are released. If the vCPU was granted an
extra time slice, then it would call a hypercall that would do the yield.

When I mentioned this to Mathieu, he was against sharing the qemu's thread
rseq with the guest VM, as that would expose much more than what is needed
to the guest. Especially since it needs to be a writable memory location.

What could be done is that another memory range is mapped between the qemu
thread and the vCPU memory, and the rseq would have a pointer to that memory.

To implement that, the slice_ctrl would need to be a pointer, where the
kernel would need to do another indirection to follow that pointer to
another location within the thread's memory.

Now I do believe that the return back to guest goes through a different
path. So this doesn't actually need to use rseq. But it would require a way
for the qemu thread to pass the memory to the kernel. I'm guessing that the
return to guest logic could share the code with the return to user logic
with just passing a struct rseq_slice_ctrl pointer to a function?

I'm bringing this up so that this use case is considered when implementing
the extended time slice. As I believe this would be a more common case than
then user space spin lock would be.

Thanks,

-- Steve

Re: [patch V3 02/12] rseq: Add fields and constants for time slice extension

Posted by Mathieu Desnoyers 3 months, 1 week ago

On 2025-10-29 09:22, Thomas Gleixner wrote:
[...]
> +
> +The thread has to enable the functionality via prctl(2)::
> +
> +    prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
> +          PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);

Enabling specifically for each thread requires hooking into thread
creation, and is not a good fit for enabling this from executable or
library constructor function.

What is the use-case for enabling it only for a few threads within
a process rather than for the entire process ?

> +
> +The kernel indicates the grant by clearing rseq::slice_ctrl::reqeust and

reqeust -> request

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [patch V3 02/12] rseq: Add fields and constants for time slice extension

Posted by Thomas Gleixner 3 months, 1 week ago

On Fri, Oct 31 2025 at 15:31, Mathieu Desnoyers wrote:
> On 2025-10-29 09:22, Thomas Gleixner wrote:
> [...]
>> +
>> +The thread has to enable the functionality via prctl(2)::
>> +
>> +    prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
>> +          PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
>
> Enabling specifically for each thread requires hooking into thread
> creation, and is not a good fit for enabling this from executable or
> library constructor function.

Where is the problem? It's not rocket science to handle that in user
space.

> What is the use-case for enabling it only for a few threads within
> a process rather than for the entire process ?

My general approach to all of this is to reduce overhead by default and
to provide the ability of fine grained control.

Using time slice extensions requires special care and a use case which
justifies the extra work to be done. So those people really can be asked
to do the extra work of enabling it, no?

I really don't get your attitude of enabling everything by default and
thereby inflicting the maximum amount of overhead on everything.

I've just wasted weeks to cure the fallout of that approach and it's
still unsatisfying because the whole CID management crud and related
overhead is there unconditionally with exactly zero users on any
distro. The special use cases of the uncompilable gurgle tcmalloc and
the esoteric librseq are not a justification at all to inflict that on
everyone.

Sadly nobody noticed when this got merged and now with RSEQ being widely
used by glibc it's even harder to turn the clock back. I'm still tempted
to break this half thought out ABI and make CID opt-in and default to
CID = CPUID if not activated.

Seriously the kernel is there to manage resources and provide resource
control, but it's not there to accomodate the laziness of user space
programmers and to proliferate the 'I envision this to be widely used'
wishful thinking mindset.

That said I'm not completely against making this per process, but then
it has to be enabled on the main thread _before_ it spawns threads and
rejected otherwise.

That said I just went down the obvious road of making it opt-in and
therefore low overhead and flexible by default. That is correct, simple
and straight forward. No?

Thanks,

        tglx

Re: [patch V3 02/12] rseq: Add fields and constants for time slice extension

Posted by Mathieu Desnoyers 3 months, 1 week ago

On 2025-10-31 16:58, Thomas Gleixner wrote:
> On Fri, Oct 31 2025 at 15:31, Mathieu Desnoyers wrote:
>> On 2025-10-29 09:22, Thomas Gleixner wrote:
>> [...]
>>> +
>>> +The thread has to enable the functionality via prctl(2)::
>>> +
>>> +    prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
>>> +          PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
>>
>> Enabling specifically for each thread requires hooking into thread
>> creation, and is not a good fit for enabling this from executable or
>> library constructor function.
> 
> Where is the problem? It's not rocket science to handle that in user
> space.

Overhead at thread creation is a metric that is closely followed
by glibc developers. If we want a fine-grained per-thread control over
the slice extension mechanism, it would be good if we can think of a way
to allow userspace to enable it through either clone3 or rseq so
we don't add another round-trip to the kernel at thread creation.
This could be done either as an addition to this prctl, or as a
replacement if we don't want to add two ways to do the same thing.

AFAIU executable startup is not in the same ballpark performance
wise as thread creation, so adding the overhead of an additional
system call there is less frowned upon. This is why I am asking
whether per-thread granularity is needed.

> 
>> What is the use-case for enabling it only for a few threads within
>> a process rather than for the entire process ?
> 
> My general approach to all of this is to reduce overhead by default and
> to provide the ability of fine grained control.

That is a sound approach, I agree.

> 
> Using time slice extensions requires special care and a use case which
> justifies the extra work to be done. So those people really can be asked
> to do the extra work of enabling it, no?

I don't mind that it needs to be enabled explicitly at all. What
am I asking here is what is the best way to express this enablement
ABI.

Here I'm just trying to put myself into the shoes of userspace
library developers to see whether the proposed ABI is a good fit, or
if due to resource conflicts with other pieces of the ecosystem
or because of overhead reasons it will be unusable by the core userspace
libraries like libc and limited to be used by niche users.

> 
> I really don't get your attitude of enabling everything by default and
> thereby inflicting the maximum amount of overhead on everything.
> 
> I've just wasted weeks to cure the fallout of that approach and it's
> still unsatisfying because the whole CID management crud and related
> overhead is there unconditionally with exactly zero users on any
> distro. The special use cases of the uncompilable gurgle tcmalloc and
> the esoteric librseq are not a justification at all to inflict that on
> everyone.
> 
> Sadly nobody noticed when this got merged and now with RSEQ being widely
> used by glibc it's even harder to turn the clock back. I'm still tempted
> to break this half thought out ABI and make CID opt-in and default to
> CID = CPUID if not activated.

That's a good idea. Making mm_cid use cpu_id by default would not break
anything in terms of hard limits. Sure it's not close to 0 anymore, but
no application should misbehave because of this.

Then there is the question of how it should be enabled. For mm_cid,
it really only makes sense per-process.

One possibility here would be to introduce an "rseq features" enablement
prctl that affects the entire process. It would have to be done while
the process is single threaded. This could gate both mm_cid and time
slice extension.

> 
> Seriously the kernel is there to manage resources and provide resource
> control, but it's not there to accomodate the laziness of user space
> programmers and to proliferate the 'I envision this to be widely used'
> wishful thinking mindset.

Gating the mm_cid with a default to cpu_id is fine with me.

> 
> That said I'm not completely against making this per process, but then
> it has to be enabled on the main thread _before_ it spawns threads and
> rejected otherwise.

Agreed. And it would be nice if we can achieve rseq feature enablement
in a way that is relatively common for all rseq features (e.g. through
a single prctl option, applying per-process, requiring single-threaded
state).

> 
> That said I just went down the obvious road of making it opt-in and
> therefore low overhead and flexible by default. That is correct, simple
> and straight forward. No?

As I pointed out above, I'm simply trying to find the way to express
this feature enablement in a way that's the best fit for the userspace
ecosystem as well, without that being too much trouble on the kernel
side.

I note your wish to gate the mm_cid with a similar enablement ABI, and
I'm OK with this unless an existing mm_cid user considers this a
significant ABI break. As maintainer of librseq I'm OK with this,
we should ask the tcmalloc maintainers.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [patch V3 02/12] rseq: Add fields and constants for time slice extension

Posted by Florian Weimer 3 months, 1 week ago

* Mathieu Desnoyers:

> On 2025-10-31 16:58, Thomas Gleixner wrote:
>> On Fri, Oct 31 2025 at 15:31, Mathieu Desnoyers wrote:
>>> On 2025-10-29 09:22, Thomas Gleixner wrote:
>>> [...]
>>>> +
>>>> +The thread has to enable the functionality via prctl(2)::
>>>> +
>>>> +    prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
>>>> +          PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
>>>
>>> Enabling specifically for each thread requires hooking into thread
>>> creation, and is not a good fit for enabling this from executable or
>>> library constructor function.
>> Where is the problem? It's not rocket science to handle that in user
>> space.
>
> Overhead at thread creation is a metric that is closely followed
> by glibc developers. If we want a fine-grained per-thread control over
> the slice extension mechanism, it would be good if we can think of a way
> to allow userspace to enable it through either clone3 or rseq so
> we don't add another round-trip to the kernel at thread creation.
> This could be done either as an addition to this prctl, or as a
> replacement if we don't want to add two ways to do the same thing.

I think this is a bit exaggerated. 8-)

I'm more concerned about this: If it's a separate system call like the
quoted prctl, we'll likely have cases where the program launches and
this feature automatically gets enabled for the main thread by glibc.
Then the application installs a seccomp filter that doesn't allow the
prctl, and calls pthread_create.  At this point we either end up with a
partially enabled feature (depending on which thread the code runs), or
we have to fail the pthread_create call.  Neither option is great.

So something enabled by rseq flags seems better to me.  Maybe
default-enable and disable with a non-zero flag if backwards
compatibility is sufficient?  As far I understand it, this series has
performance improvements that more than offset the slice extension cost?

Thanks,
Florian

Re: [patch V3 02/12] rseq: Add fields and constants for time slice extension

Posted by Thomas Gleixner 3 months, 1 week ago

On Fri, Oct 31 2025 at 21:58, Thomas Gleixner wrote:
> On Fri, Oct 31 2025 at 15:31, Mathieu Desnoyers wrote:
> That said I'm not completely against making this per process, but then
> it has to be enabled on the main thread _before_ it spawns threads and
> rejected otherwise.

It's actually not so trivial because contrary to CID, which is per MM
this is per process and as a newly created thread must register RSEQ
memory first there needs to be some 'inherited enablement on thread
creation' marker which then needs to be taken into account when the new
thread registers its RSEQ memory with the kernel.

And no, we are not going to make this unconditionally enabled when RSEQ
is registered. That's just wrong as that 'oh so tiny overhead' of user
space access accumulates nicely in high frequency scheduling scenarios
as you can see from the numbers provided with the rseq and cid cleanups.

So while it's doable the real question is whether this is worth the
trouble and extra state handling all over the place. I doubt it is and
keeping the kernel simple is definitely not the wrong approach.

Thanks,

        tglx

Re: [patch V3 02/12] rseq: Add fields and constants for time slice extension

Posted by Prakash Sangappa 3 months, 1 week ago


> On Oct 29, 2025, at 6:22 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> Aside of a Kconfig knob add the following items:
> 
>   - Two flag bits for the rseq user space ABI, which allow user space to
>     query the availability and enablement without a syscall.
> 
>   - A new member to the user space ABI struct rseq, which is going to be
>     used to communicate request and grant between kernel and user space.
> 
>   - A rseq state struct to hold the kernel state of this
> 
>   - Documentation of the new mechanism
> 
[…]
> +
> +If both the request bit and the granted bit are false when leaving the
> +critical section, then this indicates that a grant was revoked and no
> +further action is required by userspace.
> +
> +The required code flow is as follows::
> +
> +    rseq->slice_ctrl.request = 1;
> +    critical_section();
> +    if (rseq->slice_ctrl.granted)
> +         rseq_slice_yield();
> +
> +As all of this is strictly CPU local, there are no atomicity requirements.
> +Checking the granted state is racy, but that cannot be avoided at all::
> +
> +    if (rseq->slice_ctrl & GRANTED)
Could this be?
	if (rseq->slice_ctrl.granted)


> +      -> Interrupt results in schedule and grant revocation
> +        rseq_slice_yield();
> +


-Prakash

Re: [patch V3 02/12] rseq: Add fields and constants for time slice extension

Posted by Thomas Gleixner 3 months, 1 week ago

On Thu, Oct 30 2025 at 22:01, Prakash Sangappa wrote:

>> On Oct 29, 2025, at 6:22 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> 
>> Aside of a Kconfig knob add the following items:
>> 
>>   - Two flag bits for the rseq user space ABI, which allow user space to
>>     query the availability and enablement without a syscall.
>> 
>>   - A new member to the user space ABI struct rseq, which is going to be
>>     used to communicate request and grant between kernel and user space.
>> 
>>   - A rseq state struct to hold the kernel state of this
>> 
>>   - Documentation of the new mechanism
>> 
> […]
>> +
>> +If both the request bit and the granted bit are false when leaving the
>> +critical section, then this indicates that a grant was revoked and no
>> +further action is required by userspace.
>> +
>> +The required code flow is as follows::
>> +
>> +    rseq->slice_ctrl.request = 1;
>> +    critical_section();
>> +    if (rseq->slice_ctrl.granted)
>> +         rseq_slice_yield();
>> +
>> +As all of this is strictly CPU local, there are no atomicity requirements.
>> +Checking the granted state is racy, but that cannot be avoided at all::
>> +
>> +    if (rseq->slice_ctrl & GRANTED)
> Could this be?
> 	if (rseq->slice_ctrl.granted)

Yes.