[v6] rseq: Implement time slice extension mechanism

[patch V6 10/11] entry: Hook up rseq time slice extension

Posted by Thomas Gleixner 1 month, 3 weeks ago

Wire the grant decision function up in exit_to_user_mode_loop()

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
 kernel/entry/common.c |   14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -17,6 +17,14 @@ void __weak arch_do_signal_or_restart(st
 #define EXIT_TO_USER_MODE_WORK_LOOP	(EXIT_TO_USER_MODE_WORK)
 #endif
 
+/* TIF bits, which prevent a time slice extension. */
+#ifdef CONFIG_PREEMPT_RT
+# define TIF_SLICE_EXT_SCHED	(_TIF_NEED_RESCHED_LAZY)
+#else
+# define TIF_SLICE_EXT_SCHED	(_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)
+#endif
+#define TIF_SLICE_EXT_DENY	(EXIT_TO_USER_MODE_WORK & ~TIF_SLICE_EXT_SCHED)
+
 static __always_inline unsigned long __exit_to_user_mode_loop(struct pt_regs *regs,
 							      unsigned long ti_work)
 {
@@ -28,8 +36,10 @@ static __always_inline unsigned long __e
 
 		local_irq_enable_exit_to_user(ti_work);
 
-		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
-			schedule();
+		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) {
+			if (!rseq_grant_slice_extension(ti_work & TIF_SLICE_EXT_DENY))
+				schedule();
+		}
 
 		if (ti_work & _TIF_UPROBE)
 			uprobe_notify_resume(regs);

Re: [patch V6 10/11] entry: Hook up rseq time slice extension

Posted by Mathieu Desnoyers 1 month, 3 weeks ago

On 2025-12-15 13:24, Thomas Gleixner wrote:
> Wire the grant decision function up in exit_to_user_mode_loop()
> 
[...]
>   
> +/* TIF bits, which prevent a time slice extension. */
> +#ifdef CONFIG_PREEMPT_RT
> +# define TIF_SLICE_EXT_SCHED	(_TIF_NEED_RESCHED_LAZY)
> +#else
> +# define TIF_SLICE_EXT_SCHED	(_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)

It would be relevant to explain the difference between RT and non-RT
in the commit message.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [patch V6 10/11] entry: Hook up rseq time slice extension

Posted by Peter Zijlstra 1 month, 3 weeks ago

On Tue, Dec 16, 2025 at 10:37:24AM -0500, Mathieu Desnoyers wrote:
> On 2025-12-15 13:24, Thomas Gleixner wrote:
> > Wire the grant decision function up in exit_to_user_mode_loop()
> > 
> [...]
> > +/* TIF bits, which prevent a time slice extension. */
> > +#ifdef CONFIG_PREEMPT_RT
> > +# define TIF_SLICE_EXT_SCHED	(_TIF_NEED_RESCHED_LAZY)
> > +#else
> > +# define TIF_SLICE_EXT_SCHED	(_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)
> 
> It would be relevant to explain the difference between RT and non-RT
> in the commit message.

So if you include TIF_NEED_RESCHED the extension period directly affects
the minimum scheduler delay like:

  min(extension_period, min_sched_delay)

because this is strictly a from-userspace thing. That is, it is
equivalent to the in-kernel preemption/IRQ disabled regions -- with
exception of the scheduler critical sections itself.

As I've agrued many times -- I don't see a fundamental reason to not do
this for RT -- but perhaps further reduce the magic number such that its
impact cannot be observed on a 'good' machine.

But yes, if/when we do this on RT it needs the promise to agressively
decrease the magic number any time it can actually be measured to impact
performance.

cyclictest should probably get a mode where it (ab)uses the feature to
failure before we do this.

Anyway, I don't mind excluding RT for now, but it *does* deserve a
comment.

Re: [patch V6 10/11] entry: Hook up rseq time slice extension

Posted by Thomas Gleixner 4 weeks ago

On Fri, Dec 19 2025 at 12:07, Peter Zijlstra wrote:
> On Tue, Dec 16, 2025 at 10:37:24AM -0500, Mathieu Desnoyers wrote:
>> On 2025-12-15 13:24, Thomas Gleixner wrote:
>> > Wire the grant decision function up in exit_to_user_mode_loop()
>> > 
>> [...]
>> > +/* TIF bits, which prevent a time slice extension. */
>> > +#ifdef CONFIG_PREEMPT_RT
>> > +# define TIF_SLICE_EXT_SCHED	(_TIF_NEED_RESCHED_LAZY)
>> > +#else
>> > +# define TIF_SLICE_EXT_SCHED	(_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)
>> 
>> It would be relevant to explain the difference between RT and non-RT
>> in the commit message.
>
> So if you include TIF_NEED_RESCHED the extension period directly affects
> the minimum scheduler delay like:
>
>   min(extension_period, min_sched_delay)
>
> because this is strictly a from-userspace thing. That is, it is
> equivalent to the in-kernel preemption/IRQ disabled regions -- with
> exception of the scheduler critical sections itself.
>
> As I've agrued many times -- I don't see a fundamental reason to not do
> this for RT -- but perhaps further reduce the magic number such that its
> impact cannot be observed on a 'good' machine.
>
> But yes, if/when we do this on RT it needs the promise to agressively
> decrease the magic number any time it can actually be measured to impact
> performance.
>
> cyclictest should probably get a mode where it (ab)uses the feature to
> failure before we do this.
>
> Anyway, I don't mind excluding RT for now, but it *does* deserve a
> comment.

I know you argued about this many times, but I still maintain my point
of view that TIF_PREEMPT and TIF_PREEMPT_LAZY are fundmentally different:

     TIF_PREEMPT_LAZY grants a non-RT task to complete until it reaches
     return to user

     TIF_PREEMPT enforces preemption at the next possible preemption
     point

My main concern is this scenario:

   sched_other_task()
        request_slice_extension()

   ---> interrupt
        RT task is woken up

        return_to_user()
           grant_extension()
           ...

which means the RT task is delayed until the OTHER task relinquishes the
CPU voluntarily or via timeout.

That might be desired _if_ both tasks are using the same lock, but in
case of fully independent tasks it's not necessarily a good idea. If a
RT application uses locks in the RT tasks, then obviously latency is not
so much of a concern, but for optimized RT applications the side effect
of other processes getting a free pass to increase latency is troublesome.

So I prefer to keep the current semantics for RT. This can be revisited
of course when a proper evaluation has been done, but IMO there are too
many moving parts in a RT system to make this actually work correctly
under all circumstances.

I'll add proper comments to that effect.

Thanks,

        tglx

Re: [patch V6 10/11] entry: Hook up rseq time slice extension

Posted by Peter Zijlstra 3 weeks, 1 day ago

On Sun, Jan 11, 2026 at 12:01:31PM +0100, Thomas Gleixner wrote:

> I know you argued about this many times, but I still maintain my point
> of view that TIF_PREEMPT and TIF_PREEMPT_LAZY are fundmentally different:
> 
>      TIF_PREEMPT_LAZY grants a non-RT task to complete until it reaches
>      return to user
> 
>      TIF_PREEMPT enforces preemption at the next possible preemption
>      point

This is only true for lazy preemption; and that is not the only possible
model.

> My main concern is this scenario:
> 
>    sched_other_task()
>         request_slice_extension()
> 
>    ---> interrupt
>         RT task is woken up
> 
>         return_to_user()
>            grant_extension()
>            ...
> 
> which means the RT task is delayed until the OTHER task relinquishes the
> CPU voluntarily or via timeout.

Which is exactly the same as if there were a kernel preempt_disable()
region.

> So I prefer to keep the current semantics for RT. This can be revisited
> of course when a proper evaluation has been done, but IMO there are too
> many moving parts in a RT system to make this actually work correctly
> under all circumstances.
> 
> I'll add proper comments to that effect.

I've added:

+/*
+ * Since rseq slice ext has a direct correlation to the worst case
+ * scheduling latency (schedule is delayed after all), only have it affect
+ * LAZY reschedules on PREEMPT_RT for now.
+ *
+ * However, since this delay is only applicable to userspace, a value
+ * for rseq_slice_extension_nsec that is strictly less than the worst case
+ * kernel space preempt_disable() region, should mean the scheduling latency
+ * is not affected, even for !LAZY.
+ *
+ * However, since this value depends on the hardware at hand, it cannot be
+ * pre-determined in any sensible way. Hence punt on this problem for now.
+ */

[tip: sched/core] entry: Hook up rseq time slice extension

Posted by tip-bot2 for Thomas Gleixner 2 weeks, 3 days ago

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     3c78aaec19b0621bf952756670c8b066a55202fe
Gitweb:        https://git.kernel.org/tip/3c78aaec19b0621bf952756670c8b066a55202fe
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Mon, 15 Dec 2025 17:52:31 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 22 Jan 2026 11:11:19 +01:00

entry: Hook up rseq time slice extension

Wire the grant decision function up in exit_to_user_mode_loop()

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155709.258157362@linutronix.de
---
 kernel/entry/common.c | 27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 5c792b3..9ef63e4 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -17,6 +17,27 @@ void __weak arch_do_signal_or_restart(struct pt_regs *regs) { }
 #define EXIT_TO_USER_MODE_WORK_LOOP	(EXIT_TO_USER_MODE_WORK)
 #endif
 
+/* TIF bits, which prevent a time slice extension. */
+#ifdef CONFIG_PREEMPT_RT
+/*
+ * Since rseq slice ext has a direct correlation to the worst case
+ * scheduling latency (schedule is delayed after all), only have it affect
+ * LAZY reschedules on PREEMPT_RT for now.
+ *
+ * However, since this delay is only applicable to userspace, a value
+ * for rseq_slice_extension_nsec that is strictly less than the worst case
+ * kernel space preempt_disable() region, should mean the scheduling latency
+ * is not affected, even for !LAZY.
+ *
+ * However, since this value depends on the hardware at hand, it cannot be
+ * pre-determined in any sensible way. Hence punt on this problem for now.
+ */
+# define TIF_SLICE_EXT_SCHED	(_TIF_NEED_RESCHED_LAZY)
+#else
+# define TIF_SLICE_EXT_SCHED	(_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)
+#endif
+#define TIF_SLICE_EXT_DENY	(EXIT_TO_USER_MODE_WORK & ~TIF_SLICE_EXT_SCHED)
+
 static __always_inline unsigned long __exit_to_user_mode_loop(struct pt_regs *regs,
 							      unsigned long ti_work)
 {
@@ -28,8 +49,10 @@ static __always_inline unsigned long __exit_to_user_mode_loop(struct pt_regs *re
 
 		local_irq_enable_exit_to_user(ti_work);
 
-		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
-			schedule();
+		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) {
+			if (!rseq_grant_slice_extension(ti_work & TIF_SLICE_EXT_DENY))
+				schedule();
+		}
 
 		if (ti_work & _TIF_UPROBE)
 			uprobe_notify_resume(regs);