Wire the grant decision function up in exit_to_user_mode_loop()
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
kernel/entry/common.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -17,6 +17,14 @@ void __weak arch_do_signal_or_restart(st
#define EXIT_TO_USER_MODE_WORK_LOOP (EXIT_TO_USER_MODE_WORK)
#endif
+/* TIF bits, which prevent a time slice extension. */
+#ifdef CONFIG_PREEMPT_RT
+# define TIF_SLICE_EXT_SCHED (_TIF_NEED_RESCHED_LAZY)
+#else
+# define TIF_SLICE_EXT_SCHED (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)
+#endif
+#define TIF_SLICE_EXT_DENY (EXIT_TO_USER_MODE_WORK & ~TIF_SLICE_EXT_SCHED)
+
static __always_inline unsigned long __exit_to_user_mode_loop(struct pt_regs *regs,
unsigned long ti_work)
{
@@ -28,8 +36,10 @@ static __always_inline unsigned long __e
local_irq_enable_exit_to_user(ti_work);
- if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
- schedule();
+ if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) {
+ if (!rseq_grant_slice_extension(ti_work & TIF_SLICE_EXT_DENY))
+ schedule();
+ }
if (ti_work & _TIF_UPROBE)
uprobe_notify_resume(regs);
On 2025-12-15 13:24, Thomas Gleixner wrote: > Wire the grant decision function up in exit_to_user_mode_loop() > [...] > > +/* TIF bits, which prevent a time slice extension. */ > +#ifdef CONFIG_PREEMPT_RT > +# define TIF_SLICE_EXT_SCHED (_TIF_NEED_RESCHED_LAZY) > +#else > +# define TIF_SLICE_EXT_SCHED (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY) It would be relevant to explain the difference between RT and non-RT in the commit message. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
On Tue, Dec 16, 2025 at 10:37:24AM -0500, Mathieu Desnoyers wrote: > On 2025-12-15 13:24, Thomas Gleixner wrote: > > Wire the grant decision function up in exit_to_user_mode_loop() > > > [...] > > +/* TIF bits, which prevent a time slice extension. */ > > +#ifdef CONFIG_PREEMPT_RT > > +# define TIF_SLICE_EXT_SCHED (_TIF_NEED_RESCHED_LAZY) > > +#else > > +# define TIF_SLICE_EXT_SCHED (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY) > > It would be relevant to explain the difference between RT and non-RT > in the commit message. So if you include TIF_NEED_RESCHED the extension period directly affects the minimum scheduler delay like: min(extension_period, min_sched_delay) because this is strictly a from-userspace thing. That is, it is equivalent to the in-kernel preemption/IRQ disabled regions -- with exception of the scheduler critical sections itself. As I've agrued many times -- I don't see a fundamental reason to not do this for RT -- but perhaps further reduce the magic number such that its impact cannot be observed on a 'good' machine. But yes, if/when we do this on RT it needs the promise to agressively decrease the magic number any time it can actually be measured to impact performance. cyclictest should probably get a mode where it (ab)uses the feature to failure before we do this. Anyway, I don't mind excluding RT for now, but it *does* deserve a comment.
On Fri, Dec 19 2025 at 12:07, Peter Zijlstra wrote:
> On Tue, Dec 16, 2025 at 10:37:24AM -0500, Mathieu Desnoyers wrote:
>> On 2025-12-15 13:24, Thomas Gleixner wrote:
>> > Wire the grant decision function up in exit_to_user_mode_loop()
>> >
>> [...]
>> > +/* TIF bits, which prevent a time slice extension. */
>> > +#ifdef CONFIG_PREEMPT_RT
>> > +# define TIF_SLICE_EXT_SCHED (_TIF_NEED_RESCHED_LAZY)
>> > +#else
>> > +# define TIF_SLICE_EXT_SCHED (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)
>>
>> It would be relevant to explain the difference between RT and non-RT
>> in the commit message.
>
> So if you include TIF_NEED_RESCHED the extension period directly affects
> the minimum scheduler delay like:
>
> min(extension_period, min_sched_delay)
>
> because this is strictly a from-userspace thing. That is, it is
> equivalent to the in-kernel preemption/IRQ disabled regions -- with
> exception of the scheduler critical sections itself.
>
> As I've agrued many times -- I don't see a fundamental reason to not do
> this for RT -- but perhaps further reduce the magic number such that its
> impact cannot be observed on a 'good' machine.
>
> But yes, if/when we do this on RT it needs the promise to agressively
> decrease the magic number any time it can actually be measured to impact
> performance.
>
> cyclictest should probably get a mode where it (ab)uses the feature to
> failure before we do this.
>
> Anyway, I don't mind excluding RT for now, but it *does* deserve a
> comment.
I know you argued about this many times, but I still maintain my point
of view that TIF_PREEMPT and TIF_PREEMPT_LAZY are fundmentally different:
TIF_PREEMPT_LAZY grants a non-RT task to complete until it reaches
return to user
TIF_PREEMPT enforces preemption at the next possible preemption
point
My main concern is this scenario:
sched_other_task()
request_slice_extension()
---> interrupt
RT task is woken up
return_to_user()
grant_extension()
...
which means the RT task is delayed until the OTHER task relinquishes the
CPU voluntarily or via timeout.
That might be desired _if_ both tasks are using the same lock, but in
case of fully independent tasks it's not necessarily a good idea. If a
RT application uses locks in the RT tasks, then obviously latency is not
so much of a concern, but for optimized RT applications the side effect
of other processes getting a free pass to increase latency is troublesome.
So I prefer to keep the current semantics for RT. This can be revisited
of course when a proper evaluation has been done, but IMO there are too
many moving parts in a RT system to make this actually work correctly
under all circumstances.
I'll add proper comments to that effect.
Thanks,
tglx
On Sun, Jan 11, 2026 at 12:01:31PM +0100, Thomas Gleixner wrote: > I know you argued about this many times, but I still maintain my point > of view that TIF_PREEMPT and TIF_PREEMPT_LAZY are fundmentally different: > > TIF_PREEMPT_LAZY grants a non-RT task to complete until it reaches > return to user > > TIF_PREEMPT enforces preemption at the next possible preemption > point This is only true for lazy preemption; and that is not the only possible model. > My main concern is this scenario: > > sched_other_task() > request_slice_extension() > > ---> interrupt > RT task is woken up > > return_to_user() > grant_extension() > ... > > which means the RT task is delayed until the OTHER task relinquishes the > CPU voluntarily or via timeout. Which is exactly the same as if there were a kernel preempt_disable() region. > So I prefer to keep the current semantics for RT. This can be revisited > of course when a proper evaluation has been done, but IMO there are too > many moving parts in a RT system to make this actually work correctly > under all circumstances. > > I'll add proper comments to that effect. I've added: +/* + * Since rseq slice ext has a direct correlation to the worst case + * scheduling latency (schedule is delayed after all), only have it affect + * LAZY reschedules on PREEMPT_RT for now. + * + * However, since this delay is only applicable to userspace, a value + * for rseq_slice_extension_nsec that is strictly less than the worst case + * kernel space preempt_disable() region, should mean the scheduling latency + * is not affected, even for !LAZY. + * + * However, since this value depends on the hardware at hand, it cannot be + * pre-determined in any sensible way. Hence punt on this problem for now. + */
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 3c78aaec19b0621bf952756670c8b066a55202fe
Gitweb: https://git.kernel.org/tip/3c78aaec19b0621bf952756670c8b066a55202fe
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 15 Dec 2025 17:52:31 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 22 Jan 2026 11:11:19 +01:00
entry: Hook up rseq time slice extension
Wire the grant decision function up in exit_to_user_mode_loop()
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155709.258157362@linutronix.de
---
kernel/entry/common.c | 27 +++++++++++++++++++++++++--
1 file changed, 25 insertions(+), 2 deletions(-)
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 5c792b3..9ef63e4 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -17,6 +17,27 @@ void __weak arch_do_signal_or_restart(struct pt_regs *regs) { }
#define EXIT_TO_USER_MODE_WORK_LOOP (EXIT_TO_USER_MODE_WORK)
#endif
+/* TIF bits, which prevent a time slice extension. */
+#ifdef CONFIG_PREEMPT_RT
+/*
+ * Since rseq slice ext has a direct correlation to the worst case
+ * scheduling latency (schedule is delayed after all), only have it affect
+ * LAZY reschedules on PREEMPT_RT for now.
+ *
+ * However, since this delay is only applicable to userspace, a value
+ * for rseq_slice_extension_nsec that is strictly less than the worst case
+ * kernel space preempt_disable() region, should mean the scheduling latency
+ * is not affected, even for !LAZY.
+ *
+ * However, since this value depends on the hardware at hand, it cannot be
+ * pre-determined in any sensible way. Hence punt on this problem for now.
+ */
+# define TIF_SLICE_EXT_SCHED (_TIF_NEED_RESCHED_LAZY)
+#else
+# define TIF_SLICE_EXT_SCHED (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)
+#endif
+#define TIF_SLICE_EXT_DENY (EXIT_TO_USER_MODE_WORK & ~TIF_SLICE_EXT_SCHED)
+
static __always_inline unsigned long __exit_to_user_mode_loop(struct pt_regs *regs,
unsigned long ti_work)
{
@@ -28,8 +49,10 @@ static __always_inline unsigned long __exit_to_user_mode_loop(struct pt_regs *re
local_irq_enable_exit_to_user(ti_work);
- if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
- schedule();
+ if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) {
+ if (!rseq_grant_slice_extension(ti_work & TIF_SLICE_EXT_DENY))
+ schedule();
+ }
if (ti_work & _TIF_UPROBE)
uprobe_notify_resume(regs);
© 2016 - 2026 Red Hat, Inc.