Use a bit in rseq flags to indicate if the thread got rescheduled
after the cpu time extension was graned. The user thread can check this
flag before calling sched_yield() to yield the cpu.
Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
---
include/linux/sched.h | 2 ++
include/uapi/linux/rseq.h | 10 ++++++++++
kernel/rseq.c | 13 +++++++++++++
kernel/sched/core.c | 5 ++---
4 files changed, 27 insertions(+), 3 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5d2819afd481..5df055f2dd9e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2258,6 +2258,7 @@ unsigned long sched_cpu_util(int cpu);
extern bool __rseq_delay_resched(void);
extern void rseq_delay_resched_arm_timer(void);
extern void rseq_delay_resched_tick(void);
+extern void rseq_delay_resched_clear(struct task_struct *tsk);
static inline bool rseq_delay_set_need_resched(void)
{
if (current->rseq_delay_resched == RSEQ_RESCHED_DELAY_REQUESTED) {
@@ -2271,6 +2272,7 @@ static inline bool __rseq_delay_resched(void) { return false; }
static inline void rseq_delay_resched_arm_timer(void) { }
static inline void rseq_delay_resched_tick(void) { }
static inline bool rseq_delay_set_need_resched(void) { return false; }
+static inline void rseq_delay_resched_clear(struct task_struct *tsk) { }
#endif
#ifdef CONFIG_SCHED_CORE
diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index 25fc636b17d5..f4813d931387 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -27,6 +27,7 @@ enum rseq_cs_flags_bit {
RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT = 1,
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT = 2,
RSEQ_CS_FLAG_DELAY_RESCHED_BIT = 3,
+ RSEQ_CS_FLAG_RESCHEDULED_BIT = 4,
};
enum rseq_cs_flags {
@@ -38,6 +39,9 @@ enum rseq_cs_flags {
(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
RSEQ_CS_FLAG_DELAY_RESCHED =
(1U << RSEQ_CS_FLAG_DELAY_RESCHED_BIT),
+ RSEQ_CS_FLAG_RESCHEDULED =
+ (1U << RSEQ_CS_FLAG_RESCHEDULED_BIT),
+
};
/*
@@ -135,6 +139,12 @@ struct rseq {
* Request by user thread to delay preemption. With use
* of a timer, kernel grants extra cpu time upto 30us for this
* thread before being rescheduled.
+ * - RSEQ_CS_FLAG_RESCHEDULED
+ * Set by kernel if the thread was rescheduled in the extra time
+ * granted due to request RSEQ_CS_DELAY_RESCHED. This bit is
+ * checked by the thread before calling sched_yield() to yield
+ * cpu. User thread sets this bit to 0, when setting
+ * RSEQ_CS_DELAY_RESCHED to request preemption delay.
*/
__u32 flags;
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 8b6af4e12142..6331b653b402 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -480,6 +480,19 @@ void rseq_delay_resched_tick(void)
if (current->rseq_delay_resched == RSEQ_RESCHED_DELAY_REQUESTED)
set_tsk_need_resched(current);
}
+
+void rseq_delay_resched_clear(struct task_struct *tsk)
+{
+ u32 flags;
+
+ if (tsk->rseq_delay_resched == RSEQ_RESCHED_DELAY_REQUESTED) {
+ tsk->rseq_delay_resched = RSEQ_RESCHED_DELAY_PROBE;
+ if (copy_from_user_nofault(&flags, &tsk->rseq->flags, sizeof(flags)))
+ return;
+ flags |= RSEQ_CS_FLAG_RESCHEDULED;
+ copy_to_user_nofault(&tsk->rseq->flags, &flags, sizeof(flags));
+ }
+}
#endif /* CONFIG_RSEQ_RESCHED_DELAY */
#ifdef CONFIG_DEBUG_RSEQ
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e75ecbb2c1f7..ba1e4f6981cd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6752,9 +6752,8 @@ static void __sched notrace __schedule(int sched_mode)
picked:
clear_tsk_need_resched(prev);
clear_preempt_need_resched();
- if (IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY) &&
- prev->rseq_delay_resched == RSEQ_RESCHED_DELAY_REQUESTED)
- prev->rseq_delay_resched = RSEQ_RESCHED_DELAY_PROBE;
+ if(IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY))
+ rseq_delay_resched_clear(prev);
rq->last_seen_need_resched_ns = 0;
is_switch = prev != next;
--
2.43.5
On Thu, Jul 24 2025 at 16:16, Prakash Sangappa wrote: Indicate this to whom? Can you please write descriptive subject lines which summarize the change in a way that is comprehensible? > +void rseq_delay_resched_clear(struct task_struct *tsk) > +{ > + u32 flags; > + > + if (tsk->rseq_delay_resched == RSEQ_RESCHED_DELAY_REQUESTED) { > + tsk->rseq_delay_resched = RSEQ_RESCHED_DELAY_PROBE; > + if (copy_from_user_nofault(&flags, &tsk->rseq->flags, sizeof(flags))) > + return; > + flags |= RSEQ_CS_FLAG_RESCHEDULED; > + copy_to_user_nofault(&tsk->rseq->flags, &flags, sizeof(flags)); > + } > +} > #endif /* CONFIG_RSEQ_RESCHED_DELAY */ > > #ifdef CONFIG_DEBUG_RSEQ > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index e75ecbb2c1f7..ba1e4f6981cd 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -6752,9 +6752,8 @@ static void __sched notrace __schedule(int sched_mode) > picked: > clear_tsk_need_resched(prev); > clear_preempt_need_resched(); > - if (IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY) && > - prev->rseq_delay_resched == RSEQ_RESCHED_DELAY_REQUESTED) > - prev->rseq_delay_resched = RSEQ_RESCHED_DELAY_PROBE; > + if(IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY)) > + rseq_delay_resched_clear(prev); Yet another unconditional function call for the sake of something which is only used by special applications. This is the scheduler hotpath and not a dump ground for random functionality, which is even completely redundant. Why redundant? The kernel already handles in rseq, that a task was scheduled out: schedule() prepare_task_switch() rseq_preempt() rseq_preempt() sets RSEQ_EVENT_PREEMPT_BIT and TIF_NOTIFY_RESUME, which causes exit to userspace to invoke __rseq_handle_notify_resume(). That's the obvious place to handle this instead of inflicting it into the scheduler hotpath. No? Thanks, tglx
> On Aug 7, 2025, at 6:06 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > > On Thu, Jul 24 2025 at 16:16, Prakash Sangappa wrote: > > Indicate this to whom? Can you please write descriptive subject lines > which summarize the change in a way that is comprehensible? > >> +void rseq_delay_resched_clear(struct task_struct *tsk) >> +{ >> + u32 flags; >> + >> + if (tsk->rseq_delay_resched == RSEQ_RESCHED_DELAY_REQUESTED) { >> + tsk->rseq_delay_resched = RSEQ_RESCHED_DELAY_PROBE; >> + if (copy_from_user_nofault(&flags, &tsk->rseq->flags, sizeof(flags))) >> + return; >> + flags |= RSEQ_CS_FLAG_RESCHEDULED; >> + copy_to_user_nofault(&tsk->rseq->flags, &flags, sizeof(flags)); >> + } >> +} >> #endif /* CONFIG_RSEQ_RESCHED_DELAY */ >> >> #ifdef CONFIG_DEBUG_RSEQ >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >> index e75ecbb2c1f7..ba1e4f6981cd 100644 >> --- a/kernel/sched/core.c >> +++ b/kernel/sched/core.c >> @@ -6752,9 +6752,8 @@ static void __sched notrace __schedule(int sched_mode) >> picked: >> clear_tsk_need_resched(prev); >> clear_preempt_need_resched(); >> - if (IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY) && >> - prev->rseq_delay_resched == RSEQ_RESCHED_DELAY_REQUESTED) >> - prev->rseq_delay_resched = RSEQ_RESCHED_DELAY_PROBE; >> + if(IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY)) >> + rseq_delay_resched_clear(prev); > > Yet another unconditional function call for the sake of something which > is only used by special applications. This is the scheduler hotpath and > not a dump ground for random functionality, which is even completely > redundant. Why redundant? > > The kernel already handles in rseq, that a task was scheduled out: > > schedule() > prepare_task_switch() > rseq_preempt() > > rseq_preempt() sets RSEQ_EVENT_PREEMPT_BIT and TIF_NOTIFY_RESUME, which > causes exit to userspace to invoke __rseq_handle_notify_resume(). That's > the obvious place to handle this instead of inflicting it into the > scheduler hotpath. > > No? Sure, I will look at moving rseq_delay_resched_clear() call to __rseq_handle_notify_resume(). -Prakash > > Thanks, > > tglx
On Thu, Aug 07 2025 at 16:15, Prakash Sangappa wrote: >> On Aug 7, 2025, at 6:06 AM, Thomas Gleixner <tglx@linutronix.de> wrote: >> rseq_preempt() sets RSEQ_EVENT_PREEMPT_BIT and TIF_NOTIFY_RESUME, which >> causes exit to userspace to invoke __rseq_handle_notify_resume(). That's >> the obvious place to handle this instead of inflicting it into the >> scheduler hotpath. >> >> No? > > Sure, I will look at moving rseq_delay_resched_clear() call to __rseq_handle_notify_resume(). I looked deeper into it and it does not completely solve the problem. The approach of having a request bit and then a disconnected rescheduled bit is not working. You need a proper contract between kernel and userspace and you have to enforce it. You gracefully avoided to provide an actual ABI description and a user space test case for this.... You need two bits in rseq::flags: REQUEST and GRANTED The flow is: local_set_bit(REQUEST, &rseq->flags); critical_section(); if (!local_test_and_clear_bit(REQUEST, &rseq->flags)) { if (local_test_bit(GRANTED, &rseq->flags)) sched_yield(); } local_set_bit() could be a simple rseq->flags |= REQUEST; operation when and only when there is no other usage of rseq->flags than this extension muck. Otherwise the resulting RMW would race against the kernel updating flags. If that's not guaranteed and it won't be because flags might be used for other things later, then local_set_bit() must use a instruction which is atomic on the local CPU vs. interrupts, e.g. ORB on X86. There is no LOCK prefix required as there is no cross CPU concurrency possible due to rseq being strictly thread local. The only way to avoid that is to provide a distinct new rseq field for this, but that's a different debate to be had. local_test_and_clear_bit() on the other hand _must_ always be thread local atomic to prevent the obvious RMW race. On X86 this is a simple BTR without LOCK prefix. Only ALPHA, LONGARCH, MIPS, POWERPC and X86 provide such local operations, on everything else you need to fall back to a full atomic. local_test_bit() has no atomicity requirements as there is obviously a race which cannot be avoided: if (local_test_bit(GRANTED)) -> sched_yield(); If the interrupt hits between the test and the actual syscall entry, then the kernel might reschedule and clear the grant. And no, local_test_and_clear(GRANTED) does not help either because if that evaluates to true, then the syscall has to be issued anyway to reset the kernel state for the price of a brief period of inconsistent state between kernel and user space, which is not an option at all. The kernel side does in the NEED_RESCHED check: if (!tsk->state) return false; if (tsk->state == GRANTED) { tsk->rseq->flags &= ~GRANTED; tsk->state == ENABLED; return false; } if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL) return false; if (tsk->rseq->flags & REQUEST) return false; tsk->rseq->flags &= ~REQUEST; tsk->rseq->flags |= GRANTED; tsk->state = GRANTED; return true; and sched_yield() does: if (tsk->state == GRANTED) { tsk->rseq->flags &= ~GRANTED; set_need_resched(); } This obviously needs some sanity checks, whether user space violated the contract, but they are cheep because the operations on the user space flags are RMW and the value is already loaded into a register. Now taking the rseq event_mask into account the NEED_RESCHED side needs additionally: if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL) return false; + if (tsk->rseq_event_mask) + return false; if (tsk->rseq->flags & REQUEST) return false; Because if that is set then the task was preempted, migrated, had a signal delivered and then the slice extension is moot. The sched_yield() abuse wants some sanity checking. The simplest way to achieve that is to create SYSCALL_WORK for it. When granted: set_task_syscall_work(t, SYSCALL_RSEQ_SLICE); On reset clear_task_syscall_work(t, SYSCALL_RSEQ_SLICE); Plus the corresponding syscall work function, which sets NEED_RESCHED and clears the kernel and user space state. Along with state checks and a check whether syscallnr == sched_yield. If not, kill the beast. You _CANNOT_ rely on user space behaving correctly, you need to enforce it and inconsistent state is not an option. Look how strict the RSEQ critical section code or the futex code is about that. There is no room for assumptions. It's still required to hook into __rseq_handle_notify_resume() to reset the grant when event_mask is not empty. This handles the case where the task is scheduled out in exit work e.g. through a cond_resched() or blocking on a mutex. Then the subsequent iteration in the loop won't have NEED_RESCHED set, but the GRANTED state is still there and makes no sense anymore. In that case TIF_NOTIFY_RESUME is set, which ends up calling into the rseq code. TBH, my interest to stare at yet another variant of undocumented hacks, which inflict pointless overhead into the hotpaths, is very close to zero. As I've already analyzed this in depth, I sat down for half a day and converted the analysis into code. See combo patch below. I still need to address a few details and write change logs for the 17 patches, which introduce this gradually and in a reviewable way. I'll send that out in the next days. What's interesting is that the selftest does not expose a big advantage vs. the rescheduled case. # Success 1491820 # Yielded 123792 # Raced 0 # Scheduled 27 but that might depend on the actual scheduling interference pattern. The Success number might be misleading as the kernel might still have rescheduled without touching the user space bits, but enforcing an update for that is just extra pointless overhead. I wasn't able to trigger the sched_yield() race yet, but that's obviously a question of interrupt and scheduling patterns as well. Thanks, tglx --- --- a/include/linux/irq-entry-common.h +++ b/include/linux/irq-entry-common.h @@ -2,11 +2,12 @@ #ifndef __LINUX_IRQENTRYCOMMON_H #define __LINUX_IRQENTRYCOMMON_H +#include <linux/context_tracking.h> +#include <linux/kmsan.h> +#include <linux/rseq.h> #include <linux/static_call_types.h> #include <linux/syscalls.h> -#include <linux/context_tracking.h> #include <linux/tick.h> -#include <linux/kmsan.h> #include <linux/unwind_deferred.h> #include <asm/entry-common.h> @@ -67,6 +68,7 @@ static __always_inline bool arch_in_rcu_ /** * enter_from_user_mode - Establish state when coming from user mode + * @regs: Pointer to currents pt_regs * * Syscall/interrupt entry disables interrupts, but user mode is traced as * interrupts enabled. Also with NO_HZ_FULL RCU might be idle. @@ -195,15 +197,13 @@ static __always_inline void arch_exit_to */ void arch_do_signal_or_restart(struct pt_regs *regs); -/** - * exit_to_user_mode_loop - do any pending work before leaving to user space - */ -unsigned long exit_to_user_mode_loop(struct pt_regs *regs, - unsigned long ti_work); +/* Handle pending TIF work */ +unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long ti_work, bool from_irq); /** * exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required * @regs: Pointer to pt_regs on entry stack + * @from_irq: Exiting to user space from an interrupt * * 1) check that interrupts are disabled * 2) call tick_nohz_user_enter_prepare() @@ -211,7 +211,7 @@ unsigned long exit_to_user_mode_loop(str * EXIT_TO_USER_MODE_WORK are set * 4) check that interrupts are still disabled */ -static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs) +static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs, bool from_irq) { unsigned long ti_work; @@ -222,16 +222,28 @@ static __always_inline void exit_to_user ti_work = read_thread_flags(); if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK)) - ti_work = exit_to_user_mode_loop(regs, ti_work); + ti_work = exit_to_user_mode_loop(regs, ti_work, from_irq); arch_exit_to_user_mode_prepare(regs, ti_work); + rseq_exit_to_user_mode(); + /* Ensure that kernel state is sane for a return to userspace */ kmap_assert_nomap(); lockdep_assert_irqs_disabled(); lockdep_sys_exit(); } +static __always_inline void syscall_exit_to_user_mode_prepare(struct pt_regs *regs) +{ + exit_to_user_mode_prepare(regs, false); +} + +static __always_inline void irqentry_exit_to_user_mode_prepare(struct pt_regs *regs) +{ + exit_to_user_mode_prepare(regs, true); +} + /** * exit_to_user_mode - Fixup state when exiting to user mode * @@ -354,6 +366,7 @@ irqentry_state_t noinstr irqentry_enter( * Conditional reschedule with additional sanity checks. */ void raw_irqentry_exit_cond_resched(void); + #ifdef CONFIG_PREEMPT_DYNAMIC #if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL) #define irqentry_exit_cond_resched_dynamic_enabled raw_irqentry_exit_cond_resched --- a/include/linux/rseq.h +++ b/include/linux/rseq.h @@ -4,6 +4,7 @@ #ifdef CONFIG_RSEQ +#include <linux/jump_label.h> #include <linux/preempt.h> #include <linux/sched.h> @@ -61,6 +62,20 @@ static inline void rseq_migrate(struct t rseq_set_notify_resume(t); } +static __always_inline void rseq_slice_extension_timer(void); + +static __always_inline void rseq_exit_to_user_mode(void) +{ + rseq_slice_extension_timer(); + /* + * Clear the event mask so it does not contain stale bits when + * coming back from user space. + */ + current->rseq_event_mask = 0; +} + +static inline void rseq_slice_fork(struct task_struct *t, bool inherit); + /* * If parent process has a registered restartable sequences area, the * child inherits. Unregister rseq for a clone with CLONE_VM set. @@ -72,11 +87,13 @@ static inline void rseq_fork(struct task t->rseq_len = 0; t->rseq_sig = 0; t->rseq_event_mask = 0; + rseq_slice_fork(t, false); } else { t->rseq = current->rseq; t->rseq_len = current->rseq_len; t->rseq_sig = current->rseq_sig; t->rseq_event_mask = current->rseq_event_mask; + rseq_slice_fork(t, true); } } @@ -86,46 +103,127 @@ static inline void rseq_execve(struct ta t->rseq_len = 0; t->rseq_sig = 0; t->rseq_event_mask = 0; + rseq_slice_fork(t, false); } -#else +#else /* CONFIG_RSEQ */ +static inline void rseq_set_notify_resume(struct task_struct *t) { } +static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) { } +static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { } +static inline void rseq_preempt(struct task_struct *t) { } +static inline void rseq_migrate(struct task_struct *t) { } +static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags) { } +static inline void rseq_execve(struct task_struct *t) { } +static inline void rseq_exit_to_user_mode(void) { } +#endif /* !CONFIG_RSEQ */ -static inline void rseq_set_notify_resume(struct task_struct *t) -{ -} -static inline void rseq_handle_notify_resume(struct ksignal *ksig, - struct pt_regs *regs) -{ -} -static inline void rseq_signal_deliver(struct ksignal *ksig, - struct pt_regs *regs) +#ifdef CONFIG_RSEQ_SLICE_EXTENSION +/* + * Constants for task::rseq_slice_extension: + * + * ENABLED is set when the task enables it via prctl() + * GRANTED is set when the kernel grants an extension on interrupt return + * to user space. Implies ENABLED + */ +#define RSEQ_SLICE_EXTENSION_ENABLED 0x1 +#define RSEQ_SLICE_EXTENSION_GRANTED 0x2 + +DECLARE_STATIC_KEY_TRUE(rseq_slice_extension_key); + +static inline bool rseq_slice_extension_enabled(void) { + return static_branch_likely(&rseq_slice_extension_key); } -static inline void rseq_preempt(struct task_struct *t) + +int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3); +bool rseq_syscall_enter_work(long syscall); +void __rseq_slice_extension_timer(void); +bool __rseq_grant_slice_extension(unsigned int slext); + +#ifdef CONFIG_PREEMPT_RT +#define RSEQ_TIF_DENY (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL | _TIF_NEED_RESCHED) +#else +#define RSEQ_TIF_DENY (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL) +#endif + +static __always_inline bool rseq_grant_slice_extension(unsigned long ti_work) { + unsigned int slext; + + if (!rseq_slice_extension_enabled()) + return false; + + slext = current->rseq_slice_extension; + if (likely(!slext)) + return false; + + /* + * Two quick check conditions where a grant is not possible: + * 1) Signal is pending, which means the task will return + * to the signal handler and not to the interrupted code + * + * 2) On RT, when NEED_RESCHED is set. RT grants only when + * NEED_RESCHED_LAZY is set. + * + * In both cases __rseq_grant_slice_extension() has to be invoked + * when the extension was already granted to clear it. + */ + if (ti_work & RSEQ_TIF_DENY && !(slext & RSEQ_SLICE_EXTENSION_GRANTED)) + return false; + return __rseq_grant_slice_extension(slext); +} + +static inline bool rseq_slice_extension_resched(void) +{ + if (!rseq_slice_extension_enabled()) + return false; + + if (unlikely(current->rseq_slice_extension & RSEQ_SLICE_EXTENSION_GRANTED)) { + set_tsk_need_resched(current); + return true; + } + return false; } -static inline void rseq_migrate(struct task_struct *t) + +static __always_inline void rseq_slice_extension_timer(void) { + if (!rseq_slice_extension_enabled()) + return; + + if (unlikely(current->rseq_slice_extension & RSEQ_SLICE_EXTENSION_GRANTED)) + __rseq_slice_extension_timer(); } -static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags) + +static inline void rseq_slice_fork(struct task_struct *t, bool inherit) { + if (inherit) + t->rseq_slice_extension = current->rseq_slice_extension; + else + t->rseq_slice_extension = 0; } -static inline void rseq_execve(struct task_struct *t) + +static inline void rseq_slice_extension_disable(void) { + current->rseq_slice_extension = 0; } -#endif - -#ifdef CONFIG_DEBUG_RSEQ - -void rseq_syscall(struct pt_regs *regs); - -#else - -static inline void rseq_syscall(struct pt_regs *regs) +#else /* CONFIG_RSEQ_SLICE_EXTENSION */ +static inline bool rseq_slice_extension_enabled(void) { return false; } +static inline bool rseq_slice_extension_resched(void) { return false; } +static inline bool rseq_syscall_enter_work(long syscall) { return false; } +static __always_inline void rseq_slice_extension_timer(void) { } +static inline int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3) { + return -EINVAL; } +static inline void rseq_slice_fork(struct task_struct *t, bool inherit) { } +static inline void rseq_slice_extension_disable(void) { } +#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */ -#endif +#ifdef CONFIG_DEBUG_RSEQ +void rseq_debug_syscall_exit(struct pt_regs *regs); +#else /* CONFIG_DEBUG_RSEQ */ +static inline void rseq_debug_syscall_exit(struct pt_regs *regs) { } +#endif /* !CONFIG_DEBUG_RSEQ */ #endif /* _LINUX_RSEQ_H */ --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -8,24 +8,36 @@ * Mathieu Desnoyers <mathieu.desnoyers@efficios.com> */ +#include <linux/prctl.h> +#include <linux/ratelimit.h> +#include <linux/rseq.h> #include <linux/sched.h> -#include <linux/uaccess.h> #include <linux/syscalls.h> -#include <linux/rseq.h> +#include <linux/sysctl.h> #include <linux/types.h> -#include <linux/ratelimit.h> +#include <linux/uaccess.h> + #include <asm/ptrace.h> +#include "sched/hrtick.h" + #define CREATE_TRACE_POINTS #include <trace/events/rseq.h> /* The original rseq structure size (including padding) is 32 bytes. */ #define ORIG_RSEQ_SIZE 32 -#define RSEQ_CS_NO_RESTART_FLAGS (RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT | \ - RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL | \ +#define RSEQ_CS_NO_RESTART_FLAGS (RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT | \ + RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL | \ RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE) +#ifdef CONFIG_RSEQ_SLICE_EXTENSION +#define RSEQ_CS_VALID_FLAGS (RSEQ_CS_FLAG_SLICE_EXT_REQUEST | \ + RSEQ_CS_FLAG_SLICE_EXT_GRANTED) +#else +#define RSEQ_CS_VALID_FLAGS (0) +#endif + #ifdef CONFIG_DEBUG_RSEQ static struct rseq *rseq_kernel_fields(struct task_struct *t) { @@ -313,12 +325,12 @@ static bool rseq_warn_flags(const char * { u32 test_flags; - if (!flags) + if (!(flags & ~RSEQ_CS_VALID_FLAGS)) return false; test_flags = flags & RSEQ_CS_NO_RESTART_FLAGS; if (test_flags) pr_warn_once("Deprecated flags (%u) in %s ABI structure", test_flags, str); - test_flags = flags & ~RSEQ_CS_NO_RESTART_FLAGS; + test_flags = flags & ~(RSEQ_CS_NO_RESTART_FLAGS | RSEQ_CS_VALID_FLAGS); if (test_flags) pr_warn_once("Unknown flags (%u) in %s ABI structure", test_flags, str); return true; @@ -410,6 +422,8 @@ static int rseq_ip_fixup(struct pt_regs return 0; } +static inline bool rseq_reset_slice_extension(struct task_struct *t); + /* * This resume handler must always be executed between any of: * - preemption, @@ -430,11 +444,16 @@ void __rseq_handle_notify_resume(struct return; /* - * regs is NULL if and only if the caller is in a syscall path. Skip - * fixup and leave rseq_cs as is so that rseq_sycall() will detect and - * kill a misbehaving userspace on debug kernels. + * If invoked from hypervisors or IO-URING @regs is a NULL pointer, + * so fixup cannot be done. If the syscall which led to this + * invocation was invoked inside a critical section, then it can + * only be detected on a debug kernel in rseq_debug_syscall_exit(), + * which will detect and kill a misbehaving userspace. */ if (regs) { + if (!rseq_reset_slice_extension(t)) + goto error; + ret = rseq_ip_fixup(regs); if (unlikely(ret < 0)) goto error; @@ -454,7 +473,7 @@ void __rseq_handle_notify_resume(struct * Terminate the process if a syscall is issued within a restartable * sequence. */ -void rseq_syscall(struct pt_regs *regs) +void rseq_debug_syscall_exit(struct pt_regs *regs) { unsigned long ip = instruction_pointer(regs); struct task_struct *t = current; @@ -490,6 +509,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user ret = rseq_reset_rseq_cpu_node_id(current); if (ret) return ret; + rseq_slice_extension_disable(); current->rseq = NULL; current->rseq_sig = 0; current->rseq_len = 0; @@ -571,3 +591,189 @@ SYSCALL_DEFINE4(rseq, struct rseq __user return 0; } + +#ifdef CONFIG_RSEQ_SLICE_EXTENSION +DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key); + +static unsigned int rseq_slice_ext_nsecs __read_mostly = 30 * NSEC_PER_USEC; + +#ifdef CONFIG_SYSCTL +static const unsigned int rseq_slice_ext_nsecs_min = 10 * NSEC_PER_USEC; +static const unsigned int rseq_slice_ext_nsecs_max = 50 * NSEC_PER_USEC; + +static const struct ctl_table rseq_slice_ext_sysctl[] = { + { + .procname = "rseq_slice_extension_nsec", + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_douintvec_minmax, + .extra1 = (unsigned int *)&rseq_slice_ext_nsecs_min, + .extra2 = (unsigned int *)&rseq_slice_ext_nsecs_max, + }, +}; + +static int __init rseq_sysctl_init(void) +{ + register_sysctl("kernel", rseq_slice_ext_sysctl); + return 0; +} +device_initcall(rseq_sysctl_init); +#endif /* !CONFIG_SYSCTL */ + +static inline bool rseq_clear_slice_granted(struct task_struct *curr, u32 rflags) +{ + /* Check whether user space violated the contract */ + if (rflags & RSEQ_CS_FLAG_SLICE_EXT_REQUEST) + return false; + if (!(rflags & RSEQ_CS_FLAG_SLICE_EXT_GRANTED)) + return false; + + rflags &= ~RSEQ_CS_FLAG_SLICE_EXT_GRANTED; + return !put_user(rflags, &curr->rseq->flags); +} + +static bool __rseq_reset_slice_extension(struct task_struct *curr) +{ + u32 rflags; + + if (get_user(rflags, &curr->rseq->flags)) + return false; + return rseq_clear_slice_granted(curr, rflags); +} + +static inline bool rseq_reset_slice_extension(struct task_struct *curr) +{ + if (!rseq_slice_extension_enabled()) + return true; + + if (likely(!(curr->rseq_slice_extension & RSEQ_SLICE_EXTENSION_GRANTED))) + return true; + if (likely(!curr->rseq_event_mask)) + return true; + + clear_task_syscall_work(curr, SYSCALL_RSEQ_SLICE); + curr->rseq_slice_extension = RSEQ_SLICE_EXTENSION_ENABLED; + + return __rseq_reset_slice_extension(curr); +} + +/* + * Invoked from syscall entry if a time slice extension was granted and the + * kernel did not clear it before user space left the critical section. + */ +bool rseq_syscall_enter_work(long syscall) +{ + struct task_struct *curr = current; + unsigned int slext = curr->rseq_slice_extension; + + clear_task_syscall_work(curr, SYSCALL_RSEQ_SLICE); + curr->rseq_slice_extension = RSEQ_SLICE_EXTENSION_ENABLED; + + /* + * Kernel internal state inconsistency. SYSCALL_RSEQ_SLICE can only + * be set when state is GRANTED! + */ + if (WARN_ON_ONCE(slext != RSEQ_SLICE_EXTENSION_GRANTED)) + return false; + + set_tsk_need_resched(curr); + + if (unlikely(!__rseq_reset_slice_extension(curr) || syscall != __NR_sched_yield)) + force_sigsegv(0); + + /* Abort syscall to reschedule immediately */ + return true; +} + +bool __rseq_grant_slice_extension(unsigned int slext) +{ + struct task_struct *curr = current; + u32 rflags; + + if (unlikely(get_user(rflags, &curr->rseq->flags))) + goto die; + + /* + * Happens when exit_to_user_mode_loop() loops and has + * TIF_NEED_RESCHED* set again. Clear the grant and schedule. + */ + if (unlikely(slext == RSEQ_SLICE_EXTENSION_GRANTED)) { + curr->rseq_slice_extension = RSEQ_SLICE_EXTENSION_ENABLED; + clear_task_syscall_work(curr, SYSCALL_RSEQ_SLICE); + if (!rseq_clear_slice_granted(curr, rflags)) + goto die; + return false; + } + + /* User space set the flag. That's a violation of the contract. */ + if (unlikely(rflags & RSEQ_CS_FLAG_SLICE_EXT_GRANTED)) + goto die; + + /* User space is not interrested. */ + if (likely(!(rflags & RSEQ_CS_FLAG_SLICE_EXT_REQUEST))) + return false; + + /* + * Don't bother if the rseq event mask has bits pending. The task + * was preempted. + */ + if (curr->rseq_event_mask) + return false; + + /* Grant the request and update user space */ + rflags &= ~RSEQ_CS_FLAG_SLICE_EXT_REQUEST; + rflags |= RSEQ_CS_FLAG_SLICE_EXT_GRANTED; + if (unlikely(put_user(rflags, &curr->rseq->flags))) + goto die; + + curr->rseq_slice_extension = RSEQ_SLICE_EXTENSION_GRANTED; + set_task_syscall_work(curr, SYSCALL_RSEQ_SLICE); + clear_tsk_need_resched(curr); + return true; +die: + force_sig(SIGSEGV); + return false; +} + +void __rseq_slice_extension_timer(void) +{ + hrtick_extend_timeslice(rseq_slice_ext_nsecs); +} + +int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3) +{ + switch (arg2) { + case PR_RSEQ_SLICE_EXTENSION_GET: + if (arg3) + return -EINVAL; + return current->rseq_slice_extension ? PR_RSEQ_SLICE_EXT_ENABLE : 0; + + case PR_RSEQ_SLICE_EXTENSION_SET: + if (arg3 & ~PR_RSEQ_SLICE_EXT_ENABLE) + return -EINVAL; + if (!rseq_slice_extension_enabled() || !current->rseq) + return -ENOTSUPP; + current->rseq_slice_extension = (arg3 & PR_RSEQ_SLICE_EXT_ENABLE) ? + RSEQ_SLICE_EXTENSION_ENABLED : 0; + return 0; + + default: + return -EINVAL; + } +} + +static int __init rseq_slice_cmdline(char *str) +{ + bool on; + + if (kstrtobool(str, &on)) + return -EINVAL; + + if (!on) + static_branch_disable(&rseq_slice_extension_key); + return 0; +} +__setup("rseq_slice_ext=", rseq_slice_cmdline); +#else /* CONFIG_RSEQ_SLICE_EXTENSION */ +static inline bool rseq_reset_slice_extension(struct task_struct *t) { return true; } +#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */ --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -3,11 +3,11 @@ #define __LINUX_ENTRYCOMMON_H #include <linux/irq-entry-common.h> +#include <linux/livepatch.h> #include <linux/ptrace.h> +#include <linux/resume_user_mode.h> #include <linux/seccomp.h> #include <linux/sched.h> -#include <linux/livepatch.h> -#include <linux/resume_user_mode.h> #include <asm/entry-common.h> #include <asm/syscall.h> @@ -36,6 +36,7 @@ SYSCALL_WORK_SYSCALL_EMU | \ SYSCALL_WORK_SYSCALL_AUDIT | \ SYSCALL_WORK_SYSCALL_USER_DISPATCH | \ + SYSCALL_WORK_SYSCALL_RSEQ_SLICE | \ ARCH_SYSCALL_WORK_ENTER) #define SYSCALL_WORK_EXIT (SYSCALL_WORK_SYSCALL_TRACEPOINT | \ SYSCALL_WORK_SYSCALL_TRACE | \ @@ -61,8 +62,7 @@ */ void syscall_enter_from_user_mode_prepare(struct pt_regs *regs); -long syscall_trace_enter(struct pt_regs *regs, long syscall, - unsigned long work); +long syscall_trace_enter(struct pt_regs *regs, long syscall, unsigned long work); /** * syscall_enter_from_user_mode_work - Check and handle work before invoking @@ -162,7 +162,7 @@ static __always_inline void syscall_exit local_irq_enable(); } - rseq_syscall(regs); + rseq_debug_syscall_exit(regs); /* * Do one-time syscall specific work. If these work items are @@ -172,7 +172,7 @@ static __always_inline void syscall_exit if (unlikely(work & SYSCALL_WORK_EXIT)) syscall_exit_work(regs, work); local_irq_disable_exit_to_user(); - exit_to_user_mode_prepare(regs); + syscall_exit_to_user_mode_prepare(regs); } /** --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -15,9 +15,10 @@ void __weak arch_do_signal_or_restart(st * exit_to_user_mode_loop - do any pending work before leaving to user space * @regs: Pointer to pt_regs on entry stack * @ti_work: TIF work flags as read by the caller + * @from_irq: Exiting to user space from an interrupt */ -__always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs, - unsigned long ti_work) +__always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long ti_work, + bool from_irq) { /* * Before returning to user space ensure that all pending work @@ -27,8 +28,15 @@ void __weak arch_do_signal_or_restart(st local_irq_enable_exit_to_user(ti_work); - if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) - schedule(); + /* + * FIXME: This should actually take the execution time + * of the rest of the loop into account and refuse + * the extension if there is other work to do. + */ + if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) { + if (!from_irq || !rseq_grant_slice_extension(ti_work)) + schedule(); + } if (ti_work & _TIF_UPROBE) uprobe_notify_resume(regs); @@ -70,7 +78,7 @@ noinstr void irqentry_enter_from_user_mo noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs) { instrumentation_begin(); - exit_to_user_mode_prepare(regs); + irqentry_exit_to_user_mode_prepare(regs); instrumentation_end(); exit_to_user_mode(); } --- a/Documentation/userspace-api/index.rst +++ b/Documentation/userspace-api/index.rst @@ -21,6 +21,7 @@ System calls ebpf/index ioctl/index mseal + rseq Security-related interfaces =========================== --- /dev/null +++ b/Documentation/userspace-api/rseq.rst @@ -0,0 +1,92 @@ +===================== +Restartable Sequences +===================== + +Restartable Sequences allow to register a per thread userspace memory area +to be used as an ABI between kernel and user-space for three purposes: + + * user-space restartable sequences + + * quick access to read the current CPU number, node ID from user-space + + * scheduler time slice extensions + +Restartable sequences (per-cpu atomics) +--------------------------------------- + +Restartables sequences allow user-space to perform update operations on +per-cpu data without requiring heavy-weight atomic operations. The actual +ABI is unfortunately only available in the code and selftests. + +Quick access to CPU number, node ID +----------------------------------- + +Allows to implement per CPU data efficiently. Documentation is in code and +selftests. :( + +Scheduler time slice extensions +------------------------------- + +This allows a thread to request a time slice extension when it enters a +critical section to avoid contention on a resource when the thread is +scheduled out inside of the critical section. + +The prerequisites for this functionality are: + + * Enabled in Kconfig + + * Enabled at boot time (default is enabled) + + * A rseq user space pointer has been registered for the thread + +The thread has to enable the functionality via prctl(2):: + + prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET, + PR_RSEQ_SLICE_EXT_ENABLE, 0, 0); + +If granted the thread can request a time slice extension by setting the +``RSEQ_CS_FLAG_SLICE_EXT_REQUEST_BIT`` in the rseq::flags field. If the +thread is interrupted and the interrupt results in a reschedule request in +the kernel, then the kernel can grant a time slice extension and return to +user space instead of scheduling out. The kernel indicates the grant by +clearing ``RSEQ_CS_FLAG_SLICE_EXT_REQUEST_BIT`` and setting +``RSEQ_CS_FLAG_SLICE_EXT_GRANTED_BIT`` in the rseq::flags field. If there +is a reschedule of the thread after granting the extension, the kernel +clears the granted bit to indicate that to user space. + +If the request bit is still set when the leaving the critical section, user +space can clear it and continue. + +If the granted bit is set, then user space has to invoke sched_yield() when +leaving the critical section to relinquish the CPU. The kernel enforces +this by arming a timer to prevent misbehaving user space from abusing this +mechanism. + +If both the request bit and the granted bit are false when leaving the +critical section, then this indicates that a grant was revoked and no +further action is required by user space. + +The required code flow is as follows:: + + local_set_bit(REQUEST, &rseq->flags); + critical_section(); + if (!local_test_and_clear_bit(REQUEST, &rseq->flags)) { + if (local_test_bit(GRANTED, &rseq->flags)) + sched_yield(); + } + +The local bit operations on the flags, except for local_test_bit() have to +be atomically versus the local CPU to prevent the obvious RMW race versus +an interrupt. On X86 this can be achieved with ORB and BTRL without LOCK +prefix. On architectures, which do not provide lightweight CPU local +atomics this needs to be implemented with regular atomic operations. + +local_test_bit() has no atomicity requirements as there is obviously a +race which cannot be avoided at all:: + + if (local_test_bit(GRANTED)) + -> Interrupt results in schedule and grant revocation + sched_yield(); + +The kernel enforces flag consistency and terminates the thread with SIGSEGV +if it detects a violation. --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1408,7 +1408,10 @@ struct task_struct { * RmW on rseq_event_mask must be performed atomically * with respect to preemption. */ - unsigned long rseq_event_mask; + unsigned long rseq_event_mask; +#ifdef CONFIG_RSEQ_SLICE_EXTENSION + unsigned int rseq_slice_extension; +#endif # ifdef CONFIG_DEBUG_RSEQ /* * This is a place holder to save a copy of the rseq fields for --- a/include/uapi/linux/rseq.h +++ b/include/uapi/linux/rseq.h @@ -26,6 +26,8 @@ enum rseq_cs_flags_bit { RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT = 0, RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT = 1, RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT = 2, + RSEQ_CS_FLAG_SLICE_EXT_REQUEST_BIT = 3, + RSEQ_CS_FLAG_SLICE_EXT_GRANTED_BIT = 4, }; enum rseq_cs_flags { @@ -35,6 +37,10 @@ enum rseq_cs_flags { (1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT), RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE = (1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT), + RSEQ_CS_FLAG_SLICE_EXT_REQUEST = + (1U << RSEQ_CS_FLAG_SLICE_EXT_REQUEST_BIT), + RSEQ_CS_FLAG_SLICE_EXT_GRANTED = + (1U << RSEQ_CS_FLAG_SLICE_EXT_GRANTED_BIT), }; /* --- a/init/Kconfig +++ b/init/Kconfig @@ -1883,6 +1883,18 @@ config RSEQ If unsure, say Y. +config RSEQ_SLICE_EXTENSION + bool "Enable rseq based time slice extension mechanism" + depends on RSEQ && SCHED_HRTICK + help + Allows userspace to request a limited time slice extension when + returning from an interrupt to user space via the RSEQ shared + data ABI. If granted, that allows to complete a critical section, + so that other threads are not stuck on a conflicted resource, + while the task is scheduled out. + + If unsure, say N. + config DEBUG_RSEQ default n bool "Enable debugging of rseq() system call" if EXPERT --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -376,4 +376,14 @@ struct prctl_mm_map { # define PR_FUTEX_HASH_SET_SLOTS 1 # define PR_FUTEX_HASH_GET_SLOTS 2 +/* RSEQ time slice extensions */ +#define PR_RSEQ_SLICE_EXTENSION 79 +# define PR_RSEQ_SLICE_EXTENSION_GET 1 +# define PR_RSEQ_SLICE_EXTENSION_SET 2 +/* + * Bits for RSEQ_SLICE_EXTENSION_GET/SET + * PR_RSEQ_SLICE_EXT_ENABLE: Enable + */ +# define PR_RSEQ_SLICE_EXT_ENABLE 0x01 + #endif /* _LINUX_PRCTL_H */ --- a/kernel/sys.c +++ b/kernel/sys.c @@ -53,6 +53,7 @@ #include <linux/time_namespace.h> #include <linux/binfmts.h> #include <linux/futex.h> +#include <linux/rseq.h> #include <linux/sched.h> #include <linux/sched/autogroup.h> @@ -2805,6 +2806,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsi case PR_FUTEX_HASH: error = futex_hash_prctl(arg2, arg3, arg4); break; + case PR_RSEQ_SLICE_EXTENSION: + if (arg4 || arg5) + return -EINVAL; + error = rseq_slice_extension_prctl(arg2, arg3); + break; default: trace_task_prctl_unknown(option, arg2, arg3, arg4, arg5); error = -EINVAL; --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -90,6 +90,7 @@ #include "stats.h" #include "autogroup.h" +#include "hrtick.h" #include "pelt.h" #include "smp.h" @@ -873,6 +874,10 @@ static enum hrtimer_restart hrtick(struc WARN_ON_ONCE(cpu_of(rq) != smp_processor_id()); + // CHECKME: Is this correct? + if (rseq_slice_extension_resched()) + return HRTIMER_NORESTART; + rq_lock(rq, &rf); update_rq_clock(rq); rq->donor->sched_class->task_tick(rq, rq->curr, 1); @@ -902,6 +907,14 @@ static void __hrtick_start(void *arg) rq_unlock(rq, &rf); } +void hrtick_extend_timeslice(ktime_t nsecs) +{ + struct rq *rq = this_rq(); + + guard(rq_lock_irqsave)(rq); + hrtimer_start(&rq->hrtick_timer, nsecs, HRTIMER_MODE_REL_PINNED_HARD); +} + /* * Called to set the hrtick timer state. * --- /dev/null +++ b/kernel/sched/hrtick.h @@ -0,0 +1,11 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _KERNEL_SCHED_HRTICK_H +#define _KERNEL_SCHED_HRTICK_H + +/* + * Scheduler internal method to support time slice extensions, + * shared with rseq. + */ +void hrtick_extend_timeslice(ktime_t nsecs); + +#endif /* _KERNEL_SCHED_HRTICK_H */ --- a/include/linux/thread_info.h +++ b/include/linux/thread_info.h @@ -46,15 +46,17 @@ enum syscall_work_bit { SYSCALL_WORK_BIT_SYSCALL_AUDIT, SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH, SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP, + SYSCALL_WORK_BIT_SYSCALL_RSEQ_SLICE, }; -#define SYSCALL_WORK_SECCOMP BIT(SYSCALL_WORK_BIT_SECCOMP) -#define SYSCALL_WORK_SYSCALL_TRACEPOINT BIT(SYSCALL_WORK_BIT_SYSCALL_TRACEPOINT) -#define SYSCALL_WORK_SYSCALL_TRACE BIT(SYSCALL_WORK_BIT_SYSCALL_TRACE) -#define SYSCALL_WORK_SYSCALL_EMU BIT(SYSCALL_WORK_BIT_SYSCALL_EMU) -#define SYSCALL_WORK_SYSCALL_AUDIT BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT) -#define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH) -#define SYSCALL_WORK_SYSCALL_EXIT_TRAP BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP) +#define SYSCALL_WORK_SECCOMP BIT(SYSCALL_WORK_BIT_SECCOMP) +#define SYSCALL_WORK_SYSCALL_TRACEPOINT BIT(SYSCALL_WORK_BIT_SYSCALL_TRACEPOINT) +#define SYSCALL_WORK_SYSCALL_TRACE BIT(SYSCALL_WORK_BIT_SYSCALL_TRACE) +#define SYSCALL_WORK_SYSCALL_EMU BIT(SYSCALL_WORK_BIT_SYSCALL_EMU) +#define SYSCALL_WORK_SYSCALL_AUDIT BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT) +#define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH) +#define SYSCALL_WORK_SYSCALL_EXIT_TRAP BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP) +#define SYSCALL_WORK_SYSCALL_RSEQ_SLICE BIT(SYSCALL_WORK_BIT_SYSCALL_RSEQ_SLICE) #endif #include <asm/thread_info.h> --- a/kernel/entry/syscall-common.c +++ b/kernel/entry/syscall-common.c @@ -17,8 +17,7 @@ static inline void syscall_enter_audit(s } } -long syscall_trace_enter(struct pt_regs *regs, long syscall, - unsigned long work) +long syscall_trace_enter(struct pt_regs *regs, long syscall, unsigned long work) { long ret = 0; @@ -32,6 +31,16 @@ long syscall_trace_enter(struct pt_regs return -1L; } + /* + * User space got a time slice extension granted and relinquishes + * the CPU. Abort the syscall right away. If it's not sched_yield() + * rseq_syscall_enter_work() sends a SIGSEGV. + */ + if (work & SYSCALL_WORK_SYSCALL_RSEQ_SLICE) { + if (rseq_syscall_enter_work(syscall)) + return -1L; + } + /* Handle ptrace */ if (work & (SYSCALL_WORK_SYSCALL_TRACE | SYSCALL_WORK_SYSCALL_EMU)) { ret = ptrace_report_syscall_entry(regs); --- a/tools/testing/selftests/rseq/.gitignore +++ b/tools/testing/selftests/rseq/.gitignore @@ -10,3 +10,4 @@ param_test_mm_cid param_test_mm_cid_benchmark param_test_mm_cid_compare_twice syscall_errors_test +slice_test --- a/tools/testing/selftests/rseq/Makefile +++ b/tools/testing/selftests/rseq/Makefile @@ -17,7 +17,7 @@ OVERRIDE_TARGETS = 1 TEST_GEN_PROGS = basic_test basic_percpu_ops_test basic_percpu_ops_mm_cid_test param_test \ param_test_benchmark param_test_compare_twice param_test_mm_cid \ param_test_mm_cid_benchmark param_test_mm_cid_compare_twice \ - syscall_errors_test + syscall_errors_test slice_test TEST_GEN_PROGS_EXTENDED = librseq.so @@ -59,3 +59,6 @@ include ../lib.mk $(OUTPUT)/syscall_errors_test: syscall_errors_test.c $(TEST_GEN_PROGS_EXTENDED) \ rseq.h rseq-*.h $(CC) $(CFLAGS) $< $(LDLIBS) -lrseq -o $@ + +$(OUTPUT)/slice_test: slice_test.c $(TEST_GEN_PROGS_EXTENDED) rseq.h rseq-*.h + $(CC) $(CFLAGS) $< $(LDLIBS) -lrseq -o $@ --- /dev/null +++ b/tools/testing/selftests/rseq/slice_test.c @@ -0,0 +1,205 @@ +// SPDX-License-Identifier: LGPL-2.1 +#define _GNU_SOURCE +#include <assert.h> +#include <pthread.h> +#include <sched.h> +#include <signal.h> +#include <stdio.h> +#include <string.h> + +#include <linux/prctl.h> +#include <sys/prctl.h> +#include <sys/time.h> + +#include "rseq.h" + +#include "../kselftest_harness.h" + +#define BITS_PER_INT 32 +#define BITS_PER_BYTE 8 + +#ifndef PR_RSEQ_SLICE_EXTENSION +# define PR_RSEQ_SLICE_EXTENSION 79 +# define PR_RSEQ_SLICE_EXTENSION_GET 1 +# define PR_RSEQ_SLICE_EXTENSION_SET 2 +# define PR_RSEQ_SLICE_EXT_ENABLE 0x01 +#endif + +#ifndef RSEQ_CS_FLAG_SLICE_EXT_REQUEST_BIT +# define RSEQ_CS_FLAG_SLICE_EXT_REQUEST_BIT 3 +# define RSEQ_CS_FLAG_SLICE_EXT_GRANTED_BIT 4 +#endif + +#ifndef asm_inline +# define asm_inline asm __inline +#endif + +#if defined(__x86_64__) || defined(__i386__) + +static __always_inline bool local_test_and_clear_bit(unsigned int bit, + volatile unsigned int *addr) +{ + bool res; + + asm inline volatile("btrl %[__bit], %[__addr]\n" + : [__addr] "+m" (*addr), "=@cc" "c" (res) + : [__bit] "Ir" (bit) + : "memory"); + return res; +} + +static __always_inline void local_set_bit(unsigned int bit, volatile unsigned int *addr) +{ + volatile char *caddr = (void *)(addr) + (bit / BITS_PER_BYTE); + + asm inline volatile("orb %b[__bit],%[__addr]\n" + : [__addr] "+m" (*caddr) + : [__bit] "iq" (1U << (bit & (BITS_PER_BYTE - 1))) + : "memory"); +} + +static __always_inline bool local_test_bit(unsigned int bit, const volatile unsigned int *addr) +{ + return !!(addr[bit / BITS_PER_INT] & ((1U << (bit & (BITS_PER_INT - 1))))); +} + +#else +# error unsupported target +#endif + +#define NSEC_PER_SEC 1000000000L +#define NSEC_PER_USEC 1000L + +struct noise_params { + int noise_nsecs; + int sleep_nsecs; + int run; +}; + +FIXTURE(slice_ext) +{ + pthread_t noise_thread; + struct noise_params noise_params; +}; + +FIXTURE_VARIANT(slice_ext) +{ + int64_t total_nsecs; + int slice_nsecs; + int noise_nsecs; + int sleep_nsecs; +}; + +FIXTURE_VARIANT_ADD(slice_ext, n2_2_50) +{ + .total_nsecs = 5 * NSEC_PER_SEC, + .slice_nsecs = 2 * NSEC_PER_USEC, + .noise_nsecs = 2 * NSEC_PER_USEC, + .sleep_nsecs = 50 * NSEC_PER_USEC, +}; + +static inline bool elapsed(struct timespec *start, struct timespec *now, + int64_t span) +{ + int64_t delta = now->tv_sec - start->tv_sec; + + delta *= NSEC_PER_SEC; + delta += now->tv_nsec - start->tv_nsec; + return delta >= span; +} + +static void *noise_thread(void *arg) +{ + struct noise_params *p = arg; + + while (RSEQ_READ_ONCE(p->run)) { + struct timespec ts_start, ts_now; + + clock_gettime(CLOCK_MONOTONIC, &ts_start); + do { + clock_gettime(CLOCK_MONOTONIC, &ts_now); + } while (!elapsed(&ts_start, &ts_now, p->noise_nsecs)); + + ts_start.tv_sec = 0; + ts_start.tv_nsec = p->sleep_nsecs; + clock_nanosleep(CLOCK_MONOTONIC, 0, &ts_start, NULL); + } + return NULL; +} + +FIXTURE_SETUP(slice_ext) +{ + cpu_set_t affinity; + + ASSERT_EQ(sched_getaffinity(0, sizeof(affinity), &affinity), 0); + + /* Pin it on a single CPU. Avoid CPU 0 */ + for (int i = 1; i < CPU_SETSIZE; i++) { + if (!CPU_ISSET(i, &affinity)) + continue; + + CPU_ZERO(&affinity); + CPU_SET(i, &affinity); + ASSERT_EQ(sched_setaffinity(0, sizeof(affinity), &affinity), 0); + break; + } + + ASSERT_EQ(rseq_register_current_thread(), 0); + + ASSERT_EQ(prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET, + PR_RSEQ_SLICE_EXT_ENABLE, 0, 0), 0); + + self->noise_params.noise_nsecs = variant->noise_nsecs; + self->noise_params.sleep_nsecs = variant->sleep_nsecs; + self->noise_params.run = 1; + + ASSERT_EQ(pthread_create(&self->noise_thread, NULL, noise_thread, &self->noise_params), 0); +} + +FIXTURE_TEARDOWN(slice_ext) +{ + self->noise_params.run = 0; + pthread_join(self->noise_thread, NULL); +} + +TEST_F(slice_ext, slice_test) +{ + unsigned long success = 0, yielded = 0, raced = 0, scheduled = 0; + struct rseq_abi *rs = rseq_get_abi(); + struct timespec ts_start, ts_now; + + ASSERT_NE(rs, NULL); + + clock_gettime(CLOCK_MONOTONIC, &ts_start); + do { + struct timespec ts_cs; + + clock_gettime(CLOCK_MONOTONIC, &ts_cs); + + local_set_bit(RSEQ_CS_FLAG_SLICE_EXT_REQUEST_BIT, &rs->flags); + do { + clock_gettime(CLOCK_MONOTONIC, &ts_now); + } while (!elapsed(&ts_cs, &ts_now, variant->noise_nsecs)); + + if (!local_test_and_clear_bit(RSEQ_CS_FLAG_SLICE_EXT_REQUEST_BIT, &rs->flags)) { + if (local_test_bit(RSEQ_CS_FLAG_SLICE_EXT_GRANTED_BIT, &rs->flags)) { + yielded++; + if (!sched_yield()) + raced++; + } else { + scheduled++; + } + } else { + success++; + } + + clock_gettime(CLOCK_MONOTONIC, &ts_now); + } while (!elapsed(&ts_start, &ts_now, variant->total_nsecs)); + + printf("# Success %12ld\n", success); + printf("# Yielded %12ld\n", yielded); + printf("# Raced %12ld\n", raced); + printf("# Scheduled %12ld\n", scheduled); +} + +TEST_HARNESS_MAIN
> On Aug 11, 2025, at 2:45 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > > On Thu, Aug 07 2025 at 16:15, Prakash Sangappa wrote: >>> On Aug 7, 2025, at 6:06 AM, Thomas Gleixner <tglx@linutronix.de> wrote: >>> rseq_preempt() sets RSEQ_EVENT_PREEMPT_BIT and TIF_NOTIFY_RESUME, which >>> causes exit to userspace to invoke __rseq_handle_notify_resume(). That's >>> the obvious place to handle this instead of inflicting it into the >>> scheduler hotpath. >>> >>> No? >> >> Sure, I will look at moving rseq_delay_resched_clear() call to __rseq_handle_notify_resume(). > > I looked deeper into it and it does not completely solve the problem. Thanks for taking a deeper look. > > +bool rseq_syscall_enter_work(long syscall) > +{ > + struct task_struct *curr = current; > + unsigned int slext = curr->rseq_slice_extension; > + > + clear_task_syscall_work(curr, SYSCALL_RSEQ_SLICE); > + curr->rseq_slice_extension = RSEQ_SLICE_EXTENSION_ENABLED; > + > + /* > + * Kernel internal state inconsistency. SYSCALL_RSEQ_SLICE can only > + * be set when state is GRANTED! > + */ > + if (WARN_ON_ONCE(slext != RSEQ_SLICE_EXTENSION_GRANTED)) > + return false; > + > + set_tsk_need_resched(curr); > + > + if (unlikely(!__rseq_reset_slice_extension(curr) || syscall != __NR_sched_yield)) > + force_sigsegv(0); > + > + /* Abort syscall to reschedule immediately */ > + return true; > +} Is it ok to fail the sched_yield(2) syscall? The man page says sched_yield(2) always succeeds(returns 0). Also, is it necessary to force kill the process here with SIGSEGV, if some other system call was made? Ideally it would be expected that the process should not be making any system call while in the critical section and is using time slice extension, other then sched_yield(2) to relinquish the cpu. However an application process could have a signal handler that gets invoked while in the critical section which can potentially be making some system call that is not sched_yield(2). Thanks, -Prakash
On Thu, Aug 14 2025 at 07:18, Prakash Sangappa wrote: >> On Aug 11, 2025, at 2:45 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > Is it ok to fail the sched_yield(2) syscall? The man page says > sched_yield(2) always succeeds(returns 0). I used it because it's simple. In practice we need a new syscall w/o side effects. > Also, is it necessary to force kill the process here with SIGSEGV, if > some other system call was made? Yes, because we do not trust user space and any violation of the contract has consequences. Any kernel facility which interacts in such a way with user space has to be defensive by design. Assuming that user space is neither stupid nor malicious is naive at best and has been a source of big gaping holes forever. > Ideally it would be expected that the process should not be making any > system call while in the critical section and is using time slice > extension, other then sched_yield(2) to relinquish the cpu. However an > application process could have a signal handler that gets invoked > while in the critical section which can potentially be making some > system call that is not sched_yield(2). The timeslice extension is canceled when a signal is pending, so nothing bad happens. The kernel already revoked it similar to how rseq aborts the critical section on signal delivery. If it doesn't work with the POC, that may be. With the stuff I'm polishing now it works because I tested it :) Thanks tglx
I spent some time on the review. I tried to test it but for some reason userland always segfaults. This is not subject to your changes because param_test (from tools/testing/selftests/rseq) also segfaults. Also on a Debian v6.12. So this must be something else and maybe glibc related. On 2025-08-11 11:45:11 [+0200], Thomas Gleixner wrote: … > --- a/include/linux/rseq.h > +++ b/include/linux/rseq.h > @@ -4,6 +4,7 @@ > > #ifdef CONFIG_RSEQ > > +#include <linux/jump_label.h> > #include <linux/preempt.h> > #include <linux/sched.h> > > @@ -61,6 +62,20 @@ static inline void rseq_migrate(struct t > rseq_set_notify_resume(t); > } > > +static __always_inline void rseq_slice_extension_timer(void); > + > +static __always_inline void rseq_exit_to_user_mode(void) Is this __always_inline required? > +{ > + rseq_slice_extension_timer(); > + /* > + * Clear the event mask so it does not contain stale bits when > + * coming back from user space. > + */ > + current->rseq_event_mask = 0; > +} > + > +static inline void rseq_slice_fork(struct task_struct *t, bool inherit); > + > /* > * If parent process has a registered restartable sequences area, the > * child inherits. Unregister rseq for a clone with CLONE_VM set. > @@ -86,46 +103,127 @@ static inline void rseq_execve(struct ta … > -#else > - > -static inline void rseq_syscall(struct pt_regs *regs) > +#else /* CONFIG_RSEQ_SLICE_EXTENSION */ > +static inline bool rseq_slice_extension_enabled(void) { return false; } > +static inline bool rseq_slice_extension_resched(void) { return false; } > +static inline bool rseq_syscall_enter_work(long syscall) { return false; } > +static __always_inline void rseq_slice_extension_timer(void) { } why is this one so special and seends __always_inline while the other are fine with inline? > +static inline int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3) > { > + return -EINVAL; > } … > --- a/kernel/rseq.c > +++ b/kernel/rseq.c > @@ -571,3 +591,189 @@ SYSCALL_DEFINE4(rseq, struct rseq __user … > +static bool __rseq_reset_slice_extension(struct task_struct *curr) > +{ > + u32 rflags; > + > + if (get_user(rflags, &curr->rseq->flags)) > + return false; > + return rseq_clear_slice_granted(curr, rflags); > +} > + > +static inline bool rseq_reset_slice_extension(struct task_struct *curr) > +{ > + if (!rseq_slice_extension_enabled()) > + return true; > + > + if (likely(!(curr->rseq_slice_extension & RSEQ_SLICE_EXTENSION_GRANTED))) > + return true; We shouldn't get preempted because this would require an interrupt. But we could receive a signal which would bring us here, right? If an extension was not granted but userland enabled it set RSEQ_CS_FLAG_SLICE_EXT_REQUEST_BIT, shouldn't we clear RSEQ_CS_FLAG_SLICE_EXT_REQUEST_BIT indicating that we scheduled? Or we keep things as they are because the signal handler is subject the same kind of extensions? The signal handler has a list of functions which are signal safe and that might end up in a syscall. > + if (likely(!curr->rseq_event_mask)) > + return true; Why don't you need to clear SYSCALL_RSEQ_SLICE if !rseq_event_mask ? > + > + clear_task_syscall_work(curr, SYSCALL_RSEQ_SLICE); > + curr->rseq_slice_extension = RSEQ_SLICE_EXTENSION_ENABLED; > + > + return __rseq_reset_slice_extension(curr); > +} > + > +/* > + * Invoked from syscall entry if a time slice extension was granted and the > + * kernel did not clear it before user space left the critical section. > + */ > +bool rseq_syscall_enter_work(long syscall) > +{ > + struct task_struct *curr = current; > + unsigned int slext = curr->rseq_slice_extension; > + > + clear_task_syscall_work(curr, SYSCALL_RSEQ_SLICE); > + curr->rseq_slice_extension = RSEQ_SLICE_EXTENSION_ENABLED; > + > + /* > + * Kernel internal state inconsistency. SYSCALL_RSEQ_SLICE can only > + * be set when state is GRANTED! > + */ > + if (WARN_ON_ONCE(slext != RSEQ_SLICE_EXTENSION_GRANTED)) > + return false; > + > + set_tsk_need_resched(curr); > + > + if (unlikely(!__rseq_reset_slice_extension(curr) || syscall != __NR_sched_yield)) > + force_sigsegv(0); > + > + /* Abort syscall to reschedule immediately */ If the syscall is the sched_yield() as expected then you still abort it. You avoid the "scheduling" request from the do_sched_yield() (and everything the syscall does) and perform your schedule request due to the NEED_RESCHED flag above in exit_to_user_mode_loop(). This explains why sched_yield(2) returns a return code != 0 even the man page and the kernel function always returns 0. errno will be set in userland and the syscall tracer will bypass sched_yield in its trace. > + return true; > +} > + > +bool __rseq_grant_slice_extension(unsigned int slext) > +{ > + struct task_struct *curr = current; > + u32 rflags; > + > + if (unlikely(get_user(rflags, &curr->rseq->flags))) > + goto die; > + > + /* > + * Happens when exit_to_user_mode_loop() loops and has > + * TIF_NEED_RESCHED* set again. Clear the grant and schedule. > + */ Not only that. Also if userland does not finish its critical section before a subsequent scheduling request happens. > + if (unlikely(slext == RSEQ_SLICE_EXTENSION_GRANTED)) { > + curr->rseq_slice_extension = RSEQ_SLICE_EXTENSION_ENABLED; > + clear_task_syscall_work(curr, SYSCALL_RSEQ_SLICE); > + if (!rseq_clear_slice_granted(curr, rflags)) > + goto die; > + return false; > + } > + > + /* User space set the flag. That's a violation of the contract. */ > + if (unlikely(rflags & RSEQ_CS_FLAG_SLICE_EXT_GRANTED)) > + goto die; > + > + /* User space is not interrested. */ > + if (likely(!(rflags & RSEQ_CS_FLAG_SLICE_EXT_REQUEST))) > + return false; > + > + /* > + * Don't bother if the rseq event mask has bits pending. The task > + * was preempted. > + */ > + if (curr->rseq_event_mask) > + return false; > + > + /* Grant the request and update user space */ > + rflags &= ~RSEQ_CS_FLAG_SLICE_EXT_REQUEST; > + rflags |= RSEQ_CS_FLAG_SLICE_EXT_GRANTED; > + if (unlikely(put_user(rflags, &curr->rseq->flags))) > + goto die; > + > + curr->rseq_slice_extension = RSEQ_SLICE_EXTENSION_GRANTED; > + set_task_syscall_work(curr, SYSCALL_RSEQ_SLICE); > + clear_tsk_need_resched(curr); If you keep doing this also for NEED_RESCHED then you should clear the preemption counter via clear_preempt_need_resched(); otherwise you could stumble upon a spinlock_t on your way out and visit the scheduler anyway. > + return true; > +die: > + force_sig(SIGSEGV); > + return false; > +} … > --- /dev/null > +++ b/Documentation/userspace-api/rseq.rst > @@ -0,0 +1,92 @@ … > +If the request bit is still set when the leaving the critical section, user > +space can clear it and continue. > + > +If the granted bit is set, then user space has to invoke sched_yield() when sched_yield(2) > +leaving the critical section to relinquish the CPU. The kernel enforces > +this by arming a timer to prevent misbehaving user space from abusing this > +mechanism. > + Enforcing is one thing. The documentation should mention that you must not invoke any syscalls other than sched_yield() after setting RSEQ_CS_FLAG_SLICE_EXT_REQUEST_BIT or you get the segfault thrown at you. Your testcase does clock_gettime(). This works as long as the syscall can be handled via vDSO. … > --- a/init/Kconfig > +++ b/init/Kconfig > @@ -1883,6 +1883,18 @@ config RSEQ > > If unsure, say Y. > > +config RSEQ_SLICE_EXTENSION > + bool "Enable rseq based time slice extension mechanism" > + depends on RSEQ && SCHED_HRTICK > + help > + Allows userspace to request a limited time slice extension when an expanded tab > + returning from an interrupt to user space via the RSEQ shared > + data ABI. If granted, that allows to complete a critical section, > + so that other threads are not stuck on a conflicted resource, > + while the task is scheduled out. > + > + If unsure, say N. > + > config DEBUG_RSEQ > default n > bool "Enable debugging of rseq() system call" if EXPERT > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -873,6 +874,10 @@ static enum hrtimer_restart hrtick(struc > > WARN_ON_ONCE(cpu_of(rq) != smp_processor_id()); > > + // CHECKME: Is this correct? > + if (rseq_slice_extension_resched()) > + return HRTIMER_NORESTART; > + You shouldn't need to return early HRTIMER_NORESTART in hrtick(). If the extension is not yet granted then rseq_slice_extension_resched() returns false and the task_tick() below does the usual thing setting RESCHED_LAZY. This will be cleared on return to userland granting an extension, arming the timer again. If this fires for the second time then let the sched_class->task_tick do the usual and set RESCHED_LAZY. Given that we return from IRQ exit_to_user_mode_loop() will clear the grant and go to schedule(). > rq_lock(rq, &rf); > update_rq_clock(rq); > rq->donor->sched_class->task_tick(rq, rq->curr, 1); > @@ -902,6 +907,14 @@ static void __hrtick_start(void *arg) > rq_unlock(rq, &rf); > } > > +void hrtick_extend_timeslice(ktime_t nsecs) > +{ > + struct rq *rq = this_rq(); > + > + guard(rq_lock_irqsave)(rq); > + hrtimer_start(&rq->hrtick_timer, nsecs, HRTIMER_MODE_REL_PINNED_HARD); You arm the timer after granting an extension. So it run for some time, got a scheduling request and now you extend it and keep the timer to honour it. If the user does yield before the timer fires then schedule() should clear the timer. I *think* you need update __schedule() because it has | if (sched_feat(HRTICK) || sched_feat(HRTICK_DL)) | hrtick_clear(rq); and HRTICK is disabled by default | grep -i hrtick --color /sys/kernel/debug/sched/features | PLACE_LAG … NO_HRTICK NO_HRTICK_DL … > +} > + > /* > * Called to set the hrtick timer state. > * … > --- /dev/null > +++ b/tools/testing/selftests/rseq/slice_test.c > @@ -0,0 +1,205 @@ … > +#if defined(__x86_64__) || defined(__i386__) > + > +static __always_inline bool local_test_and_clear_bit(unsigned int bit, > + volatile unsigned int *addr) > +{ > + bool res; > + > + asm inline volatile("btrl %[__bit], %[__addr]\n" > + : [__addr] "+m" (*addr), "=@cc" "c" (res) > + : [__bit] "Ir" (bit) > + : "memory"); > + return res; > +} > + > +static __always_inline void local_set_bit(unsigned int bit, volatile unsigned int *addr) > +{ > + volatile char *caddr = (void *)(addr) + (bit / BITS_PER_BYTE); > + > + asm inline volatile("orb %b[__bit],%[__addr]\n" > + : [__addr] "+m" (*caddr) > + : [__bit] "iq" (1U << (bit & (BITS_PER_BYTE - 1))) > + : "memory"); > +} gcc has __atomic_fetch_and() and __atomic_fetch_or() provided as built-ins. There is atomic_fetch_and_explicit() and atomic_fetch_or_explicit() provided by <stdatomic.h>. Mostly the same magic. If you use this like | static inline int test_and_clear_bit(unsigned long *ptr, unsigned int bit) | { | return __atomic_fetch_and(ptr, ~(1 << bit), __ATOMIC_RELAXED) & (1 << bit); | } the gcc will emit btr. Sadly the lock prefix will be there, too. On the plus side you would have logic for every architecture. … Sebastian
On Wed, Aug 13 2025 at 18:19, bigeasy@linutronix.de wrote: > I spent some time on the review. I tried to test it but for some reason > userland always segfaults. This is not subject to your changes because > param_test (from tools/testing/selftests/rseq) also segfaults. Also on a > Debian v6.12. So this must be something else and maybe glibc related. Hrm. I did not run the rseq tests. I only used the test I wrote, but that works and the underlying glibc uses rseq too, but I might have screwed up there. As I said it's POC. I'm about to send out the polished version, which survive the selftests nicely :) > On 2025-08-11 11:45:11 [+0200], Thomas Gleixner wrote: >> +static __always_inline void rseq_slice_extension_timer(void); >> + >> +static __always_inline void rseq_exit_to_user_mode(void) > > Is this __always_inline required? To prevent stupid compilers to put it out of line. >> +static inline bool rseq_slice_extension_enabled(void) { return false; } >> +static inline bool rseq_slice_extension_resched(void) { return false; } >> +static inline bool rseq_syscall_enter_work(long syscall) { return false; } >> +static __always_inline void rseq_slice_extension_timer(void) { } > > why is this one so special and seends __always_inline while the other > are fine with inline? Copy and pasta :) >> +static inline bool rseq_reset_slice_extension(struct task_struct *curr) >> +{ >> + if (!rseq_slice_extension_enabled()) >> + return true; >> + >> + if (likely(!(curr->rseq_slice_extension & RSEQ_SLICE_EXTENSION_GRANTED))) >> + return true; > > We shouldn't get preempted because this would require an interrupt. But > we could receive a signal which would bring us here, right? Signal or a another round through the decision function, when NEED_RESCHED was raised again. > If an extension was not granted but userland enabled it set > RSEQ_CS_FLAG_SLICE_EXT_REQUEST_BIT, shouldn't we clear > RSEQ_CS_FLAG_SLICE_EXT_REQUEST_BIT indicating that we scheduled? No. The user space flow is: set(REQUEST); critical_section(); if (!test_and_clear_bit(REQUEST)) { if (test_bit(GRANTED)) sched_yield(); } This is not meant as a 'we scheduled' indicator, which is useless after the fact. That's what critical sections are for. > Or we keep things as they are because the signal handler is subject the > same kind of extensions? The signal handler has a list of functions > which are signal safe and that might end up in a syscall. > >> + if (likely(!curr->rseq_event_mask)) >> + return true; > > Why don't you need to clear SYSCALL_RSEQ_SLICE if !rseq_event_mask ? The problem is that rseq_handle_notify_resume() is invoked unconditionally when TIF_NOTIFY_RESUME is set, which can be set by other functionalities too. So when nothing happened (no migration, signal, preemption) then there is no point to revoke it, no? My rework addresses that: https://lore.kernel.org/lkml/20250813155941.014821755@linutronix.de/ That's just preparatory work for this time slice muck :) >> + >> + clear_task_syscall_work(curr, SYSCALL_RSEQ_SLICE); >> + curr->rseq_slice_extension = RSEQ_SLICE_EXTENSION_ENABLED; >> + >> + return __rseq_reset_slice_extension(curr); >> +} >> + >> +/* >> + * Invoked from syscall entry if a time slice extension was granted and the >> + * kernel did not clear it before user space left the critical section. >> + */ >> +bool rseq_syscall_enter_work(long syscall) >> +{ >> + struct task_struct *curr = current; >> + unsigned int slext = curr->rseq_slice_extension; >> + >> + clear_task_syscall_work(curr, SYSCALL_RSEQ_SLICE); >> + curr->rseq_slice_extension = RSEQ_SLICE_EXTENSION_ENABLED; >> + >> + /* >> + * Kernel internal state inconsistency. SYSCALL_RSEQ_SLICE can only >> + * be set when state is GRANTED! >> + */ >> + if (WARN_ON_ONCE(slext != RSEQ_SLICE_EXTENSION_GRANTED)) >> + return false; >> + >> + set_tsk_need_resched(curr); >> + >> + if (unlikely(!__rseq_reset_slice_extension(curr) || syscall != __NR_sched_yield)) >> + force_sigsegv(0); >> + >> + /* Abort syscall to reschedule immediately */ > > If the syscall is the sched_yield() as expected then you still abort it. > You avoid the "scheduling" request from the do_sched_yield() (and > everything the syscall does) and perform your schedule request due to > the NEED_RESCHED flag above in exit_to_user_mode_loop(). > This explains why sched_yield(2) returns a return code != 0 even the man > page and the kernel function always returns 0. errno will be set in > userland and the syscall tracer will bypass sched_yield in its trace. I took the liberty to optimize it that way. It's also useful to see whether this raced against the kernel: if (test_bit(GRANTED)) -> Interrupt sched_yield(); that race can't be avoided. If the kernel wins, then sched_yield() returns 0. We might change that later, but this makes a lot of sense conceptually. Ideally we have a dedicated mechanism instead of relying on sched_yield(), but that's bikeshed painting territory. >> + return true; >> +} >> + >> +bool __rseq_grant_slice_extension(unsigned int slext) >> +{ >> + struct task_struct *curr = current; >> + u32 rflags; >> + >> + if (unlikely(get_user(rflags, &curr->rseq->flags))) >> + goto die; >> + >> + /* >> + * Happens when exit_to_user_mode_loop() loops and has >> + * TIF_NEED_RESCHED* set again. Clear the grant and schedule. >> + */ > > Not only that. Also if userland does not finish its critical section > before a subsequent scheduling request happens. Correct. >> + if (unlikely(slext == RSEQ_SLICE_EXTENSION_GRANTED)) { >> + curr->rseq_slice_extension = RSEQ_SLICE_EXTENSION_ENABLED; >> + clear_task_syscall_work(curr, SYSCALL_RSEQ_SLICE); >> + if (!rseq_clear_slice_granted(curr, rflags)) >> + goto die; >> + return false; >> + } >> + >> + /* User space set the flag. That's a violation of the contract. */ >> + if (unlikely(rflags & RSEQ_CS_FLAG_SLICE_EXT_GRANTED)) >> + goto die; >> + >> + /* User space is not interrested. */ >> + if (likely(!(rflags & RSEQ_CS_FLAG_SLICE_EXT_REQUEST))) >> + return false; >> + >> + /* >> + * Don't bother if the rseq event mask has bits pending. The task >> + * was preempted. >> + */ >> + if (curr->rseq_event_mask) >> + return false; >> + >> + /* Grant the request and update user space */ >> + rflags &= ~RSEQ_CS_FLAG_SLICE_EXT_REQUEST; >> + rflags |= RSEQ_CS_FLAG_SLICE_EXT_GRANTED; >> + if (unlikely(put_user(rflags, &curr->rseq->flags))) >> + goto die; >> + >> + curr->rseq_slice_extension = RSEQ_SLICE_EXTENSION_GRANTED; >> + set_task_syscall_work(curr, SYSCALL_RSEQ_SLICE); >> + clear_tsk_need_resched(curr); > > If you keep doing this also for NEED_RESCHED then you should clear the > preemption counter via > clear_preempt_need_resched(); > > otherwise you could stumble upon a spinlock_t on your way out and visit > the scheduler anyway. Hmm. Good point. > Enforcing is one thing. The documentation should mention that you must > not invoke any syscalls other than sched_yield() after setting > RSEQ_CS_FLAG_SLICE_EXT_REQUEST_BIT or you get the segfault thrown at > you. > Your testcase does clock_gettime(). This works as long as the syscall > can be handled via vDSO. Of course :) >> + // CHECKME: Is this correct? >> + if (rseq_slice_extension_resched()) >> + return HRTIMER_NORESTART; >> + > > You shouldn't need to return early HRTIMER_NORESTART in hrtick(). > If the extension is not yet granted then rseq_slice_extension_resched() > returns false and the task_tick() below does the usual thing setting > RESCHED_LAZY. This will be cleared on return to userland granting an > extension, arming the timer again. > If this fires for the second time then let the sched_class->task_tick do > the usual and set RESCHED_LAZY. Given that we return from IRQ > exit_to_user_mode_loop() will clear the grant and go to schedule(). No. This is all wrong and I implemented a dedicated timer for this as the abuse of HRTICK is daft and depending on the scheduler state (HRTICK can be disabled) this might cause hard to diagnose subtle surprises. >> +void hrtick_extend_timeslice(ktime_t nsecs) >> +{ >> + struct rq *rq = this_rq(); >> + >> + guard(rq_lock_irqsave)(rq); >> + hrtimer_start(&rq->hrtick_timer, nsecs, HRTIMER_MODE_REL_PINNED_HARD); > > You arm the timer after granting an extension. So it run for some time, > got a scheduling request and now you extend it and keep the timer to > honour it. If the user does yield before the timer fires then schedule() > should clear the timer. I *think* you need update __schedule() because > it has > | if (sched_feat(HRTICK) || sched_feat(HRTICK_DL)) > | hrtick_clear(rq); > > and HRTICK is disabled by default > | grep -i hrtick --color /sys/kernel/debug/sched/features > | PLACE_LAG … NO_HRTICK NO_HRTICK_DL … See new code :) > gcc has __atomic_fetch_and() and __atomic_fetch_or() provided as > built-ins. > There is atomic_fetch_and_explicit() and atomic_fetch_or_explicit() > provided by <stdatomic.h>. Mostly the same magic. > > If you use this like > | static inline int test_and_clear_bit(unsigned long *ptr, unsigned int bit) > | { > | return __atomic_fetch_and(ptr, ~(1 << bit), __ATOMIC_RELAXED) & (1 << bit); > | } > > the gcc will emit btr. Sadly the lock prefix will be there, too. On the > plus side you would have logic for every architecture. I know, but the whole point is to avoid the LOCK prefix because it's not necessary in this context and slows things down. The only requirement is CPU local atomicity vs. an interrupt/exception/NMI or whatever the CPU uses to mess things up. You need LOCK if you have cross CPU concurrency, which is not the case here. The LOCK is very measurable when you use this pattern with a high frequency and that's what the people who long for this do :) Thanks, tglx
On 2025-08-13 18:56:16 [+0200], Thomas Gleixner wrote: > On Wed, Aug 13 2025 at 18:19, bigeasy@linutronix.de wrote: > > I spent some time on the review. I tried to test it but for some reason > > userland always segfaults. This is not subject to your changes because > > param_test (from tools/testing/selftests/rseq) also segfaults. Also on a > > Debian v6.12. So this must be something else and maybe glibc related. > > Hrm. I did not run the rseq tests. I only used the test I wrote, but > that works and the underlying glibc uses rseq too, but I might have > screwed up there. As I said it's POC. I'm about to send out the polished > version, which survive the selftests nicely :) It was not your code. Everything exploded here. Am right to assume that you had a recent/ current Debian Trixie environment testing? My guess is that glibc or gcc got out of sync. > > gcc has __atomic_fetch_and() and __atomic_fetch_or() provided as > > built-ins. > > There is atomic_fetch_and_explicit() and atomic_fetch_or_explicit() > > provided by <stdatomic.h>. Mostly the same magic. > > > > If you use this like > > | static inline int test_and_clear_bit(unsigned long *ptr, unsigned int bit) > > | { > > | return __atomic_fetch_and(ptr, ~(1 << bit), __ATOMIC_RELAXED) & (1 << bit); > > | } > > > > the gcc will emit btr. Sadly the lock prefix will be there, too. On the > > plus side you would have logic for every architecture. > > I know, but the whole point is to avoid the LOCK prefix because it's not > necessary in this context and slows things down. The only requirement is > CPU local atomicity vs. an interrupt/exception/NMI or whatever the CPU > uses to mess things up. You need LOCK if you have cross CPU concurrency, > which is not the case here. The LOCK is very measurable when you use > this pattern with a high frequency and that's what the people who long > for this do :) Sure. You can keep it on x86 and use the generic one in the else case rather than abort with an error. Looking at arch___test_and_clear_bit() in the kernel, there is x86 with its custom implementation. s390 points to generic___test_and_clear_bit() which is a surprise. alpha's and sh's isn't atomic so this does not look right. hexagon and m68k might okay and a candidate. > Thanks, > > tglx Sebastian
On Mon, Aug 18 2025 at 15:16, bigeasy@linutronix.de wrote: > On 2025-08-13 18:56:16 [+0200], Thomas Gleixner wrote: >> On Wed, Aug 13 2025 at 18:19, bigeasy@linutronix.de wrote: >> > I spent some time on the review. I tried to test it but for some reason >> > userland always segfaults. This is not subject to your changes because >> > param_test (from tools/testing/selftests/rseq) also segfaults. Also on a >> > Debian v6.12. So this must be something else and maybe glibc related. >> >> Hrm. I did not run the rseq tests. I only used the test I wrote, but >> that works and the underlying glibc uses rseq too, but I might have >> screwed up there. As I said it's POC. I'm about to send out the polished >> version, which survive the selftests nicely :) > > It was not your code. Everything exploded here. Am right to assume that > you had a recent/ current Debian Trixie environment testing? My guess is > that glibc or gcc got out of sync. https://lore.kernel.org/lkml/aKODByTQMYFs3WVN@google.com :) >> > gcc has __atomic_fetch_and() and __atomic_fetch_or() provided as >> > built-ins. >> > There is atomic_fetch_and_explicit() and atomic_fetch_or_explicit() >> > provided by <stdatomic.h>. Mostly the same magic. >> > >> > If you use this like >> > | static inline int test_and_clear_bit(unsigned long *ptr, unsigned int bit) >> > | { >> > | return __atomic_fetch_and(ptr, ~(1 << bit), __ATOMIC_RELAXED) & (1 << bit); >> > | } >> > >> > the gcc will emit btr. Sadly the lock prefix will be there, too. On the >> > plus side you would have logic for every architecture. >> >> I know, but the whole point is to avoid the LOCK prefix because it's not >> necessary in this context and slows things down. The only requirement is >> CPU local atomicity vs. an interrupt/exception/NMI or whatever the CPU >> uses to mess things up. You need LOCK if you have cross CPU concurrency, >> which is not the case here. The LOCK is very measurable when you use >> this pattern with a high frequency and that's what the people who long >> for this do :) > > Sure. You can keep it on x86 and use the generic one in the else case > rather than abort with an error. > Looking at arch___test_and_clear_bit() in the kernel, there is x86 with > its custom implementation. s390 points to generic___test_and_clear_bit() > which is a surprise. alpha's and sh's isn't atomic so this does not look > right. hexagon and m68k might okay and a candidate. Right, I'll look into that after I sorted out the underlying rseq mess. See the context of the link above. That solved will make the integration of this timeslice muck way simpler (famous last words). Thanks, tglx
© 2016 - 2025 Red Hat, Inc.