From nobody Mon Dec 1 22:02:35 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6AB9229ACD7; Mon, 1 Dec 2025 07:06:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764572781; cv=none; b=NKn3YRD42U5OTNr9oTkyPKXkw1myC34HzTUtA3yXZEFVFKU+PzBkUVg6lPmf/Ah/CliGbHYAcMeeBuvyC+Vi3a8Q+4r9Q5o0B0z/tO17lSmSACgiOnfM4Rs/lS80sGkQdJnk/VyGXU37e6Yg+jsqXNtfDFQc7uJfvxE6P/1t6B4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764572781; c=relaxed/simple; bh=lwvh2C66j2bWcWjmCzLzNMj7pjbjkd1Fb70fysY6j5E=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=getICN6dFPA0YgmlQGkWoRdCIs30XCtr2ONU54pElyAqG4rjI/KbS7ql2ChafLhCQ0tC5ov9XBC4x4T4qFnMWThJy/qiDUckQPtFcd2Vkohvczero0eI5j5DFUSHlGakSYap1JVENs+7NLApW4/ncbs0uwGt9xQTCXOqhoG63NA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=g6Zq7HI7; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=Y4Ij4tXv; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="g6Zq7HI7"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="Y4Ij4tXv" Message-ID: <20251128230241.028742236@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1764572777; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=m9XMlb/Z6i+Glqy95yaIqKRlvoH6/pHt4rtgW9j0+As=; b=g6Zq7HI7bLDTsqLWWCY+/YgRVsP74ZGNB8bVhHz8/OJuoZ/NIDMEnJoKJ3yOiViTzgIAHi Dd93GLzOPu4Rk3IwFrNLr9XlhSKKhsbwipRhc9qiG3IiP6APtZ5I9OXGx9gDks5h9DBD6p cXzhoGlIm4tBYWEl1L/Z84+uGDoam9uyoLsTBg4wBHA9SULdx/PyjYytXRzz4466m0hILg 3Szdk2ZT1MVdrtdf1ug0xl7oaabbTw8tlkZ0gllMVtDdBLPmTeCz9YyFhhhzTXIbJy5MU/ et2bkW021NBZuwMWysy1Vu2zkWqt0vGRZ7R72ut9pFzXMoKFHZxUJWFf/+wLhg== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1764572777; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=m9XMlb/Z6i+Glqy95yaIqKRlvoH6/pHt4rtgW9j0+As=; b=Y4Ij4tXvi64YdBGasjjuBnsVjwP+mbYvaPfgM3wDiPcbh2nJRMplSolrd00WM3QRl6E9wc X/IVFyUQSHIk4WAg== From: Thomas Gleixner To: LKML Cc: Mathieu Desnoyers , "Paul E. McKenney" , Boqun Feng , Jonathan Corbet , Prakash Sangappa , Madadi Vineeth Reddy , K Prateek Nayak , Steven Rostedt , Sebastian Andrzej Siewior , Arnd Bergmann , linux-arch@vger.kernel.org, Randy Dunlap , Peter Zijlstra , Ron Geva , Waiman Long Subject: [patch V5 06/11] rseq: Implement syscall entry work for time slice extensions References: <20251128225931.959481199@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Mon, 1 Dec 2025 08:06:12 +0100 (CET) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The kernel sets SYSCALL_WORK_RSEQ_SLICE when it grants a time slice extension. This allows to handle the rseq_slice_yield() syscall, which is used by user space to relinquish the CPU after finishing the critical section for which it requested an extension. In case the kernel state is still GRANTED, the kernel resets both kernel and user space state with a set of sanity checks. If the kernel state is already cleared, then this raced against the timer or some other interrupt and just clears the work bit. Doing it in syscall entry work allows to catch misbehaving user space, which issues an arbitrary syscall, i.e. not rseq_slice_yield(), from the critical section. Contrary to the initial strict requirement to use rseq_slice_yield() arbitrary syscalls are not considered a violation of the ABI contract anymore to allow onion architecture applications, which cannot control the code inside a critical section, to utilize this as well. If the code detects inconsistent user space that result in a SIGSEGV for the application. If the grant was still active and the task was not preempted yet, the work code reschedules immediately before continuing through the syscall. Signed-off-by: Thomas Gleixner Cc: Peter Zijlstra Cc: Mathieu Desnoyers Cc: "Paul E. McKenney" Cc: Boqun Feng --- V5: Allow arbitrary syscalls V3: Use get/put_user() --- include/linux/entry-common.h | 2=20 include/linux/rseq.h | 2=20 include/linux/thread_info.h | 16 ++++--- kernel/entry/syscall-common.c | 11 ++++- kernel/rseq.c | 91 +++++++++++++++++++++++++++++++++++++= +++++ 5 files changed, 112 insertions(+), 10 deletions(-) --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -36,8 +36,8 @@ SYSCALL_WORK_SYSCALL_EMU | \ SYSCALL_WORK_SYSCALL_AUDIT | \ SYSCALL_WORK_SYSCALL_USER_DISPATCH | \ + SYSCALL_WORK_SYSCALL_RSEQ_SLICE | \ ARCH_SYSCALL_WORK_ENTER) - #define SYSCALL_WORK_EXIT (SYSCALL_WORK_SYSCALL_TRACEPOINT | \ SYSCALL_WORK_SYSCALL_TRACE | \ SYSCALL_WORK_SYSCALL_AUDIT | \ --- a/include/linux/rseq.h +++ b/include/linux/rseq.h @@ -164,8 +164,10 @@ static inline void rseq_syscall(struct p #endif /* !CONFIG_DEBUG_RSEQ */ =20 #ifdef CONFIG_RSEQ_SLICE_EXTENSION +void rseq_syscall_enter_work(long syscall); int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3); #else /* CONFIG_RSEQ_SLICE_EXTENSION */ +static inline void rseq_syscall_enter_work(long syscall) { } static inline int rseq_slice_extension_prctl(unsigned long arg2, unsigned = long arg3) { return -ENOTSUPP; --- a/include/linux/thread_info.h +++ b/include/linux/thread_info.h @@ -46,15 +46,17 @@ enum syscall_work_bit { SYSCALL_WORK_BIT_SYSCALL_AUDIT, SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH, SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP, + SYSCALL_WORK_BIT_SYSCALL_RSEQ_SLICE, }; =20 -#define SYSCALL_WORK_SECCOMP BIT(SYSCALL_WORK_BIT_SECCOMP) -#define SYSCALL_WORK_SYSCALL_TRACEPOINT BIT(SYSCALL_WORK_BIT_SYSCALL_TRACE= POINT) -#define SYSCALL_WORK_SYSCALL_TRACE BIT(SYSCALL_WORK_BIT_SYSCALL_TRACE) -#define SYSCALL_WORK_SYSCALL_EMU BIT(SYSCALL_WORK_BIT_SYSCALL_EMU) -#define SYSCALL_WORK_SYSCALL_AUDIT BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT) -#define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_US= ER_DISPATCH) -#define SYSCALL_WORK_SYSCALL_EXIT_TRAP BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_T= RAP) +#define SYSCALL_WORK_SECCOMP BIT(SYSCALL_WORK_BIT_SECCOMP) +#define SYSCALL_WORK_SYSCALL_TRACEPOINT BIT(SYSCALL_WORK_BIT_SYSCALL_TRAC= EPOINT) +#define SYSCALL_WORK_SYSCALL_TRACE BIT(SYSCALL_WORK_BIT_SYSCALL_TRACE) +#define SYSCALL_WORK_SYSCALL_EMU BIT(SYSCALL_WORK_BIT_SYSCALL_EMU) +#define SYSCALL_WORK_SYSCALL_AUDIT BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT) +#define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_US= ER_DISPATCH) +#define SYSCALL_WORK_SYSCALL_EXIT_TRAP BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_= TRAP) +#define SYSCALL_WORK_SYSCALL_RSEQ_SLICE BIT(SYSCALL_WORK_BIT_SYSCALL_RSEQ= _SLICE) #endif =20 #include --- a/kernel/entry/syscall-common.c +++ b/kernel/entry/syscall-common.c @@ -17,8 +17,7 @@ static inline void syscall_enter_audit(s } } =20 -long syscall_trace_enter(struct pt_regs *regs, long syscall, - unsigned long work) +long syscall_trace_enter(struct pt_regs *regs, long syscall, unsigned long= work) { long ret =3D 0; =20 @@ -32,6 +31,14 @@ long syscall_trace_enter(struct pt_regs return -1L; } =20 + /* + * User space got a time slice extension granted and relinquishes + * the CPU. The work stops the slice timer to avoid an extra round + * through hrtimer_interrupt(). + */ + if (work & SYSCALL_WORK_SYSCALL_RSEQ_SLICE) + rseq_syscall_enter_work(syscall); + /* Handle ptrace */ if (work & (SYSCALL_WORK_SYSCALL_TRACE | SYSCALL_WORK_SYSCALL_EMU)) { ret =3D ptrace_report_syscall_entry(regs); --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -502,6 +502,97 @@ SYSCALL_DEFINE4(rseq, struct rseq __user #ifdef CONFIG_RSEQ_SLICE_EXTENSION DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key); =20 +static inline void rseq_slice_set_need_resched(struct task_struct *curr) +{ + /* + * The interrupt guard is required to prevent inconsistent state in + * this case: + * + * set_tsk_need_resched() + * --> Interrupt + * wakeup() + * set_tsk_need_resched() + * set_preempt_need_resched() + * schedule_on_return() + * clear_tsk_need_resched() + * clear_preempt_need_resched() + * set_preempt_need_resched() <- Inconsistent state + * + * This is safe vs. a remote set of TIF_NEED_RESCHED because that + * only sets the already set bit and does not create inconsistent + * state. + */ + scoped_guard(irq) + set_need_resched_current(); +} + +static void rseq_slice_validate_ctrl(u32 expected) +{ + u32 __user *sctrl =3D ¤t->rseq.usrptr->slice_ctrl.all; + u32 uval; + + if (get_user(uval, sctrl) || uval !=3D expected) + force_sig(SIGSEGV); +} + +/* + * Invoked from syscall entry if a time slice extension was granted and the + * kernel did not clear it before user space left the critical section. + * + * While the recommended way to relinquish the CPU side effect free is + * rseq_slice_yield(2), any syscall within a granted slice terminates the + * grant and immediately reschedules if required. This supports onion layer + * applications, where the code requesting the grant cannot control the + * code within the critical section. + */ +void rseq_syscall_enter_work(long syscall) +{ + struct task_struct *curr =3D current; + struct rseq_slice_ctrl ctrl =3D { .granted =3D curr->rseq.slice.state.gra= nted }; + + clear_task_syscall_work(curr, SYSCALL_RSEQ_SLICE); + + if (static_branch_unlikely(&rseq_debug_enabled)) + rseq_slice_validate_ctrl(ctrl.all); + + /* + * The kernel might have raced, revoked the grant and updated + * userspace, but kept the SLICE work set. + */ + if (!ctrl.granted) + return; + + /* + * Required to make set_tsk_need_resched() correct on PREEMPT[RT] + * kernels. Leaving the scope will reschedule on preemption models + * FULL, LAZY and RT if necessary. + */ + scoped_guard(preempt) { + /* + * Now that preemption is disabled, quickly check whether + * the task was already rescheduled before arriving here. + */ + if (!curr->rseq.event.sched_switch) { + rseq_slice_set_need_resched(curr); + + if (syscall =3D=3D __NR_rseq_slice_yield) { + rseq_stat_inc(rseq_stats.s_yielded); + /* Update the yielded state for syscall return */ + curr->rseq.slice.yielded =3D 1; + } else { + rseq_stat_inc(rseq_stats.s_aborted); + } + } + } + /* Reschedule on NONE/VOLUNTARY preemption models */ + cond_resched(); + + /* Clear the grant in kernel state and user space */ + curr->rseq.slice.state.granted =3D false; + if (put_user(0U, &curr->rseq.usrptr->slice_ctrl.all)) + force_sig(SIGSEGV); +} + int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3) { switch (arg2) {