From nobody Thu Dec 18 04:27:55 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 79E492FE05F for ; Mon, 27 Oct 2025 08:44:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554661; cv=none; b=ZfsyWRP2SsnIoRFd1IAo2tiXkYv4EVbvwY7vmZoVYsq7GUtIc2wUATySvCLB2RB9V4Ktts3WBuuiyXB8yVInBcPbpImkv4oIa4khZNr4ar6K0aAw2p6RxxUuSSVadhpCEy0WI9u40A5f8YdHCfgcD4jM3ikwoSgLoidAWn+1Llc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554661; c=relaxed/simple; bh=SxGAivdKRnQtVucR9kOVnYYemmVHPaXNwhhc6VNIou8=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=btv8lY+LH9RGL6dXexUiXdAB7eqxApH9xIO+JkTQ5WnO3vgUgIfPlSHmlNrL8BzKkHOnP5mfJrKCuWAAWWW2+j/6GQK9X5kWLLb7aygjO82sb1aOcdDKyIAzngbkykTv9xJReCcwu8bhEDxvxc4eSVqYL2E+p18CAgiVz9AJVxw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=YDXtQaEY; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=pNtgTrLO; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="YDXtQaEY"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="pNtgTrLO" Message-ID: <20251027084306.022571576@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1761554657; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=KnwJh5XFY2DNKV03sEzh8Zl1PLBPHiHfzgN9LGzdgCQ=; b=YDXtQaEY4cm2wRtJ23u3Epuucq2ojgYyNk3bUZUsObBr3gusoMqWTtQBQxu5tWx4JXNhaU ZNlapmx/MCTSENdpASNfOmpLPDFeV7+vI2w72XMbRDZMvbruE48Sl5L9DDlpjz1vqwSNC7 2X9CntvDTFD9ctUQyKyvgHJT7ZLIreZId183kLp1eWUBGYKzdzqIhTpFG5VHLzM6Htazp7 xFIrF53Es1ctfAY6o0LLc5qGN8+ci/f2eHwuC7T+Pm/U02692+PcYXVIOUghSHg8m1Jppf dzWVCnrHk8d0InQJ/822hbXIWQOA03C6fWrfXN6XIZL661aHQEuwqIYO0dkpsw== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1761554657; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=KnwJh5XFY2DNKV03sEzh8Zl1PLBPHiHfzgN9LGzdgCQ=; b=pNtgTrLOiotyY6RCkscrs1jxEnWWIOKXUgQIxsxChvTemvOlTPlYUQGlFK0BmCj3oXSImu W7knnkaJE4EJl8DQ== From: Thomas Gleixner To: LKML Cc: Michael Jeanson , Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , x86@kernel.org, Sean Christopherson , Wei Liu Subject: [patch V6 01/31] rseq: Avoid pointless evaluation in __rseq_notify_resume() References: <20251027084220.785525188@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Mon, 27 Oct 2025 09:44:16 +0100 (CET) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Thomas Gleixner The RSEQ critical section mechanism only clears the event mask when a critical section is registered, otherwise it is stale and collects bits. That means once a critical section is installed the first invocation of that code when TIF_NOTIFY_RESUME is set will abort the critical section, even when the TIF bit was not raised by the rseq preempt/migrate/signal helpers. This also has a performance implication because TIF_NOTIFY_RESUME is a multiplexing TIF bit, which is utilized by quite some infrastructure. That means every invocation of __rseq_notify_resume() goes unconditionally through the heavy lifting of user space access and consistency checks even if there is no reason to do so. Keeping the stale event mask around when exiting to user space also prevents it from being utilized by the upcoming time slice extension mechanism. Avoid this by reading and clearing the event mask before doing the user space critical section access with interrupts or preemption disabled, which ensures that the read and clear operation is CPU local atomic versus scheduling and the membarrier IPI. This is correct as after re-enabling interrupts/preemption any relevant event will set the bit again and raise TIF_NOTIFY_RESUME, which makes the user space exit code take another round of TIF bit clearing. If the event mask was non-zero, invoke the slow path. On debug kernels the slow path is invoked unconditionally and the result of the event mask evaluation is handed in. Add a exit path check after the TIF bit loop, which validates on debug kernels that the event mask is zero before exiting to user space. While at it reword the convoluted comment why the pt_regs pointer can be NULL under certain circumstances. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- include/linux/irq-entry-common.h | 7 ++-- include/linux/rseq.h | 10 +++++ kernel/rseq.c | 66 ++++++++++++++++++++++++++--------= ----- 3 files changed, 58 insertions(+), 25 deletions(-) --- --- a/include/linux/irq-entry-common.h +++ b/include/linux/irq-entry-common.h @@ -2,11 +2,12 @@ #ifndef __LINUX_IRQENTRYCOMMON_H #define __LINUX_IRQENTRYCOMMON_H =20 +#include +#include +#include #include #include -#include #include -#include #include =20 #include @@ -226,6 +227,8 @@ static __always_inline void exit_to_user =20 arch_exit_to_user_mode_prepare(regs, ti_work); =20 + rseq_exit_to_user_mode(); + /* Ensure that kernel state is sane for a return to userspace */ kmap_assert_nomap(); lockdep_assert_irqs_disabled(); --- a/include/linux/rseq.h +++ b/include/linux/rseq.h @@ -66,6 +66,14 @@ static inline void rseq_migrate(struct t rseq_set_notify_resume(t); } =20 +static __always_inline void rseq_exit_to_user_mode(void) +{ + if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) { + if (WARN_ON_ONCE(current->rseq && current->rseq_event_mask)) + current->rseq_event_mask =3D 0; + } +} + /* * If parent process has a registered restartable sequences area, the * child inherits. Unregister rseq for a clone with CLONE_VM set. @@ -118,7 +126,7 @@ static inline void rseq_fork(struct task static inline void rseq_execve(struct task_struct *t) { } - +static inline void rseq_exit_to_user_mode(void) { } #endif =20 #ifdef CONFIG_DEBUG_RSEQ --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -324,9 +324,9 @@ static bool rseq_warn_flags(const char * return true; } =20 -static int rseq_need_restart(struct task_struct *t, u32 cs_flags) +static int rseq_check_flags(struct task_struct *t, u32 cs_flags) { - u32 flags, event_mask; + u32 flags; int ret; =20 if (rseq_warn_flags("rseq_cs", cs_flags)) @@ -339,17 +339,7 @@ static int rseq_need_restart(struct task =20 if (rseq_warn_flags("rseq", flags)) return -EINVAL; - - /* - * Load and clear event mask atomically with respect to - * scheduler preemption and membarrier IPIs. - */ - scoped_guard(RSEQ_EVENT_GUARD) { - event_mask =3D t->rseq_event_mask; - t->rseq_event_mask =3D 0; - } - - return !!event_mask; + return 0; } =20 static int clear_rseq_cs(struct rseq __user *rseq) @@ -380,7 +370,7 @@ static bool in_rseq_cs(unsigned long ip, return ip - rseq_cs->start_ip < rseq_cs->post_commit_offset; } =20 -static int rseq_ip_fixup(struct pt_regs *regs) +static int rseq_ip_fixup(struct pt_regs *regs, bool abort) { unsigned long ip =3D instruction_pointer(regs); struct task_struct *t =3D current; @@ -398,9 +388,11 @@ static int rseq_ip_fixup(struct pt_regs */ if (!in_rseq_cs(ip, &rseq_cs)) return clear_rseq_cs(t->rseq); - ret =3D rseq_need_restart(t, rseq_cs.flags); - if (ret <=3D 0) + ret =3D rseq_check_flags(t, rseq_cs.flags); + if (ret < 0) return ret; + if (!abort) + return 0; ret =3D clear_rseq_cs(t->rseq); if (ret) return ret; @@ -430,14 +422,44 @@ void __rseq_handle_notify_resume(struct return; =20 /* - * regs is NULL if and only if the caller is in a syscall path. Skip - * fixup and leave rseq_cs as is so that rseq_sycall() will detect and - * kill a misbehaving userspace on debug kernels. + * If invoked from hypervisors or IO-URING, then @regs is a NULL + * pointer, so fixup cannot be done. If the syscall which led to + * this invocation was invoked inside a critical section, then it + * will either end up in this code again or a possible violation of + * a syscall inside a critical region can only be detected by the + * debug code in rseq_syscall() in a debug enabled kernel. */ if (regs) { - ret =3D rseq_ip_fixup(regs); - if (unlikely(ret < 0)) - goto error; + /* + * Read and clear the event mask first. If the task was not + * preempted or migrated or a signal is on the way, there + * is no point in doing any of the heavy lifting here on + * production kernels. In that case TIF_NOTIFY_RESUME was + * raised by some other functionality. + * + * This is correct because the read/clear operation is + * guarded against scheduler preemption, which makes it CPU + * local atomic. If the task is preempted right after + * re-enabling preemption then TIF_NOTIFY_RESUME is set + * again and this function is invoked another time _before_ + * the task is able to return to user mode. + * + * On a debug kernel, invoke the fixup code unconditionally + * with the result handed in to allow the detection of + * inconsistencies. + */ + u32 event_mask; + + scoped_guard(RSEQ_EVENT_GUARD) { + event_mask =3D t->rseq_event_mask; + t->rseq_event_mask =3D 0; + } + + if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event_mask) { + ret =3D rseq_ip_fixup(regs, !!event_mask); + if (unlikely(ret < 0)) + goto error; + } } if (unlikely(rseq_update_cpu_node_id(t))) goto error; From nobody Thu Dec 18 04:27:55 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CDA9B302CDF for ; Mon, 27 Oct 2025 08:44:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554662; cv=none; b=klUUeUiOVq7Y6iFYXNtzvRZNR1i25DWcrTQV5hoSmnVOJNAqPkF5X9K5XhnrK4GrPa8JyK4OmVcfX2nbX5wa2a8SJ7ibcmasrYWHlB+EgNWnS3h178S+JkQsHMBT3eA46L1PSjTDyeRFA+FZUTKPylk9RAOvDnWvzpw0ANx7pxY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554662; c=relaxed/simple; bh=HcnqdHsvaXUK49IWIHRp6yZ7E75n5eA1mWiYnxksauY=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=AJKzLiBtzZ0DM5RZchHxe7mtutX1Lt4Jl4uZBXAxAYhxWb3pM+qtGrB3aD5YW65Wd9RcHFx4Gdfl6Zwmc+RV/C9IhJvrkp+7MuLhmX3ufcjcTlTUJQSrefClQkN3WOQ93mg/xBQaQeKoT8xVYs7hkD4HJvGNixfL/Qds/Ub3Js4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=1rtP0E25; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=qC6MbZI3; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="1rtP0E25"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="qC6MbZI3" Message-ID: <20251027084306.085971048@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1761554659; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=kKWzqcQeo1xGt/k7+VIul31KkT1uc4ygM+hcAgzVCDQ=; b=1rtP0E259WCtgzY+uc2N0qNA8CuUIHknWBbZ/BuSlZOgVEF8wIGCuY5DIq2HfCuFQaOCJQ MmmrwUYhgS1YsStVHtJCG/cjam+fN/dz0UHBgB2HotoDENesqFAFk2Mj75JhxFYWpsnT5z pFv7KGyao7HNHN5OK4HraKkQGMNIAxszoTICScXs0Vw7grJKDihhgJ/38hFAWpzbUv0DL9 ko/Y42TDVmo1yBmEFg3DP5EYcyupFl08qCQ5NRzczR8JQyHXoERkhUqSAv9ZxwdaXL4g2W Wqj3EUyWFLDs3h60lDITxLyNU/cAC68pBNy2RftzEyA5gauwHtncuUZzSQFe9g== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1761554659; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=kKWzqcQeo1xGt/k7+VIul31KkT1uc4ygM+hcAgzVCDQ=; b=qC6MbZI3QBmiE1fpt7E8Hfu+Hn3iaF2u+wc1lWkKvRtg3E0UY4zJNxUpDaSjLt2qeputyS KzVU7zBWYt3NtFDA== From: Thomas Gleixner To: LKML Cc: Michael Jeanson , Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , x86@kernel.org, Sean Christopherson , Wei Liu Subject: [patch V6 02/31] rseq: Condense the inline stubs References: <20251027084220.785525188@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Mon, 27 Oct 2025 09:44:18 +0100 (CET) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Thomas Gleixner Scrolling over tons of pointless { } lines to find the actual code is annoying at best. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- include/linux/rseq.h | 47 ++++++++++++----------------------------------- 1 file changed, 12 insertions(+), 35 deletions(-) --- --- a/include/linux/rseq.h +++ b/include/linux/rseq.h @@ -101,44 +101,21 @@ static inline void rseq_execve(struct ta t->rseq_event_mask =3D 0; } =20 -#else - -static inline void rseq_set_notify_resume(struct task_struct *t) -{ -} -static inline void rseq_handle_notify_resume(struct ksignal *ksig, - struct pt_regs *regs) -{ -} -static inline void rseq_signal_deliver(struct ksignal *ksig, - struct pt_regs *regs) -{ -} -static inline void rseq_preempt(struct task_struct *t) -{ -} -static inline void rseq_migrate(struct task_struct *t) -{ -} -static inline void rseq_fork(struct task_struct *t, u64 clone_flags) -{ -} -static inline void rseq_execve(struct task_struct *t) -{ -} +#else /* CONFIG_RSEQ */ +static inline void rseq_set_notify_resume(struct task_struct *t) { } +static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct = pt_regs *regs) { } +static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_reg= s *regs) { } +static inline void rseq_preempt(struct task_struct *t) { } +static inline void rseq_migrate(struct task_struct *t) { } +static inline void rseq_fork(struct task_struct *t, u64 clone_flags) { } +static inline void rseq_execve(struct task_struct *t) { } static inline void rseq_exit_to_user_mode(void) { } -#endif +#endif /* !CONFIG_RSEQ */ =20 #ifdef CONFIG_DEBUG_RSEQ - void rseq_syscall(struct pt_regs *regs); - -#else - -static inline void rseq_syscall(struct pt_regs *regs) -{ -} - -#endif +#else /* CONFIG_DEBUG_RSEQ */ +static inline void rseq_syscall(struct pt_regs *regs) { } +#endif /* !CONFIG_DEBUG_RSEQ */ =20 #endif /* _LINUX_RSEQ_H */ From nobody Thu Dec 18 04:27:55 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5B9B130DEA0 for ; Mon, 27 Oct 2025 08:44:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554665; cv=none; b=RkZDrQr0sQpjiBEozUJgZKVN/sQUCN2EVcaZDZg1zpoE1u08VTUNziANbGdbwKkgS2UlMS7xCA2ZHx+YaC8BkC9kLoMzt1N7Rff+TI39Bfxr1a7NcuKcH6m3BwUMaLSC5BNwGpY4cuNpEECJgezyAX8p2jyvjBySxo0qUoE9KOw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554665; c=relaxed/simple; bh=8lzjY6YlJ932iRievx3IRBGtJrp7aRUMI+C8bgmdaGw=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=fx/2hxro3fQF1Ri+bZknDMr9XCVqqv6d1wgvJn+r7MX0iiGJ7FyQNPOxfOb17nD88mX29gvrg2zGdeNfC7F818/BBiluJxz2bz5zPDShpzOmTp9Jb4DG0HteaSb+LQH+zkLxgmxFQ0kLHZGHvWWCuXGAYa1hyD6aQF96AyPsZFk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=yM9NhHVu; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=nt6ecdxR; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="yM9NhHVu"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="nt6ecdxR" Message-ID: <20251027084306.149519580@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1761554661; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=M/1Ul6W9ursLNumrrGapkMgfrRi7zwPJZgpAY0HcHII=; b=yM9NhHVuog9kheqpErLlxXwxq5JZbEmgLCy8LUPROB2l9eOsotetStR0siej3ZyxVs2k3T X59QBOpU4QARXdT0R1gnmCegQfsY2ZVNih6DExsZ6wsl19bwVgE5fGLw5c2EGURX2neSGb EKt7nMX65/fUqR0L0N33f7jbyexu/wUSY/SApoWAg63LmNlFbMjeWHlZ4vPtsB7GPOMaAG Hm9fhvsNCZVwPuJ18+3Crlw+j6OH5TrMEqe0eGEBsBAfX+4cDMRTUknbcqqV21BVYiLFhu gjSM3/6p4CCXa0j+pXqEEo0CeGIsCTv/NxUObTht738wSW7lcDDJ3vuQbEZJ+Q== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1761554661; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=M/1Ul6W9ursLNumrrGapkMgfrRi7zwPJZgpAY0HcHII=; b=nt6ecdxRwD1Ip/5eAAMvHAECaJ0OMFnILOY0fK99erjlpXjqKct3k0ecWPqwN5iXk3u7NT 19ANRqiIy0jPTTBw== From: Thomas Gleixner To: LKML Cc: Michael Jeanson , Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , x86@kernel.org, Sean Christopherson , Wei Liu Subject: [patch V6 03/31] rseq: Move algorithm comment to top References: <20251027084220.785525188@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Mon, 27 Oct 2025 09:44:20 +0100 (CET) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Move the comment which documents the RSEQ algorithm to the top of the file, so it does not create horrible diffs later when the actual implementation is fed into the mincer. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- kernel/rseq.c | 119 ++++++++++++++++++++++++++++-------------------------= ----- 1 file changed, 59 insertions(+), 60 deletions(-) --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -8,6 +8,65 @@ * Mathieu Desnoyers */ =20 +/* + * Restartable sequences are a lightweight interface that allows + * user-level code to be executed atomically relative to scheduler + * preemption and signal delivery. Typically used for implementing + * per-cpu operations. + * + * It allows user-space to perform update operations on per-cpu data + * without requiring heavy-weight atomic operations. + * + * Detailed algorithm of rseq user-space assembly sequences: + * + * init(rseq_cs) + * cpu =3D TLS->rseq::cpu_id_start + * [1] TLS->rseq::rseq_cs =3D rseq_cs + * [start_ip] ---------------------------- + * [2] if (cpu !=3D TLS->rseq::cpu_id) + * goto abort_ip; + * [3] + * [post_commit_ip] ---------------------------- + * + * The address of jump target abort_ip must be outside the critical + * region, i.e.: + * + * [abort_ip] < [start_ip] || [abort_ip] >=3D [post_commit_ip] + * + * Steps [2]-[3] (inclusive) need to be a sequence of instructions in + * userspace that can handle being interrupted between any of those + * instructions, and then resumed to the abort_ip. + * + * 1. Userspace stores the address of the struct rseq_cs assembly + * block descriptor into the rseq_cs field of the registered + * struct rseq TLS area. This update is performed through a single + * store within the inline assembly instruction sequence. + * [start_ip] + * + * 2. Userspace tests to check whether the current cpu_id field match + * the cpu number loaded before start_ip, branching to abort_ip + * in case of a mismatch. + * + * If the sequence is preempted or interrupted by a signal + * at or after start_ip and before post_commit_ip, then the kernel + * clears TLS->__rseq_abi::rseq_cs, and sets the user-space return + * ip to abort_ip before returning to user-space, so the preempted + * execution resumes at abort_ip. + * + * 3. Userspace critical section final instruction before + * post_commit_ip is the commit. The critical section is + * self-terminating. + * [post_commit_ip] + * + * 4. + * + * On failure at [2], or if interrupted by preempt or signal delivery + * between [1] and [3]: + * + * [abort_ip] + * F1. + */ + #include #include #include @@ -98,66 +157,6 @@ static int rseq_validate_ro_fields(struc unsafe_put_user(value, &t->rseq->field, error_label) #endif =20 -/* - * - * Restartable sequences are a lightweight interface that allows - * user-level code to be executed atomically relative to scheduler - * preemption and signal delivery. Typically used for implementing - * per-cpu operations. - * - * It allows user-space to perform update operations on per-cpu data - * without requiring heavy-weight atomic operations. - * - * Detailed algorithm of rseq user-space assembly sequences: - * - * init(rseq_cs) - * cpu =3D TLS->rseq::cpu_id_start - * [1] TLS->rseq::rseq_cs =3D rseq_cs - * [start_ip] ---------------------------- - * [2] if (cpu !=3D TLS->rseq::cpu_id) - * goto abort_ip; - * [3] - * [post_commit_ip] ---------------------------- - * - * The address of jump target abort_ip must be outside the critical - * region, i.e.: - * - * [abort_ip] < [start_ip] || [abort_ip] >=3D [post_commit_ip] - * - * Steps [2]-[3] (inclusive) need to be a sequence of instructions in - * userspace that can handle being interrupted between any of those - * instructions, and then resumed to the abort_ip. - * - * 1. Userspace stores the address of the struct rseq_cs assembly - * block descriptor into the rseq_cs field of the registered - * struct rseq TLS area. This update is performed through a single - * store within the inline assembly instruction sequence. - * [start_ip] - * - * 2. Userspace tests to check whether the current cpu_id field match - * the cpu number loaded before start_ip, branching to abort_ip - * in case of a mismatch. - * - * If the sequence is preempted or interrupted by a signal - * at or after start_ip and before post_commit_ip, then the kernel - * clears TLS->__rseq_abi::rseq_cs, and sets the user-space return - * ip to abort_ip before returning to user-space, so the preempted - * execution resumes at abort_ip. - * - * 3. Userspace critical section final instruction before - * post_commit_ip is the commit. The critical section is - * self-terminating. - * [post_commit_ip] - * - * 4. - * - * On failure at [2], or if interrupted by preempt or signal delivery - * between [1] and [3]: - * - * [abort_ip] - * F1. - */ - static int rseq_update_cpu_node_id(struct task_struct *t) { struct rseq __user *rseq =3D t->rseq; From nobody Thu Dec 18 04:27:55 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 97AC730E0E7 for ; Mon, 27 Oct 2025 08:44:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554666; cv=none; b=kLz/hl0mN5Jz/aqFJDIo6OLoWQ34w3UNhSOM//6IIwChevdjiMIplyKaEx4UWO6mYpVq13v7X2cOSw4cdbfeo3w5nm23Eadmd++wEEF+56sSAHWKJkSPKkMycZRx4YNFibbjG6TK4kXhPIIzHJzA6lmvOZcvmn/CwAjz/lGfOnA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554666; c=relaxed/simple; bh=A1vRO+JC6wXvVRyZ6M6JYXziZGl5gDaaaQTk8Wtbe7w=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=OzwWTH5NH43E5MGozgGsGuUA50pk8hM7Ad511IlHyflE8y9bbkgxE7ta7NgBj6lWkwmhp331Q4Ud5AbB/V7ertMF5Ul3Vnb/UTq6BwcCH+TViEva/o/vmwqm0uJ7EGmioohE08MKQ2PbXNe2RWAZ9PYmeytRUTL5ekdf0BIRsng= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=oVKibH/j; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=c5o2WPWv; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="oVKibH/j"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="c5o2WPWv" Message-ID: <20251027084306.211520245@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1761554663; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=1gK10EHJk76APGnXrchUXEgYg86oZi63TuoER7ZkuNM=; b=oVKibH/jhCui5YorU+3u23kXJ0EZsEus5VItjqVFI27gyA47LlYyNq1iAfF68okA8AMTJA lFbKZIocmg6lxobu1A1jAfjd0gEHgdrKXIYpchrXige7lDE9a5c9mdmYm5TkdfjIMV9/mh 7aswQEiMN8q2DDbRaT9WoXyQCQA/xl6ClkSjKvwDNl7KULIVzjuL9d2kauxvcvMDjbXQ+x NIy8VMcUOF1O0Hhi3EFabZCHdQjC7bsmicv9oLsTxjRm/Yxo0tox4696LukRHrfXRLDMyA I6wJUg6qwmxAznO99nByHIioS475CfOjK7IND7gv7MmT4UB9QAyNvvMZa/qMjg== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1761554663; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=1gK10EHJk76APGnXrchUXEgYg86oZi63TuoER7ZkuNM=; b=c5o2WPWvI3ER9ksa/NUfP2tFixKLYoCTIhGA1X4TUpETwvi3P4OD6qaMX7oqlUzQP278TC LsHek3jJajXxxOCw== From: Thomas Gleixner To: LKML Cc: Michael Jeanson , Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , x86@kernel.org, Sean Christopherson , Wei Liu Subject: [patch V6 04/31] rseq: Remove the ksig argument from rseq_handle_notify_resume() References: <20251027084220.785525188@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Mon, 27 Oct 2025 09:44:22 +0100 (CET) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" There is no point for this being visible in the resume_to_user_mode() handling. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- include/linux/resume_user_mode.h | 2 +- include/linux/rseq.h | 13 +++++++------ 2 files changed, 8 insertions(+), 7 deletions(-) --- a/include/linux/resume_user_mode.h +++ b/include/linux/resume_user_mode.h @@ -59,7 +59,7 @@ static inline void resume_user_mode_work mem_cgroup_handle_over_high(GFP_KERNEL); blkcg_maybe_throttle_current(); =20 - rseq_handle_notify_resume(NULL, regs); + rseq_handle_notify_resume(regs); } =20 #endif /* LINUX_RESUME_USER_MODE_H */ --- a/include/linux/rseq.h +++ b/include/linux/rseq.h @@ -37,19 +37,20 @@ static inline void rseq_set_notify_resum =20 void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs= ); =20 -static inline void rseq_handle_notify_resume(struct ksignal *ksig, - struct pt_regs *regs) +static inline void rseq_handle_notify_resume(struct pt_regs *regs) { if (current->rseq) - __rseq_handle_notify_resume(ksig, regs); + __rseq_handle_notify_resume(NULL, regs); } =20 static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { - scoped_guard(RSEQ_EVENT_GUARD) - __set_bit(RSEQ_EVENT_SIGNAL_BIT, ¤t->rseq_event_mask); - rseq_handle_notify_resume(ksig, regs); + if (current->rseq) { + scoped_guard(RSEQ_EVENT_GUARD) + __set_bit(RSEQ_EVENT_SIGNAL_BIT, ¤t->rseq_event_mask); + __rseq_handle_notify_resume(ksig, regs); + } } =20 /* rseq_preempt() requires preemption to be disabled. */ From nobody Thu Dec 18 04:27:55 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 80DB330EF7B for ; Mon, 27 Oct 2025 08:44:26 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554668; cv=none; b=n+6MV4R4dfmeFi/bfJ31uVHrcXt0Vai6Sk8KfEB7rN/v6NJlHbIEehxpJ5/Mu9/ZkVHbel8ZE/QwlMWVudYXs8MrrpdmZaD/VH0JRJcUSw144rocY4o1vretCIqWE9INi4fPajyD0IlAfITXunt4wwZlh/BiO681h9A6EkC2xzA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554668; c=relaxed/simple; bh=e3OJ4RiNIDe1v1kacOEr2FOq3U4L8U6FyFGNWvK7zdM=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=VGMJKVBn7h5C5G5zU4/ly6XYKf+bkHfRALs/TIZR2AP9UTkT1ozv9SbnPuqTrS/YeWBkD0eBuOWdHxvC6J7V9SmSenDMjs2wmFXyrXJBDz1oa1Is9yeNSb4nJdzzYDZwQFOfpb8TrcYYx58fbfS+8ApYLlU+gZvQQJIsS1LZfNE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=nhwI5MhH; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=pXCI6MJ5; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="nhwI5MhH"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="pXCI6MJ5" Message-ID: <20251027084306.274661227@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1761554664; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=M3IwPvLFzu7B9upSE/ixqLxD1KTuFCAN7ZgD7sx/C6U=; b=nhwI5MhHHLFyiVet+evYdAxiJFF5phIvNj86peMyGdm3YBvLh+3ElUsu17PaQXZnFy88Tb AIJ/zmBeiUmlf/6ScwpCcsOqyIc/vmhTOtXXLjaFgbf4Pny7UI0V+FG92HNyx0GyxJDiWh NZJeuwHQZmhD8ElxE75vRhaA5FyzwSoupQcbhCNjTaHWFvu/rtxPsCCels7JXS/Z3W5OE4 Z0PSbgXxpqKeU42Q/hclj/rI1NCtvVYUHxpB4l7lpYgoIHlzggCaoxPmH/0khvc+xM3XxS lSv7hfk/1OEKtVTYhhtF7ugmGoy8g/Pkp9jaTOb1O7iiD1UVAlKVQhMI/Kgjkg== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1761554664; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=M3IwPvLFzu7B9upSE/ixqLxD1KTuFCAN7ZgD7sx/C6U=; b=pXCI6MJ58v9UHaHW9G+DOwVjnkxIV884GbrNoesLd2WIuiSX+xNjjRBDBuVsOwwflPXM0a tB6D1W70vANH4cDA== From: Thomas Gleixner To: LKML Cc: Michael Jeanson , Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , x86@kernel.org, Sean Christopherson , Wei Liu Subject: [patch V6 05/31] rseq: Simplify registration References: <20251027084220.785525188@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Mon, 27 Oct 2025 09:44:24 +0100 (CET) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" There is no point to read the critical section element in the newly registered user space RSEQ struct first in order to clear it. Just clear it and be done with it. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- kernel/rseq.c | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-) --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -492,11 +492,9 @@ void rseq_syscall(struct pt_regs *regs) /* * sys_rseq - setup restartable sequences for caller thread. */ -SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, - int, flags, u32, sig) +SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flag= s, u32, sig) { int ret; - u64 rseq_cs; =20 if (flags & RSEQ_FLAG_UNREGISTER) { if (flags & ~RSEQ_FLAG_UNREGISTER) @@ -557,11 +555,9 @@ SYSCALL_DEFINE4(rseq, struct rseq __user * avoid a potential segfault on return to user-space. The proper thing * to do would have been to fail the registration but this would break * older libcs that reuse the rseq area for new threads without - * clearing the fields. + * clearing the fields. Don't bother reading it, just reset it. */ - if (rseq_get_rseq_cs_ptr_val(rseq, &rseq_cs)) - return -EFAULT; - if (rseq_cs && clear_rseq_cs(rseq)) + if (put_user(0UL, &rseq->rseq_cs)) return -EFAULT; =20 #ifdef CONFIG_DEBUG_RSEQ From nobody Thu Dec 18 04:27:55 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7C19C30F55F for ; Mon, 27 Oct 2025 08:44:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554671; cv=none; b=BC5rIRRB/0vw8PTD6RccdSi43PTjmYGxqzPu6eBuyTcA9OgXLhSP8UlsCNi9Prsy+qQ4I+Fj8NXIsL4n6obQ6RxzZDXgm6Kpe2Fn+1SNse7mRSExJGXbZsz/Gg1Ahyh/SdIpvmEMQlCH1s8+rErTjFn1vP6H7LSND3fDIqdym4w= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554671; c=relaxed/simple; bh=VLXzeIVs+R25Vwy91YXOv4+OYnNsowVgdrYXGXdrQYo=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=TGaBUgQuDY+sLzFKVXPDWwNCMTlaIYQ6bTAvof9Ge8cAB2v9Z1ep+70fOSbIW4Ir8wvpGplGbB/+c24m82JgcTQKInBUHk7svB+ydzd1b2V9Qad6z1seoAwiohPk0hx0y4W8uTOgKIQmsAgipE7QUm9O3U7GZAakQ8D5mAb3YOo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=m6yh5Cby; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=CyqKKt+z; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="m6yh5Cby"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="CyqKKt+z" Message-ID: <20251027084306.336978188@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1761554667; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=leOBN1ake5gW2TGpr3VxRvq3rEsWhDtOtAAfpcBPAhE=; b=m6yh5CbysmSvTwj7aocgyHs22YtEb5Fr3YSdluHRM02XDF5hs2xYaVetz0T5K7sjGpGzcO 9BHJB7k4zogMRZzW+lVcC+ALDvp5Ol8SJa3gI7O6IQ8YzaXH22LTwd2oLNa3rffrO5acXD CrCkEbR0bJAmylSZ0NdTX57TZIwNuEODCZjgyi53Wb7r8Yv/cln2f1ImfZiQj4Hxu2EIy6 y/3qLzUXIO/z2/gnTmD8ebEeGLahhb4xXfDzRwjtG5dyOsRT23HtlAgDPgNRVYGwdsC/D8 f5xLi35EvIP5nAeWd9L1b23QoEgqD/bkpS/iyo/eas8P5IsTMgz+UUCy7DqS6A== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1761554667; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=leOBN1ake5gW2TGpr3VxRvq3rEsWhDtOtAAfpcBPAhE=; b=CyqKKt+zaX9iIvp3/fRWWHCz+jR4QQkULxIAOfx9Z7l3KkVxKECGnben8klxyX4vnRZX1S 7CGA2wmZGg4+rHDg== From: Thomas Gleixner To: LKML Cc: Michael Jeanson , Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , x86@kernel.org, Sean Christopherson , Wei Liu Subject: [patch V6 06/31] rseq: Simplify the event notification References: <20251027084220.785525188@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Mon, 27 Oct 2025 09:44:26 +0100 (CET) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Since commit 0190e4198e47 ("rseq: Deprecate RSEQ_CS_FLAG_NO_RESTART_ON_* flags") the bits in task::rseq_event_mask are meaningless and just extra work in terms of setting them individually. Aside of that the only relevant point where an event has to be raised is context switch. Neither the CPU nor MM CID can change without going through a context switch. Collapse them all into a single boolean which simplifies the code a lot and remove the pointless invocations which have been sprinkled all over the place for no value. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- V2: Reduce it to the sched switch event. --- fs/exec.c | 2 - include/linux/rseq.h | 66 +++++++++--------------------------------= ----- include/linux/sched.h | 10 +++--- include/uapi/linux/rseq.h | 21 ++++---------- kernel/rseq.c | 28 +++++++++++-------- kernel/sched/core.c | 5 --- kernel/sched/membarrier.c | 8 ++--- 7 files changed, 48 insertions(+), 92 deletions(-) --- --- a/fs/exec.c +++ b/fs/exec.c @@ -1775,7 +1775,7 @@ static int bprm_execve(struct linux_binp force_fatal_sig(SIGSEGV); =20 sched_mm_cid_after_execve(current); - rseq_set_notify_resume(current); + rseq_sched_switch_event(current); current->in_execve =3D 0; =20 return retval; --- a/include/linux/rseq.h +++ b/include/linux/rseq.h @@ -3,38 +3,8 @@ #define _LINUX_RSEQ_H =20 #ifdef CONFIG_RSEQ - -#include #include =20 -#ifdef CONFIG_MEMBARRIER -# define RSEQ_EVENT_GUARD irq -#else -# define RSEQ_EVENT_GUARD preempt -#endif - -/* - * Map the event mask on the user-space ABI enum rseq_cs_flags - * for direct mask checks. - */ -enum rseq_event_mask_bits { - RSEQ_EVENT_PREEMPT_BIT =3D RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT, - RSEQ_EVENT_SIGNAL_BIT =3D RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT, - RSEQ_EVENT_MIGRATE_BIT =3D RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT, -}; - -enum rseq_event_mask { - RSEQ_EVENT_PREEMPT =3D (1U << RSEQ_EVENT_PREEMPT_BIT), - RSEQ_EVENT_SIGNAL =3D (1U << RSEQ_EVENT_SIGNAL_BIT), - RSEQ_EVENT_MIGRATE =3D (1U << RSEQ_EVENT_MIGRATE_BIT), -}; - -static inline void rseq_set_notify_resume(struct task_struct *t) -{ - if (t->rseq) - set_tsk_thread_flag(t, TIF_NOTIFY_RESUME); -} - void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs= ); =20 static inline void rseq_handle_notify_resume(struct pt_regs *regs) @@ -43,35 +13,27 @@ static inline void rseq_handle_notify_re __rseq_handle_notify_resume(NULL, regs); } =20 -static inline void rseq_signal_deliver(struct ksignal *ksig, - struct pt_regs *regs) +static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_reg= s *regs) { if (current->rseq) { - scoped_guard(RSEQ_EVENT_GUARD) - __set_bit(RSEQ_EVENT_SIGNAL_BIT, ¤t->rseq_event_mask); + current->rseq_event_pending =3D true; __rseq_handle_notify_resume(ksig, regs); } } =20 -/* rseq_preempt() requires preemption to be disabled. */ -static inline void rseq_preempt(struct task_struct *t) +static inline void rseq_sched_switch_event(struct task_struct *t) { - __set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask); - rseq_set_notify_resume(t); -} - -/* rseq_migrate() requires preemption to be disabled. */ -static inline void rseq_migrate(struct task_struct *t) -{ - __set_bit(RSEQ_EVENT_MIGRATE_BIT, &t->rseq_event_mask); - rseq_set_notify_resume(t); + if (t->rseq) { + t->rseq_event_pending =3D true; + set_tsk_thread_flag(t, TIF_NOTIFY_RESUME); + } } =20 static __always_inline void rseq_exit_to_user_mode(void) { if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) { - if (WARN_ON_ONCE(current->rseq && current->rseq_event_mask)) - current->rseq_event_mask =3D 0; + if (WARN_ON_ONCE(current->rseq && current->rseq_event_pending)) + current->rseq_event_pending =3D false; } } =20 @@ -85,12 +47,12 @@ static inline void rseq_fork(struct task t->rseq =3D NULL; t->rseq_len =3D 0; t->rseq_sig =3D 0; - t->rseq_event_mask =3D 0; + t->rseq_event_pending =3D false; } else { t->rseq =3D current->rseq; t->rseq_len =3D current->rseq_len; t->rseq_sig =3D current->rseq_sig; - t->rseq_event_mask =3D current->rseq_event_mask; + t->rseq_event_pending =3D current->rseq_event_pending; } } =20 @@ -99,15 +61,13 @@ static inline void rseq_execve(struct ta t->rseq =3D NULL; t->rseq_len =3D 0; t->rseq_sig =3D 0; - t->rseq_event_mask =3D 0; + t->rseq_event_pending =3D false; } =20 #else /* CONFIG_RSEQ */ -static inline void rseq_set_notify_resume(struct task_struct *t) { } static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct = pt_regs *regs) { } static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_reg= s *regs) { } -static inline void rseq_preempt(struct task_struct *t) { } -static inline void rseq_migrate(struct task_struct *t) { } +static inline void rseq_sched_switch_event(struct task_struct *t) { } static inline void rseq_fork(struct task_struct *t, u64 clone_flags) { } static inline void rseq_execve(struct task_struct *t) { } static inline void rseq_exit_to_user_mode(void) { } --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1407,14 +1407,14 @@ struct task_struct { #endif /* CONFIG_NUMA_BALANCING */ =20 #ifdef CONFIG_RSEQ - struct rseq __user *rseq; - u32 rseq_len; - u32 rseq_sig; + struct rseq __user *rseq; + u32 rseq_len; + u32 rseq_sig; /* - * RmW on rseq_event_mask must be performed atomically + * RmW on rseq_event_pending must be performed atomically * with respect to preemption. */ - unsigned long rseq_event_mask; + bool rseq_event_pending; # ifdef CONFIG_DEBUG_RSEQ /* * This is a place holder to save a copy of the rseq fields for --- a/include/uapi/linux/rseq.h +++ b/include/uapi/linux/rseq.h @@ -114,20 +114,13 @@ struct rseq { /* * Restartable sequences flags field. * - * This field should only be updated by the thread which - * registered this data structure. Read by the kernel. - * Mainly used for single-stepping through rseq critical sections - * with debuggers. - * - * - RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT - * Inhibit instruction sequence block restart on preemption - * for this thread. - * - RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL - * Inhibit instruction sequence block restart on signal - * delivery for this thread. - * - RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE - * Inhibit instruction sequence block restart on migration for - * this thread. + * This field was initially intended to allow event masking for + * single-stepping through rseq critical sections with debuggers. + * The kernel does not support this anymore and the relevant bits + * are checked for being always false: + * - RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT + * - RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL + * - RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE */ __u32 flags; =20 --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -78,6 +78,12 @@ #define CREATE_TRACE_POINTS #include =20 +#ifdef CONFIG_MEMBARRIER +# define RSEQ_EVENT_GUARD irq +#else +# define RSEQ_EVENT_GUARD preempt +#endif + /* The original rseq structure size (including padding) is 32 bytes. */ #define ORIG_RSEQ_SIZE 32 =20 @@ -430,11 +436,11 @@ void __rseq_handle_notify_resume(struct */ if (regs) { /* - * Read and clear the event mask first. If the task was not - * preempted or migrated or a signal is on the way, there - * is no point in doing any of the heavy lifting here on - * production kernels. In that case TIF_NOTIFY_RESUME was - * raised by some other functionality. + * Read and clear the event pending bit first. If the task + * was not preempted or migrated or a signal is on the way, + * there is no point in doing any of the heavy lifting here + * on production kernels. In that case TIF_NOTIFY_RESUME + * was raised by some other functionality. * * This is correct because the read/clear operation is * guarded against scheduler preemption, which makes it CPU @@ -447,15 +453,15 @@ void __rseq_handle_notify_resume(struct * with the result handed in to allow the detection of * inconsistencies. */ - u32 event_mask; + bool event; =20 scoped_guard(RSEQ_EVENT_GUARD) { - event_mask =3D t->rseq_event_mask; - t->rseq_event_mask =3D 0; + event =3D t->rseq_event_pending; + t->rseq_event_pending =3D false; } =20 - if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event_mask) { - ret =3D rseq_ip_fixup(regs, !!event_mask); + if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) { + ret =3D rseq_ip_fixup(regs, event); if (unlikely(ret < 0)) goto error; } @@ -584,7 +590,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user * registered, ensure the cpu_id_start and cpu_id fields * are updated before returning to user-space. */ - rseq_set_notify_resume(current); + rseq_sched_switch_event(current); =20 return 0; } --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3329,7 +3329,6 @@ void set_task_cpu(struct task_struct *p, if (p->sched_class->migrate_task_rq) p->sched_class->migrate_task_rq(p, new_cpu); p->se.nr_migrations++; - rseq_migrate(p); sched_mm_cid_migrate_from(p); perf_event_task_migrate(p); } @@ -4763,7 +4762,6 @@ int sched_cgroup_fork(struct task_struct p->sched_task_group =3D tg; } #endif - rseq_migrate(p); /* * We're setting the CPU for the first time, we don't migrate, * so use __set_task_cpu(). @@ -4827,7 +4825,6 @@ void wake_up_new_task(struct task_struct * as we're not fully set-up yet. */ p->recent_used_cpu =3D task_cpu(p); - rseq_migrate(p); __set_task_cpu(p, select_task_rq(p, task_cpu(p), &wake_flags)); rq =3D __task_rq_lock(p, &rf); update_rq_clock(rq); @@ -5121,7 +5118,7 @@ prepare_task_switch(struct rq *rq, struc kcov_prepare_switch(prev); sched_info_switch(rq, prev, next); perf_event_task_sched_out(prev, next); - rseq_preempt(prev); + rseq_sched_switch_event(prev); fire_sched_out_preempt_notifiers(prev, next); kmap_local_sched_out(); prepare_task(next); --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -199,7 +199,7 @@ static void ipi_rseq(void *info) * is negligible. */ smp_mb(); - rseq_preempt(current); + rseq_sched_switch_event(current); } =20 static void ipi_sync_rq_state(void *info) @@ -407,9 +407,9 @@ static int membarrier_private_expedited( * membarrier, we will end up with some thread in the mm * running without a core sync. * - * For RSEQ, don't rseq_preempt() the caller. User code - * is not supposed to issue syscalls at all from inside an - * rseq critical section. + * For RSEQ, don't invoke rseq_sched_switch_event() on the + * caller. User code is not supposed to issue syscalls at + * all from inside an rseq critical section. */ if (flags !=3D MEMBARRIER_FLAG_SYNC_CORE) { preempt_disable(); From nobody Thu Dec 18 04:27:55 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C932230F536 for ; Mon, 27 Oct 2025 08:44:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554673; cv=none; b=VUXVEOcttwsnZhIax9r01G4AAL9NUx6dVre9ExkkxfbWaaZlys8mDGSMxynhPPussn/EqUTSD56cR56Lrs+ohtvHEg3UVe8drX+edcnlLJxrxz4jvEOfgfkX8Ue7NXGiD0iSEDxeutI5cL6+W+pU5pZu2HLN7J9tv7nCwerUARY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554673; c=relaxed/simple; bh=CpCTB1VqNEpuJXgSETrB9FSHCNY4H7uboFG5veyl0WE=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=eEwXRCLXzf2GaAFWPShWceW0ue+3/6JApG6APug/0twAia7jdQyNdwujBV4XzqdZ5Ko6xS9KCNP+EM815DojM/jWPTrc3kcxdlHkzq/8Suf2P7pWyKTf89cywfyczatT0ACgNQC3ag7uFy9agZLcKO3jdxBPOEVjA5MnZp1Sw14= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=UwPjv6sS; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=v+gUqijw; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="UwPjv6sS"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="v+gUqijw" Message-ID: <20251027084306.399495855@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1761554670; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=VQmyoOet3ZH5HQRVE59iB+D2S5nx1PfyGcDowX85zMY=; b=UwPjv6sSrwsB8RF1Qcr0sFi73O1eE3YnhA986sf4OSx9QmQO0fS/CYlCDrXTxbSsk72wUz TGqZ22HvXQTsQNVXIEI4/FisxZv+DO6kV4OnP0IfP4ec8P7UkvBdQuNr/IZmLUMoSjLxNO Ot/7eux4ItfwJaNk2Head4/uFKrDCfh+vUmtZECe1JbBcsBe2LnTfSdkJaiCPhe4hguMfI 9I3FZsLNUu6/BrKYe4Z2xAHXicv2gjL71u8fy82hteAZLmbGLHEwrhQB5TBalJdkLeytmE yvcF/HwOPR0CWBtfRamD5kC8M65Vd4R4OxBXntIWbDhWafEwk9n8sGuH/T6veQ== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1761554670; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=VQmyoOet3ZH5HQRVE59iB+D2S5nx1PfyGcDowX85zMY=; b=v+gUqijwotEMVMOnr8td4meGaiZxr9joh9T2iGEC6ayaXFcF3jk+r5dRfCyWeZyG0Erplm ZHiQceLiUz88OtBQ== From: Thomas Gleixner To: LKML Cc: Michael Jeanson , Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , x86@kernel.org, Sean Christopherson , Wei Liu Subject: [patch V6 07/31] rseq, virt: Retrigger RSEQ after vcpu_run() References: <20251027084220.785525188@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Mon, 27 Oct 2025 09:44:28 +0100 (CET) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Hypervisors invoke resume_user_mode_work() before entering the guest, which clears TIF_NOTIFY_RESUME. The @regs argument is NULL as there is no user space context available to them, so the rseq notify handler skips inspecting the critical section, but updates the CPU/MM CID values unconditionally so that the eventual pending rseq event is not lost on the way to user space. This is a pointless exercise as the task might be rescheduled before actually returning to user space and it creates unnecessary work in the vcpu_run() loops. It's way more efficient to ignore that invocation based on @regs =3D=3D NULL and let the hypervisors re-raise TIF_NOTIFY_RESUME after returning from the vcpu_run() loop before returning from the ioctl(). This ensures that a pending RSEQ update is not lost and the IDs are updated before returning to user space. Once the RSEQ handling is decoupled from TIF_NOTIFY_RESUME, this turns into a NOOP. Signed-off-by: Thomas Gleixner Acked-by: Sean Christopherson Reviewed-by: Mathieu Desnoyers --- V5: Add a comment that this is temporary - Sean V3: Add the missing rseq.h include for HV - 0-day --- drivers/hv/mshv_root_main.c | 3 + include/linux/rseq.h | 17 +++++++++ kernel/rseq.c | 76 +++++++++++++++++++++++----------------= ----- virt/kvm/kvm_main.c | 7 ++++ 4 files changed, 67 insertions(+), 36 deletions(-) --- a/drivers/hv/mshv_root_main.c +++ b/drivers/hv/mshv_root_main.c @@ -29,6 +29,7 @@ #include #include #include +#include =20 #include "mshv_eventfd.h" #include "mshv.h" @@ -560,6 +561,8 @@ static long mshv_run_vp_with_root_schedu } } while (!vp->run.flags.intercept_suspend); =20 + rseq_virt_userspace_exit(); + return ret; } =20 --- a/include/linux/rseq.h +++ b/include/linux/rseq.h @@ -38,6 +38,22 @@ static __always_inline void rseq_exit_to } =20 /* + * KVM/HYPERV invoke resume_user_mode_work() before entering guest mode, + * which clears TIF_NOTIFY_RESUME. To avoid updating user space RSEQ in + * that case just to do it eventually again before returning to user space, + * the entry resume_user_mode_work() invocation is ignored as the register + * argument is NULL. + * + * After returning from guest mode, they have to invoke this function to + * re-raise TIF_NOTIFY_RESUME if necessary. + */ +static inline void rseq_virt_userspace_exit(void) +{ + if (current->rseq_event_pending) + set_tsk_thread_flag(current, TIF_NOTIFY_RESUME); +} + +/* * If parent process has a registered restartable sequences area, the * child inherits. Unregister rseq for a clone with CLONE_VM set. */ @@ -68,6 +84,7 @@ static inline void rseq_execve(struct ta static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct = pt_regs *regs) { } static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_reg= s *regs) { } static inline void rseq_sched_switch_event(struct task_struct *t) { } +static inline void rseq_virt_userspace_exit(void) { } static inline void rseq_fork(struct task_struct *t, u64 clone_flags) { } static inline void rseq_execve(struct task_struct *t) { } static inline void rseq_exit_to_user_mode(void) { } --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -422,50 +422,54 @@ void __rseq_handle_notify_resume(struct { struct task_struct *t =3D current; int ret, sig; + bool event; + + /* + * If invoked from hypervisors before entering the guest via + * resume_user_mode_work(), then @regs is a NULL pointer. + * + * resume_user_mode_work() clears TIF_NOTIFY_RESUME and re-raises + * it before returning from the ioctl() to user space when + * rseq_event.sched_switch is set. + * + * So it's safe to ignore here instead of pointlessly updating it + * in the vcpu_run() loop. + */ + if (!regs) + return; =20 if (unlikely(t->flags & PF_EXITING)) return; =20 /* - * If invoked from hypervisors or IO-URING, then @regs is a NULL - * pointer, so fixup cannot be done. If the syscall which led to - * this invocation was invoked inside a critical section, then it - * will either end up in this code again or a possible violation of - * a syscall inside a critical region can only be detected by the - * debug code in rseq_syscall() in a debug enabled kernel. + * Read and clear the event pending bit first. If the task + * was not preempted or migrated or a signal is on the way, + * there is no point in doing any of the heavy lifting here + * on production kernels. In that case TIF_NOTIFY_RESUME + * was raised by some other functionality. + * + * This is correct because the read/clear operation is + * guarded against scheduler preemption, which makes it CPU + * local atomic. If the task is preempted right after + * re-enabling preemption then TIF_NOTIFY_RESUME is set + * again and this function is invoked another time _before_ + * the task is able to return to user mode. + * + * On a debug kernel, invoke the fixup code unconditionally + * with the result handed in to allow the detection of + * inconsistencies. */ - if (regs) { - /* - * Read and clear the event pending bit first. If the task - * was not preempted or migrated or a signal is on the way, - * there is no point in doing any of the heavy lifting here - * on production kernels. In that case TIF_NOTIFY_RESUME - * was raised by some other functionality. - * - * This is correct because the read/clear operation is - * guarded against scheduler preemption, which makes it CPU - * local atomic. If the task is preempted right after - * re-enabling preemption then TIF_NOTIFY_RESUME is set - * again and this function is invoked another time _before_ - * the task is able to return to user mode. - * - * On a debug kernel, invoke the fixup code unconditionally - * with the result handed in to allow the detection of - * inconsistencies. - */ - bool event; - - scoped_guard(RSEQ_EVENT_GUARD) { - event =3D t->rseq_event_pending; - t->rseq_event_pending =3D false; - } + scoped_guard(RSEQ_EVENT_GUARD) { + event =3D t->rseq_event_pending; + t->rseq_event_pending =3D false; + } =20 - if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) { - ret =3D rseq_ip_fixup(regs, event); - if (unlikely(ret < 0)) - goto error; - } + if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) { + ret =3D rseq_ip_fixup(regs, event); + if (unlikely(ret < 0)) + goto error; } + if (unlikely(rseq_update_cpu_node_id(t))) goto error; return; --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -49,6 +49,7 @@ #include #include #include +#include =20 #include #include @@ -4476,6 +4477,12 @@ static long kvm_vcpu_ioctl(struct file * r =3D kvm_arch_vcpu_ioctl_run(vcpu); vcpu->wants_to_run =3D false; =20 + /* + * FIXME: Remove this hack once all KVM architectures + * support the generic TIF bits, i.e. a dedicated TIF_RSEQ. + */ + rseq_virt_userspace_exit(); + trace_kvm_userspace_exit(vcpu->run->exit_reason, r); break; } From nobody Thu Dec 18 04:27:55 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F15F830F803 for ; Mon, 27 Oct 2025 08:44:33 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554675; cv=none; b=nC2hfhx64REYR8VPaW3hYR/v/4NjAiNiHI8o5NodbF6n3G4m2Q0kMrgcTUuoJ4CK2x8jv0urlhWQEtrCTky21qX2ZDJROc6KeRdbpdVYh/CHctoAw6UbMuTOf7FutC+EfgqQHGNwWHS+XjrFnFQ2Snb8MNZCIC1PNTJ3NLpesu8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554675; c=relaxed/simple; bh=nnhB+f0G36x3W0MIZh4aEhevuMAkfzzvEKPQH+SzSl0=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=VccIy6h+2lA9b9msUyiGnB2pJo8bxdNHQq32naZcdmTp2VAzv5gEBObcb0GDh9jVN1kSe+ybjhEI2YtcbcFJtPju0kUCGAywoXMbjkAASHAGSaIYR+le5UEb3gVlgAStJsjzEFfviTWZfmGpiYENyp4VKp68j3GlbBfMEaFHtEw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=hiWaYCHu; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=KdpVRHpQ; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="hiWaYCHu"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="KdpVRHpQ" Message-ID: <20251027084306.462964916@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1761554672; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=g71gtEzQsU5afaEplI48n3ovcqOlik+gKaQw3jv1WLk=; b=hiWaYCHuYQpz6cGGDNzoZJrOLzn1YySnN4lurVt2TaIma97ngfS1VxLTAoHdEOoaJMI88c 6peEoBNlp0K0FYaVFMhpb8d9yaBgFjkcamhXEkPjGFxdgxShm6CyH3v4QRtJhUrYRgpq4I mNGh0iLUnV8UYdMnWJ9v5dpG2NRUBba1OpPQhXsTyszYSHqnKVVgHV4+WUWrUvneeASZsW e/YqX1WJ11SmBHkQMHwfy6lDRaI4uBBpluA0MwOv71yU+ZO9aQQr5FQxD5AnIdnvXXevpl 7nJwdyfHDZlFQY6D8EiELzqiEesMKUqR/uR7RlP3V1AgTi8Op12TOUN2pcRlkw== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1761554672; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=g71gtEzQsU5afaEplI48n3ovcqOlik+gKaQw3jv1WLk=; b=KdpVRHpQbAyfan8puYa/8LKfBJPWXM4LkjKo6BKlMcZAxoihUkiYEZZge5KHmN4N/Wq9RE AkaZiLdwF1w2DCCg== From: Thomas Gleixner To: LKML Cc: Michael Jeanson , Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , x86@kernel.org, Sean Christopherson , Wei Liu Subject: [patch V6 08/31] rseq: Avoid CPU/MM CID updates when no event pending References: <20251027084220.785525188@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Mon, 27 Oct 2025 09:44:31 +0100 (CET) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" There is no need to update these values unconditionally if there is no event pending. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- kernel/rseq.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -464,11 +464,12 @@ void __rseq_handle_notify_resume(struct t->rseq_event_pending =3D false; } =20 - if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) { - ret =3D rseq_ip_fixup(regs, event); - if (unlikely(ret < 0)) - goto error; - } + if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event) + return; + + ret =3D rseq_ip_fixup(regs, event); + if (unlikely(ret < 0)) + goto error; =20 if (unlikely(rseq_update_cpu_node_id(t))) goto error; From nobody Thu Dec 18 04:27:55 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 155FC30F92F for ; Mon, 27 Oct 2025 08:44:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554678; cv=none; b=JWTqFCL+QlTu4nwjz+6wOREIHD412ft36kTHv42W1Rr2GSUK/Vh0qxI18ZNRPVGS70oTUVG3RyQNYHUlZCN+uCFM8DCk7/1zcd/D1PDPcqA5lpXxAL08pDTELjBT0QY56bGsRQA0On6beoER60S2/pwc7tysVPeBHOSNUl/mJR4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554678; c=relaxed/simple; bh=JcdXAQYAJUNOH/fM19NRwlbD+AQL7rJyqi+dDy9pj68=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=PYfgBZPEz2XIK9AUFuAlaqx3a7/FDq41rFpYiS/sNwZnqTbxrwupil+8SeYVhLcYse5fPqnGYUo4XWjhEsgfVdFFN5j90z7T3y95Xc+jPa+GUjA8SbcCKvi0R7vwQwbu+i3CScDYiWxfEnwGx7yX9Qs2di1ZcqmZsd1k1utMx+Q= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=4urTG0y0; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=YuBZb4p5; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="4urTG0y0"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="YuBZb4p5" Message-ID: <20251027084306.527086690@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1761554674; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=EPX2sisZ4Cmh6rSdTRM9vGs3yfQApgOLqtqmQ/SXvVc=; b=4urTG0y0T2lTTLErBVNA3le5CCDOtMdkneQyzNxdNnqFyYP1enMX/zfpKbgciKrpOcI2BO wuWxzZxWtK1GP9BFjf3wunfPxdGMDndi2Pyjv1FPbJjxx6w/XzhWtjX//EABl8SYeB1GDv Do7dTN2HShoGf7ZUseZFDAiXxxtKwk2RfYX5x00Q3fV9rTwjabKtqapdMt7hYxrbLdoCxC pRgmFvZoxYOxeaaftPYQbbgn3/h5poacQ2Ly9ky//z09UxFpCwQ+Hg1/BokFk/l+cXjPw8 WxDnPWPds079/vtypbd8wcaOiOKYvaUGR+1KBLG14TKKoxdH1KcBR4z9BgT/PA== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1761554674; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=EPX2sisZ4Cmh6rSdTRM9vGs3yfQApgOLqtqmQ/SXvVc=; b=YuBZb4p5GQIt/L8xfI5q3lWxg+4gUhNw2s3A6j96/ND6t/1rIUJHlfLHibxTt/ZM/TdXJk Cun+8UFHsXIcg4Bw== From: Thomas Gleixner To: LKML Cc: Michael Jeanson , Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , x86@kernel.org, Sean Christopherson , Wei Liu Subject: [patch V6 09/31] rseq: Introduce struct rseq_data References: <20251027084220.785525188@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Mon, 27 Oct 2025 09:44:33 +0100 (CET) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" In preparation for a major rewrite of this code, provide a data structure for rseq management. Put all the rseq related data into it (except for the debug part), which allows to simplify fork/execve by using memset() and memcpy() instead of adding new fields to initialize over and over. Create a storage struct for event management as well and put the sched_switch event and a indicator for RSEQ on a task into it as a start. That uses a union, which allows to mask and clear the whole lot efficiently. The indicators are explicitly not a bit field. Bit fields generate abysmal code. The boolean members are defined as u8 as that actually guarantees that it fits. There seem to be strange architecture ABIs which need more than 8 bits for a boolean. The has_rseq member is redundant vs. task::rseq, but it turns out that boolean operations and quick checks on the union generate better code than fiddling with separate entities and data types. This struct will be extended over time to carry more information. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- V4: Move all rseq related data into a dedicated umbrella struct --- include/linux/rseq.h | 48 +++++++++++++++------------------- include/linux/rseq_types.h | 51 ++++++++++++++++++++++++++++++++++++ include/linux/sched.h | 14 ++-------- kernel/ptrace.c | 6 ++-- kernel/rseq.c | 63 ++++++++++++++++++++++------------------= ----- 5 files changed, 110 insertions(+), 72 deletions(-) --- a/include/linux/rseq.h +++ b/include/linux/rseq.h @@ -9,22 +9,22 @@ void __rseq_handle_notify_resume(struct =20 static inline void rseq_handle_notify_resume(struct pt_regs *regs) { - if (current->rseq) + if (current->rseq.event.has_rseq) __rseq_handle_notify_resume(NULL, regs); } =20 static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_reg= s *regs) { - if (current->rseq) { - current->rseq_event_pending =3D true; + if (current->rseq.event.has_rseq) { + current->rseq.event.sched_switch =3D true; __rseq_handle_notify_resume(ksig, regs); } } =20 static inline void rseq_sched_switch_event(struct task_struct *t) { - if (t->rseq) { - t->rseq_event_pending =3D true; + if (t->rseq.event.has_rseq) { + t->rseq.event.sched_switch =3D true; set_tsk_thread_flag(t, TIF_NOTIFY_RESUME); } } @@ -32,8 +32,9 @@ static inline void rseq_sched_switch_eve static __always_inline void rseq_exit_to_user_mode(void) { if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) { - if (WARN_ON_ONCE(current->rseq && current->rseq_event_pending)) - current->rseq_event_pending =3D false; + if (WARN_ON_ONCE(current->rseq.event.has_rseq && + current->rseq.event.events)) + current->rseq.event.events =3D 0; } } =20 @@ -49,35 +50,30 @@ static __always_inline void rseq_exit_to */ static inline void rseq_virt_userspace_exit(void) { - if (current->rseq_event_pending) + if (current->rseq.event.sched_switch) set_tsk_thread_flag(current, TIF_NOTIFY_RESUME); } =20 +static inline void rseq_reset(struct task_struct *t) +{ + memset(&t->rseq, 0, sizeof(t->rseq)); +} + +static inline void rseq_execve(struct task_struct *t) +{ + rseq_reset(t); +} + /* * If parent process has a registered restartable sequences area, the * child inherits. Unregister rseq for a clone with CLONE_VM set. */ static inline void rseq_fork(struct task_struct *t, u64 clone_flags) { - if (clone_flags & CLONE_VM) { - t->rseq =3D NULL; - t->rseq_len =3D 0; - t->rseq_sig =3D 0; - t->rseq_event_pending =3D false; - } else { + if (clone_flags & CLONE_VM) + rseq_reset(t); + else t->rseq =3D current->rseq; - t->rseq_len =3D current->rseq_len; - t->rseq_sig =3D current->rseq_sig; - t->rseq_event_pending =3D current->rseq_event_pending; - } -} - -static inline void rseq_execve(struct task_struct *t) -{ - t->rseq =3D NULL; - t->rseq_len =3D 0; - t->rseq_sig =3D 0; - t->rseq_event_pending =3D false; } =20 #else /* CONFIG_RSEQ */ --- /dev/null +++ b/include/linux/rseq_types.h @@ -0,0 +1,51 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_RSEQ_TYPES_H +#define _LINUX_RSEQ_TYPES_H + +#include + +#ifdef CONFIG_RSEQ +struct rseq; + +/** + * struct rseq_event - Storage for rseq related event management + * @all: Compound to initialize and clear the data efficiently + * @events: Compound to access events with a single load/store + * @sched_switch: True if the task was scheduled out + * @has_rseq: True if the task has a rseq pointer installed + */ +struct rseq_event { + union { + u32 all; + struct { + union { + u16 events; + struct { + u8 sched_switch; + }; + }; + + u8 has_rseq; + }; + }; +}; + +/** + * struct rseq_data - Storage for all rseq related data + * @usrptr: Pointer to the registered user space RSEQ memory + * @len: Length of the RSEQ region + * @sig: Signature of critial section abort IPs + * @event: Storage for event management + */ +struct rseq_data { + struct rseq __user *usrptr; + u32 len; + u32 sig; + struct rseq_event event; +}; + +#else /* CONFIG_RSEQ */ +struct rseq_data { }; +#endif /* !CONFIG_RSEQ */ + +#endif --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -41,6 +41,7 @@ #include #include #include +#include #include #include #include @@ -1406,16 +1407,8 @@ struct task_struct { unsigned long numa_pages_migrated; #endif /* CONFIG_NUMA_BALANCING */ =20 -#ifdef CONFIG_RSEQ - struct rseq __user *rseq; - u32 rseq_len; - u32 rseq_sig; - /* - * RmW on rseq_event_pending must be performed atomically - * with respect to preemption. - */ - bool rseq_event_pending; -# ifdef CONFIG_DEBUG_RSEQ + struct rseq_data rseq; +#ifdef CONFIG_DEBUG_RSEQ /* * This is a place holder to save a copy of the rseq fields for * validation of read-only fields. The struct rseq has a @@ -1423,7 +1416,6 @@ struct task_struct { * directly. Reserve a size large enough for the known fields. */ char rseq_fields[sizeof(struct rseq)]; -# endif #endif =20 #ifdef CONFIG_SCHED_MM_CID --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -793,9 +793,9 @@ static long ptrace_get_rseq_configuratio unsigned long size, void __user *data) { struct ptrace_rseq_configuration conf =3D { - .rseq_abi_pointer =3D (u64)(uintptr_t)task->rseq, - .rseq_abi_size =3D task->rseq_len, - .signature =3D task->rseq_sig, + .rseq_abi_pointer =3D (u64)(uintptr_t)task->rseq.usrptr, + .rseq_abi_size =3D task->rseq.len, + .signature =3D task->rseq.sig, .flags =3D 0, }; =20 --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -103,13 +103,13 @@ static int rseq_validate_ro_fields(struc DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); u32 cpu_id_start, cpu_id, node_id, mm_cid; - struct rseq __user *rseq =3D t->rseq; + struct rseq __user *rseq =3D t->rseq.usrptr; =20 /* * Validate fields which are required to be read-only by * user-space. */ - if (!user_read_access_begin(rseq, t->rseq_len)) + if (!user_read_access_begin(rseq, t->rseq.len)) goto efault; unsafe_get_user(cpu_id_start, &rseq->cpu_id_start, efault_end); unsafe_get_user(cpu_id, &rseq->cpu_id, efault_end); @@ -147,10 +147,10 @@ static int rseq_validate_ro_fields(struc * Update an rseq field and its in-kernel copy in lock-step to keep a cohe= rent * state. */ -#define rseq_unsafe_put_user(t, value, field, error_label) \ - do { \ - unsafe_put_user(value, &t->rseq->field, error_label); \ - rseq_kernel_fields(t)->field =3D value; \ +#define rseq_unsafe_put_user(t, value, field, error_label) \ + do { \ + unsafe_put_user(value, &t->rseq.usrptr->field, error_label); \ + rseq_kernel_fields(t)->field =3D value; \ } while (0) =20 #else @@ -160,12 +160,12 @@ static int rseq_validate_ro_fields(struc } =20 #define rseq_unsafe_put_user(t, value, field, error_label) \ - unsafe_put_user(value, &t->rseq->field, error_label) + unsafe_put_user(value, &t->rseq.usrptr->field, error_label) #endif =20 static int rseq_update_cpu_node_id(struct task_struct *t) { - struct rseq __user *rseq =3D t->rseq; + struct rseq __user *rseq =3D t->rseq.usrptr; u32 cpu_id =3D raw_smp_processor_id(); u32 node_id =3D cpu_to_node(cpu_id); u32 mm_cid =3D task_mm_cid(t); @@ -176,7 +176,7 @@ static int rseq_update_cpu_node_id(struc if (rseq_validate_ro_fields(t)) goto efault; WARN_ON_ONCE((int) mm_cid < 0); - if (!user_write_access_begin(rseq, t->rseq_len)) + if (!user_write_access_begin(rseq, t->rseq.len)) goto efault; =20 rseq_unsafe_put_user(t, cpu_id, cpu_id_start, efault_end); @@ -201,7 +201,7 @@ static int rseq_update_cpu_node_id(struc =20 static int rseq_reset_rseq_cpu_node_id(struct task_struct *t) { - struct rseq __user *rseq =3D t->rseq; + struct rseq __user *rseq =3D t->rseq.usrptr; u32 cpu_id_start =3D 0, cpu_id =3D RSEQ_CPU_ID_UNINITIALIZED, node_id =3D= 0, mm_cid =3D 0; =20 @@ -211,7 +211,7 @@ static int rseq_reset_rseq_cpu_node_id(s if (rseq_validate_ro_fields(t)) goto efault; =20 - if (!user_write_access_begin(rseq, t->rseq_len)) + if (!user_write_access_begin(rseq, t->rseq.len)) goto efault; =20 /* @@ -272,7 +272,7 @@ static int rseq_get_rseq_cs(struct task_ u32 sig; int ret; =20 - ret =3D rseq_get_rseq_cs_ptr_val(t->rseq, &ptr); + ret =3D rseq_get_rseq_cs_ptr_val(t->rseq.usrptr, &ptr); if (ret) return ret; =20 @@ -305,10 +305,10 @@ static int rseq_get_rseq_cs(struct task_ if (ret) return ret; =20 - if (current->rseq_sig !=3D sig) { + if (current->rseq.sig !=3D sig) { printk_ratelimited(KERN_WARNING "Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%= x (pid=3D%d, addr=3D%p).\n", - sig, current->rseq_sig, current->pid, usig); + sig, current->rseq.sig, current->pid, usig); return -EINVAL; } return 0; @@ -338,7 +338,7 @@ static int rseq_check_flags(struct task_ return -EINVAL; =20 /* Get thread flags. */ - ret =3D get_user(flags, &t->rseq->flags); + ret =3D get_user(flags, &t->rseq.usrptr->flags); if (ret) return ret; =20 @@ -392,13 +392,13 @@ static int rseq_ip_fixup(struct pt_regs * Clear the rseq_cs pointer and return. */ if (!in_rseq_cs(ip, &rseq_cs)) - return clear_rseq_cs(t->rseq); + return clear_rseq_cs(t->rseq.usrptr); ret =3D rseq_check_flags(t, rseq_cs.flags); if (ret < 0) return ret; if (!abort) return 0; - ret =3D clear_rseq_cs(t->rseq); + ret =3D clear_rseq_cs(t->rseq.usrptr); if (ret) return ret; trace_rseq_ip_fixup(ip, rseq_cs.start_ip, rseq_cs.post_commit_offset, @@ -460,8 +460,8 @@ void __rseq_handle_notify_resume(struct * inconsistencies. */ scoped_guard(RSEQ_EVENT_GUARD) { - event =3D t->rseq_event_pending; - t->rseq_event_pending =3D false; + event =3D t->rseq.event.sched_switch; + t->rseq.event.sched_switch =3D false; } =20 if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event) @@ -492,7 +492,7 @@ void rseq_syscall(struct pt_regs *regs) struct task_struct *t =3D current; struct rseq_cs rseq_cs; =20 - if (!t->rseq) + if (!t->rseq.usrptr) return; if (rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs)) force_sig(SIGSEGV); @@ -511,33 +511,31 @@ SYSCALL_DEFINE4(rseq, struct rseq __user if (flags & ~RSEQ_FLAG_UNREGISTER) return -EINVAL; /* Unregister rseq for current thread. */ - if (current->rseq !=3D rseq || !current->rseq) + if (current->rseq.usrptr !=3D rseq || !current->rseq.usrptr) return -EINVAL; - if (rseq_len !=3D current->rseq_len) + if (rseq_len !=3D current->rseq.len) return -EINVAL; - if (current->rseq_sig !=3D sig) + if (current->rseq.sig !=3D sig) return -EPERM; ret =3D rseq_reset_rseq_cpu_node_id(current); if (ret) return ret; - current->rseq =3D NULL; - current->rseq_sig =3D 0; - current->rseq_len =3D 0; + rseq_reset(current); return 0; } =20 if (unlikely(flags)) return -EINVAL; =20 - if (current->rseq) { + if (current->rseq.usrptr) { /* * If rseq is already registered, check whether * the provided address differs from the prior * one. */ - if (current->rseq !=3D rseq || rseq_len !=3D current->rseq_len) + if (current->rseq.usrptr !=3D rseq || rseq_len !=3D current->rseq.len) return -EINVAL; - if (current->rseq_sig !=3D sig) + if (current->rseq.sig !=3D sig) return -EPERM; /* Already registered. */ return -EBUSY; @@ -586,15 +584,16 @@ SYSCALL_DEFINE4(rseq, struct rseq __user * Activate the registration by setting the rseq area address, length * and signature in the task struct. */ - current->rseq =3D rseq; - current->rseq_len =3D rseq_len; - current->rseq_sig =3D sig; + current->rseq.usrptr =3D rseq; + current->rseq.len =3D rseq_len; + current->rseq.sig =3D sig; =20 /* * If rseq was previously inactive, and has just been * registered, ensure the cpu_id_start and cpu_id fields * are updated before returning to user-space. */ + current->rseq.event.has_rseq =3D true; rseq_sched_switch_event(current); =20 return 0; From nobody Thu Dec 18 04:27:55 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8AB122F6912 for ; Mon, 27 Oct 2025 08:44:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554680; cv=none; b=cBtywS/egDtnN/IcTi6LHPVhdvF971eVknd2uV+AWLseVyEqbHb5BSUBIrjbibnddc1Kc1NcvxCCAi18qxBxBQq5jnaznXTAuvEq/XkQIb5SzsSGq2Y30DxKSDmfpzreXEc1rQUBEim4o70SwYF/ag2EJFf2FwYYcjZAy96YPf0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554680; c=relaxed/simple; bh=Lb/w3BhjRzrsXCvCUDFGzCIOxidEz3RzuOjuQTkrykE=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=MCUs+paS8p0D+6k7C+u7hq5fl42iAhJnL1SpDnZHDbNKTxvCAWwcuiEkdvUbjuNyF4lAQLJ1gc4MNyIoCnCYFhdzUDiUQ6Xic0WRqV5zwn4o5kutoWCzA1jCPw5ubvdtxt2oOisF0aL1YvBRp7zZuJhArHLiRMAATPKZboIqmDM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=z4Uzu52B; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=dN0kEZkt; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="z4Uzu52B"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="dN0kEZkt" Message-ID: <20251027084306.590338411@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1761554677; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=mhjj3z5Xu0lm96OOzRXu+oa1LeUseXOmoNzKxGmeiPM=; b=z4Uzu52Bnq9f/JYto2R5Ms5JztVrkq2bPSmJazZLeWeHOovART1ShAGC6U9qs0BoaUDkvA i2hyOw/8dz4A/clIjPkD+n0CfQweEcuM4yDnpZTeW4+SJtkh2vzykI6+srKIBBIQujaUp5 A6xM/mIzA+wFKqAH8sbVUlnIBo4Ch6ciLzf7opfwBdAGT2yhWCIudmEFyyNLuVsufUdVvn 74/GC0r8TflP66SZZXbsr7codOsosGxyoOPIpfeR7Z3N6f6hUxPV0K5QR9lxgIB8wQ/8Ky kLBpfUlXnS304GovRf9roaO5sJXhMqR7Qf0TEon0RvHRXT2E3lLGRi9PRzZsCA== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1761554677; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=mhjj3z5Xu0lm96OOzRXu+oa1LeUseXOmoNzKxGmeiPM=; b=dN0kEZktgTibErCi80ksdonRM2mqiTnXlXshOjLZkqmoiApqNGU4LoGquJEbAd0XYSQ7IR GXCSWiMtSu80T0Bg== From: Thomas Gleixner To: LKML Cc: Michael Jeanson , Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , x86@kernel.org, Sean Christopherson , Wei Liu Subject: [patch V6 10/31] entry: Cleanup header References: <20251027084220.785525188@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Mon, 27 Oct 2025 09:44:36 +0100 (CET) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Thomas Gleixner Cleanup the include ordering, kernel-doc and other trivialities before making further changes. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- include/linux/entry-common.h | 8 ++++---- include/linux/irq-entry-common.h | 2 ++ 2 files changed, 6 insertions(+), 4 deletions(-) --- --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -3,11 +3,11 @@ #define __LINUX_ENTRYCOMMON_H =20 #include +#include #include +#include #include #include -#include -#include =20 #include #include @@ -37,6 +37,7 @@ SYSCALL_WORK_SYSCALL_AUDIT | \ SYSCALL_WORK_SYSCALL_USER_DISPATCH | \ ARCH_SYSCALL_WORK_ENTER) + #define SYSCALL_WORK_EXIT (SYSCALL_WORK_SYSCALL_TRACEPOINT | \ SYSCALL_WORK_SYSCALL_TRACE | \ SYSCALL_WORK_SYSCALL_AUDIT | \ @@ -61,8 +62,7 @@ */ void syscall_enter_from_user_mode_prepare(struct pt_regs *regs); =20 -long syscall_trace_enter(struct pt_regs *regs, long syscall, - unsigned long work); +long syscall_trace_enter(struct pt_regs *regs, long syscall, unsigned long= work); =20 /** * syscall_enter_from_user_mode_work - Check and handle work before invoki= ng --- a/include/linux/irq-entry-common.h +++ b/include/linux/irq-entry-common.h @@ -68,6 +68,7 @@ static __always_inline bool arch_in_rcu_ =20 /** * enter_from_user_mode - Establish state when coming from user mode + * @regs: Pointer to currents pt_regs * * Syscall/interrupt entry disables interrupts, but user mode is traced as * interrupts enabled. Also with NO_HZ_FULL RCU might be idle. @@ -357,6 +358,7 @@ irqentry_state_t noinstr irqentry_enter( * Conditional reschedule with additional sanity checks. */ void raw_irqentry_exit_cond_resched(void); + #ifdef CONFIG_PREEMPT_DYNAMIC #if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL) #define irqentry_exit_cond_resched_dynamic_enabled raw_irqentry_exit_cond_= resched From nobody Thu Dec 18 04:27:55 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0C9C330F95F for ; Mon, 27 Oct 2025 08:44:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554682; cv=none; b=uZfe06tTHNiP68lqQz/INMn0Uk8JDBXheJ/EmcyTyGoL4CdnC+cVII8PxHC9MPYSeBr9QbswocqInpzwBbnRFvCMs0a9YXiCAnh86wTJFThYlMXU61qutPxXRZi5SlKuz2o+UdQgAWtssBd+p5c9ev3OLuXbI0SNeVzfDW4cFs4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554682; c=relaxed/simple; bh=SeZPdTmCe4v+WYd34OKEanmX+1ItPZGmEBGQa4R02iw=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=jRe8b6LSE0H8EHfGoJp+k2NxZBbs4XffES0epsCiTTpmGo7hzDgZMGbkZhnPsuolH7tB8tXkZx3tHaIktgL9mn4aFLajgJsLrfPM7LC4SQR3I9M5wnYNmka3fVy4WQG9oj6V3kadiQPk+VCvbpipVZsmEE1vcQ/fzG62F9b/d6Y= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=d4L5bRYX; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=vB9AFG7M; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="d4L5bRYX"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="vB9AFG7M" Message-ID: <20251027084306.652839989@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1761554679; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=Wc6dWcdga2mtG0gAvqUcpS69T+Kix6UflFrDQXjtfs0=; b=d4L5bRYXBQ2lUy4LlhAQLZcYSUqxtK72Si+KlPXuxWkPX1ZqvGnRLkgBrG7v7zb9FPsuqK M+1g1DhPt0sgAOUxi/OUUSSxLf13t8oKa9zT7Fep906ko2nvYweW1+ejjfMnozR79cgu0l 9lyU0tkFOT54gzOszG+KoZGtgq6/ahyzKU69uKRRGoF9lFXJ/qctovLvOZan0hZM9j94vj DLifiHE7gFyEcDObel/LJ0XJ5KEfGuC41UNEm8xrAM3zl1p92SoGBTvKPgbMFVDDc+zfW7 dtYOwaU6qIiH23Lu4A9FIQQ5HsAzlwNysKjPeGgvKLNXRnK4dm8zMwYDI35cqA== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1761554679; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=Wc6dWcdga2mtG0gAvqUcpS69T+Kix6UflFrDQXjtfs0=; b=vB9AFG7M1XX1PGOXKOm7ig9ib4npZYEcPCbAvLZmVU8YGawSsc5TfsU+pghZaoX40jIG4K EHrGWptmT0xGU1BQ== From: Thomas Gleixner To: LKML Cc: Michael Jeanson , Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , x86@kernel.org, Sean Christopherson , Wei Liu Subject: [patch V6 11/31] entry: Remove syscall_enter_from_user_mode_prepare() References: <20251027084220.785525188@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Mon, 27 Oct 2025 09:44:38 +0100 (CET) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Open code the only user in the x86 syscall code and reduce the zoo of functions. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- arch/x86/entry/syscall_32.c | 3 ++- include/linux/entry-common.h | 26 +++++--------------------- kernel/entry/syscall-common.c | 8 -------- 3 files changed, 7 insertions(+), 30 deletions(-) --- a/arch/x86/entry/syscall_32.c +++ b/arch/x86/entry/syscall_32.c @@ -274,9 +274,10 @@ static noinstr bool __do_fast_syscall_32 * fetch EBP before invoking any of the syscall entry work * functions. */ - syscall_enter_from_user_mode_prepare(regs); + enter_from_user_mode(regs); =20 instrumentation_begin(); + local_irq_enable(); /* Fetch EBP from where the vDSO stashed it. */ if (IS_ENABLED(CONFIG_X86_64)) { /* --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -45,23 +45,6 @@ SYSCALL_WORK_SYSCALL_EXIT_TRAP | \ ARCH_SYSCALL_WORK_EXIT) =20 -/** - * syscall_enter_from_user_mode_prepare - Establish state and enable inter= rupts - * @regs: Pointer to currents pt_regs - * - * Invoked from architecture specific syscall entry code with interrupts - * disabled. The calling code has to be non-instrumentable. When the - * function returns all state is correct, interrupts are enabled and the - * subsequent functions can be instrumented. - * - * This handles lockdep, RCU (context tracking) and tracing state, i.e. - * the functionality provided by enter_from_user_mode(). - * - * This is invoked when there is extra architecture specific functionality - * to be done between establishing state and handling user mode entry work. - */ -void syscall_enter_from_user_mode_prepare(struct pt_regs *regs); - long syscall_trace_enter(struct pt_regs *regs, long syscall, unsigned long= work); =20 /** @@ -71,8 +54,8 @@ long syscall_trace_enter(struct pt_regs * @syscall: The syscall number * * Invoked from architecture specific syscall entry code with interrupts - * enabled after invoking syscall_enter_from_user_mode_prepare() and extra - * architecture specific work. + * enabled after invoking enter_from_user_mode(), enabling interrupts and + * extra architecture specific work. * * Returns: The original or a modified syscall number * @@ -108,8 +91,9 @@ static __always_inline long syscall_ente * function returns all state is correct, interrupts are enabled and the * subsequent functions can be instrumented. * - * This is combination of syscall_enter_from_user_mode_prepare() and - * syscall_enter_from_user_mode_work(). + * This is the combination of enter_from_user_mode() and + * syscall_enter_from_user_mode_work() to be used when there is no + * architecture specific work to be done between the two. * * Returns: The original or a modified syscall number. See * syscall_enter_from_user_mode_work() for further explanation. --- a/kernel/entry/syscall-common.c +++ b/kernel/entry/syscall-common.c @@ -63,14 +63,6 @@ long syscall_trace_enter(struct pt_regs return ret ? : syscall; } =20 -noinstr void syscall_enter_from_user_mode_prepare(struct pt_regs *regs) -{ - enter_from_user_mode(regs); - instrumentation_begin(); - local_irq_enable(); - instrumentation_end(); -} - /* * If SYSCALL_EMU is set, then the only reason to report is when * SINGLESTEP is set (i.e. PTRACE_SYSEMU_SINGLESTEP). This syscall From nobody Thu Dec 18 04:27:55 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 99A742F12AE for ; Mon, 27 Oct 2025 08:44:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554686; cv=none; b=AiX9t5gkvoOrcYIMdyZHrWIcnUNLVwJBYs+JM8oIaG9wDt/vZcPOUIbsh/HbEROWEZC8yCh11OHy177rojgl5lVRJqFs6HTfpeYGwd/ceCOPk5LVeyRlb/4EzgTWHqGWiiJK64uwzURwCToNxnTxDXmDj7TQT88zperKqqMlI9k= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554686; c=relaxed/simple; bh=nNcmc9NuBR5mKSpaY7pA17gsauTbI89Vg8sje15ed+M=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=DWEX8gSfkfTFdOrDKeKq3zWJKiJebcs/DqiU6EqYRqScqB2OjS4Nmp7f90JsB06h3elGzDKVoGHLYeAd2CBNZDt00QRfNxzQ8Sb80xUpTZqCEBNX+8WntVAajB3ca5zhdMeU2vA7Ui+2XxPMgqsTsH7d9khK9TGv2RkmIVAgsHc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=QusWIlPk; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=uHqxfnDV; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="QusWIlPk"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="uHqxfnDV" Message-ID: <20251027084306.715309918@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1761554681; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=U9wZY6TfKM1n4wUTA4Xhxpku28PKV7TrzvWt6YuJk0w=; b=QusWIlPkq4ylmVy4EQTM1HUWvEtmaMiUKoe/OESrFqDOXz1qQmmqlIL+3o9eS0v29O/M/A kJ+vXT9dqklOaclas+qzfsAZZmE1x/nEVoZ9jH7u5grmrCnn41XdZlGkxPPXnUN8HqVDZZ 4BQtau9zHwOM+KPglXrH1YLlwoUUdpodUzdaxu0GTw7VZ1ZGDphxjRlzu9+23QruRnyyl5 t/1MUw8nJoM4uYA/grCbSbcFh2X0opSly2r0VuJH0GV/9roQ8N6ur/kbtb6CdJDmvLLcHe 0UcuyvLVvenGISs75VmuZlk6577etOfscs9kSVZFwVtxQV6fuN8u+GAheHsKHQ== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1761554681; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=U9wZY6TfKM1n4wUTA4Xhxpku28PKV7TrzvWt6YuJk0w=; b=uHqxfnDVV09FCFnQVoHbq+1KZfuVBxbE7mhChuPwsfGyzYB21KubpzvsUXQSfbnwssLl14 UEDJsIZ2jR3KxnDg== From: Thomas Gleixner To: LKML Cc: Michael Jeanson , Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , x86@kernel.org, Sean Christopherson , Wei Liu Subject: [patch V6 12/31] entry: Inline irqentry_enter/exit_from/to_user_mode() References: <20251027084220.785525188@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Mon, 27 Oct 2025 09:44:40 +0100 (CET) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" There is no point to have this as a function which just inlines enter_from_user_mode(). The function call overhead is larger than the function itself. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- include/linux/irq-entry-common.h | 13 +++++++++++-- kernel/entry/common.c | 13 ------------- 2 files changed, 11 insertions(+), 15 deletions(-) --- a/include/linux/irq-entry-common.h +++ b/include/linux/irq-entry-common.h @@ -278,7 +278,10 @@ static __always_inline void exit_to_user * * The function establishes state (lockdep, RCU (context tracking), tracin= g) */ -void irqentry_enter_from_user_mode(struct pt_regs *regs); +static __always_inline void irqentry_enter_from_user_mode(struct pt_regs *= regs) +{ + enter_from_user_mode(regs); +} =20 /** * irqentry_exit_to_user_mode - Interrupt exit work @@ -293,7 +296,13 @@ void irqentry_enter_from_user_mode(struc * Interrupt exit is not invoking #1 which is the syscall specific one time * work. */ -void irqentry_exit_to_user_mode(struct pt_regs *regs); +static __always_inline void irqentry_exit_to_user_mode(struct pt_regs *reg= s) +{ + instrumentation_begin(); + exit_to_user_mode_prepare(regs); + instrumentation_end(); + exit_to_user_mode(); +} =20 #ifndef irqentry_state /** --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -62,19 +62,6 @@ void __weak arch_do_signal_or_restart(st return ti_work; } =20 -noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs) -{ - enter_from_user_mode(regs); -} - -noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs) -{ - instrumentation_begin(); - exit_to_user_mode_prepare(regs); - instrumentation_end(); - exit_to_user_mode(); -} - noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs) { irqentry_state_t ret =3D { From nobody Thu Dec 18 04:27:55 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C9E542F12C5 for ; Mon, 27 Oct 2025 08:44:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554687; cv=none; b=Gp5YcnnYzBosVwJdPl5OYPF00XuSghdfLJQ8x7m3brUQAXXKn++R7HtUAcxMHOK72A/TbJa2XEWjlm1KjdeCsFc2YADnjZy0ZTYdpQcdSNHbDsNXfa3GisSvb8mPbHgP0H/GVDwH5GrMiPDc/YPSR6THUgNmftWLo3PspToSsng= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554687; c=relaxed/simple; bh=WGDqby2sqkwgvULx3HGAainFv6gztqVnqcLqL+xhWdA=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=PLi3F4o9foYLKus+s1e/jeRxICgVLJPZd5Tt9KI96NhWWrIcM5pFxLLEduEILi0crYUxJ8eZ0vacXLc3f5dsDvW5hhtxcfsSiuOSumsSrwt7mT2+p6aIfFdICgGWgkTj2UlSFB/qMutR8ItaydDeqZeyeLCMTjIRsrYnl2TVlgs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=Kg9yAJCX; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=FFs6yROI; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="Kg9yAJCX"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="FFs6yROI" Message-ID: <20251027084306.778457951@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1761554684; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=HN07IaFgPRsBdcqOPnZoi/IbPDXu/kN5yAfZuOw92y0=; b=Kg9yAJCXbTFeZedIyxcY15p/uTlkhRZ0c/T2axPzGuF3a8d4D84wFgV4Q38qVBsGyFjvcl FRS89g8EvCx/pB4BlDVglHIYLyubRjP5mxknU//gPi7G9BG3R5wthR9CTgJ3lmitJZPJMX TMM35lzfr4cgQfls56wkcGvGY2SluhpxQBjY/9SF3ng/bIuRQlbSk004USffozPtP0CBM7 IzYklXJNSQG/GpmcH8Su9bCPmQY4g+bu7LGvz4FKRYfN9ioyHHQNRkgOVbhyP+biPRGhOL JOItz/NEfdB1cllSekJ/MGsMnQuUPX4x4HWg5UPKFoa/bRf6pgWxomLfWC5Uxw== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1761554684; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=HN07IaFgPRsBdcqOPnZoi/IbPDXu/kN5yAfZuOw92y0=; b=FFs6yROI6egPXN8nWOiGawcwL6lmdLlQJrsgVgFmt6EdJZmH1Xrkish0Qgp632Q1FEhjX+ Jpff3RP21axzZIAw== From: Thomas Gleixner To: LKML Cc: Michael Jeanson , Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , x86@kernel.org, Sean Christopherson , Wei Liu Subject: [patch V6 13/31] sched: Move MM CID related functions to sched.h References: <20251027084220.785525188@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Mon, 27 Oct 2025 09:44:42 +0100 (CET) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" There is nothing mm specific in that and including mm.h can cause header recursion hell. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- include/linux/mm.h | 25 ------------------------- include/linux/sched.h | 26 ++++++++++++++++++++++++++ 2 files changed, 26 insertions(+), 25 deletions(-) --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2401,31 +2401,6 @@ struct zap_details { /* Set in unmap_vmas() to indicate a final unmap call. Only used by huget= lb */ #define ZAP_FLAG_UNMAP ((__force zap_flags_t) BIT(1)) =20 -#ifdef CONFIG_SCHED_MM_CID -void sched_mm_cid_before_execve(struct task_struct *t); -void sched_mm_cid_after_execve(struct task_struct *t); -void sched_mm_cid_fork(struct task_struct *t); -void sched_mm_cid_exit_signals(struct task_struct *t); -static inline int task_mm_cid(struct task_struct *t) -{ - return t->mm_cid; -} -#else -static inline void sched_mm_cid_before_execve(struct task_struct *t) { } -static inline void sched_mm_cid_after_execve(struct task_struct *t) { } -static inline void sched_mm_cid_fork(struct task_struct *t) { } -static inline void sched_mm_cid_exit_signals(struct task_struct *t) { } -static inline int task_mm_cid(struct task_struct *t) -{ - /* - * Use the processor id as a fall-back when the mm cid feature is - * disabled. This provides functional per-cpu data structure accesses - * in user-space, althrough it won't provide the memory usage benefits. - */ - return raw_smp_processor_id(); -} -#endif - #ifdef CONFIG_MMU extern bool can_do_mlock(void); #else --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2310,6 +2310,32 @@ static __always_inline void alloc_tag_re #define alloc_tag_restore(_tag, _old) do {} while (0) #endif =20 +/* Avoids recursive inclusion hell */ +#ifdef CONFIG_SCHED_MM_CID +void sched_mm_cid_before_execve(struct task_struct *t); +void sched_mm_cid_after_execve(struct task_struct *t); +void sched_mm_cid_fork(struct task_struct *t); +void sched_mm_cid_exit_signals(struct task_struct *t); +static inline int task_mm_cid(struct task_struct *t) +{ + return t->mm_cid; +} +#else +static inline void sched_mm_cid_before_execve(struct task_struct *t) { } +static inline void sched_mm_cid_after_execve(struct task_struct *t) { } +static inline void sched_mm_cid_fork(struct task_struct *t) { } +static inline void sched_mm_cid_exit_signals(struct task_struct *t) { } +static inline int task_mm_cid(struct task_struct *t) +{ + /* + * Use the processor id as a fall-back when the mm cid feature is + * disabled. This provides functional per-cpu data structure accesses + * in user-space, althrough it won't provide the memory usage benefits. + */ + return task_cpu(t); +} +#endif + #ifndef MODULE #ifndef COMPILE_OFFSETS From nobody Thu Dec 18 04:27:55 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5B07A3101B4 for ; Mon, 27 Oct 2025 08:44:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554689; cv=none; b=K5+9NoZYq+XNK/iG6DVyQKaP6afXsEtdH1ckfjDqKJDhm9SnY0X61MnllWgYjsdvjCQQIoGWOORn3eS0yXtYEGu7N3H/KxDOTfe4svTEUHf1Axq/eCDN0L1v1SMr2xhpmi8sPaMN2VB1l9DkzDk5IofwZ8VmQeMmYfu0ScxxS0k= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554689; c=relaxed/simple; bh=GG5Bv0Y2h1NKF25hTLaz2kxBPCflNHAcQQRA1XAgEwg=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=elxXEpQjWQzzRwyhbsYRVC3yOH2GCtRwq452TIDRO+ZK+4GIXJ9dq1j3FUQhT28vaVBm/5uqA4GlniJfyHKLMzlEKGYMzx56I1lem8hlkRuZRpqRd1fh2sibzxKlraamZrm8AqszxTuHiIhUfaNUROwAYHYyC6r6D+CMaxPqb4I= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=Ip6vB+oQ; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=0EJj9wPZ; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="Ip6vB+oQ"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="0EJj9wPZ" Message-ID: <20251027084306.841964081@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1761554687; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=zCmxAB/ZIVgxIt9ekfQ3JOgTrp6uLH4pflx9bdIF0b4=; b=Ip6vB+oQjQmq5iVlFrJjE0zFOpB2ec3WSLDIqcEiN5K695uaQRmzJM7EWunISswmPoi3R7 3Qk8yBYmUYZjIzSjBCB3FBtEyxtzCXpmQPC2C7S4QCz/WaDJx8y6I47pB6lFadTCUqtdzR 0kLdgegsaLeToAR5+PnuH8jhLELxeINCqFwkiSIqTkFZUJAswJzm6SpHYUWMT8vBSQ2KVr 3uXlWWE61hMZ0R9aTsoKarCpG5ZbIbXUrzi0GZZkeyvN2lmO5+FaHPN786tPDUeqMpB+pu ZZKyaOoqrZB+KWd0VOoRvqheRiBnfDrs1wboaF4VnZ2HV3OlHgqhfY8EEYnqiA== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1761554687; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=zCmxAB/ZIVgxIt9ekfQ3JOgTrp6uLH4pflx9bdIF0b4=; b=0EJj9wPZdhshATLHn5uDDHmj3z/gaaK98xEzoV3v86zApzXYNXQY4dAMZOMaBtoE17W9EG dpmz/Ym8snfupvBg== From: Thomas Gleixner To: LKML Cc: Michael Jeanson , Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , x86@kernel.org, Sean Christopherson , Wei Liu Subject: [patch V6 14/31] rseq: Cache CPU ID and MM CID values References: <20251027084220.785525188@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Mon, 27 Oct 2025 09:44:45 +0100 (CET) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" In preparation for rewriting RSEQ exit to user space handling provide storage to cache the CPU ID and MM CID values which were written to user space. That prepares for a quick check, which avoids the update when nothing changed. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- include/linux/rseq.h | 7 +++++-- include/linux/rseq_types.h | 21 +++++++++++++++++++++ include/trace/events/rseq.h | 4 ++-- kernel/rseq.c | 4 ++++ 4 files changed, 32 insertions(+), 4 deletions(-) --- a/include/linux/rseq.h +++ b/include/linux/rseq.h @@ -57,6 +57,7 @@ static inline void rseq_virt_userspace_e static inline void rseq_reset(struct task_struct *t) { memset(&t->rseq, 0, sizeof(t->rseq)); + t->rseq.ids.cpu_cid =3D ~0ULL; } =20 static inline void rseq_execve(struct task_struct *t) @@ -70,10 +71,12 @@ static inline void rseq_execve(struct ta */ static inline void rseq_fork(struct task_struct *t, u64 clone_flags) { - if (clone_flags & CLONE_VM) + if (clone_flags & CLONE_VM) { rseq_reset(t); - else + } else { t->rseq =3D current->rseq; + t->rseq.ids.cpu_cid =3D ~0ULL; + } } =20 #else /* CONFIG_RSEQ */ --- a/include/linux/rseq_types.h +++ b/include/linux/rseq_types.h @@ -31,17 +31,38 @@ struct rseq_event { }; =20 /** + * struct rseq_ids - Cache for ids, which need to be updated + * @cpu_cid: Compound of @cpu_id and @mm_cid to make the + * compiler emit a single compare on 64-bit + * @cpu_id: The CPU ID which was written last to user space + * @mm_cid: The MM CID which was written last to user space + * + * @cpu_id and @mm_cid are updated when the data is written to user space. + */ +struct rseq_ids { + union { + u64 cpu_cid; + struct { + u32 cpu_id; + u32 mm_cid; + }; + }; +}; + +/** * struct rseq_data - Storage for all rseq related data * @usrptr: Pointer to the registered user space RSEQ memory * @len: Length of the RSEQ region * @sig: Signature of critial section abort IPs * @event: Storage for event management + * @ids: Storage for cached CPU ID and MM CID */ struct rseq_data { struct rseq __user *usrptr; u32 len; u32 sig; struct rseq_event event; + struct rseq_ids ids; }; =20 #else /* CONFIG_RSEQ */ --- a/include/trace/events/rseq.h +++ b/include/trace/events/rseq.h @@ -21,9 +21,9 @@ TRACE_EVENT(rseq_update, ), =20 TP_fast_assign( - __entry->cpu_id =3D raw_smp_processor_id(); + __entry->cpu_id =3D t->rseq.ids.cpu_id; __entry->node_id =3D cpu_to_node(__entry->cpu_id); - __entry->mm_cid =3D task_mm_cid(t); + __entry->mm_cid =3D t->rseq.ids.mm_cid; ), =20 TP_printk("cpu_id=3D%d node_id=3D%d mm_cid=3D%d", __entry->cpu_id, --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -184,6 +184,10 @@ static int rseq_update_cpu_node_id(struc rseq_unsafe_put_user(t, node_id, node_id, efault_end); rseq_unsafe_put_user(t, mm_cid, mm_cid, efault_end); =20 + /* Cache the user space values */ + t->rseq.ids.cpu_id =3D cpu_id; + t->rseq.ids.mm_cid =3D mm_cid; + /* * Additional feature fields added after ORIG_RSEQ_SIZE * need to be conditionally updated only if From nobody Thu Dec 18 04:27:55 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C5FDC2F1FD5 for ; Mon, 27 Oct 2025 08:44:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554692; cv=none; b=Kyoy8ma/bSSdwNO2zS/KrAMhZb9ESmKf9lMC+3itV6o3JWaAMb3S0kcWXJVNZQgfpOectV42HeIjQwnvshiFO45mno33W0EeZ4dW5+4n2eEx8h5EvROEi3ODHR7qUiODc5yg1SsnjGi249m22Khc6nMR292VYNgs+qpfxitc4G4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554692; c=relaxed/simple; bh=SuhYVgyFQ4LoQjTWH6EJ7e/oahDD9nGT5aZupiSYXzc=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=q0qTQzRs7ORmwDTzkHENd+75+E3tPzaoBRScgvTM8pyITm1s4UH8B6BBON11xBcM4b8sSpCMEMckxBGfEhFdPMf8mvoWEIR65suoYBhbAqJ6MbexcdwAdOEEIt0wdic93axD5iVRzY1mETf31qWuu8xu+qLorl6xr2sPaL7/DZI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=Ph2DhUsx; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=dtBK5t3A; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="Ph2DhUsx"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="dtBK5t3A" Message-ID: <20251027084306.905067101@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1761554689; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=+aYqb4c1BNKTCmuXaTNh2fE6SLohTIVuFq1wCdsgTkU=; b=Ph2DhUsxw4lWN4UzXPN7d5UZO0c40FD4eaSDKn3YunLn8gYXmMlIwLQV3UIEnHLD8p9rXr xcf8NiY/LP57em0uyE/wGVxC0hG5Rvb28Hn04nlPxIo4k09//NKUS+VscskGrSyCrYGjql mbWMGbyIDjD/2d3hg8HX/x/0kvVF6nnDh3v/WQi+0hzApD1DIKVB6hjZu2lHf6cdSos/1x 9qqOfAx62rfkSYPLX49GTSF4ILi7T6C7zeLFKuI29JEIveNCRCGexZ9xm9GMtNuWT0vvoM CqjM+INGcXxSW2Lr3r3GQqdZ3V1unl9CZU4Tneve+/DD9VPD7wsry0NPiV3gpw== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1761554689; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=+aYqb4c1BNKTCmuXaTNh2fE6SLohTIVuFq1wCdsgTkU=; b=dtBK5t3A104CTbVIqBAJYUHDeATLZmqQbBne8F28Zao7zBzXMJa71bBJz63ZKcDiAy3T0h qqeqTQo3CTc922DQ== From: Thomas Gleixner To: LKML Cc: Michael Jeanson , Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , x86@kernel.org, Sean Christopherson , Wei Liu Subject: [patch V6 15/31] rseq: Record interrupt from user space References: <20251027084220.785525188@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Mon, 27 Oct 2025 09:44:48 +0100 (CET) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" For RSEQ the only relevant reason to inspect and eventually fixup (abort) user space critical sections is when user space was interrupted and the task was scheduled out. If the user to kernel entry was from a syscall no fixup is required. If user space invokes a syscall from a critical section it can keep the pieces as documented. This is only supported on architectures which utilize the generic entry code. If your architecture does not use it, bad luck. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- include/linux/irq-entry-common.h | 3 ++- include/linux/rseq.h | 16 +++++++++++----- include/linux/rseq_entry.h | 18 ++++++++++++++++++ include/linux/rseq_types.h | 2 ++ 4 files changed, 33 insertions(+), 6 deletions(-) --- a/include/linux/irq-entry-common.h +++ b/include/linux/irq-entry-common.h @@ -4,7 +4,7 @@ =20 #include #include -#include +#include #include #include #include @@ -281,6 +281,7 @@ static __always_inline void exit_to_user static __always_inline void irqentry_enter_from_user_mode(struct pt_regs *= regs) { enter_from_user_mode(regs); + rseq_note_user_irq_entry(); } =20 /** --- a/include/linux/rseq.h +++ b/include/linux/rseq.h @@ -31,11 +31,17 @@ static inline void rseq_sched_switch_eve =20 static __always_inline void rseq_exit_to_user_mode(void) { - if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) { - if (WARN_ON_ONCE(current->rseq.event.has_rseq && - current->rseq.event.events)) - current->rseq.event.events =3D 0; - } + struct rseq_event *ev =3D ¤t->rseq.event; + + if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) + WARN_ON_ONCE(ev->sched_switch); + + /* + * Ensure that event (especially user_irq) is cleared when the + * interrupt did not result in a schedule and therefore the + * rseq processing did not clear it. + */ + ev->events =3D 0; } =20 /* --- /dev/null +++ b/include/linux/rseq_entry.h @@ -0,0 +1,18 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_RSEQ_ENTRY_H +#define _LINUX_RSEQ_ENTRY_H + +#ifdef CONFIG_RSEQ +#include + +static __always_inline void rseq_note_user_irq_entry(void) +{ + if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) + current->rseq.event.user_irq =3D true; +} + +#else /* CONFIG_RSEQ */ +static inline void rseq_note_user_irq_entry(void) { } +#endif /* !CONFIG_RSEQ */ + +#endif /* _LINUX_RSEQ_ENTRY_H */ --- a/include/linux/rseq_types.h +++ b/include/linux/rseq_types.h @@ -12,6 +12,7 @@ struct rseq; * @all: Compound to initialize and clear the data efficiently * @events: Compound to access events with a single load/store * @sched_switch: True if the task was scheduled out + * @user_irq: True on interrupt entry from user mode * @has_rseq: True if the task has a rseq pointer installed */ struct rseq_event { @@ -22,6 +23,7 @@ struct rseq_event { u16 events; struct { u8 sched_switch; + u8 user_irq; }; }; From nobody Thu Dec 18 04:27:55 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 25913310777 for ; Mon, 27 Oct 2025 08:44:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554694; cv=none; b=rn21uTGEP4OsMMcU7UpXaFVEiylpd3KRtyAjAci4eorXHgzHs9lOzYJOTBXbGRuNNOPXqNwbj8vWcFlXOdUUq65i4v6fw1Cq6mTTkSLTp+aPcm8cKSJz1ueH7UBpMyvvhvRUBj8Q/aRXHYir8W7p7ojYHhyPXw3jYEQNX+3BT2k= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554694; c=relaxed/simple; bh=UB6GUoa1bwFI1b5d6i52NpWqC/cZwFr/ngcFtxyI9jw=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=MHvTIhcCj3SThiVgeTQEtCAs4aCfAt0CCBd8bYGUxPpjMDDAjIdDiFmYKIUR9CupOj1lCgoBbHNeQYfBC1Ondg0tSlqlCsLdxC1Y0aIuLcGeFfJkYApHB5onpCpbLDHruBrX42FO2r0YKRMh39Zy31uPsLIk38tMzozB+9AqSG4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=tjnI+9ZI; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=PKsprpb8; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="tjnI+9ZI"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="PKsprpb8" Message-ID: <20251027084306.967114316@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1761554691; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=VHbEMEUbJ9sABjNNjENKnjw6SUQbj/5g+w8+hNtQ8+E=; b=tjnI+9ZINbVz3PD/pHieLUpo0Z3Q0noZbmLZbQpo2pwE/OeW50dkgtbRNFvqNtlJrOQpRD whPcHyTrbd+vhq759Tzzgj/tVDyJl5fLVqxviK4x+R6j4VEOIMZe4KkMeupr1QJN92heyr 17sOBG+QyidacU5nZUfA4IXah6fLNN3NlwbQ3D4m6P7wcoADpyNUSTXzCBr/tplN+VUVIx R5ndSyFOnUBUQhXkWh5ZZU/Hpb8vTtrTfNQ8UbX9oHnbg/KlVq98gecHnQ1KAuzhaZ4svm NjwxuhcVHup4yXN6PcWF1YXF2yqrBDcBeTCVAW2TLZ46EnE6RH9GsZrf9W4RCw== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1761554691; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=VHbEMEUbJ9sABjNNjENKnjw6SUQbj/5g+w8+hNtQ8+E=; b=PKsprpb8Ka3VIvEcrAEuC7L3xVtk19Lig0hn6BMwJs9yIxKi8GLDjRyFGYVA+O+r5PNc9g +pipklUlVwqqWjDg== From: Thomas Gleixner To: LKML Cc: Michael Jeanson , Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , x86@kernel.org, Sean Christopherson , Wei Liu Subject: [patch V6 16/31] rseq: Provide tracepoint wrappers for inline code References: <20251027084220.785525188@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Mon, 27 Oct 2025 09:44:50 +0100 (CET) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Provide tracepoint wrappers for the upcoming RSEQ exit to user space inline fast path, so that the header can be safely included by code which defines actual trace points. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- V4: Fix the fallback stub V3: Get rid of one indentation level - Mathieu --- include/linux/rseq_entry.h | 28 ++++++++++++++++++++++++++++ kernel/rseq.c | 19 ++++++++++++++++++- 2 files changed, 46 insertions(+), 1 deletion(-) --- a/include/linux/rseq_entry.h +++ b/include/linux/rseq_entry.h @@ -5,6 +5,34 @@ #ifdef CONFIG_RSEQ #include =20 +#include + +#ifdef CONFIG_TRACEPOINTS +DECLARE_TRACEPOINT(rseq_update); +DECLARE_TRACEPOINT(rseq_ip_fixup); +void __rseq_trace_update(struct task_struct *t); +void __rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip, + unsigned long offset, unsigned long abort_ip); + +static inline void rseq_trace_update(struct task_struct *t, struct rseq_id= s *ids) +{ + if (tracepoint_enabled(rseq_update) && ids) + __rseq_trace_update(t); +} + +static inline void rseq_trace_ip_fixup(unsigned long ip, unsigned long sta= rt_ip, + unsigned long offset, unsigned long abort_ip) +{ + if (tracepoint_enabled(rseq_ip_fixup)) + __rseq_trace_ip_fixup(ip, start_ip, offset, abort_ip); +} + +#else /* CONFIG_TRACEPOINT */ +static inline void rseq_trace_update(struct task_struct *t, struct rseq_id= s *ids) { } +static inline void rseq_trace_ip_fixup(unsigned long ip, unsigned long sta= rt_ip, + unsigned long offset, unsigned long abort_ip) { } +#endif /* !CONFIG_TRACEPOINT */ + static __always_inline void rseq_note_user_irq_entry(void) { if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -70,7 +70,7 @@ #include #include #include -#include +#include #include #include #include @@ -91,6 +91,23 @@ RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL | \ RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE) =20 +#ifdef CONFIG_TRACEPOINTS +/* + * Out of line, so the actual update functions can be in a header to be + * inlined into the exit to user code. + */ +void __rseq_trace_update(struct task_struct *t) +{ + trace_rseq_update(t); +} + +void __rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip, + unsigned long offset, unsigned long abort_ip) +{ + trace_rseq_ip_fixup(ip, start_ip, offset, abort_ip); +} +#endif /* CONFIG_TRACEPOINTS */ + #ifdef CONFIG_DEBUG_RSEQ static struct rseq *rseq_kernel_fields(struct task_struct *t) { From nobody Thu Dec 18 04:27:55 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DFA13311589 for ; Mon, 27 Oct 2025 08:44:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554697; cv=none; b=fDWr78rsNBuUUZgW1QIMX0mO1xqY6idQTETxBc3ug011A+yeb6+fNmBg3rXc8vfuJ02HlnjV+lhlS2xQsoi3vMeS96unZhE9oSYs/eAoko7rfK38CmkGDcQqPRGULRZb1FmHgibD9Jg7S9ulEQiRoOyk7DPL0tjKDH6yeBNi1z0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554697; c=relaxed/simple; bh=B+Y+pOXitQoaWLJ1kcDAfzI0A9do8rK8pEI6PVmIT1g=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=lOoUz9+trujiV4hdJA5cHVQAYyxQaUzgRbkzz9P4/+7LUuNXrQ78MwtoCqsN55JWR93Rxs9Yh7kLCZKy2jcXmuhSNk1TewCR5ZCOeXUYwFK92BXFtTXj45kvfBAuIRiFRH3jecNm3P8OuKxa5utVTjcZdNMu03UHxlpgyVw0TuU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=jTEuZnJI; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=mj2BVX0d; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="jTEuZnJI"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="mj2BVX0d" Message-ID: <20251027084307.027916598@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1761554694; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=KX9qzElSLVZczNJHncTxzo6fDg8+4rvI+ShPqh1Vaw4=; b=jTEuZnJIGLujry9LcleT2VLsPVFX2rzpJYTx07hkkiLx5gBP7NeGkns4VCxqbPzsawI2kZ MPv5/8Otb/PIQKiG04EDkIuuynednuAl21VHd2JGyzMbD957DDco2cJMdpSGsYDOnGQ+pz uDNZPo20yYlW0b4hZlfz7RsaEgN44Id2VmyO/JUOUssOHHj57ebb6I62i+E2FxYLFuVmrv lOr8TrwG5VRZrCMhHK3oahLF27+EAGRRflFwftjDYTbWTfuz54W8q0gML0KSf+pq5RYv9c 132IIU0o09wPl5/EbuUQ5CCKVDrEfCqtpHorU4+inOz4H3uO5jvdBGIPs5XzxQ== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1761554694; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=KX9qzElSLVZczNJHncTxzo6fDg8+4rvI+ShPqh1Vaw4=; b=mj2BVX0d+R2jFcDuDd6L2f4nvREY02u4LAGtDYQKQwq4BeJ1XHEOgiPAQ2fAnEiCUToJxn EkqEg41VwGNPbrAQ== From: Thomas Gleixner To: LKML Cc: Michael Jeanson , Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , x86@kernel.org, Sean Christopherson , Wei Liu Subject: [patch V6 17/31] rseq: Expose lightweight statistics in debugfs References: <20251027084220.785525188@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Mon, 27 Oct 2025 09:44:52 +0100 (CET) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Analyzing the call frequency without actually using tracing is helpful for analysis of this infrastructure. The overhead is minimal as it just increments a per CPU counter associated to each operation. The debugfs readout provides a racy sum of all counters. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- include/linux/rseq.h | 16 --------- include/linux/rseq_entry.h | 49 +++++++++++++++++++++++++++ init/Kconfig | 12 ++++++ kernel/rseq.c | 79 ++++++++++++++++++++++++++++++++++++++++= +---- 4 files changed, 133 insertions(+), 23 deletions(-) --- a/include/linux/rseq.h +++ b/include/linux/rseq.h @@ -29,21 +29,6 @@ static inline void rseq_sched_switch_eve } } =20 -static __always_inline void rseq_exit_to_user_mode(void) -{ - struct rseq_event *ev =3D ¤t->rseq.event; - - if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) - WARN_ON_ONCE(ev->sched_switch); - - /* - * Ensure that event (especially user_irq) is cleared when the - * interrupt did not result in a schedule and therefore the - * rseq processing did not clear it. - */ - ev->events =3D 0; -} - /* * KVM/HYPERV invoke resume_user_mode_work() before entering guest mode, * which clears TIF_NOTIFY_RESUME. To avoid updating user space RSEQ in @@ -92,7 +77,6 @@ static inline void rseq_sched_switch_eve static inline void rseq_virt_userspace_exit(void) { } static inline void rseq_fork(struct task_struct *t, u64 clone_flags) { } static inline void rseq_execve(struct task_struct *t) { } -static inline void rseq_exit_to_user_mode(void) { } #endif /* !CONFIG_RSEQ */ =20 #ifdef CONFIG_DEBUG_RSEQ --- a/include/linux/rseq_entry.h +++ b/include/linux/rseq_entry.h @@ -2,6 +2,37 @@ #ifndef _LINUX_RSEQ_ENTRY_H #define _LINUX_RSEQ_ENTRY_H =20 +/* Must be outside the CONFIG_RSEQ guard to resolve the stubs */ +#ifdef CONFIG_RSEQ_STATS +#include + +struct rseq_stats { + unsigned long exit; + unsigned long signal; + unsigned long slowpath; + unsigned long ids; + unsigned long cs; + unsigned long clear; + unsigned long fixup; +}; + +DECLARE_PER_CPU(struct rseq_stats, rseq_stats); + +/* + * Slow path has interrupts and preemption enabled, but the fast path + * runs with interrupts disabled so there is no point in having the + * preemption checks implied in __this_cpu_inc() for every operation. + */ +#ifdef RSEQ_BUILD_SLOW_PATH +#define rseq_stat_inc(which) this_cpu_inc((which)) +#else +#define rseq_stat_inc(which) raw_cpu_inc((which)) +#endif + +#else /* CONFIG_RSEQ_STATS */ +#define rseq_stat_inc(x) do { } while (0) +#endif /* !CONFIG_RSEQ_STATS */ + #ifdef CONFIG_RSEQ #include =20 @@ -39,8 +70,26 @@ static __always_inline void rseq_note_us current->rseq.event.user_irq =3D true; } =20 +static __always_inline void rseq_exit_to_user_mode(void) +{ + struct rseq_event *ev =3D ¤t->rseq.event; + + rseq_stat_inc(rseq_stats.exit); + + if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) + WARN_ON_ONCE(ev->sched_switch); + + /* + * Ensure that event (especially user_irq) is cleared when the + * interrupt did not result in a schedule and therefore the + * rseq processing did not clear it. + */ + ev->events =3D 0; +} + #else /* CONFIG_RSEQ */ static inline void rseq_note_user_irq_entry(void) { } +static inline void rseq_exit_to_user_mode(void) { } #endif /* !CONFIG_RSEQ */ =20 #endif /* _LINUX_RSEQ_ENTRY_H */ --- a/init/Kconfig +++ b/init/Kconfig @@ -1913,6 +1913,18 @@ config RSEQ =20 If unsure, say Y. =20 +config RSEQ_STATS + default n + bool "Enable lightweight statistics of restartable sequences" if EXPERT + depends on RSEQ && DEBUG_FS + help + Enable lightweight counters which expose information about the + frequency of RSEQ operations via debugfs. Mostly interesting for + kernel debugging or performance analysis. While lightweight it's + still adding code into the user/kernel mode transitions. + + If unsure, say N. + config DEBUG_RSEQ default n bool "Enable debugging of rseq() system call" if EXPERT --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -67,12 +67,16 @@ * F1. */ =20 +/* Required to select the proper per_cpu ops for rseq_stats_inc() */ +#define RSEQ_BUILD_SLOW_PATH + +#include +#include +#include #include -#include #include -#include +#include #include -#include #include =20 #define CREATE_TRACE_POINTS @@ -108,6 +112,56 @@ void __rseq_trace_ip_fixup(unsigned long } #endif /* CONFIG_TRACEPOINTS */ =20 +#ifdef CONFIG_RSEQ_STATS +DEFINE_PER_CPU(struct rseq_stats, rseq_stats); + +static int rseq_debug_show(struct seq_file *m, void *p) +{ + struct rseq_stats stats =3D { }; + unsigned int cpu; + + for_each_possible_cpu(cpu) { + stats.exit +=3D data_race(per_cpu(rseq_stats.exit, cpu)); + stats.signal +=3D data_race(per_cpu(rseq_stats.signal, cpu)); + stats.slowpath +=3D data_race(per_cpu(rseq_stats.slowpath, cpu)); + stats.ids +=3D data_race(per_cpu(rseq_stats.ids, cpu)); + stats.cs +=3D data_race(per_cpu(rseq_stats.cs, cpu)); + stats.clear +=3D data_race(per_cpu(rseq_stats.clear, cpu)); + stats.fixup +=3D data_race(per_cpu(rseq_stats.fixup, cpu)); + } + + seq_printf(m, "exit: %16lu\n", stats.exit); + seq_printf(m, "signal: %16lu\n", stats.signal); + seq_printf(m, "slowp: %16lu\n", stats.slowpath); + seq_printf(m, "ids: %16lu\n", stats.ids); + seq_printf(m, "cs: %16lu\n", stats.cs); + seq_printf(m, "clear: %16lu\n", stats.clear); + seq_printf(m, "fixup: %16lu\n", stats.fixup); + return 0; +} + +static int rseq_debug_open(struct inode *inode, struct file *file) +{ + return single_open(file, rseq_debug_show, inode->i_private); +} + +static const struct file_operations dfs_ops =3D { + .open =3D rseq_debug_open, + .read =3D seq_read, + .llseek =3D seq_lseek, + .release =3D single_release, +}; + +static int __init rseq_debugfs_init(void) +{ + struct dentry *root_dir =3D debugfs_create_dir("rseq", NULL); + + debugfs_create_file("stats", 0444, root_dir, NULL, &dfs_ops); + return 0; +} +__initcall(rseq_debugfs_init); +#endif /* CONFIG_RSEQ_STATS */ + #ifdef CONFIG_DEBUG_RSEQ static struct rseq *rseq_kernel_fields(struct task_struct *t) { @@ -187,12 +241,13 @@ static int rseq_update_cpu_node_id(struc u32 node_id =3D cpu_to_node(cpu_id); u32 mm_cid =3D task_mm_cid(t); =20 - /* - * Validate read-only rseq fields. - */ + rseq_stat_inc(rseq_stats.ids); + + /* Validate read-only rseq fields on debug kernels */ if (rseq_validate_ro_fields(t)) goto efault; WARN_ON_ONCE((int) mm_cid < 0); + if (!user_write_access_begin(rseq, t->rseq.len)) goto efault; =20 @@ -403,6 +458,8 @@ static int rseq_ip_fixup(struct pt_regs struct rseq_cs rseq_cs; int ret; =20 + rseq_stat_inc(rseq_stats.cs); + ret =3D rseq_get_rseq_cs(t, &rseq_cs); if (ret) return ret; @@ -412,8 +469,10 @@ static int rseq_ip_fixup(struct pt_regs * If not nested over a rseq critical section, restart is useless. * Clear the rseq_cs pointer and return. */ - if (!in_rseq_cs(ip, &rseq_cs)) + if (!in_rseq_cs(ip, &rseq_cs)) { + rseq_stat_inc(rseq_stats.clear); return clear_rseq_cs(t->rseq.usrptr); + } ret =3D rseq_check_flags(t, rseq_cs.flags); if (ret < 0) return ret; @@ -422,6 +481,7 @@ static int rseq_ip_fixup(struct pt_regs ret =3D clear_rseq_cs(t->rseq.usrptr); if (ret) return ret; + rseq_stat_inc(rseq_stats.fixup); trace_rseq_ip_fixup(ip, rseq_cs.start_ip, rseq_cs.post_commit_offset, rseq_cs.abort_ip); instruction_pointer_set(regs, (unsigned long)rseq_cs.abort_ip); @@ -462,6 +522,11 @@ void __rseq_handle_notify_resume(struct if (unlikely(t->flags & PF_EXITING)) return; =20 + if (ksig) + rseq_stat_inc(rseq_stats.signal); + else + rseq_stat_inc(rseq_stats.slowpath); + /* * Read and clear the event pending bit first. If the task * was not preempted or migrated or a signal is on the way, From nobody Thu Dec 18 04:27:55 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9FA8D311973 for ; Mon, 27 Oct 2025 08:44:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554700; cv=none; b=fhUdKAQkgR/pgh32+A/BMuUy/KewDTDC8g/C6RWmUu0aJVMA/i8r18lh9r9YE6UoQzA9ar/eOSD5bQ4e4kff5g+/RXfF0mpPIH4eMwpQExZp7lC1gSc+wm617O1nHTQpWVQ4EcWkBYg8jsRkAOB+q5JNg+YWT5ic+Sxp8oUm1h0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554700; c=relaxed/simple; bh=mZE+IKfmra3it9D5GudDwCheOh2yfHZ6WxI+YxI5n4g=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=ox9VtMh/E3RnAbO5F6MqJ8eYkAH9/Svl35qPwm2TqaR1wgeU07fdmaY3n9uJ9pTKjrp1OWDYVMExjBHhMsDyspoTWpmYIyFCZ3tSFyH2Aeb+YhKz3KIdvJxUtAZaGl29dOVzyCaOAj8XBXryqZhbGTOz8bwcxKTR4QzPJDM0jXE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=Gm2f8GLb; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=tgnBgrBn; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="Gm2f8GLb"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="tgnBgrBn" Message-ID: <20251027084307.089270547@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1761554696; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=W6ZRe1LPMaChVqbOzxGyTRfKzj1c25tUJ5+3MWmUJGI=; b=Gm2f8GLbC0Zbm22uh5CvUZEyRJwJQpi3Ooj+ynAwqBi5m72ktNk3NCQCz2jJcGUCaOaGVq +3Umo0jusKyTvuBf6on6RuNV5NfO5/8w0qhkwWPr2GZX+Xx4XC3xwqe8dI+OxAu9Z0hc1k U+g1Uy9SmLzM7GphQDDgMs0/y59Y4cOqYt4Zc5rrwQdayy5cQ+xbBcvFhAbLqaCqCDpXEY tvazohZiMBEOOasTHg/kMFIWG5NpwMbOCZ2hDsbG+5F4QSdH3t5/so18HgdGgweuTCQInz YIKnibpy8z9igq8svmaNFKQu+w5E3uQ6FIlO+epIaAIZ2oSMXZNmcpm8RcEdDA== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1761554696; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=W6ZRe1LPMaChVqbOzxGyTRfKzj1c25tUJ5+3MWmUJGI=; b=tgnBgrBnlXretwK+ZOofK796lkn4TUA6CL3tWcI2xxRshW6jrh7T3ATeN5+uysegZwW2rB W2C4/exDaIjANnAw== From: Thomas Gleixner To: LKML Cc: Michael Jeanson , Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , x86@kernel.org, Sean Christopherson , Wei Liu Subject: [patch V6 18/31] rseq: Provide static branch for runtime debugging References: <20251027084220.785525188@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Mon, 27 Oct 2025 09:44:55 +0100 (CET) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Config based debug is rarely turned on and is not available easily when things go wrong. Provide a static branch to allow permanent integration of debug mechanisms along with the usual toggles in Kconfig, command line and debugfs. Requested-by: Peter Zijlstra Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- V3: Fix __setup() return value - Michael --- Documentation/admin-guide/kernel-parameters.txt | 4 + include/linux/rseq_entry.h | 3=20 init/Kconfig | 14 ++++ kernel/rseq.c | 73 +++++++++++++++++++= +++-- 4 files changed, 90 insertions(+), 4 deletions(-) --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -6443,6 +6443,10 @@ Memory area to be used by remote processor image, managed by CMA. =20 + rseq_debug=3D [KNL] Enable or disable restartable sequence + debug mode. Defaults to CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE. + Format: + rt_group_sched=3D [KNL] Enable or disable SCHED_RR/FIFO group scheduling when CONFIG_RT_GROUP_SCHED=3Dy. Defaults to !CONFIG_RT_GROUP_SCHED_DEFAULT_DISABLED. --- a/include/linux/rseq_entry.h +++ b/include/linux/rseq_entry.h @@ -34,6 +34,7 @@ DECLARE_PER_CPU(struct rseq_stats, rseq_ #endif /* !CONFIG_RSEQ_STATS */ =20 #ifdef CONFIG_RSEQ +#include #include =20 #include @@ -64,6 +65,8 @@ static inline void rseq_trace_ip_fixup(u unsigned long offset, unsigned long abort_ip) { } #endif /* !CONFIG_TRACEPOINT */ =20 +DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enab= led); + static __always_inline void rseq_note_user_irq_entry(void) { if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) --- a/init/Kconfig +++ b/init/Kconfig @@ -1895,10 +1895,24 @@ config RSEQ_STATS =20 If unsure, say N. =20 +config RSEQ_DEBUG_DEFAULT_ENABLE + default n + bool "Enable restartable sequences debug mode by default" if EXPERT + depends on RSEQ + help + This enables the static branch for debug mode of restartable + sequences. + + This also can be controlled on the kernel command line via the + command line parameter "rseq_debug=3D0/1" and through debugfs. + + If unsure, say N. + config DEBUG_RSEQ default n bool "Enable debugging of rseq() system call" if EXPERT depends on RSEQ && DEBUG_KERNEL + select RSEQ_DEBUG_DEFAULT_ENABLE help Enable extra debugging checks for the rseq system call. =20 --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -95,6 +95,27 @@ RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL | \ RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE) =20 +DEFINE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabl= ed); + +static inline void rseq_control_debug(bool on) +{ + if (on) + static_branch_enable(&rseq_debug_enabled); + else + static_branch_disable(&rseq_debug_enabled); +} + +static int __init rseq_setup_debug(char *str) +{ + bool on; + + if (kstrtobool(str, &on)) + return -EINVAL; + rseq_control_debug(on); + return 1; +} +__setup("rseq_debug=3D", rseq_setup_debug); + #ifdef CONFIG_TRACEPOINTS /* * Out of line, so the actual update functions can be in a header to be @@ -112,10 +133,11 @@ void __rseq_trace_ip_fixup(unsigned long } #endif /* CONFIG_TRACEPOINTS */ =20 +#ifdef CONFIG_DEBUG_FS #ifdef CONFIG_RSEQ_STATS DEFINE_PER_CPU(struct rseq_stats, rseq_stats); =20 -static int rseq_debug_show(struct seq_file *m, void *p) +static int rseq_stats_show(struct seq_file *m, void *p) { struct rseq_stats stats =3D { }; unsigned int cpu; @@ -140,14 +162,56 @@ static int rseq_debug_show(struct seq_fi return 0; } =20 +static int rseq_stats_open(struct inode *inode, struct file *file) +{ + return single_open(file, rseq_stats_show, inode->i_private); +} + +static const struct file_operations stat_ops =3D { + .open =3D rseq_stats_open, + .read =3D seq_read, + .llseek =3D seq_lseek, + .release =3D single_release, +}; + +static int __init rseq_stats_init(struct dentry *root_dir) +{ + debugfs_create_file("stats", 0444, root_dir, NULL, &stat_ops); + return 0; +} +#else +static inline void rseq_stats_init(struct dentry *root_dir) { } +#endif /* CONFIG_RSEQ_STATS */ + +static int rseq_debug_show(struct seq_file *m, void *p) +{ + bool on =3D static_branch_unlikely(&rseq_debug_enabled); + + seq_printf(m, "%d\n", on); + return 0; +} + +static ssize_t rseq_debug_write(struct file *file, const char __user *ubuf, + size_t count, loff_t *ppos) +{ + bool on; + + if (kstrtobool_from_user(ubuf, count, &on)) + return -EINVAL; + + rseq_control_debug(on); + return count; +} + static int rseq_debug_open(struct inode *inode, struct file *file) { return single_open(file, rseq_debug_show, inode->i_private); } =20 -static const struct file_operations dfs_ops =3D { +static const struct file_operations debug_ops =3D { .open =3D rseq_debug_open, .read =3D seq_read, + .write =3D rseq_debug_write, .llseek =3D seq_lseek, .release =3D single_release, }; @@ -156,11 +220,12 @@ static int __init rseq_debugfs_init(void { struct dentry *root_dir =3D debugfs_create_dir("rseq", NULL); =20 - debugfs_create_file("stats", 0444, root_dir, NULL, &dfs_ops); + debugfs_create_file("debug", 0644, root_dir, NULL, &debug_ops); + rseq_stats_init(root_dir); return 0; } __initcall(rseq_debugfs_init); -#endif /* CONFIG_RSEQ_STATS */ +#endif /* CONFIG_DEBUG_FS */ =20 #ifdef CONFIG_DEBUG_RSEQ static struct rseq *rseq_kernel_fields(struct task_struct *t) From nobody Thu Dec 18 04:27:55 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 61E7F3126BD for ; Mon, 27 Oct 2025 08:45:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554705; cv=none; b=txUYJHega0czahMAXLqOAM6dUvPnogvX4q9CMxK5x2aJ+FJTzxki8R2NarAOAbKkneWeOzlX6EZDB1F/gv3EOLTu5UDtgsLwIc5kdMDq0YgRIsAxqXDfu0Q2oRA4pNfxMeQ2CT9PUvuK2Q8ZdT5gMDQV9lrpazZh0vgjKgeL/zk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554705; c=relaxed/simple; bh=wnpyS7MR6j5AX9Njs9cGtEsemOrL9f0wJ2cL5/IM25U=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=BFKLUAY4N7HRfS2ZNTcxvJR0Tmf54Wq1tB7SYib8HkJHROm8MK84e1oobb7+QjNc3yz+NR4+JofB4jGkZAselRkBANwJk0RIA53//njH4fQ/d2nLZI7jKGaWsV0MvCgh8B2fiHML4QQp5Y5lmTAFewpn17tdX44ZxgcfUJ14WQM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=4tal/WVl; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=Eo421YoU; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="4tal/WVl"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="Eo421YoU" Message-ID: <20251027084307.151465632@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1761554699; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=qfsvAMLyGa4KdByhzd8WrGTjVPetwm5tPiglFumOTYs=; b=4tal/WVltD0oi40VQ7BGqS3Emh2uemeTOplKIbfiEEUwXq8ogJDpeDwdg9eWStxV27QD7O 8CEgt1NKJMekzsYnV42aMkyNe0YrmBQ6aegcSR4uBIjjWOCTAHs4oafXLPGD3O+/0fUTGB 6YGHFKw6Sp1WC4f6Se16Oc2m99PfjJw6CGDTGRK/UOpfrBPwaQZNmb4Nujym3kcXfjDp5g NQiYS4pVmRShgou7feD91M+ann+rJdt6pbD1S2B8R7FDvfMQbQuhi/kq3+tsFl799wQ2RD RjQ/E+SLYoco1XcO0erWKDAnZsfq6LRk0sVeDk/j8GdVOe1Ypdzxlke8JQytJw== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1761554699; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=qfsvAMLyGa4KdByhzd8WrGTjVPetwm5tPiglFumOTYs=; b=Eo421YoUcpyk6qqCBq2pUwY64BIJ6deHKnAUquFh2Ev4ovdAw8Dx9Mk7NNpfQfki2R0qDj u7MMr1gNTS5rs7BA== From: Thomas Gleixner To: LKML Cc: Michael Jeanson , Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , x86@kernel.org, Sean Christopherson , Wei Liu Subject: [patch V6 19/31] rseq: Provide and use rseq_update_user_cs() References: <20251027084220.785525188@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Mon, 27 Oct 2025 09:44:57 +0100 (CET) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Provide a straight forward implementation to check for and eventually clear/fixup critical sections in user space. The non-debug version does only the minimal sanity checks and aims for efficiency. There are two attack vectors, which are checked for: 1) An abort IP which is in the kernel address space. That would cause at least x86 to return to kernel space via IRET. 2) A rogue critical section descriptor with an abort IP pointing to some arbitrary address, which is not preceded by the RSEQ signature. If the section descriptors are invalid then the resulting misbehaviour of the user space application is not the kernels problem. The kernel provides a run-time switchable debug slow path, which implements the full zoo of checks including termination of the task when one of the gazillion conditions is not met. Replace the zoo in rseq.c with it and invoke it from the TIF_NOTIFY_RESUME handler. Move the remainders into the CONFIG_DEBUG_RSEQ section, which will be replaced and removed in a subsequent step. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- V5: Update comments and fix typos - Mathieu V3: Brought back the signature check along with a comment - Mathieu --- include/linux/rseq_entry.h | 206 +++++++++++++++++++++++++++++++++++++ include/linux/rseq_types.h | 11 +- kernel/rseq.c | 244 +++++++++++++---------------------------= ----- 3 files changed, 290 insertions(+), 171 deletions(-) --- a/include/linux/rseq_entry.h +++ b/include/linux/rseq_entry.h @@ -36,6 +36,7 @@ DECLARE_PER_CPU(struct rseq_stats, rseq_ #ifdef CONFIG_RSEQ #include #include +#include =20 #include =20 @@ -67,12 +68,217 @@ static inline void rseq_trace_ip_fixup(u =20 DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enab= led); =20 +#ifdef RSEQ_BUILD_SLOW_PATH +#define rseq_inline +#else +#define rseq_inline __always_inline +#endif + +bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs= , unsigned long csaddr); + static __always_inline void rseq_note_user_irq_entry(void) { if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) current->rseq.event.user_irq =3D true; } =20 +/* + * Check whether there is a valid critical section and whether the + * instruction pointer in @regs is inside the critical section. + * + * - If the critical section is invalid, terminate the task. + * + * - If valid and the instruction pointer is inside, set it to the abort = IP + * + * - If valid and the instruction pointer is outside, clear the critical + * section address. + * + * Returns true, if the section was valid and either fixup or clear was + * done, false otherwise. + * + * In the failure case task::rseq_event::fatal is set when a invalid + * section was found. It's clear when the failure was an unresolved page + * fault. + * + * If inlined into the exit to user path with interrupts disabled, the + * caller has to protect against page faults with pagefault_disable(). + * + * In preemptible task context this would be counterproductive as the page + * faults could not be fully resolved. As a consequence unresolved page + * faults in task context are fatal too. + */ + +#ifdef RSEQ_BUILD_SLOW_PATH +/* + * The debug version is put out of line, but kept here so the code stays + * together. + * + * @csaddr has already been checked by the caller to be in user space + */ +bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, + unsigned long csaddr) +{ + struct rseq_cs __user *ucs =3D (struct rseq_cs __user *)(unsigned long)cs= addr; + u64 start_ip, abort_ip, offset, cs_end, head, tasksize =3D TASK_SIZE; + unsigned long ip =3D instruction_pointer(regs); + u64 __user *uc_head =3D (u64 __user *) ucs; + u32 usig, __user *uc_sig; + + scoped_user_rw_access(ucs, efault) { + /* + * Evaluate the user pile and exit if one of the conditions + * is not fulfilled. + */ + unsafe_get_user(start_ip, &ucs->start_ip, efault); + if (unlikely(start_ip >=3D tasksize)) + goto die; + /* If outside, just clear the critical section. */ + if (ip < start_ip) + goto clear; + + unsafe_get_user(offset, &ucs->post_commit_offset, efault); + cs_end =3D start_ip + offset; + /* Check for overflow and wraparound */ + if (unlikely(cs_end >=3D tasksize || cs_end < start_ip)) + goto die; + + /* If not inside, clear it. */ + if (ip >=3D cs_end) + goto clear; + + unsafe_get_user(abort_ip, &ucs->abort_ip, efault); + /* Ensure it's "valid" */ + if (unlikely(abort_ip >=3D tasksize || abort_ip < sizeof(*uc_sig))) + goto die; + /* Validate that the abort IP is not in the critical section */ + if (unlikely(abort_ip - start_ip < offset)) + goto die; + + /* + * Check version and flags for 0. No point in emitting + * deprecated warnings before dying. That could be done in + * the slow path eventually, but *shrug*. + */ + unsafe_get_user(head, uc_head, efault); + if (unlikely(head)) + goto die; + + /* abort_ip - 4 is >=3D 0. See abort_ip check above */ + uc_sig =3D (u32 __user *)(unsigned long)(abort_ip - sizeof(*uc_sig)); + unsafe_get_user(usig, uc_sig, efault); + if (unlikely(usig !=3D t->rseq.sig)) + goto die; + + /* rseq_event.user_irq is only valid if CONFIG_GENERIC_IRQ_ENTRY=3Dy */ + if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) { + /* If not in interrupt from user context, let it die */ + if (unlikely(!t->rseq.event.user_irq)) + goto die; + } + unsafe_put_user(0ULL, &t->rseq.usrptr->rseq_cs, efault); + instruction_pointer_set(regs, (unsigned long)abort_ip); + rseq_stat_inc(rseq_stats.fixup); + break; + clear: + unsafe_put_user(0ULL, &t->rseq.usrptr->rseq_cs, efault); + rseq_stat_inc(rseq_stats.clear); + abort_ip =3D 0ULL; + } + + if (unlikely(abort_ip)) + rseq_trace_ip_fixup(ip, start_ip, offset, abort_ip); + return true; +die: + t->rseq.event.fatal =3D true; +efault: + return false; +} + +#endif /* RSEQ_BUILD_SLOW_PATH */ + +/* + * This only ensures that abort_ip is in the user address space and + * validates that it is preceded by the signature. + * + * No other sanity checks are done here, that's what the debug code is for. + */ +static rseq_inline bool +rseq_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned = long csaddr) +{ + struct rseq_cs __user *ucs =3D (struct rseq_cs __user *)(unsigned long)cs= addr; + unsigned long ip =3D instruction_pointer(regs); + u64 start_ip, abort_ip, offset; + u32 usig, __user *uc_sig; + + rseq_stat_inc(rseq_stats.cs); + + if (unlikely(csaddr >=3D TASK_SIZE)) { + t->rseq.event.fatal =3D true; + return false; + } + + if (static_branch_unlikely(&rseq_debug_enabled)) + return rseq_debug_update_user_cs(t, regs, csaddr); + + scoped_user_rw_access(ucs, efault) { + unsafe_get_user(start_ip, &ucs->start_ip, efault); + unsafe_get_user(offset, &ucs->post_commit_offset, efault); + unsafe_get_user(abort_ip, &ucs->abort_ip, efault); + + /* + * No sanity checks. If user space screwed it up, it can + * keep the pieces. That's what debug code is for. + * + * If outside, just clear the critical section. + */ + if (ip - start_ip >=3D offset) + goto clear; + + /* + * Two requirements for @abort_ip: + * - Must be in user space as x86 IRET would happily return to + * the kernel. + * - The four bytes preceding the instruction at @abort_ip must + * contain the signature. + * + * The latter protects against the following attack vector: + * + * An attacker with limited abilities to write, creates a critical + * section descriptor, sets the abort IP to a library function or + * some other ROP gadget and stores the address of the descriptor + * in TLS::rseq::rseq_cs. An RSEQ abort would then evade ROP + * protection. + */ + if (abort_ip >=3D TASK_SIZE || abort_ip < sizeof(*uc_sig)) + goto die; + + /* The address is guaranteed to be >=3D 0 and < TASK_SIZE */ + uc_sig =3D (u32 __user *)(unsigned long)(abort_ip - sizeof(*uc_sig)); + unsafe_get_user(usig, uc_sig, efault); + if (unlikely(usig !=3D t->rseq.sig)) + goto die; + + /* Invalidate the critical section */ + unsafe_put_user(0ULL, &t->rseq.usrptr->rseq_cs, efault); + /* Update the instruction pointer */ + instruction_pointer_set(regs, (unsigned long)abort_ip); + rseq_stat_inc(rseq_stats.fixup); + break; + clear: + unsafe_put_user(0ULL, &t->rseq.usrptr->rseq_cs, efault); + rseq_stat_inc(rseq_stats.clear); + abort_ip =3D 0ULL; + } + + if (unlikely(abort_ip)) + rseq_trace_ip_fixup(ip, start_ip, offset, abort_ip); + return true; +die: + t->rseq.event.fatal =3D true; +efault: + return false; +} + static __always_inline void rseq_exit_to_user_mode(void) { struct rseq_event *ev =3D ¤t->rseq.event; --- a/include/linux/rseq_types.h +++ b/include/linux/rseq_types.h @@ -14,10 +14,12 @@ struct rseq; * @sched_switch: True if the task was scheduled out * @user_irq: True on interrupt entry from user mode * @has_rseq: True if the task has a rseq pointer installed + * @error: Compound error code for the slow path to analyze + * @fatal: User space data corrupted or invalid */ struct rseq_event { union { - u32 all; + u64 all; struct { union { u16 events; @@ -28,6 +30,13 @@ struct rseq_event { }; =20 u8 has_rseq; + u8 __pad; + union { + u16 error; + struct { + u8 fatal; + }; + }; }; }; }; --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -382,175 +382,18 @@ static int rseq_reset_rseq_cpu_node_id(s return -EFAULT; } =20 -/* - * Get the user-space pointer value stored in the 'rseq_cs' field. - */ -static int rseq_get_rseq_cs_ptr_val(struct rseq __user *rseq, u64 *rseq_cs) -{ - if (!rseq_cs) - return -EFAULT; - -#ifdef CONFIG_64BIT - if (get_user(*rseq_cs, &rseq->rseq_cs)) - return -EFAULT; -#else - if (copy_from_user(rseq_cs, &rseq->rseq_cs, sizeof(*rseq_cs))) - return -EFAULT; -#endif - - return 0; -} - -/* - * If the rseq_cs field of 'struct rseq' contains a valid pointer to - * user-space, copy 'struct rseq_cs' from user-space and validate its fiel= ds. - */ -static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs) -{ - struct rseq_cs __user *urseq_cs; - u64 ptr; - u32 __user *usig; - u32 sig; - int ret; - - ret =3D rseq_get_rseq_cs_ptr_val(t->rseq.usrptr, &ptr); - if (ret) - return ret; - - /* If the rseq_cs pointer is NULL, return a cleared struct rseq_cs. */ - if (!ptr) { - memset(rseq_cs, 0, sizeof(*rseq_cs)); - return 0; - } - /* Check that the pointer value fits in the user-space process space. */ - if (ptr >=3D TASK_SIZE) - return -EINVAL; - urseq_cs =3D (struct rseq_cs __user *)(unsigned long)ptr; - if (copy_from_user(rseq_cs, urseq_cs, sizeof(*rseq_cs))) - return -EFAULT; - - if (rseq_cs->start_ip >=3D TASK_SIZE || - rseq_cs->start_ip + rseq_cs->post_commit_offset >=3D TASK_SIZE || - rseq_cs->abort_ip >=3D TASK_SIZE || - rseq_cs->version > 0) - return -EINVAL; - /* Check for overflow. */ - if (rseq_cs->start_ip + rseq_cs->post_commit_offset < rseq_cs->start_ip) - return -EINVAL; - /* Ensure that abort_ip is not in the critical section. */ - if (rseq_cs->abort_ip - rseq_cs->start_ip < rseq_cs->post_commit_offset) - return -EINVAL; - - usig =3D (u32 __user *)(unsigned long)(rseq_cs->abort_ip - sizeof(u32)); - ret =3D get_user(sig, usig); - if (ret) - return ret; - - if (current->rseq.sig !=3D sig) { - printk_ratelimited(KERN_WARNING - "Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%= x (pid=3D%d, addr=3D%p).\n", - sig, current->rseq.sig, current->pid, usig); - return -EINVAL; - } - return 0; -} - -static bool rseq_warn_flags(const char *str, u32 flags) +static bool rseq_handle_cs(struct task_struct *t, struct pt_regs *regs) { - u32 test_flags; + struct rseq __user *urseq =3D t->rseq.usrptr; + u64 csaddr; =20 - if (!flags) - return false; - test_flags =3D flags & RSEQ_CS_NO_RESTART_FLAGS; - if (test_flags) - pr_warn_once("Deprecated flags (%u) in %s ABI structure", test_flags, st= r); - test_flags =3D flags & ~RSEQ_CS_NO_RESTART_FLAGS; - if (test_flags) - pr_warn_once("Unknown flags (%u) in %s ABI structure", test_flags, str); - return true; -} - -static int rseq_check_flags(struct task_struct *t, u32 cs_flags) -{ - u32 flags; - int ret; - - if (rseq_warn_flags("rseq_cs", cs_flags)) - return -EINVAL; - - /* Get thread flags. */ - ret =3D get_user(flags, &t->rseq.usrptr->flags); - if (ret) - return ret; - - if (rseq_warn_flags("rseq", flags)) - return -EINVAL; - return 0; -} - -static int clear_rseq_cs(struct rseq __user *rseq) -{ - /* - * The rseq_cs field is set to NULL on preemption or signal - * delivery on top of rseq assembly block, as well as on top - * of code outside of the rseq assembly block. This performs - * a lazy clear of the rseq_cs field. - * - * Set rseq_cs to NULL. - */ -#ifdef CONFIG_64BIT - return put_user(0UL, &rseq->rseq_cs); -#else - if (clear_user(&rseq->rseq_cs, sizeof(rseq->rseq_cs))) - return -EFAULT; - return 0; -#endif -} - -/* - * Unsigned comparison will be true when ip >=3D start_ip, and when - * ip < start_ip + post_commit_offset. - */ -static bool in_rseq_cs(unsigned long ip, struct rseq_cs *rseq_cs) -{ - return ip - rseq_cs->start_ip < rseq_cs->post_commit_offset; -} - -static int rseq_ip_fixup(struct pt_regs *regs, bool abort) -{ - unsigned long ip =3D instruction_pointer(regs); - struct task_struct *t =3D current; - struct rseq_cs rseq_cs; - int ret; - - rseq_stat_inc(rseq_stats.cs); - - ret =3D rseq_get_rseq_cs(t, &rseq_cs); - if (ret) - return ret; - - /* - * Handle potentially not being within a critical section. - * If not nested over a rseq critical section, restart is useless. - * Clear the rseq_cs pointer and return. - */ - if (!in_rseq_cs(ip, &rseq_cs)) { - rseq_stat_inc(rseq_stats.clear); - return clear_rseq_cs(t->rseq.usrptr); - } - ret =3D rseq_check_flags(t, rseq_cs.flags); - if (ret < 0) - return ret; - if (!abort) - return 0; - ret =3D clear_rseq_cs(t->rseq.usrptr); - if (ret) - return ret; - rseq_stat_inc(rseq_stats.fixup); - trace_rseq_ip_fixup(ip, rseq_cs.start_ip, rseq_cs.post_commit_offset, - rseq_cs.abort_ip); - instruction_pointer_set(regs, (unsigned long)rseq_cs.abort_ip); - return 0; + scoped_user_read_access(urseq, efault) + unsafe_get_user(csaddr, &urseq->rseq_cs, efault); + if (likely(!csaddr)) + return true; + return rseq_update_user_cs(t, regs, csaddr); +efault: + return false; } =20 /* @@ -567,8 +410,8 @@ static int rseq_ip_fixup(struct pt_regs void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *reg= s) { struct task_struct *t =3D current; - int ret, sig; bool event; + int sig; =20 /* * If invoked from hypervisors before entering the guest via @@ -618,8 +461,7 @@ void __rseq_handle_notify_resume(struct if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event) return; =20 - ret =3D rseq_ip_fixup(regs, event); - if (unlikely(ret < 0)) + if (!rseq_handle_cs(t, regs)) goto error; =20 if (unlikely(rseq_update_cpu_node_id(t))) @@ -632,6 +474,68 @@ void __rseq_handle_notify_resume(struct } =20 #ifdef CONFIG_DEBUG_RSEQ +/* + * Unsigned comparison will be true when ip >=3D start_ip, and when + * ip < start_ip + post_commit_offset. + */ +static bool in_rseq_cs(unsigned long ip, struct rseq_cs *rseq_cs) +{ + return ip - rseq_cs->start_ip < rseq_cs->post_commit_offset; +} + +/* + * If the rseq_cs field of 'struct rseq' contains a valid pointer to + * user-space, copy 'struct rseq_cs' from user-space and validate its fiel= ds. + */ +static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs) +{ + struct rseq __user *urseq =3D t->rseq.usrptr; + struct rseq_cs __user *urseq_cs; + u32 __user *usig; + u64 ptr; + u32 sig; + int ret; + + if (get_user(ptr, &rseq->rseq_cs)) + return -EFAULT; + + /* If the rseq_cs pointer is NULL, return a cleared struct rseq_cs. */ + if (!ptr) { + memset(rseq_cs, 0, sizeof(*rseq_cs)); + return 0; + } + /* Check that the pointer value fits in the user-space process space. */ + if (ptr >=3D TASK_SIZE) + return -EINVAL; + urseq_cs =3D (struct rseq_cs __user *)(unsigned long)ptr; + if (copy_from_user(rseq_cs, urseq_cs, sizeof(*rseq_cs))) + return -EFAULT; + + if (rseq_cs->start_ip >=3D TASK_SIZE || + rseq_cs->start_ip + rseq_cs->post_commit_offset >=3D TASK_SIZE || + rseq_cs->abort_ip >=3D TASK_SIZE || + rseq_cs->version > 0) + return -EINVAL; + /* Check for overflow. */ + if (rseq_cs->start_ip + rseq_cs->post_commit_offset < rseq_cs->start_ip) + return -EINVAL; + /* Ensure that abort_ip is not in the critical section. */ + if (rseq_cs->abort_ip - rseq_cs->start_ip < rseq_cs->post_commit_offset) + return -EINVAL; + + usig =3D (u32 __user *)(unsigned long)(rseq_cs->abort_ip - sizeof(u32)); + ret =3D get_user(sig, usig); + if (ret) + return ret; + + if (current->rseq.sig !=3D sig) { + printk_ratelimited(KERN_WARNING + "Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%= x (pid=3D%d, addr=3D%p).\n", + sig, current->rseq.sig, current->pid, usig); + return -EINVAL; + } + return 0; +} =20 /* * Terminate the process if a syscall is issued within a restartable From nobody Thu Dec 18 04:27:55 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 61DF42ECE97 for ; Mon, 27 Oct 2025 08:45:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554704; cv=none; b=AbHHIUIqLh7EkV7YiBSk8XzkmWLf4C+gd77nliEY9nkmqbGifnS5kA7WEeu/0r/dsx9qRmrSf2NU3CN1z7H2u1yVWTULxrk7MGqWoccpbGb4BYmlFXF7ydBo0iQmQmaZqI0oqOMnobXETYr2T9ePwltDj/S1gFvFkcJVELiYp4A= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554704; c=relaxed/simple; bh=o8+anuL765YDWkk9Rva3WdYqhOZW6a1Ijwwe0cH/gYI=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=LuZvYaODI5A7lYDYdRstlKj27eji5BCYOQ9IJIVGRwG9TmVS6iOXh0JIDKwclW+GAFMACe5Q00zlCAsA0Ps9Kp3CuAODcpOz3getlG3H8QolnIY14uJy8v21NwaHUlVNP2TULitGtyeGu2qUTUUpoGdszKK823G5ncrcGIV4EBg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=37PXP75Y; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=nmzYVHVT; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="37PXP75Y"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="nmzYVHVT" Message-ID: <20251027084307.212510692@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1761554701; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=kLGu5WotexOEVXsBMru+GZK4x5QTJRGoqn6jv1HQqss=; b=37PXP75YEqC9b71Q9wGTzoinPRuU1oCR6/MEdB9ySSh3dMU9XSdKO5bbVNW6SbBeTPtf1F ltjnAdqwADEpVcPWaOvlcqB+XIHMfpjdNyXEGzqsunzEnUb56Je54YrBv2sq8Xz5lX5AJ1 cX294hQE161wyqI5iqj8FRO2NUkVRKU/A+3knmrUU9Rp4GjxhUltaoX3ecOWIXs7nK0yP8 xOHAG38OGr4IOd/lgDzwFBMiqe/yCbIu5acgAdTyYrjuXtzFg5cw+PZEZfUYdrcTteqI5D s0oQFanw0FwhZVGBS/0+Io350Bhs9uI+62xJj19tGi7kOkasYmgIqLjkvPsZ0g== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1761554701; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=kLGu5WotexOEVXsBMru+GZK4x5QTJRGoqn6jv1HQqss=; b=nmzYVHVTgq9zTKd4aDMnOzKgqTNIXLX0z7bp4j168JTIbZzdJ60rexVw7CH383J5qhrahM VUI0LutfjdmMgODg== From: Thomas Gleixner To: LKML Cc: Michael Jeanson , Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , x86@kernel.org, Sean Christopherson , Wei Liu Subject: [patch V6 20/31] rseq: Replace the original debug implementation References: <20251027084220.785525188@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Mon, 27 Oct 2025 09:45:00 +0100 (CET) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Just utilize the new infrastructure and put the original one to rest. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- kernel/rseq.c | 81 ++++++++---------------------------------------------= ----- 1 file changed, 12 insertions(+), 69 deletions(-) --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -475,84 +475,27 @@ void __rseq_handle_notify_resume(struct =20 #ifdef CONFIG_DEBUG_RSEQ /* - * Unsigned comparison will be true when ip >=3D start_ip, and when - * ip < start_ip + post_commit_offset. - */ -static bool in_rseq_cs(unsigned long ip, struct rseq_cs *rseq_cs) -{ - return ip - rseq_cs->start_ip < rseq_cs->post_commit_offset; -} - -/* - * If the rseq_cs field of 'struct rseq' contains a valid pointer to - * user-space, copy 'struct rseq_cs' from user-space and validate its fiel= ds. - */ -static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs) -{ - struct rseq __user *urseq =3D t->rseq.usrptr; - struct rseq_cs __user *urseq_cs; - u32 __user *usig; - u64 ptr; - u32 sig; - int ret; - - if (get_user(ptr, &rseq->rseq_cs)) - return -EFAULT; - - /* If the rseq_cs pointer is NULL, return a cleared struct rseq_cs. */ - if (!ptr) { - memset(rseq_cs, 0, sizeof(*rseq_cs)); - return 0; - } - /* Check that the pointer value fits in the user-space process space. */ - if (ptr >=3D TASK_SIZE) - return -EINVAL; - urseq_cs =3D (struct rseq_cs __user *)(unsigned long)ptr; - if (copy_from_user(rseq_cs, urseq_cs, sizeof(*rseq_cs))) - return -EFAULT; - - if (rseq_cs->start_ip >=3D TASK_SIZE || - rseq_cs->start_ip + rseq_cs->post_commit_offset >=3D TASK_SIZE || - rseq_cs->abort_ip >=3D TASK_SIZE || - rseq_cs->version > 0) - return -EINVAL; - /* Check for overflow. */ - if (rseq_cs->start_ip + rseq_cs->post_commit_offset < rseq_cs->start_ip) - return -EINVAL; - /* Ensure that abort_ip is not in the critical section. */ - if (rseq_cs->abort_ip - rseq_cs->start_ip < rseq_cs->post_commit_offset) - return -EINVAL; - - usig =3D (u32 __user *)(unsigned long)(rseq_cs->abort_ip - sizeof(u32)); - ret =3D get_user(sig, usig); - if (ret) - return ret; - - if (current->rseq.sig !=3D sig) { - printk_ratelimited(KERN_WARNING - "Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%= x (pid=3D%d, addr=3D%p).\n", - sig, current->rseq.sig, current->pid, usig); - return -EINVAL; - } - return 0; -} - -/* * Terminate the process if a syscall is issued within a restartable * sequence. */ void rseq_syscall(struct pt_regs *regs) { - unsigned long ip =3D instruction_pointer(regs); struct task_struct *t =3D current; - struct rseq_cs rseq_cs; + u64 csaddr; =20 - if (!t->rseq.usrptr) + if (!t->rseq.event.has_rseq) + return; + if (!get_user(csaddr, &t->rseq.usrptr->rseq_cs)) + goto fail; + if (likely(!csaddr)) return; - if (rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs)) - force_sig(SIGSEGV); + if (unlikely(csaddr >=3D TASK_SIZE)) + goto fail; + if (rseq_debug_update_user_cs(t, regs, csaddr)) + return; +fail: + force_sig(SIGSEGV); } - #endif =20 /* From nobody Thu Dec 18 04:27:55 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4D40F2E6CDA for ; Mon, 27 Oct 2025 08:45:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554707; cv=none; b=AE0vMF5V5vZqc8Nx8jdukYBMBWSzSez1j5GGtDBdfPsZZC/j5UczJwPs4Y04EQN2Cnuzmx3iM3f8rOyPP2l0WiPw68UYu9GBFtpKVULZr0pUHCTqrtrlZhkHKwL+WLyu59mtJi5NKkh4VxzAmHsktL+xh3SLOsTzw53IQISgjnI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554707; c=relaxed/simple; bh=wxY4ImD5W6x3C9tGA0rcVshMwz4u/5PbLiTCRuNHt8k=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=IZEco99CiRdbZvQ9XKNdOt20X/WnShbfbD7xOQs1SUCbPsqPVFkUVQL7Mdpe33/UPR3zcCtIu5Z/3eAInq1E1j0GrOdUledVtcGUKYC3fGc0S/GemBPOozdQNiMVFfAeE263OyoFHVYruFedsezpxgz+6Akt/1EqDDUSS4WE4Xg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=xjQO/N/k; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=tpJcTG+I; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="xjQO/N/k"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="tpJcTG+I" Message-ID: <20251027084307.272660745@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1761554704; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=YhF9cftFkrvHXR70/3Ri3JrRmZx0vcwHRd7bammTRVs=; b=xjQO/N/kvwikHCtoiAKLezYK//vEdAVhm9fVII8JXIIhzYBWQCbpr3zoESWcvI8mo6bbCd CTrMcV1egUd+FfMHPvw/x9lEkimn7uVtLXswi350c3ZaeUJY3Bd8Yk67pNGD7a/33CWopj SSTSibefdO4xSfYF93cYHnbz0ks1WtSPnAyfxVcVfeYmotOVICNHYt5EmTxN8SQ+YXFt9C EBEjXgnSF4/uCxY98MjHH5hozNsxa2twzqxYe4t8kn+64WmBXveCfkipzskqVTnGnpNnwa dDJuAcYfQMUMKqXChvovGiI8Jot3V7LdDmtE+s4EwBM9yQe4NAvR8MLhOd8zIw== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1761554704; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=YhF9cftFkrvHXR70/3Ri3JrRmZx0vcwHRd7bammTRVs=; b=tpJcTG+IU+ZiJu8llUwQtBh07cIEY2jrda3VNZcIX6Of+VVrg3lf/TWZhREs17CePuXZa6 1weFkjTCMkNM10AQ== From: Thomas Gleixner To: LKML Cc: Michael Jeanson , Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , x86@kernel.org, Sean Christopherson , Wei Liu Subject: [patch V6 21/31] rseq: Make exit debugging static branch based References: <20251027084220.785525188@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Mon, 27 Oct 2025 09:45:02 +0100 (CET) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Disconnect it from the config switch and use the static debug branch. This is a temporary measure for validating the rework. At the end this check needs to be hidden behind lockdep as it has nothing to do with the other debug infrastructure, which mainly aids user space debugging by enabling a zoo of checks which terminate misbehaving tasks instead of letting them keep the hard to diagnose pieces. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- include/linux/rseq_entry.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/include/linux/rseq_entry.h +++ b/include/linux/rseq_entry.h @@ -285,7 +285,7 @@ static __always_inline void rseq_exit_to =20 rseq_stat_inc(rseq_stats.exit); =20 - if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) + if (static_branch_unlikely(&rseq_debug_enabled)) WARN_ON_ONCE(ev->sched_switch); =20 /* From nobody Thu Dec 18 04:27:55 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 77BB03128A9 for ; Mon, 27 Oct 2025 08:45:09 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554711; cv=none; b=H26weZO78WKC1IzLppxYdfj8fPmJVNVqvwGBt2afWx8+ja23lWMsHTCAJSYRGHTKWfafg22aP3Xg3uj6G/BKKohrE+zrKLChgzplkLzvz8AXQ756aggXixWzXs+IeyAtYggVUJOa9V9m70on1rWM44m21ulu8W+ifnOel7vQe24= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554711; c=relaxed/simple; bh=bPbzRdKi9rgjvSjUDEAF6P/eZSUApyAMpjWvEgf2IEE=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=Kc79dVKqW4F1/EtdhnX5wB4cT7XUYUMZ5P/Z+3freF3VOvx055lqgDlrp1oymFwM1SVdcFETwthXOc65NLVxZxesN58PbMU1gP0PdiUhCPViVqvTUhFie4MawrjecNfJlPQQDpl0Ye61b3kGX2qifsZFl8p3oUx++Rc/EgdU93I= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=pEvomM0S; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=Rky9C62P; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="pEvomM0S"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="Rky9C62P" Message-ID: <20251027084307.333440475@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1761554706; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=oi2tDgCN5RQeyHe3kWjPPl5jVL1zsw6ax8RUhiC9WMA=; b=pEvomM0SuIMQ6js/40L9WTfBU79auvCOi8md4uWEQHb50Id9eV4S7VGNtvnLBm4hrZUlCb xF2vcc0tWiST4vM+Se9rKuZsHa/Xq5OMDNX5FeM/daVPIamgyQXdjL7k/QPFx981epD4MN r2g4vJ8KGTmt2CJs0oBrk3svNHBcQyB1PqqObbJNFA9SO3K3wVctDRsOtKDSUcuhYJ+P2y BAs9seX+0Iv3y3ciVNK89T0nilnFzzi777FbGiCtdCKLoq6x+BJcjSa9S5s8eektRT38mZ fytOEZtvHdH8e3dEnz+bUvK/TFerKpe6Pcr+w22NU7B3cJc7wx3orZOxSSVtsw== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1761554706; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=oi2tDgCN5RQeyHe3kWjPPl5jVL1zsw6ax8RUhiC9WMA=; b=Rky9C62Pg5//lMpgPey+qiCRrsdLjRYf6tS6x+pWAgaf/4622vWDSfJXlSlVM3EDibxfC2 exFQ6sgoN8MTLdDA== From: Thomas Gleixner To: LKML Cc: Michael Jeanson , Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , x86@kernel.org, Sean Christopherson , Wei Liu Subject: [patch V6 22/31] rseq: Use static branch for syscall exit debug when GENERIC_IRQ_ENTRY=y References: <20251027084220.785525188@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Mon, 27 Oct 2025 09:45:05 +0100 (CET) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Make the syscall exit debug mechanism available via the static branch on architectures which utilize the generic entry code. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- include/linux/entry-common.h | 2 +- include/linux/rseq_entry.h | 9 +++++++++ kernel/rseq.c | 10 ++++++++-- 3 files changed, 18 insertions(+), 3 deletions(-) --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -146,7 +146,7 @@ static __always_inline void syscall_exit local_irq_enable(); } =20 - rseq_syscall(regs); + rseq_debug_syscall_return(regs); =20 /* * Do one-time syscall specific work. If these work items are --- a/include/linux/rseq_entry.h +++ b/include/linux/rseq_entry.h @@ -296,9 +296,18 @@ static __always_inline void rseq_exit_to ev->events =3D 0; } =20 +void __rseq_debug_syscall_return(struct pt_regs *regs); + +static inline void rseq_debug_syscall_return(struct pt_regs *regs) +{ + if (static_branch_unlikely(&rseq_debug_enabled)) + __rseq_debug_syscall_return(regs); +} + #else /* CONFIG_RSEQ */ static inline void rseq_note_user_irq_entry(void) { } static inline void rseq_exit_to_user_mode(void) { } +static inline void rseq_debug_syscall_return(struct pt_regs *regs) { } #endif /* !CONFIG_RSEQ */ =20 #endif /* _LINUX_RSEQ_ENTRY_H */ --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -473,12 +473,11 @@ void __rseq_handle_notify_resume(struct force_sigsegv(sig); } =20 -#ifdef CONFIG_DEBUG_RSEQ /* * Terminate the process if a syscall is issued within a restartable * sequence. */ -void rseq_syscall(struct pt_regs *regs) +void __rseq_debug_syscall_return(struct pt_regs *regs) { struct task_struct *t =3D current; u64 csaddr; @@ -496,6 +495,13 @@ void rseq_syscall(struct pt_regs *regs) fail: force_sig(SIGSEGV); } + +#ifdef CONFIG_DEBUG_RSEQ +/* Kept around to keep GENERIC_ENTRY=3Dn architectures supported. */ +void rseq_syscall(struct pt_regs *regs) +{ + __rseq_debug_syscall_return(regs); +} #endif =20 /* From nobody Thu Dec 18 04:27:55 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 961F2312804 for ; Mon, 27 Oct 2025 08:45:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554712; cv=none; b=XAIirRig7j3spR5Dk+5/fDuVysqPqnkDbt52Ccgtl6ZGZUWKmkZSvdBljCAw8dEEP1sOTMXkGlRE7qkSVG5ZSqGuOM6tWkXqlqTGBqia2F/8VPDy2Z6R9ZLsdsWESPuTZSAnIESrnXbzTbsCjqPLZF6Hu0zzOfkTRJQjOLaMmz0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554712; c=relaxed/simple; bh=Aj7gF455JqCZnm88hvtKpSR00tOLcIISwrlhKw+u3lM=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=bTzFAz+maFGL4CKehVnRaeSfF1wckqxoplTQ7LTa0GAveNmPRDJ7DYUIUYP36t17ExBz7m3dRw/Ccf8/nDuQ48JAw+OtA1HhatqbigvuUZy/SB605X1h99S4oniWf9LPKUblag2o6+rJZcHTsEUKP50+HucTMUJghNUjAJCfxmM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=wqB1SPDz; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=BdL9Pc0i; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="wqB1SPDz"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="BdL9Pc0i" Message-ID: <20251027084307.393972266@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1761554709; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=N1U3azPxgCqYcVlkJyxSFXp6AjTuJebcBfSfEeVo6i8=; b=wqB1SPDzmvIq2wTtctSJlcPVAqQA/vs4/Y0/nMdOSTsJx++gdhwoA8fpwh8aTtbPWiI3SG NlJb7pAaV0/n0CBiZRdOiL4kGz/xllmKIyURjOra6slMvx21UtQe7qSmUN5QJyik7EjJIW rjAhugI2i7dc1TEcaBSxu/Z/IIIJb5ZiKP9d6XmrhKfESyUuRcLcw/H3kxkK8FdqObsZRJ SKNGH+4r8NoH9n0l9cld+980kBIUWjX9fAoHzmOe9YGt8VqLFIbr+bObE7IQn+fgAH2H8v gMPjmQdw82Ags1nHdy4C6TJx6Ky1KZMv/Oze21ovkWVnCYRwMSU2GbIpPwwdxQ== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1761554709; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=N1U3azPxgCqYcVlkJyxSFXp6AjTuJebcBfSfEeVo6i8=; b=BdL9Pc0ilXHSo9Ciu+jrREBrSho9r1GKLyToLP/5HQQGwY0K1Hq5wWGPmtj4XSrDl1zq6N J020+XS25bUElJCg== From: Thomas Gleixner To: LKML Cc: Michael Jeanson , Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , x86@kernel.org, Sean Christopherson , Wei Liu Subject: [patch V6 23/31] rseq: Provide and use rseq_set_ids() References: <20251027084220.785525188@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Mon, 27 Oct 2025 09:45:08 +0100 (CET) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Provide a new and straight forward implementation to set the IDs (CPU ID, Node ID and MM CID), which can be later inlined into the fast path. It does all operations in one scoped_user_rw_access() section and retrieves also the critical section member (rseq::cs_rseq) from user space to avoid another user..begin/end() pair. This is in preparation for optimizing the fast path to avoid extra work when not required. On rseq registration set the CPU ID fields to RSEQ_CPU_ID_UNINITIALIZED and node and MM CID to zero. That's the same as the kernel internal reset values. That makes the debug validation in the exit code work correctly on the first exit to user space. Use it to replace the whole related zoo in rseq.c Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- V5: Initialize IDs on registration, keep them on fork and lift the first exit restriction in the debug code - Mathieu V3: Fixed the node ID comparison in the debug path - Mathieu --- fs/binfmt_elf.c | 2=20 include/linux/rseq.h | 16 ++- include/linux/rseq_entry.h | 89 ++++++++++++++++ include/linux/sched.h | 10 - kernel/rseq.c | 237 +++++++++-------------------------------= ----- 5 files changed, 152 insertions(+), 202 deletions(-) --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -46,7 +46,7 @@ #include #include #include -#include +#include #include #include =20 --- a/include/linux/rseq.h +++ b/include/linux/rseq.h @@ -5,6 +5,8 @@ #ifdef CONFIG_RSEQ #include =20 +#include + void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs= ); =20 static inline void rseq_handle_notify_resume(struct pt_regs *regs) @@ -48,7 +50,7 @@ static inline void rseq_virt_userspace_e static inline void rseq_reset(struct task_struct *t) { memset(&t->rseq, 0, sizeof(t->rseq)); - t->rseq.ids.cpu_cid =3D ~0ULL; + t->rseq.ids.cpu_id =3D RSEQ_CPU_ID_UNINITIALIZED; } =20 static inline void rseq_execve(struct task_struct *t) @@ -59,15 +61,19 @@ static inline void rseq_execve(struct ta /* * If parent process has a registered restartable sequences area, the * child inherits. Unregister rseq for a clone with CLONE_VM set. + * + * On fork, keep the IDs (CPU, MMCID) of the parent, which avoids a fault + * on the COW page on exit to user space, when the child stays on the same + * CPU as the parent. That's obviously not guaranteed, but in overcommit + * scenarios it is more likely and optimizes for the fork/exec case without + * taking the fault. */ static inline void rseq_fork(struct task_struct *t, u64 clone_flags) { - if (clone_flags & CLONE_VM) { + if (clone_flags & CLONE_VM) rseq_reset(t); - } else { + else t->rseq =3D current->rseq; - t->rseq.ids.cpu_cid =3D ~0ULL; - } } =20 #else /* CONFIG_RSEQ */ --- a/include/linux/rseq_entry.h +++ b/include/linux/rseq_entry.h @@ -75,6 +75,7 @@ DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEB #endif =20 bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs= , unsigned long csaddr); +bool rseq_debug_validate_ids(struct task_struct *t); =20 static __always_inline void rseq_note_user_irq_entry(void) { @@ -194,6 +195,43 @@ bool rseq_debug_update_user_cs(struct ta return false; } =20 +/* + * On debug kernels validate that user space did not mess with it if the + * debug branch is enabled. + */ +bool rseq_debug_validate_ids(struct task_struct *t) +{ + struct rseq __user *rseq =3D t->rseq.usrptr; + u32 cpu_id, uval, node_id; + + /* + * On the first exit after registering the rseq region CPU ID is + * RSEQ_CPU_ID_UNINITIALIZED and node_id in user space is 0! + */ + node_id =3D t->rseq.ids.cpu_id !=3D RSEQ_CPU_ID_UNINITIALIZED ? + cpu_to_node(t->rseq.ids.cpu_id) : 0; + + scoped_user_read_access(rseq, efault) { + unsafe_get_user(cpu_id, &rseq->cpu_id_start, efault); + if (cpu_id !=3D t->rseq.ids.cpu_id) + goto die; + unsafe_get_user(uval, &rseq->cpu_id, efault); + if (uval !=3D cpu_id) + goto die; + unsafe_get_user(uval, &rseq->node_id, efault); + if (uval !=3D node_id) + goto die; + unsafe_get_user(uval, &rseq->mm_cid, efault); + if (uval !=3D t->rseq.ids.mm_cid) + goto die; + } + return true; +die: + t->rseq.event.fatal =3D true; +efault: + return false; +} + #endif /* RSEQ_BUILD_SLOW_PATH */ =20 /* @@ -278,6 +316,57 @@ rseq_update_user_cs(struct task_struct * efault: return false; } + +/* + * Updates CPU ID, Node ID and MM CID and reads the critical section + * address, when @csaddr !=3D NULL. This allows to put the ID update and t= he + * read under the same uaccess region to spare a separate begin/end. + * + * As this is either invoked from a C wrapper with @csaddr =3D NULL or from + * the fast path code with a valid pointer, a clever compiler should be + * able to optimize the read out. Spares a duplicate implementation. + * + * Returns true, if the operation was successful, false otherwise. + * + * In the failure case task::rseq_event::fatal is set when invalid data + * was found on debug kernels. It's clear when the failure was an unresolv= ed page + * fault. + * + * If inlined into the exit to user path with interrupts disabled, the + * caller has to protect against page faults with pagefault_disable(). + * + * In preemptible task context this would be counterproductive as the page + * faults could not be fully resolved. As a consequence unresolved page + * faults in task context are fatal too. + */ +static rseq_inline +bool rseq_set_ids_get_csaddr(struct task_struct *t, struct rseq_ids *ids, + u32 node_id, u64 *csaddr) +{ + struct rseq __user *rseq =3D t->rseq.usrptr; + + if (static_branch_unlikely(&rseq_debug_enabled)) { + if (!rseq_debug_validate_ids(t)) + return false; + } + + scoped_user_rw_access(rseq, efault) { + unsafe_put_user(ids->cpu_id, &rseq->cpu_id_start, efault); + unsafe_put_user(ids->cpu_id, &rseq->cpu_id, efault); + unsafe_put_user(node_id, &rseq->node_id, efault); + unsafe_put_user(ids->mm_cid, &rseq->mm_cid, efault); + if (csaddr) + unsafe_get_user(*csaddr, &rseq->rseq_cs, efault); + } + + /* Cache the new values */ + t->rseq.ids.cpu_cid =3D ids->cpu_cid; + rseq_stat_inc(rseq_stats.ids); + rseq_trace_update(t, ids); + return true; +efault: + return false; +} =20 static __always_inline void rseq_exit_to_user_mode(void) { --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -42,7 +42,6 @@ #include #include #include -#include #include #include #include @@ -1408,15 +1407,6 @@ struct task_struct { #endif /* CONFIG_NUMA_BALANCING */ =20 struct rseq_data rseq; -#ifdef CONFIG_DEBUG_RSEQ - /* - * This is a place holder to save a copy of the rseq fields for - * validation of read-only fields. The struct rseq has a - * variable-length array at the end, so it cannot be used - * directly. Reserve a size large enough for the known fields. - */ - char rseq_fields[sizeof(struct rseq)]; -#endif =20 #ifdef CONFIG_SCHED_MM_CID int mm_cid; /* Current cid in mm */ --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -88,13 +88,6 @@ # define RSEQ_EVENT_GUARD preempt #endif =20 -/* The original rseq structure size (including padding) is 32 bytes. */ -#define ORIG_RSEQ_SIZE 32 - -#define RSEQ_CS_NO_RESTART_FLAGS (RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT | \ - RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL | \ - RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE) - DEFINE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabl= ed); =20 static inline void rseq_control_debug(bool on) @@ -227,159 +220,9 @@ static int __init rseq_debugfs_init(void __initcall(rseq_debugfs_init); #endif /* CONFIG_DEBUG_FS */ =20 -#ifdef CONFIG_DEBUG_RSEQ -static struct rseq *rseq_kernel_fields(struct task_struct *t) -{ - return (struct rseq *) t->rseq_fields; -} - -static int rseq_validate_ro_fields(struct task_struct *t) -{ - static DEFINE_RATELIMIT_STATE(_rs, - DEFAULT_RATELIMIT_INTERVAL, - DEFAULT_RATELIMIT_BURST); - u32 cpu_id_start, cpu_id, node_id, mm_cid; - struct rseq __user *rseq =3D t->rseq.usrptr; - - /* - * Validate fields which are required to be read-only by - * user-space. - */ - if (!user_read_access_begin(rseq, t->rseq.len)) - goto efault; - unsafe_get_user(cpu_id_start, &rseq->cpu_id_start, efault_end); - unsafe_get_user(cpu_id, &rseq->cpu_id, efault_end); - unsafe_get_user(node_id, &rseq->node_id, efault_end); - unsafe_get_user(mm_cid, &rseq->mm_cid, efault_end); - user_read_access_end(); - - if ((cpu_id_start !=3D rseq_kernel_fields(t)->cpu_id_start || - cpu_id !=3D rseq_kernel_fields(t)->cpu_id || - node_id !=3D rseq_kernel_fields(t)->node_id || - mm_cid !=3D rseq_kernel_fields(t)->mm_cid) && __ratelimit(&_rs)) { - - pr_warn("Detected rseq corruption for pid: %d, name: %s\n" - "\tcpu_id_start: %u ?=3D %u\n" - "\tcpu_id: %u ?=3D %u\n" - "\tnode_id: %u ?=3D %u\n" - "\tmm_cid: %u ?=3D %u\n", - t->pid, t->comm, - cpu_id_start, rseq_kernel_fields(t)->cpu_id_start, - cpu_id, rseq_kernel_fields(t)->cpu_id, - node_id, rseq_kernel_fields(t)->node_id, - mm_cid, rseq_kernel_fields(t)->mm_cid); - } - - /* For now, only print a console warning on mismatch. */ - return 0; - -efault_end: - user_read_access_end(); -efault: - return -EFAULT; -} - -/* - * Update an rseq field and its in-kernel copy in lock-step to keep a cohe= rent - * state. - */ -#define rseq_unsafe_put_user(t, value, field, error_label) \ - do { \ - unsafe_put_user(value, &t->rseq.usrptr->field, error_label); \ - rseq_kernel_fields(t)->field =3D value; \ - } while (0) - -#else -static int rseq_validate_ro_fields(struct task_struct *t) +static bool rseq_set_ids(struct task_struct *t, struct rseq_ids *ids, u32 = node_id) { - return 0; -} - -#define rseq_unsafe_put_user(t, value, field, error_label) \ - unsafe_put_user(value, &t->rseq.usrptr->field, error_label) -#endif - -static int rseq_update_cpu_node_id(struct task_struct *t) -{ - struct rseq __user *rseq =3D t->rseq.usrptr; - u32 cpu_id =3D raw_smp_processor_id(); - u32 node_id =3D cpu_to_node(cpu_id); - u32 mm_cid =3D task_mm_cid(t); - - rseq_stat_inc(rseq_stats.ids); - - /* Validate read-only rseq fields on debug kernels */ - if (rseq_validate_ro_fields(t)) - goto efault; - WARN_ON_ONCE((int) mm_cid < 0); - - if (!user_write_access_begin(rseq, t->rseq.len)) - goto efault; - - rseq_unsafe_put_user(t, cpu_id, cpu_id_start, efault_end); - rseq_unsafe_put_user(t, cpu_id, cpu_id, efault_end); - rseq_unsafe_put_user(t, node_id, node_id, efault_end); - rseq_unsafe_put_user(t, mm_cid, mm_cid, efault_end); - - /* Cache the user space values */ - t->rseq.ids.cpu_id =3D cpu_id; - t->rseq.ids.mm_cid =3D mm_cid; - - /* - * Additional feature fields added after ORIG_RSEQ_SIZE - * need to be conditionally updated only if - * t->rseq_len !=3D ORIG_RSEQ_SIZE. - */ - user_write_access_end(); - trace_rseq_update(t); - return 0; - -efault_end: - user_write_access_end(); -efault: - return -EFAULT; -} - -static int rseq_reset_rseq_cpu_node_id(struct task_struct *t) -{ - struct rseq __user *rseq =3D t->rseq.usrptr; - u32 cpu_id_start =3D 0, cpu_id =3D RSEQ_CPU_ID_UNINITIALIZED, node_id =3D= 0, - mm_cid =3D 0; - - /* - * Validate read-only rseq fields. - */ - if (rseq_validate_ro_fields(t)) - goto efault; - - if (!user_write_access_begin(rseq, t->rseq.len)) - goto efault; - - /* - * Reset all fields to their initial state. - * - * All fields have an initial state of 0 except cpu_id which is set to - * RSEQ_CPU_ID_UNINITIALIZED, so that any user coming in after - * unregistration can figure out that rseq needs to be registered - * again. - */ - rseq_unsafe_put_user(t, cpu_id_start, cpu_id_start, efault_end); - rseq_unsafe_put_user(t, cpu_id, cpu_id, efault_end); - rseq_unsafe_put_user(t, node_id, node_id, efault_end); - rseq_unsafe_put_user(t, mm_cid, mm_cid, efault_end); - - /* - * Additional feature fields added after ORIG_RSEQ_SIZE - * need to be conditionally reset only if - * t->rseq_len !=3D ORIG_RSEQ_SIZE. - */ - user_write_access_end(); - return 0; - -efault_end: - user_write_access_end(); -efault: - return -EFAULT; + return rseq_set_ids_get_csaddr(t, ids, node_id, NULL); } =20 static bool rseq_handle_cs(struct task_struct *t, struct pt_regs *regs) @@ -410,6 +253,8 @@ static bool rseq_handle_cs(struct task_s void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *reg= s) { struct task_struct *t =3D current; + struct rseq_ids ids; + u32 node_id; bool event; int sig; =20 @@ -456,6 +301,8 @@ void __rseq_handle_notify_resume(struct scoped_guard(RSEQ_EVENT_GUARD) { event =3D t->rseq.event.sched_switch; t->rseq.event.sched_switch =3D false; + ids.cpu_id =3D task_cpu(t); + ids.mm_cid =3D task_mm_cid(t); } =20 if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event) @@ -464,7 +311,8 @@ void __rseq_handle_notify_resume(struct if (!rseq_handle_cs(t, regs)) goto error; =20 - if (unlikely(rseq_update_cpu_node_id(t))) + node_id =3D cpu_to_node(ids.cpu_id); + if (!rseq_set_ids(t, &ids, node_id)) goto error; return; =20 @@ -504,13 +352,33 @@ void rseq_syscall(struct pt_regs *regs) } #endif =20 +static bool rseq_reset_ids(void) +{ + struct rseq_ids ids =3D { + .cpu_id =3D RSEQ_CPU_ID_UNINITIALIZED, + .mm_cid =3D 0, + }; + + /* + * If this fails, terminate it because this leaves the kernel in + * stupid state as exit to user space will try to fixup the ids + * again. + */ + if (rseq_set_ids(current, &ids, 0)) + return true; + + force_sig(SIGSEGV); + return false; +} + +/* The original rseq structure size (including padding) is 32 bytes. */ +#define ORIG_RSEQ_SIZE 32 + /* * sys_rseq - setup restartable sequences for caller thread. */ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flag= s, u32, sig) { - int ret; - if (flags & RSEQ_FLAG_UNREGISTER) { if (flags & ~RSEQ_FLAG_UNREGISTER) return -EINVAL; @@ -521,9 +389,8 @@ SYSCALL_DEFINE4(rseq, struct rseq __user return -EINVAL; if (current->rseq.sig !=3D sig) return -EPERM; - ret =3D rseq_reset_rseq_cpu_node_id(current); - if (ret) - return ret; + if (!rseq_reset_ids()) + return -EFAULT; rseq_reset(current); return 0; } @@ -563,27 +430,23 @@ SYSCALL_DEFINE4(rseq, struct rseq __user if (!access_ok(rseq, rseq_len)) return -EFAULT; =20 - /* - * If the rseq_cs pointer is non-NULL on registration, clear it to - * avoid a potential segfault on return to user-space. The proper thing - * to do would have been to fail the registration but this would break - * older libcs that reuse the rseq area for new threads without - * clearing the fields. Don't bother reading it, just reset it. - */ - if (put_user(0UL, &rseq->rseq_cs)) - return -EFAULT; + scoped_user_write_access(rseq, efault) { + /* + * If the rseq_cs pointer is non-NULL on registration, clear it to + * avoid a potential segfault on return to user-space. The proper thing + * to do would have been to fail the registration but this would break + * older libcs that reuse the rseq area for new threads without + * clearing the fields. Don't bother reading it, just reset it. + */ + unsafe_put_user(0UL, &rseq->rseq_cs, efault); + unsafe_put_user(0UL, &rseq->rseq_cs, efault); + /* Initialize IDs in user space */ + unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id_start, efault); + unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id, efault); + unsafe_put_user(0U, &rseq->node_id, efault); + unsafe_put_user(0U, &rseq->mm_cid, efault); + } =20 -#ifdef CONFIG_DEBUG_RSEQ - /* - * Initialize the in-kernel rseq fields copy for validation of - * read-only fields. - */ - if (get_user(rseq_kernel_fields(current)->cpu_id_start, &rseq->cpu_id_sta= rt) || - get_user(rseq_kernel_fields(current)->cpu_id, &rseq->cpu_id) || - get_user(rseq_kernel_fields(current)->node_id, &rseq->node_id) || - get_user(rseq_kernel_fields(current)->mm_cid, &rseq->mm_cid)) - return -EFAULT; -#endif /* * Activate the registration by setting the rseq area address, length * and signature in the task struct. @@ -599,6 +462,8 @@ SYSCALL_DEFINE4(rseq, struct rseq __user */ current->rseq.event.has_rseq =3D true; rseq_sched_switch_event(current); - return 0; + +efault: + return -EFAULT; } From nobody Thu Dec 18 04:27:55 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 83C933128C8 for ; Mon, 27 Oct 2025 08:45:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554714; cv=none; b=iNjR6/6W5ZAfFWdzmEsOcM7gC3+KEfM4rz6GAPhX3GBtP3ouo4HYOe/QCUdZCkmhw66smnxJ+Yo6GPVfuyAzVBx0zK3jzze/kF08P1xlJzfkslJFEkiwr1HFRh6H79SCtAki9JAPAWz53YAfthcV0oAT6cDu6rHKh6jqR1Lhn+0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554714; c=relaxed/simple; bh=2Zj3waCszveTEMFoUIvLSqV4YexUg/nTnTiT5wFWriY=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=XkZxSErKVBv6Wn+hodQwDPLvmbSH7ZUpmXwcNpvOBpr6S+dTz6UYXkV1JCQ69HNXjf7S38uUf3vVoBXQ5/U3jMcfc7zzlexWjZluX3J9gVvrH4N3TNkZyHAqtdkZoevU7+9zB5GNZHm3nm3ypf0YzTKrTvfww3xUCGROTogEUZI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=hlk9uh7c; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=dD7yxXaz; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="hlk9uh7c"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="dD7yxXaz" Message-ID: <20251027084307.455429038@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1761554711; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=k/b5oAHgYseQeRcamBZeKNCvZQKGQvNU+ZVAgCvY8IU=; b=hlk9uh7c/UusTpDRvtyeRJ91gIi+FJvHXNFSdedR/V4VftLP6LTwSRLiFLaSLqTgvzDRcC 5MlEZ6N9l3rXeXq6KEL8W+uX6bOpa4PLps+ycbc1V4/ajjRxDIjrAjgCzcUirmeaDKBQbj oEjW8dC6Rz2lHizChWAfzyNxdDil2XKS4qWaRj/bd1x++CywQTcabUzL2+HTkyXlzmF5Sx XnqNK3KCLrlnVepa7c9eDeksHN3Ia69sIrDoZOUkHUvm1eIyeeX0GN3wMvjaMUljtd9hSM ybQyhnm+dE0O4M+bk3BxEqSKmQYL9XQKoQvWL0HRuDrgQqJgZuSi3U9/7gmyig== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1761554711; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=k/b5oAHgYseQeRcamBZeKNCvZQKGQvNU+ZVAgCvY8IU=; b=dD7yxXazf2Ze52E8HQtNwZiePnAXqBxJDB2srZhc1xS4CIlxAEJrqqbLpeqkAcOzm9kyHF i7H2WZv8wLK5erCg== From: Thomas Gleixner To: LKML Cc: Michael Jeanson , Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , x86@kernel.org, Sean Christopherson , Wei Liu Subject: [patch V6 24/31] rseq: Separate the signal delivery path References: <20251027084220.785525188@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Mon, 27 Oct 2025 09:45:10 +0100 (CET) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Completely separate the signal delivery path from the notify handler as they have different semantics versus the event handling. The signal delivery only needs to ensure that the interrupted user context was not in a critical section or the section is aborted before it switches to the signal frame context. The signal frame context does not have the original instruction pointer anymore, so that can't be handled on exit to user space. No point in updating the CPU/CID ids as they might change again before the task returns to user space for real. The fast path optimization, which checks for the 'entry from user via interrupt' condition is only available for architectures which use the generic entry code. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- V3: Move rseq_update_usr() to the next patch - Mathieu --- include/linux/rseq.h | 21 ++++++++++++++++----- kernel/rseq.c | 30 ++++++++++++++++++++++-------- 2 files changed, 38 insertions(+), 13 deletions(-) --- a/include/linux/rseq.h +++ b/include/linux/rseq.h @@ -7,22 +7,33 @@ =20 #include =20 -void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs= ); +void __rseq_handle_notify_resume(struct pt_regs *regs); =20 static inline void rseq_handle_notify_resume(struct pt_regs *regs) { if (current->rseq.event.has_rseq) - __rseq_handle_notify_resume(NULL, regs); + __rseq_handle_notify_resume(regs); } =20 +void __rseq_signal_deliver(int sig, struct pt_regs *regs); + +/* + * Invoked from signal delivery to fixup based on the register context bef= ore + * switching to the signal delivery context. + */ static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_reg= s *regs) { - if (current->rseq.event.has_rseq) { - current->rseq.event.sched_switch =3D true; - __rseq_handle_notify_resume(ksig, regs); + if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) { + /* '&' is intentional to spare one conditional branch */ + if (current->rseq.event.has_rseq & current->rseq.event.user_irq) + __rseq_signal_deliver(ksig->sig, regs); + } else { + if (current->rseq.event.has_rseq) + __rseq_signal_deliver(ksig->sig, regs); } } =20 +/* Raised from context switch and exevce to force evaluation on exit to us= er */ static inline void rseq_sched_switch_event(struct task_struct *t) { if (t->rseq.event.has_rseq) { --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -250,13 +250,12 @@ static bool rseq_handle_cs(struct task_s * respect to other threads scheduled on the same CPU, and with respect * to signal handlers. */ -void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *reg= s) +void __rseq_handle_notify_resume(struct pt_regs *regs) { struct task_struct *t =3D current; struct rseq_ids ids; u32 node_id; bool event; - int sig; =20 /* * If invoked from hypervisors before entering the guest via @@ -275,10 +274,7 @@ void __rseq_handle_notify_resume(struct if (unlikely(t->flags & PF_EXITING)) return; =20 - if (ksig) - rseq_stat_inc(rseq_stats.signal); - else - rseq_stat_inc(rseq_stats.slowpath); + rseq_stat_inc(rseq_stats.slowpath); =20 /* * Read and clear the event pending bit first. If the task @@ -317,8 +313,26 @@ void __rseq_handle_notify_resume(struct return; =20 error: - sig =3D ksig ? ksig->sig : 0; - force_sigsegv(sig); + force_sig(SIGSEGV); +} + +void __rseq_signal_deliver(int sig, struct pt_regs *regs) +{ + rseq_stat_inc(rseq_stats.signal); + /* + * Don't update IDs, they are handled on exit to user if + * necessary. The important thing is to abort a critical section of + * the interrupted context as after this point the instruction + * pointer in @regs points to the signal handler. + */ + if (unlikely(!rseq_handle_cs(current, regs))) { + /* + * Clear the errors just in case this might survive + * magically, but leave the rest intact. + */ + current->rseq.event.error =3D 0; + force_sigsegv(sig); + } } =20 /* From nobody Thu Dec 18 04:27:55 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E24F031327E for ; Mon, 27 Oct 2025 08:45:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554716; cv=none; b=s097uAuxOqKfY1rCDk/QQsyuY56lC/cXkVa7/WuJ2i9SC5u2vyKzce1U5ALgMl79TmAiclPrjU1jU6RNYUtApLhDTJq3OnJ2EQAIpG8lt5sg448mqcJ0Tdtj7/A/TJ9TgXY2Ua2DY3uvXGbHWH9Y2GPJdZAUFlnSjQqy7X6G0yM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554716; c=relaxed/simple; bh=3PhBLyTlrhkvBwd4mig2A9RBw4fvZO5xaVEOcq/0L0U=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=luuUzIHBStny9KZ5oJ1uF6rukx4WIacRZpfMm3HuJluCd35oet8yDsueVPiM7cMiLKcCf+Ft1W9D5FO3YmrtxgBHqDXesxi6suYWBUGiftl0FkMzPDM1BoahZMnxsZRvj/F4MfPHabSE8PegYB/s6wUEknXdAhQgv3OTlqn1DmY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=e2zcVOFn; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=AbnEhz3X; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="e2zcVOFn"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="AbnEhz3X" Message-ID: <20251027084307.517640811@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1761554713; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=xXKcj91o+zsQ3OLhiveaxKGYk1C4AbcDINlzrFOKBuY=; b=e2zcVOFnJ+mur1BbopccwftTeDzdymiZvzikg0rYxMvLJsSLMvLjRstSUMw6LC/A/iYAE4 xgqOPVHQWzDnzJ5fEG7naRYrLVMdUVxYYIc2ye1PyMPhr9Mkt1j4ymgsDxkE/zv0qsqK65 bY/XQHj2VtwsaSNOYiw3dhNBEOCR5g4j+YK/r9aeKtMjMn5kngKmHdlIfZaE5oyLwI6B2t kumi87un/L9tLftCeyyxVSwe97TxJLgpGRhqUl5Nl0atkswHn7O7aiYdr5IsJBj40JGDiq YG76XN6dpc/CRgeq5nfNXlfr5E/9OPIjue1AjQTgMXInS/GvcftZcIBiMSM28g== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1761554713; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=xXKcj91o+zsQ3OLhiveaxKGYk1C4AbcDINlzrFOKBuY=; b=AbnEhz3XAUFdTfMigZBvI3jEWIXY6M9oq/QZxKu5i8tmuC9HwxKYat/GFgaFEmg/riaelG K/QUp7mSx2TsagCw== From: Thomas Gleixner To: LKML Cc: Michael Jeanson , Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , x86@kernel.org, Sean Christopherson , Wei Liu Subject: [patch V6 25/31] rseq: Rework the TIF_NOTIFY handler References: <20251027084220.785525188@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Mon, 27 Oct 2025 09:45:12 +0100 (CET) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Replace the whole logic with a new implementation, which is shared with signal delivery and the upcoming exit fast path. Contrary to the original implementation, this ignores invocations from KVM/IO-uring, which invoke resume_user_mode_work() with the @regs argument set to NULL. The original implementation updated the CPU/Node/MM CID fields, but that was just a side effect, which was addressing the problem that this invocation cleared TIF_NOTIFY_RESUME, which in turn could cause an update on return to user space to be lost. This problem has been addressed differently, so that it's not longer required to do that update before entering the guest. That might be considered a user visible change, when the hosts thread TLS memory is mapped into the guest, but as this was never intentionally supported, this abuse of kernel internal implementation details is not considered an ABI break. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- V3: Moved rseq_update_usr() to this one - Mathieu Documented the KVM visible change - Sean --- include/linux/rseq_entry.h | 29 +++++++++++++++++ kernel/rseq.c | 76 +++++++++++++++++++---------------------= ----- 2 files changed, 62 insertions(+), 43 deletions(-) --- a/include/linux/rseq_entry.h +++ b/include/linux/rseq_entry.h @@ -368,6 +368,35 @@ bool rseq_set_ids_get_csaddr(struct task return false; } =20 +/* + * Update user space with new IDs and conditionally check whether the task + * is in a critical section. + */ +static rseq_inline bool rseq_update_usr(struct task_struct *t, struct pt_r= egs *regs, + struct rseq_ids *ids, u32 node_id) +{ + u64 csaddr; + + if (!rseq_set_ids_get_csaddr(t, ids, node_id, &csaddr)) + return false; + + /* + * On architectures which utilize the generic entry code this + * allows to skip the critical section when the entry was not from + * a user space interrupt, unless debug mode is enabled. + */ + if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) { + if (!static_branch_unlikely(&rseq_debug_enabled)) { + if (likely(!t->rseq.event.user_irq)) + return true; + } + } + if (likely(!csaddr)) + return true; + /* Sigh, this really needs to do work */ + return rseq_update_user_cs(t, regs, csaddr); +} + static __always_inline void rseq_exit_to_user_mode(void) { struct rseq_event *ev =3D ¤t->rseq.event; --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -82,12 +82,6 @@ #define CREATE_TRACE_POINTS #include =20 -#ifdef CONFIG_MEMBARRIER -# define RSEQ_EVENT_GUARD irq -#else -# define RSEQ_EVENT_GUARD preempt -#endif - DEFINE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabl= ed); =20 static inline void rseq_control_debug(bool on) @@ -239,38 +233,15 @@ static bool rseq_handle_cs(struct task_s return false; } =20 -/* - * This resume handler must always be executed between any of: - * - preemption, - * - signal delivery, - * and return to user-space. - * - * This is how we can ensure that the entire rseq critical section - * will issue the commit instruction only if executed atomically with - * respect to other threads scheduled on the same CPU, and with respect - * to signal handlers. - */ -void __rseq_handle_notify_resume(struct pt_regs *regs) +static void rseq_slowpath_update_usr(struct pt_regs *regs) { + /* Preserve rseq state and user_irq state for exit to user */ + const struct rseq_event evt_mask =3D { .has_rseq =3D true, .user_irq =3D = true, }; struct task_struct *t =3D current; struct rseq_ids ids; u32 node_id; bool event; =20 - /* - * If invoked from hypervisors before entering the guest via - * resume_user_mode_work(), then @regs is a NULL pointer. - * - * resume_user_mode_work() clears TIF_NOTIFY_RESUME and re-raises - * it before returning from the ioctl() to user space when - * rseq_event.sched_switch is set. - * - * So it's safe to ignore here instead of pointlessly updating it - * in the vcpu_run() loop. - */ - if (!regs) - return; - if (unlikely(t->flags & PF_EXITING)) return; =20 @@ -294,26 +265,45 @@ void __rseq_handle_notify_resume(struct * with the result handed in to allow the detection of * inconsistencies. */ - scoped_guard(RSEQ_EVENT_GUARD) { + scoped_guard(irq) { event =3D t->rseq.event.sched_switch; - t->rseq.event.sched_switch =3D false; + t->rseq.event.all &=3D evt_mask.all; ids.cpu_id =3D task_cpu(t); ids.mm_cid =3D task_mm_cid(t); } =20 - if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event) + if (!event) return; =20 - if (!rseq_handle_cs(t, regs)) - goto error; - node_id =3D cpu_to_node(ids.cpu_id); - if (!rseq_set_ids(t, &ids, node_id)) - goto error; - return; =20 -error: - force_sig(SIGSEGV); + if (unlikely(!rseq_update_usr(t, regs, &ids, node_id))) { + /* + * Clear the errors just in case this might survive magically, but + * leave the rest intact. + */ + t->rseq.event.error =3D 0; + force_sig(SIGSEGV); + } +} + +void __rseq_handle_notify_resume(struct pt_regs *regs) +{ + /* + * If invoked from hypervisors before entering the guest via + * resume_user_mode_work(), then @regs is a NULL pointer. + * + * resume_user_mode_work() clears TIF_NOTIFY_RESUME and re-raises + * it before returning from the ioctl() to user space when + * rseq_event.sched_switch is set. + * + * So it's safe to ignore here instead of pointlessly updating it + * in the vcpu_run() loop. + */ + if (!regs) + return; + + rseq_slowpath_update_usr(regs); } =20 void __rseq_signal_deliver(int sig, struct pt_regs *regs) From nobody Thu Dec 18 04:27:55 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0FA9831327F for ; Mon, 27 Oct 2025 08:45:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554719; cv=none; b=X9HSUy510gJinxTDzl4/jZRAMyQFkOKitN9yxPpg7qkDFGu/ROPtgbL0fDLTIkA2N2DStqvsozjDT8Dx0dTIO0c7LCr16wA7I8XE0KQlRIlx+8HgDMlOkUHBszsFWTaO6ruvBKnPSMgT7xYmjyB0tQcnwIG8w3cQK7UBWT6OamA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554719; c=relaxed/simple; bh=9LMXBBAxxTSh0Cak5VnRRSiVJ2tUu9kEJ+tOhuQHm/s=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=XQGlaaMepY5Uqr+/vsz9E2RcwUTMQ+hDwzsfp6zDH0/lonrXsrU54Y2zTTlvk5tlKepgv2VHqmHMGcJv34HUfS8+z/ghIg+Cx70RLtJvmbObKL+rdnbrlke9g2lcUs82885wQdwGVfMaQPgKX6FQn9ldpCJh59xXO32qesAw3Wg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=2wxda8+Z; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=9SL1dc4q; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="2wxda8+Z"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="9SL1dc4q" Message-ID: <20251027084307.578058898@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1761554716; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=fZ58n7hDIlsn/xa1KokOzN/ruflAy964N/XrMGvhoZc=; b=2wxda8+ZyouHdTS/DiGD4yqJcK8hyIA8o8KrdqQpyNfjch2Z1EmPwvqkS6I7l964z+iS+C rKwF9Fm98C5wONSQC2hDV5J/+btI84Qx1kdEcienF1n0lK7gu0bj2eICKsnsNjlkQC5L1Z 36KHCgbc1KIkGOpTAJ0Kfej3iVmrdpV8j1IFJrIA1kzCYV0TYS/vwl6QEJ365o/gowA0Rf ZlmoAxI7WYGaGd3AFkrxFBZ+0iY+6qRBbOf6zBYSWUYoYotK9vwmJUJXpBHUiyDYRdTT1a 0zO7m7jaFacqoTle68vOH2Ha/OfuURf1y3GAeAh1gF4k89uERjXXRt60L7kIGw== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1761554716; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=fZ58n7hDIlsn/xa1KokOzN/ruflAy964N/XrMGvhoZc=; b=9SL1dc4qZA8RU6Gv7XaLyO3tllIpdlE7fu7ocNr1JB/2j679O/U87y/UDhaQl1fmY5cC3t MtX7Zdgwm+E8T4Aw== From: Thomas Gleixner To: LKML Cc: Michael Jeanson , Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , x86@kernel.org, Sean Christopherson , Wei Liu Subject: [patch V6 26/31] rseq: Optimize event setting References: <20251027084220.785525188@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Mon, 27 Oct 2025 09:45:14 +0100 (CET) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" After removing the various condition bits earlier it turns out that one extra information is needed to avoid setting event::sched_switch and TIF_NOTIFY_RESUME unconditionally on every context switch. The update of the RSEQ user space memory is only required, when either the task was interrupted in user space and schedules or the CPU or MM CID changes in schedule() independent of the entry mode Right now only the interrupt from user information is available. Add a event flag, which is set when the CPU or MM CID or both change. Evaluate this event in the scheduler to decide whether the sched_switch event and the TIF bit need to be set. It's an extra conditional in context_switch(), but the downside of unconditionally handling RSEQ after a context switch to user is way more significant. The utilized boolean logic minimizes this to a single conditional branch. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- fs/exec.c | 2 - include/linux/rseq.h | 81 ++++++++++++++++++++++++++++++++++++++++= +---- include/linux/rseq_types.h | 11 +++++- kernel/rseq.c | 2 - kernel/sched/core.c | 7 +++ kernel/sched/sched.h | 5 ++ 6 files changed, 95 insertions(+), 13 deletions(-) --- a/fs/exec.c +++ b/fs/exec.c @@ -1775,7 +1775,7 @@ static int bprm_execve(struct linux_binp force_fatal_sig(SIGSEGV); =20 sched_mm_cid_after_execve(current); - rseq_sched_switch_event(current); + rseq_force_update(); current->in_execve =3D 0; =20 return retval; --- a/include/linux/rseq.h +++ b/include/linux/rseq.h @@ -11,7 +11,8 @@ void __rseq_handle_notify_resume(struct =20 static inline void rseq_handle_notify_resume(struct pt_regs *regs) { - if (current->rseq.event.has_rseq) + /* '&' is intentional to spare one conditional branch */ + if (current->rseq.event.sched_switch & current->rseq.event.has_rseq) __rseq_handle_notify_resume(regs); } =20 @@ -33,12 +34,75 @@ static inline void rseq_signal_deliver(s } } =20 -/* Raised from context switch and exevce to force evaluation on exit to us= er */ -static inline void rseq_sched_switch_event(struct task_struct *t) +static inline void rseq_raise_notify_resume(struct task_struct *t) { - if (t->rseq.event.has_rseq) { - t->rseq.event.sched_switch =3D true; - set_tsk_thread_flag(t, TIF_NOTIFY_RESUME); + set_tsk_thread_flag(t, TIF_NOTIFY_RESUME); +} + +/* Invoked from context switch to force evaluation on exit to user */ +static __always_inline void rseq_sched_switch_event(struct task_struct *t) +{ + struct rseq_event *ev =3D &t->rseq.event; + + if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) { + /* + * Avoid a boat load of conditionals by using simple logic + * to determine whether NOTIFY_RESUME needs to be raised. + * + * It's required when the CPU or MM CID has changed or + * the entry was from user space. + */ + bool raise =3D (ev->user_irq | ev->ids_changed) & ev->has_rseq; + + if (raise) { + ev->sched_switch =3D true; + rseq_raise_notify_resume(t); + } + } else { + if (ev->has_rseq) { + t->rseq.event.sched_switch =3D true; + rseq_raise_notify_resume(t); + } + } +} + +/* + * Invoked from __set_task_cpu() when a task migrates to enforce an IDs + * update. + * + * This does not raise TIF_NOTIFY_RESUME as that happens in + * rseq_sched_switch_event(). + */ +static __always_inline void rseq_sched_set_task_cpu(struct task_struct *t,= unsigned int cpu) +{ + t->rseq.event.ids_changed =3D true; +} + +/* + * Invoked from switch_mm_cid() in context switch when the task gets a MM + * CID assigned. + * + * This does not raise TIF_NOTIFY_RESUME as that happens in + * rseq_sched_switch_event(). + */ +static __always_inline void rseq_sched_set_task_mm_cid(struct task_struct = *t, unsigned int cid) +{ + /* + * Requires a comparison as the switch_mm_cid() code does not + * provide a conditional for it readily. So avoid excessive updates + * when nothing changes. + */ + if (t->rseq.ids.mm_cid !=3D cid) + t->rseq.event.ids_changed =3D true; +} + +/* Enforce a full update after RSEQ registration and when execve() failed = */ +static inline void rseq_force_update(void) +{ + if (current->rseq.event.has_rseq) { + current->rseq.event.ids_changed =3D true; + current->rseq.event.sched_switch =3D true; + rseq_raise_notify_resume(current); } } =20 @@ -55,7 +119,7 @@ static inline void rseq_sched_switch_eve static inline void rseq_virt_userspace_exit(void) { if (current->rseq.event.sched_switch) - set_tsk_thread_flag(current, TIF_NOTIFY_RESUME); + rseq_raise_notify_resume(current); } =20 static inline void rseq_reset(struct task_struct *t) @@ -91,6 +155,9 @@ static inline void rseq_fork(struct task static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct = pt_regs *regs) { } static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_reg= s *regs) { } static inline void rseq_sched_switch_event(struct task_struct *t) { } +static inline void rseq_sched_set_task_cpu(struct task_struct *t, unsigned= int cpu) { } +static inline void rseq_sched_set_task_mm_cid(struct task_struct *t, unsig= ned int cid) { } +static inline void rseq_force_update(void) { } static inline void rseq_virt_userspace_exit(void) { } static inline void rseq_fork(struct task_struct *t, u64 clone_flags) { } static inline void rseq_execve(struct task_struct *t) { } --- a/include/linux/rseq_types.h +++ b/include/linux/rseq_types.h @@ -11,20 +11,27 @@ struct rseq; * struct rseq_event - Storage for rseq related event management * @all: Compound to initialize and clear the data efficiently * @events: Compound to access events with a single load/store - * @sched_switch: True if the task was scheduled out + * @sched_switch: True if the task was scheduled and needs update on + * exit to user + * @ids_changed: Indicator that IDs need to be updated * @user_irq: True on interrupt entry from user mode * @has_rseq: True if the task has a rseq pointer installed * @error: Compound error code for the slow path to analyze * @fatal: User space data corrupted or invalid + * + * @sched_switch and @ids_changed must be adjacent and the combo must be + * 16bit aligned to allow a single store, when both are set at the same + * time in the scheduler. */ struct rseq_event { union { u64 all; struct { union { - u16 events; + u32 events; struct { u8 sched_switch; + u8 ids_changed; u8 user_irq; }; }; --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -465,7 +465,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user * are updated before returning to user-space. */ current->rseq.event.has_rseq =3D true; - rseq_sched_switch_event(current); + rseq_force_update(); return 0; =20 efault: --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5118,7 +5118,6 @@ prepare_task_switch(struct rq *rq, struc kcov_prepare_switch(prev); sched_info_switch(rq, prev, next); perf_event_task_sched_out(prev, next); - rseq_sched_switch_event(prev); fire_sched_out_preempt_notifiers(prev, next); kmap_local_sched_out(); prepare_task(next); @@ -5316,6 +5315,12 @@ context_switch(struct rq *rq, struct tas /* switch_mm_cid() requires the memory barriers above. */ switch_mm_cid(rq, prev, next); =20 + /* + * Tell rseq that the task was scheduled in. Must be after + * switch_mm_cid() to get the TIF flag set. + */ + rseq_sched_switch_event(next); + prepare_lock_switch(rq, next, rf); =20 /* Here we just switch the register state and the stack. */ --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2208,6 +2208,7 @@ static inline void __set_task_cpu(struct smp_wmb(); WRITE_ONCE(task_thread_info(p)->cpu, cpu); p->wake_cpu =3D cpu; + rseq_sched_set_task_cpu(p, cpu); #endif /* CONFIG_SMP */ } =20 @@ -3808,8 +3809,10 @@ static inline void switch_mm_cid(struct mm_cid_put_lazy(prev); prev->mm_cid =3D -1; } - if (next->mm_cid_active) + if (next->mm_cid_active) { next->last_mm_cid =3D next->mm_cid =3D mm_cid_get(rq, next, next->mm); + rseq_sched_set_task_mm_cid(next, next->mm_cid); + } } =20 #else /* !CONFIG_SCHED_MM_CID: */ From nobody Thu Dec 18 04:27:55 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0BF692F25E5 for ; Mon, 27 Oct 2025 08:45:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554721; cv=none; b=LWmYkWFy2IrHuxaJbQsJz2LuUrmZXdDicZdWhCgcVU6F4VpDGQD45VpY+fBlaGOaYEIKyIKjdkkcPmbcSUy6JqX01IL6M00ZLu4Q5hS8kcdRJ4eyhl9lYlgz70XV8CmsX4MZjIkEwj/N/THR9JZOZDzytnQRas36P6htzpAq+gc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554721; c=relaxed/simple; bh=tRAJ7ZMT0FvPIRbMa502d26tSX9QUGmX6eA7S8pGOKM=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=mGz1tYeId8l85qQ0ADVqBLopBPWgnzLa9r7XJd4PravI2h23/ZYAx9TCsb4XrAmgt2LgsZrBJmNUVchulILFixXfaBlNGqJYUCUnYzNUa7H6iu7qBOfhlXpw1VsVVwp7+ifUsNDARznu77Rmhk48n6rJphItqjiXe6+lEwY1FXw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=bIYoMBUA; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=4dcHjZSN; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="bIYoMBUA"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="4dcHjZSN" Message-ID: <20251027084307.638929615@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1761554718; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=iTJ/KAO1Q3zifuyglGiReNy4Rfi/QRoj5B7NBJRjQ8g=; b=bIYoMBUAwgTxZSFE6fnKfGCZZJtQagzaLp6ny69/yKaDzrzEMRtDOJAMpyOh/bIm4QI+FI abP+ImbpEkEaa2GBynr+w89SkhfaKuFoOVeVaznqqmQAeRMNX1T0M+8MG3xu9NOYgkyYDJ VmPsXFt2dw2enSr3oeKGzOkD8Sj9txnI/GtSI2mnz8we8LXydIUTLkG3cQ+H54AZ1SGEqP Xq8eE+XdN2ocKEvUUl7eRmOBCFlLRu+qAfjrdH8tGViEksinOrdXm5CgLDxlRA9W3tCHNF fDRrBGlz9SRr8RU3dOlMksgDauUxMm740iToNFg2pgM3kGVJaScnR3OK6t7p7Q== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1761554718; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=iTJ/KAO1Q3zifuyglGiReNy4Rfi/QRoj5B7NBJRjQ8g=; b=4dcHjZSN7C5E74UQUWPNBoPJDr9KV/VpzJ2F4YGVBrzYQB5Xhenu3YC1fzv7gxkhWljULK C92MvQRAEsF0MMAg== From: Thomas Gleixner To: LKML Cc: Michael Jeanson , Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , x86@kernel.org, Sean Christopherson , Wei Liu Subject: [patch V6 27/31] rseq: Implement fast path for exit to user References: <20251027084220.785525188@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Mon, 27 Oct 2025 09:45:17 +0100 (CET) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Implement the actual logic for handling RSEQ updates in a fast path after handling the TIF work and at the point where the task is actually returning to user space. This is the right point to do that because at this point the CPU and the MM CID are stable and cannot longer change due to yet another reschedule. That happens when the task is handling it via TIF_NOTIFY_RESUME in resume_user_mode_work(), which is invoked from the exit to user mode work loop. The function is invoked after the TIF work is handled and runs with interrupts disabled, which means it cannot resolve page faults. It therefore disables page faults and in case the access to the user space memory faults, it: - notes the fail in the event struct - raises TIF_NOTIFY_RESUME - returns false to the caller The caller has to go back to the TIF work, which runs with interrupts enabled and therefore can resolve the page faults. This happens mostly on fork() when the memory is marked COW. If the user memory inspection finds invalid data, the function returns false as well and sets the fatal flag in the event struct along with TIF_NOTIFY_RESUME. The slow path notify handler has to evaluate that flag and terminate the task with SIGSEGV as documented. The initial decision to invoke any of this is based on one flags in the event struct: @sched_switch. The decision is in pseudo ASM: load tsk::event::sched_switch jnz inspect_user_space mov $0, tsk::event::events ... leave So for the common case where the task was not scheduled out, this really boils down to three instructions before going out if the compiler is not completely stupid (and yes, some of them are). If the condition is true, then it checks, whether CPU ID or MM CID have changed. If so, then the CPU/MM IDs have to be updated and are thereby cached for the next round. The update unconditionally retrieves the user space critical section address to spare another user*begin/end() pair. If that's not zero and tsk::event::user_irq is set, then the critical section is analyzed and acted upon. If either zero or the entry came via syscall the critical section analysis is skipped. If the comparison is false then the critical section has to be analyzed because the event flag is then only true when entry from user was by interrupt. This is provided without the actual hookup to let reviewers focus on the implementation details. The hookup happens in the next step. Note: As with quite some other optimizations this depends on the generic entry infrastructure and is not enabled to be sucked into random architecture implementations. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- V5: Reduce the decision to event::sched_switch - Mathieu --- include/linux/rseq_entry.h | 131 ++++++++++++++++++++++++++++++++++++++++= +++-- include/linux/rseq_types.h | 3 + kernel/rseq.c | 2=20 3 files changed, 133 insertions(+), 3 deletions(-) --- a/include/linux/rseq_entry.h +++ b/include/linux/rseq_entry.h @@ -10,6 +10,7 @@ struct rseq_stats { unsigned long exit; unsigned long signal; unsigned long slowpath; + unsigned long fastpath; unsigned long ids; unsigned long cs; unsigned long clear; @@ -245,12 +246,13 @@ rseq_update_user_cs(struct task_struct * { struct rseq_cs __user *ucs =3D (struct rseq_cs __user *)(unsigned long)cs= addr; unsigned long ip =3D instruction_pointer(regs); + unsigned long tasksize =3D TASK_SIZE; u64 start_ip, abort_ip, offset; u32 usig, __user *uc_sig; =20 rseq_stat_inc(rseq_stats.cs); =20 - if (unlikely(csaddr >=3D TASK_SIZE)) { + if (unlikely(csaddr >=3D tasksize)) { t->rseq.event.fatal =3D true; return false; } @@ -287,7 +289,7 @@ rseq_update_user_cs(struct task_struct * * in TLS::rseq::rseq_cs. An RSEQ abort would then evade ROP * protection. */ - if (abort_ip >=3D TASK_SIZE || abort_ip < sizeof(*uc_sig)) + if (unlikely(abort_ip >=3D tasksize || abort_ip < sizeof(*uc_sig))) goto die; =20 /* The address is guaranteed to be >=3D 0 and < TASK_SIZE */ @@ -397,6 +399,126 @@ static rseq_inline bool rseq_update_usr( return rseq_update_user_cs(t, regs, csaddr); } =20 +/* + * If you want to use this then convert your architecture to the generic + * entry code. I'm tired of building workarounds for people who can't be + * bothered to make the maintenance of generic infrastructure less + * burdensome. Just sucking everything into the architecture code and + * thereby making others chase the horrible hacks and keep them working is + * neither acceptable nor sustainable. + */ +#ifdef CONFIG_GENERIC_ENTRY + +/* + * This is inlined into the exit path because: + * + * 1) It's a one time comparison in the fast path when there is no event to + * handle + * + * 2) The access to the user space rseq memory (TLS) is unlikely to fault + * so the straight inline operation is: + * + * - Four 32-bit stores only if CPU ID/ MM CID need to be updated + * - One 64-bit load to retrieve the critical section address + * + * 3) In the unlikely case that the critical section address is !=3D NULL: + * + * - One 64-bit load to retrieve the start IP + * - One 64-bit load to retrieve the offset for calculating the end + * - One 64-bit load to retrieve the abort IP + * - One 64-bit load to retrieve the signature + * - One store to clear the critical section address + * + * The non-debug case implements only the minimal required checking. It + * provides protection against a rogue abort IP in kernel space, which + * would be exploitable at least on x86, and also against a rouge CS + * descriptor by checking the signature at the abort IP. Any fallout from + * invalid critical section descriptors is a user space problem. The debug + * case provides the full set of checks and terminates the task if a + * condition is not met. + * + * In case of a fault or an invalid value, this sets TIF_NOTIFY_RESUME and + * tells the caller to loop back into exit_to_user_mode_loop(). The rseq + * slow path there will handle the fail. + */ +static __always_inline bool rseq_exit_user_update(struct pt_regs *regs, st= ruct task_struct *t) +{ + /* + * Page faults need to be disabled as this is called with + * interrupts disabled + */ + guard(pagefault)(); + if (likely(!t->rseq.event.ids_changed)) { + struct rseq __user *rseq =3D t->rseq.usrptr; + /* + * If IDs have not changed rseq_event::user_irq must be true + * See rseq_sched_switch_event(). + */ + u64 csaddr; + + if (unlikely(!get_user_inline(csaddr, &rseq->rseq_cs))) + return false; + + if (static_branch_unlikely(&rseq_debug_enabled) || unlikely(csaddr)) { + if (unlikely(!rseq_update_user_cs(t, regs, csaddr))) + return false; + } + return true; + } + + struct rseq_ids ids =3D { + .cpu_id =3D task_cpu(t), + .mm_cid =3D task_mm_cid(t), + }; + u32 node_id =3D cpu_to_node(ids.cpu_id); + + return rseq_update_usr(t, regs, &ids, node_id); +} + +static __always_inline bool __rseq_exit_to_user_mode_restart(struct pt_reg= s *regs) +{ + struct task_struct *t =3D current; + + /* + * If the task did not go through schedule or got the flag enforced + * by the rseq syscall or execve, then nothing to do here. + * + * CPU ID and MM CID can only change when going through a context + * switch. + * + * rseq_sched_switch_event() sets the rseq_event::sched_switch bit + * only when rseq_event::has_rseq is true. That conditional is + * required to avoid setting the TIF bit if RSEQ is not registered + * for a task. rseq_event::sched_switch is cleared when RSEQ is + * unregistered by a task so it's sufficient to check for the + * sched_switch bit alone. + * + * A sane compiler requires three instructions for the nothing to do + * case including clearing the events, but your mileage might vary. + */ + if (unlikely((t->rseq.event.sched_switch))) { + rseq_stat_inc(rseq_stats.fastpath); + + if (unlikely(!rseq_exit_user_update(regs, t))) + return true; + } + /* Clear state so next entry starts from a clean slate */ + t->rseq.event.events =3D 0; + return false; +} + +static __always_inline bool rseq_exit_to_user_mode_restart(struct pt_regs = *regs) +{ + if (unlikely(__rseq_exit_to_user_mode_restart(regs))) { + current->rseq.event.slowpath =3D true; + set_tsk_thread_flag(current, TIF_NOTIFY_RESUME); + return true; + } + return false; +} + +#endif /* CONFIG_GENERIC_ENTRY */ + static __always_inline void rseq_exit_to_user_mode(void) { struct rseq_event *ev =3D ¤t->rseq.event; @@ -421,9 +543,12 @@ static inline void rseq_debug_syscall_re if (static_branch_unlikely(&rseq_debug_enabled)) __rseq_debug_syscall_return(regs); } - #else /* CONFIG_RSEQ */ static inline void rseq_note_user_irq_entry(void) { } +static inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs) +{ + return false; +} static inline void rseq_exit_to_user_mode(void) { } static inline void rseq_debug_syscall_return(struct pt_regs *regs) { } #endif /* !CONFIG_RSEQ */ --- a/include/linux/rseq_types.h +++ b/include/linux/rseq_types.h @@ -18,6 +18,8 @@ struct rseq; * @has_rseq: True if the task has a rseq pointer installed * @error: Compound error code for the slow path to analyze * @fatal: User space data corrupted or invalid + * @slowpath: Indicator that slow path processing via TIF_NOTIFY_RESUME + * is required * * @sched_switch and @ids_changed must be adjacent and the combo must be * 16bit aligned to allow a single store, when both are set at the same @@ -42,6 +44,7 @@ struct rseq_event { u16 error; struct { u8 fatal; + u8 slowpath; }; }; }; --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -133,6 +133,7 @@ static int rseq_stats_show(struct seq_fi stats.exit +=3D data_race(per_cpu(rseq_stats.exit, cpu)); stats.signal +=3D data_race(per_cpu(rseq_stats.signal, cpu)); stats.slowpath +=3D data_race(per_cpu(rseq_stats.slowpath, cpu)); + stats.fastpath +=3D data_race(per_cpu(rseq_stats.fastpath, cpu)); stats.ids +=3D data_race(per_cpu(rseq_stats.ids, cpu)); stats.cs +=3D data_race(per_cpu(rseq_stats.cs, cpu)); stats.clear +=3D data_race(per_cpu(rseq_stats.clear, cpu)); @@ -142,6 +143,7 @@ static int rseq_stats_show(struct seq_fi seq_printf(m, "exit: %16lu\n", stats.exit); seq_printf(m, "signal: %16lu\n", stats.signal); seq_printf(m, "slowp: %16lu\n", stats.slowpath); + seq_printf(m, "fastp: %16lu\n", stats.fastpath); seq_printf(m, "ids: %16lu\n", stats.ids); seq_printf(m, "cs: %16lu\n", stats.cs); seq_printf(m, "clear: %16lu\n", stats.clear); From nobody Thu Dec 18 04:27:55 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 589B02F3629 for ; Mon, 27 Oct 2025 08:45:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554724; cv=none; b=XQ0UnQMxXgIP1x1QfB5f8PwkbfwUDhzbJzqrQDABOvWo6sxgMCk/eqmk1r6oxzVdBuuFvVF9rhemNorwJ4cs2BP4wgWvVjC6vz5ukbA9E7UIau3X1qaMlejT6/8HGCpISmf/groibdbSwCecW5lu87LypD7eJKvHfVQR3ychYyc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554724; c=relaxed/simple; bh=stuvbCCRHen5rU0h5kcU2gH7NuE3Tvr4q3SxvtDpVe8=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=eApwy6Mpn4gqG0ACccWZ5D1TxgngLfoEa8Koa/UHtflUdqcq7k5YjauAhPzhAaXsMqzs0GNDsa4QYrySNJ6I9Wi5s2uj00buKRdB+vhbzkpsM4wQfEiF9Ux+N/YR8VmhmP2Av9ENm4gCHQdpLCVj+RLzoUEf1we0IQOQeGblOVo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=pRGQunpo; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=AFPa6HmB; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="pRGQunpo"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="AFPa6HmB" Message-ID: <20251027084307.701201365@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1761554721; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=Rj4rBVjacSEtwLc/Ezk5zbbwdTGsIv5IIEIqDrCdNqE=; b=pRGQunpo5DbWZ4SKYYnWeRTpnBqtARMycQc1M3VNVrkDls8htmzxBWrCqCIe0zgBt7IDIK ylH86lJE/RJFUC96cUczxC5tXQfFI/ODMM3lHuy/NYkBz+BGfuxOnLLM9+uTGiAMGKInul UXUDOYL/3QXksh7JKasvvkZ6lviccAAHeaRszDnzeUVk5HiUafTgJ3bIH445FtYZ07kqXf NKZWH/cBfjnuPFF05JzaJim1jXIpg5mmLF4TnOzHL0MhLy6e6MMUV0f2O5nXRNmkJYl7/k 8kwagbP0jIiJEUaJ9tWFsTHVKO57IzHJPMvDWlkJvoqs6eNgT56rM1duMej/pg== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1761554721; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=Rj4rBVjacSEtwLc/Ezk5zbbwdTGsIv5IIEIqDrCdNqE=; b=AFPa6HmB8XrhzzLjzRAGhnc/2b9Tvrv8rNOruuO9jo6BVHeeUGRREci+9wToTI41TnlAOz Ptaxfrz6Auou0rBw== From: Thomas Gleixner To: LKML Cc: Michael Jeanson , Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , x86@kernel.org, Sean Christopherson , Wei Liu Subject: [patch V6 28/31] rseq: Switch to fast path processing on exit to user References: <20251027084220.785525188@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Mon, 27 Oct 2025 09:45:19 +0100 (CET) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Now that all bits and pieces are in place, hook the RSEQ handling fast path function into exit_to_user_mode_prepare() after the TIF work bits have been handled. If case of fast path failure, TIF_NOTIFY_RESUME has been raised and the caller needs to take another turn through the TIF handling slow path. This only works for architectures which use the generic entry code. Architectures who still have their own incomplete hacks are not supported and won't be. This results in the following improvements: Kernel build Before After Reduction =20 exit to user 80692981 80514451 =20 signal checks: 32581 121 99% slowpath runs: 1201408 1.49% 198 0.00% 100% fastpath runs: 675941 0.84% N/A id updates: 1233989 1.53% 50541 0.06% 96% cs checks: 1125366 1.39% 0 0.00% 100% cs cleared: 1125366 100% 0 100% cs fixup: 0 0% 0 =20 RSEQ selftests Before After Reduction exit to user: 386281778 387373750 =20 signal checks: 35661203 0 100% slowpath runs: 140542396 36.38% 100 0.00% 100% fastpath runs: 9509789 2.51% N/A id updates: 176203599 45.62% 9087994 2.35% 95% cs checks: 175587856 45.46% 4728394 1.22% 98% cs cleared: 172359544 98.16% 1319307 27.90% 99%=20 cs fixup: 3228312 1.84% 3409087 72.10% The 'cs cleared' and 'cs fixup' percentages are not relative to the exit to user invocations, they are relative to the actual 'cs check' invocations. While some of this could have been avoided in the original code, like the obvious clearing of CS when it's already clear, the main problem of going through TIF_NOTIFY_RESUME cannot be solved. In some workloads the RSEQ notify handler is invoked more than once before going out to user space. Doing this once when everything has stabilized is the only solution to avoid this. The initial attempt to completely decouple it from the TIF work turned out to be suboptimal for workloads, which do a lot of quick and short system calls. Even if the fast path decision is only 4 instructions (including a conditional branch), this adds up quickly and becomes measurable when the rate for actually having to handle rseq is in the low single digit percentage range of user/kernel transitions. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- V4: Move the rseq handling into a separate loop to avoid gotos later on --- include/linux/irq-entry-common.h | 7 ++----- include/linux/resume_user_mode.h | 2 +- include/linux/rseq.h | 18 ++++++++++++------ init/Kconfig | 2 +- kernel/entry/common.c | 26 +++++++++++++++++++------- kernel/rseq.c | 8 ++++++-- 6 files changed, 41 insertions(+), 22 deletions(-) --- a/include/linux/irq-entry-common.h +++ b/include/linux/irq-entry-common.h @@ -197,11 +197,8 @@ static __always_inline void arch_exit_to */ void arch_do_signal_or_restart(struct pt_regs *regs); =20 -/** - * exit_to_user_mode_loop - do any pending work before leaving to user spa= ce - */ -unsigned long exit_to_user_mode_loop(struct pt_regs *regs, - unsigned long ti_work); +/* Handle pending TIF work */ +unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long t= i_work); =20 /** * exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required --- a/include/linux/resume_user_mode.h +++ b/include/linux/resume_user_mode.h @@ -59,7 +59,7 @@ static inline void resume_user_mode_work mem_cgroup_handle_over_high(GFP_KERNEL); blkcg_maybe_throttle_current(); =20 - rseq_handle_notify_resume(regs); + rseq_handle_slowpath(regs); } =20 #endif /* LINUX_RESUME_USER_MODE_H */ --- a/include/linux/rseq.h +++ b/include/linux/rseq.h @@ -7,13 +7,19 @@ =20 #include =20 -void __rseq_handle_notify_resume(struct pt_regs *regs); +void __rseq_handle_slowpath(struct pt_regs *regs); =20 -static inline void rseq_handle_notify_resume(struct pt_regs *regs) +/* Invoked from resume_user_mode_work() */ +static inline void rseq_handle_slowpath(struct pt_regs *regs) { - /* '&' is intentional to spare one conditional branch */ - if (current->rseq.event.sched_switch & current->rseq.event.has_rseq) - __rseq_handle_notify_resume(regs); + if (IS_ENABLED(CONFIG_GENERIC_ENTRY)) { + if (current->rseq.event.slowpath) + __rseq_handle_slowpath(regs); + } else { + /* '&' is intentional to spare one conditional branch */ + if (current->rseq.event.sched_switch & current->rseq.event.has_rseq) + __rseq_handle_slowpath(regs); + } } =20 void __rseq_signal_deliver(int sig, struct pt_regs *regs); @@ -152,7 +158,7 @@ static inline void rseq_fork(struct task } =20 #else /* CONFIG_RSEQ */ -static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct = pt_regs *regs) { } +static inline void rseq_handle_slowpath(struct pt_regs *regs) { } static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_reg= s *regs) { } static inline void rseq_sched_switch_event(struct task_struct *t) { } static inline void rseq_sched_set_task_cpu(struct task_struct *t, unsigned= int cpu) { } --- a/init/Kconfig +++ b/init/Kconfig @@ -1941,7 +1941,7 @@ config RSEQ_DEBUG_DEFAULT_ENABLE config DEBUG_RSEQ default n bool "Enable debugging of rseq() system call" if EXPERT - depends on RSEQ && DEBUG_KERNEL + depends on RSEQ && DEBUG_KERNEL && !GENERIC_ENTRY select RSEQ_DEBUG_DEFAULT_ENABLE help Enable extra debugging checks for the rseq system call. --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -11,13 +11,8 @@ /* Workaround to allow gradual conversion of architecture code */ void __weak arch_do_signal_or_restart(struct pt_regs *regs) { } =20 -/** - * exit_to_user_mode_loop - do any pending work before leaving to user spa= ce - * @regs: Pointer to pt_regs on entry stack - * @ti_work: TIF work flags as read by the caller - */ -__always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs, - unsigned long ti_work) +static __always_inline unsigned long __exit_to_user_mode_loop(struct pt_re= gs *regs, + unsigned long ti_work) { /* * Before returning to user space ensure that all pending work @@ -62,6 +57,23 @@ void __weak arch_do_signal_or_restart(st return ti_work; } =20 +/** + * exit_to_user_mode_loop - do any pending work before leaving to user spa= ce + * @regs: Pointer to pt_regs on entry stack + * @ti_work: TIF work flags as read by the caller + */ +__always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs, + unsigned long ti_work) +{ + for (;;) { + ti_work =3D __exit_to_user_mode_loop(regs, ti_work); + + if (likely(!rseq_exit_to_user_mode_restart(regs))) + return ti_work; + ti_work =3D read_thread_flags(); + } +} + noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs) { irqentry_state_t ret =3D { --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -237,7 +237,11 @@ static bool rseq_handle_cs(struct task_s =20 static void rseq_slowpath_update_usr(struct pt_regs *regs) { - /* Preserve rseq state and user_irq state for exit to user */ + /* + * Preserve rseq state and user_irq state. The generic entry code + * clears user_irq on the way out, the non-generic entry + * architectures are not having user_irq. + */ const struct rseq_event evt_mask =3D { .has_rseq =3D true, .user_irq =3D = true, }; struct task_struct *t =3D current; struct rseq_ids ids; @@ -289,7 +293,7 @@ static void rseq_slowpath_update_usr(str } } =20 -void __rseq_handle_notify_resume(struct pt_regs *regs) +void __rseq_handle_slowpath(struct pt_regs *regs) { /* * If invoked from hypervisors before entering the guest via From nobody Thu Dec 18 04:27:55 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7E5DB314B65 for ; Mon, 27 Oct 2025 08:45:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554729; cv=none; b=GP2fIzq7qRQH6A+CWDJVMk/l6+zbKGmh0YqJQVPrJ0oZbUaOJzqP6Jm0vR2c28he/eQxfxoVgK0xZ6JXzIOm31Aor9U/EgYUSM7BPTsUusPoDGZJliWlZTeuR67fkI2PlMS18OgT/8F32ij3g6XCVOY/LLX+s7GqFR7QbNKr6tE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554729; c=relaxed/simple; bh=fSAYLVd5AqHucCuNHOo0vXpOkZl/dhTZKCK1EjF/s0U=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=SpxIu5rSK5WvKnNWV+/zIojyzjShBoEYUlH2sQT8oBLmYEXvZI3vUBmgYjDSLknl9Aqxrk/VYGD5SjBm7V5hibCBqcEMVgsavgIj+k/eUD+AHKwtWx8Lkycsnvj2e1G6EO7ve5k6AXGRtnrm1SNQCVUDp6AP3svWcyTikzEcd6Q= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=MSljGMgA; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=zfv5hlcs; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="MSljGMgA"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="zfv5hlcs" Message-ID: <20251027084307.782234789@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1761554723; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=MC87nopEMMD6W1ICzJrXeQbGP0Yoxe/0UCOwm3vz4Qw=; b=MSljGMgAymDco8oA1Hp3kyWOtbrX9bhRH97qzKXArx6Q/c/OFRCYLpDN3OEt36S9g3rMFZ yjnMO62slu0Q6gS5z44zCohUTgQrClvzlGR/6mRwNw0Nv0EaHcqu72f0vRObtUvv6TmqJU PdEV0E6Zw4pNvvVf52+Fpp3OVT6GGTU9nhHVYqUs7XAVXL2fYFsbKR4jGaAfVlaTmk271e 8RbcSMmBcsm8pXYuGFh+xKu8I9p8lHIjHLlCGMGFKm0OEPcpuAgliuvE/22HSUix+gEsjf 3vqWv7hSr+uN0JiIRDR4KWhuOIyoL4tL4DraCWazmtsobwtzJ8P9TSyHJFQsmA== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1761554723; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=MC87nopEMMD6W1ICzJrXeQbGP0Yoxe/0UCOwm3vz4Qw=; b=zfv5hlcsWcGj/ez9RCuzgKEZAKWd4UlKBZO3ZbIT7vgoTWUPzfrmESumNwS49ehWKQ4Ltz kwzIJcbFkEleL5BQ== From: Thomas Gleixner To: LKML Cc: Michael Jeanson , Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , x86@kernel.org, Sean Christopherson , Wei Liu Subject: [patch V6 29/31] entry: Split up exit_to_user_mode_prepare() References: <20251027084220.785525188@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Mon, 27 Oct 2025 09:45:21 +0100 (CET) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" exit_to_user_mode_prepare() is used for both interrupts and syscalls, but there is extra rseq work, which is only required for in the interrupt exit case. Split up the function and provide wrappers for syscalls and interrupts, which allows to separate the rseq exit work in the next step. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- include/linux/entry-common.h | 2 - include/linux/irq-entry-common.h | 42 ++++++++++++++++++++++++++++++++++= ----- 2 files changed, 38 insertions(+), 6 deletions(-) --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -156,7 +156,7 @@ static __always_inline void syscall_exit if (unlikely(work & SYSCALL_WORK_EXIT)) syscall_exit_work(regs, work); local_irq_disable_exit_to_user(); - exit_to_user_mode_prepare(regs); + syscall_exit_to_user_mode_prepare(regs); } =20 /** --- a/include/linux/irq-entry-common.h +++ b/include/linux/irq-entry-common.h @@ -201,7 +201,7 @@ void arch_do_signal_or_restart(struct pt unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long t= i_work); =20 /** - * exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required + * __exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required * @regs: Pointer to pt_regs on entry stack * * 1) check that interrupts are disabled @@ -209,8 +209,10 @@ unsigned long exit_to_user_mode_loop(str * 3) call exit_to_user_mode_loop() if any flags from * EXIT_TO_USER_MODE_WORK are set * 4) check that interrupts are still disabled + * + * Don't invoke directly, use the syscall/irqentry_ prefixed variants below */ -static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs) +static __always_inline void __exit_to_user_mode_prepare(struct pt_regs *re= gs) { unsigned long ti_work; =20 @@ -224,15 +226,45 @@ static __always_inline void exit_to_user ti_work =3D exit_to_user_mode_loop(regs, ti_work); =20 arch_exit_to_user_mode_prepare(regs, ti_work); +} =20 - rseq_exit_to_user_mode(); - +static __always_inline void __exit_to_user_mode_validate(void) +{ /* Ensure that kernel state is sane for a return to userspace */ kmap_assert_nomap(); lockdep_assert_irqs_disabled(); lockdep_sys_exit(); } =20 + +/** + * syscall_exit_to_user_mode_prepare - call exit_to_user_mode_loop() if re= quired + * @regs: Pointer to pt_regs on entry stack + * + * Wrapper around __exit_to_user_mode_prepare() to separate the exit work = for + * syscalls and interrupts. + */ +static __always_inline void syscall_exit_to_user_mode_prepare(struct pt_re= gs *regs) +{ + __exit_to_user_mode_prepare(regs); + rseq_exit_to_user_mode(); + __exit_to_user_mode_validate(); +} + +/** + * irqentry_exit_to_user_mode_prepare - call exit_to_user_mode_loop() if r= equired + * @regs: Pointer to pt_regs on entry stack + * + * Wrapper around __exit_to_user_mode_prepare() to separate the exit work = for + * syscalls and interrupts. + */ +static __always_inline void irqentry_exit_to_user_mode_prepare(struct pt_r= egs *regs) +{ + __exit_to_user_mode_prepare(regs); + rseq_exit_to_user_mode(); + __exit_to_user_mode_validate(); +} + /** * exit_to_user_mode - Fixup state when exiting to user mode * @@ -297,7 +329,7 @@ static __always_inline void irqentry_ent static __always_inline void irqentry_exit_to_user_mode(struct pt_regs *reg= s) { instrumentation_begin(); - exit_to_user_mode_prepare(regs); + irqentry_exit_to_user_mode_prepare(regs); instrumentation_end(); exit_to_user_mode(); } From nobody Thu Dec 18 04:27:55 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3CB59314A9D for ; Mon, 27 Oct 2025 08:45:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554728; cv=none; b=TrZ16a92ozmqxyl/tp5csqzoEo05gvigNKqbTIyBAfVT9LDIxWqHrqobiRH4WpetTINNDGpZb6MJXY7Wcodt7fcPmqjj6fqNej5qHadnw9Mbijr6LMGVroAEoweiMHBBGgymbL6x7aeu85L/5ys+cFtfK8XBtAaRDaXBiflDUDA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554728; c=relaxed/simple; bh=fRtKSAT+7JawqD54Nvj6KBAjse/kVX+VAldx1oveosY=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=gtncAYWSI99oEFr1OMZc3VsS4kl1gxQvpnF2UmPC4/d63DegF5rRoJu8zVTfAnUaLe5oI5D67EOyf9H4T5LBDSK0PKUslIlTWpXzeUI54o1vn67y7Xy8NJS+m2ROF/XeRpnr2AoUUUwhuOc2fK5Sx+R9IVcVHIeXgCNYL+y4M8s= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=yeeckoxe; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=q3ifiu5F; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="yeeckoxe"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="q3ifiu5F" Message-ID: <20251027084307.842785700@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1761554725; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=u/TMrFpJWl68PmfpLX0HbgIWYFPqgU/kRgNhPDl2Dak=; b=yeeckoxeyM4n81b5Vqxr9nppS9ydBKITYfHqtB438C6gxtgB9J58DB/1o09RT0VULapl65 vSLF9cgNbuRyJbtMJo5rCinD0+bMNWvVwJ1uvqT93gXe7DisRYYKD5sBY32vbeDQoIY5hA WdLvYPl8F2IgrlWoSq5sZctW18ZuRG2C+YDrjoMf+LQ6ofqZTu4RnA3LU44wjNDBHamJdF mCQPZkIUIkQOlb28x2E7pSA458GpbxG+nV3yTyNMo+6t/QYlM2jemiqo8Bji+B/gAgr5V7 rSQxS4zeCRnZPzVg/BMUJOkSEj3nl60LZkTk8s8mU8ZoH8+xP/nNXvM4GXrVlw== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1761554725; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=u/TMrFpJWl68PmfpLX0HbgIWYFPqgU/kRgNhPDl2Dak=; b=q3ifiu5F5Y1qGq/6DTOdSU/vjS9roBf3XsbLpiHXNYBtyRCjun8/+n/iBqdrEcdm6+fnHe iify3RF8W2UlrGBQ== From: Thomas Gleixner To: LKML Cc: Michael Jeanson , Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , x86@kernel.org, Sean Christopherson , Wei Liu Subject: [patch V6 30/31] rseq: Split up rseq_exit_to_user_mode() References: <20251027084220.785525188@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Mon, 27 Oct 2025 09:45:24 +0100 (CET) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Separate the interrupt and syscall exit handling. Syscall exit does not require to clear the user_irq bit as it can't be set. On interrupt exit it can be set when the interrupt did not result in a scheduling event and therefore the return path did not invoke the TIF work handling, which would have cleared it. The debug check for the event state is also not really required even when debug mode is enabled via the static key. Debug mode is largely aiding user space by enabling a larger amount of validation checks, which cause a segfault when a malformed critical section is detected. In production mode the critical section handling takes the content mostly as is and lets user space keep the pieces when it screwed up. On kernel changes in that area the state check is useful, but that can be done when lockdep is enabled, which is anyway a required test scenario for fundamental changes. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- include/linux/irq-entry-common.h | 4 ++-- include/linux/rseq_entry.h | 21 +++++++++++++++++---- 2 files changed, 19 insertions(+), 6 deletions(-) --- a/include/linux/irq-entry-common.h +++ b/include/linux/irq-entry-common.h @@ -247,7 +247,7 @@ static __always_inline void __exit_to_us static __always_inline void syscall_exit_to_user_mode_prepare(struct pt_re= gs *regs) { __exit_to_user_mode_prepare(regs); - rseq_exit_to_user_mode(); + rseq_syscall_exit_to_user_mode(); __exit_to_user_mode_validate(); } =20 @@ -261,7 +261,7 @@ static __always_inline void syscall_exit static __always_inline void irqentry_exit_to_user_mode_prepare(struct pt_r= egs *regs) { __exit_to_user_mode_prepare(regs); - rseq_exit_to_user_mode(); + rseq_irqentry_exit_to_user_mode(); __exit_to_user_mode_validate(); } =20 --- a/include/linux/rseq_entry.h +++ b/include/linux/rseq_entry.h @@ -520,19 +520,31 @@ static __always_inline bool rseq_exit_to =20 #endif /* CONFIG_GENERIC_ENTRY */ =20 -static __always_inline void rseq_exit_to_user_mode(void) +static __always_inline void rseq_syscall_exit_to_user_mode(void) { struct rseq_event *ev =3D ¤t->rseq.event; =20 rseq_stat_inc(rseq_stats.exit); =20 - if (static_branch_unlikely(&rseq_debug_enabled)) + /* Needed to remove the store for the !lockdep case */ + if (IS_ENABLED(CONFIG_LOCKDEP)) { WARN_ON_ONCE(ev->sched_switch); + ev->events =3D 0; + } +} + +static __always_inline void rseq_irqentry_exit_to_user_mode(void) +{ + struct rseq_event *ev =3D ¤t->rseq.event; + + rseq_stat_inc(rseq_stats.exit); + + lockdep_assert_once(!ev->sched_switch); =20 /* * Ensure that event (especially user_irq) is cleared when the * interrupt did not result in a schedule and therefore the - * rseq processing did not clear it. + * rseq processing could not clear it. */ ev->events =3D 0; } @@ -550,7 +562,8 @@ static inline bool rseq_exit_to_user_mod { return false; } -static inline void rseq_exit_to_user_mode(void) { } +static inline void rseq_syscall_exit_to_user_mode(void) { } +static inline void rseq_irqentry_exit_to_user_mode(void) { } static inline void rseq_debug_syscall_return(struct pt_regs *regs) { } #endif /* !CONFIG_RSEQ */ From nobody Thu Dec 18 04:27:55 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0FB632F3C1A for ; Mon, 27 Oct 2025 08:45:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554732; cv=none; b=oAwoUx+3oajn47Q/yO1+jXKAjRPC04bXQLfc/4xCXWRx9CFEIUoCN3zIXl0jFHUbE0xB1g7a6lMlg+dceDdJrhPrtLLvhV6F5YDEB0orRs6/WpTA4tFva2o04i/EN2DhrC52qPLcyKHgsLexXhFe3pASk6XiYc9UVGYZoPBoZps= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761554732; c=relaxed/simple; bh=MLJlR5gDufQoHuLihUB41vDLrVPi96PJBglzaw+py24=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=VzXPHSY0brTezjYi2h9aRsSvmgnez/eBiVqSss4msjgGr4Cdae5Yqg3KmyEy7VI7gXKJtWlja45uJcaMGSss4KWWs1/ig6URe7548DX79OcaLgHPpRwY8MZhk8Q1ew9jtKPq2Z+4HMsoqq2IwvHzYojuoi4piixlK6ES2yy0W2Y= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=aQ7VYtha; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=99+iJSC2; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="aQ7VYtha"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="99+iJSC2" Message-ID: <20251027084307.903622031@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1761554727; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=tSfZiFEU/OVr8nUKt3URejIK7nghCq2swoGI8UYKJ6w=; b=aQ7VYtha/J7Ywu1ewiv8XNXJLpra9q5WM6Wa0yBpuW4yfdT0TivTGmgxuzvBIYt2tCLBPn MYG8F7LVO/044FKcybgo7/PHAKkrkuUC5hq2+/DcMOg3mEPs03DvgUI6Z1gt7Q9zYG/CJ9 yAxYOu1eeoEYhguZ8Jt9e+rF6VwpUdbOA164Pzl+XNVTgIuDE20E3Z/vpj/X4KLq1CXe7U HEHx7f0FZkFvHgecqMkQgyCldE1b/fb10GHx9LqG7yYtwxO0FzTk1Dnr6bAPX2ye+Iw7Kl zP7d9dCNM4X/PVuegOyn3wE0bLchz7Bi/hLpIv5C9lLsVVCXC93b8hpty7RJsA== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1761554727; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=tSfZiFEU/OVr8nUKt3URejIK7nghCq2swoGI8UYKJ6w=; b=99+iJSC2kim8Zb8jdzIUm4ScIVeLRLHxBuW1q22xF4MPd+gB7W2obGy7fUIbghGmwCKo2M ZrWC1kq3MTEx1KCg== From: Thomas Gleixner To: LKML Cc: Michael Jeanson , Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , x86@kernel.org, Sean Christopherson , Wei Liu Subject: [patch V6 31/31] rseq: Switch to TIF_RSEQ if supported References: <20251027084220.785525188@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Mon, 27 Oct 2025 09:45:26 +0100 (CET) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" TIF_NOTIFY_RESUME is a multiplexing TIF bit, which is suboptimal especially with the RSEQ fast path depending on it, but not really handling it. Define a seperate TIF_RSEQ in the generic TIF space and enable the full seperation of fast and slow path for architectures which utilize that. That avoids the hassle with invocations of resume_user_mode_work() from hypervisors, which clear TIF_NOTIFY_RESUME. It makes the therefore required re-evaluation at the end of vcpu_run() a NOOP on architectures which utilize the generic TIF space and have a seperate TIF_RSEQ. The hypervisor TIF handling does not include the seperate TIF_RSEQ as there is no point in doing so. The guest does neither know nor care about the VMM host applications RSEQ state. That state is only relevant when the ioctl() returns to user space. The fastpath implementation still utilizes TIF_NOTIFY_RESUME for failure handling, but this only happens within exit_to_user_mode_loop(), so arguably the hypervisor ioctl() code is long done when this happens. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- V4: Adjust it to the new outer loop mechanism V3: Updated the comment for rseq_virt_userspace_exit() - Sean Added a static assert for TIF_RSEQ !=3D TIF_NOTIFY_RESUME - Sean --- include/asm-generic/thread_info_tif.h | 3 +++ include/linux/irq-entry-common.h | 2 +- include/linux/rseq.h | 22 +++++++++++++++------- include/linux/rseq_entry.h | 27 +++++++++++++++++++++++++-- include/linux/thread_info.h | 5 +++++ kernel/entry/common.c | 10 ++++++++-- 6 files changed, 57 insertions(+), 12 deletions(-) --- a/include/asm-generic/thread_info_tif.h +++ b/include/asm-generic/thread_info_tif.h @@ -45,4 +45,7 @@ # define _TIF_RESTORE_SIGMASK BIT(TIF_RESTORE_SIGMASK) #endif =20 +#define TIF_RSEQ 11 // Run RSEQ fast path +#define _TIF_RSEQ BIT(TIF_RSEQ) + #endif /* _ASM_GENERIC_THREAD_INFO_TIF_H_ */ --- a/include/linux/irq-entry-common.h +++ b/include/linux/irq-entry-common.h @@ -30,7 +30,7 @@ #define EXIT_TO_USER_MODE_WORK \ (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | \ _TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY | \ - _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | \ + _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | _TIF_RSEQ | \ ARCH_EXIT_TO_USER_MODE_WORK) =20 /** --- a/include/linux/rseq.h +++ b/include/linux/rseq.h @@ -42,7 +42,7 @@ static inline void rseq_signal_deliver(s =20 static inline void rseq_raise_notify_resume(struct task_struct *t) { - set_tsk_thread_flag(t, TIF_NOTIFY_RESUME); + set_tsk_thread_flag(t, TIF_RSEQ); } =20 /* Invoked from context switch to force evaluation on exit to user */ @@ -114,17 +114,25 @@ static inline void rseq_force_update(voi =20 /* * KVM/HYPERV invoke resume_user_mode_work() before entering guest mode, - * which clears TIF_NOTIFY_RESUME. To avoid updating user space RSEQ in - * that case just to do it eventually again before returning to user space, - * the entry resume_user_mode_work() invocation is ignored as the register - * argument is NULL. + * which clears TIF_NOTIFY_RESUME on architectures that don't use the + * generic TIF bits and therefore can't provide a separate TIF_RSEQ flag. * - * After returning from guest mode, they have to invoke this function to - * re-raise TIF_NOTIFY_RESUME if necessary. + * To avoid updating user space RSEQ in that case just to do it eventually + * again before returning to user space, because __rseq_handle_slowpath() + * does nothing when invoked with NULL register state. + * + * After returning from guest mode, before exiting to userspace, hyperviso= rs + * must invoke this function to re-raise TIF_NOTIFY_RESUME if necessary. */ static inline void rseq_virt_userspace_exit(void) { if (current->rseq.event.sched_switch) + /* + * The generic optimization for deferring RSEQ updates until the next + * exit relies on having a dedicated TIF_RSEQ. + */ + if (!IS_ENABLED(CONFIG_HAVE_GENERIC_TIF_BITS) && + current->rseq.event.sched_switch) rseq_raise_notify_resume(current); } =20 --- a/include/linux/rseq_entry.h +++ b/include/linux/rseq_entry.h @@ -507,13 +507,36 @@ static __always_inline bool __rseq_exit_ return false; } =20 -static __always_inline bool rseq_exit_to_user_mode_restart(struct pt_regs = *regs) +/* Required to allow conversion to GENERIC_ENTRY w/o GENERIC_TIF_BITS */ +#ifdef CONFIG_HAVE_GENERIC_TIF_BITS +static __always_inline bool test_tif_rseq(unsigned long ti_work) { + return ti_work & _TIF_RSEQ; +} + +static __always_inline void clear_tif_rseq(void) +{ + static_assert(TIF_RSEQ !=3D TIF_NOTIFY_RESUME); + clear_thread_flag(TIF_RSEQ); +} +#else +static __always_inline bool test_tif_rseq(unsigned long ti_work) { return = true; } +static __always_inline void clear_tif_rseq(void) { } +#endif + +static __always_inline bool +rseq_exit_to_user_mode_restart(struct pt_regs *regs, unsigned long ti_work) +{ + if (likely(!test_tif_rseq(ti_work))) + return false; + if (unlikely(__rseq_exit_to_user_mode_restart(regs))) { current->rseq.event.slowpath =3D true; set_tsk_thread_flag(current, TIF_NOTIFY_RESUME); return true; } + + clear_tif_rseq(); return false; } =20 @@ -557,7 +580,7 @@ static inline void rseq_debug_syscall_re } #else /* CONFIG_RSEQ */ static inline void rseq_note_user_irq_entry(void) { } -static inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs) +static inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs, un= signed long ti_work) { return false; } --- a/include/linux/thread_info.h +++ b/include/linux/thread_info.h @@ -67,6 +67,11 @@ enum syscall_work_bit { #define _TIF_NEED_RESCHED_LAZY _TIF_NEED_RESCHED #endif =20 +#ifndef TIF_RSEQ +# define TIF_RSEQ TIF_NOTIFY_RESUME +# define _TIF_RSEQ _TIF_NOTIFY_RESUME +#endif + #ifdef __KERNEL__ =20 #ifndef arch_set_restart_data --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -11,6 +11,12 @@ /* Workaround to allow gradual conversion of architecture code */ void __weak arch_do_signal_or_restart(struct pt_regs *regs) { } =20 +#ifdef CONFIG_HAVE_GENERIC_TIF_BITS +#define EXIT_TO_USER_MODE_WORK_LOOP (EXIT_TO_USER_MODE_WORK & ~_TIF_RSEQ) +#else +#define EXIT_TO_USER_MODE_WORK_LOOP (EXIT_TO_USER_MODE_WORK) +#endif + static __always_inline unsigned long __exit_to_user_mode_loop(struct pt_re= gs *regs, unsigned long ti_work) { @@ -18,7 +24,7 @@ static __always_inline unsigned long __e * Before returning to user space ensure that all pending work * items have been completed. */ - while (ti_work & EXIT_TO_USER_MODE_WORK) { + while (ti_work & EXIT_TO_USER_MODE_WORK_LOOP) { =20 local_irq_enable_exit_to_user(ti_work); =20 @@ -68,7 +74,7 @@ static __always_inline unsigned long __e for (;;) { ti_work =3D __exit_to_user_mode_loop(regs, ti_work); =20 - if (likely(!rseq_exit_to_user_mode_restart(regs))) + if (likely(!rseq_exit_to_user_mode_restart(regs, ti_work))) return ti_work; ti_work =3D read_thread_flags(); }