From nobody Fri Oct 3 21:53:25 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 39B6F205502 for ; Sat, 23 Aug 2025 16:39:09 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967152; cv=none; b=GxQLVo+3Gm19zaLSSnIy6pWuuo9d78Pf0IGOR1ljbyW2uXqyBKXGNmKVoE/+A3CIHPQkNTXsaqvu74nyD+GexKXZQJ0uw0yy7WiS6nm2CaMH6+L/yV3dd+i4URp2dogrSYcsKRj1fpbvbAHBUPRcyofzo9CxjIbnlYFLo4UGCic= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967152; c=relaxed/simple; bh=QYm4wfiR0pamtSfkVgAOMfgsdl+hzbfrJ6RGswtxhy4=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=Daplb6RgdhUtsdWea0Q4pwuHWVTsg5diKsailP2cDMZJwNpRmmfVqB7+vlb2lWRra3kGldpH6xCGnISeui1gWdqgmRRCioUBF8EJnRF24liE64X4Bb72lwQJ5u64J8pg7uWIiG+Uwo1FnHLud0OXec8gVzH2jhcfQCBdJTg6e9o= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=0QyjpasF; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=+ey9yhri; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="0QyjpasF"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="+ey9yhri" Message-ID: <20250823161653.322198601@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1755967148; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=MBWzYSFOk72fWytnLWJa+E8nP2s1PtmBz5nHfwt8sNE=; b=0QyjpasFU24lmE2MH/cqg6JEf5nGDT9sAx4eKS/JhLWr9Ui1ZArffw99pwmONE2nt0GNkS avIR54aZ8SFBkuOhFWmPUUBCva4Qm2m7XFiUAmkhPmv5C+1ef53FeIaFzT/1/RbKy2Vm0V 0xJNdvn7urLPvaeTjF6DE3F7WuXOniHFqKZdA2LfeyX5mvjHmvufDqAz7Il3c+BIqhqWW9 5tLgjovpekmveWJpM33K+c+YgqiBYFB9w76jrLlZVt5dxcfbNJq1IeC5nQYUizqkeT6PAT W52aMC4y+h4qwqHUy8Ut2IAnwoo4Z7bb8n/ERgTdBYu4VH6VE2vGpTWI67YwAg== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1755967148; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=MBWzYSFOk72fWytnLWJa+E8nP2s1PtmBz5nHfwt8sNE=; b=+ey9yhribSLKQfn5MUAriPqO2OYuGDWiXj9B/jNKyVdeYVcymyvdy5twbltee3qW8Z6NHK HFpUcZjEgv/FywCw== From: Thomas Gleixner To: LKML Cc: Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Paolo Bonzini , Sean Christopherson , Wei Liu , Dexuan Cui , x86@kernel.org, Arnd Bergmann , Heiko Carstens , Christian Borntraeger , Sven Schnelle , Huacai Chen , Paul Walmsley , Palmer Dabbelt Subject: [patch V2 01/37] rseq: Avoid pointless evaluation in __rseq_notify_resume() References: <20250823161326.635281786@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Sat, 23 Aug 2025 18:39:07 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Thomas Gleixner The RSEQ critical section mechanism only clears the event mask when a critical section is registered, otherwise it is stale and collects bits. That means once a critical section is installed the first invocation of that code when TIF_NOTIFY_RESUME is set will abort the critical section, even when the TIF bit was not raised by the rseq preempt/migrate/signal helpers. This also has a performance implication because TIF_NOTIFY_RESUME is a multiplexing TIF bit, which is utilized by quite some infrastructure. That means every invocation of __rseq_notify_resume() goes unconditionally through the heavy lifting of user space access and consistency checks even if there is no reason to do so. Keeping the stale event mask around when exiting to user space also prevents it from being utilized by the upcoming time slice extension mechanism. Avoid this by reading and clearing the event mask before doing the user space critical section access with interrupts or preemption disabled, which ensures that the read and clear operation is CPU local atomic versus scheduling and the membarrier IPI. This is correct as after re-enabling interrupts/preemption any relevant event will set the bit again and raise TIF_NOTIFY_RESUME, which makes the user space exit code take another round of TIF bit clearing. If the event mask was non-zero, invoke the slow path. On debug kernels the slow path is invoked unconditionally and the result of the event mask evaluation is handed in. Add a exit path check after the TIF bit loop, which validates on debug kernels that the event mask is zero before exiting to user space. While at it reword the convoluted comment why the pt_regs pointer can be NULL under certain circumstances. Signed-off-by: Thomas Gleixner Cc: Mathieu Desnoyers Cc: Peter Zijlstra Cc: "Paul E. McKenney" Cc: Boqun Feng Reviewed-by: Mathieu Desnoyers --- include/linux/irq-entry-common.h | 7 ++-- include/linux/rseq.h | 10 +++++ kernel/rseq.c | 66 ++++++++++++++++++++++++++--------= ----- 3 files changed, 58 insertions(+), 25 deletions(-) --- --- a/include/linux/irq-entry-common.h +++ b/include/linux/irq-entry-common.h @@ -2,11 +2,12 @@ #ifndef __LINUX_IRQENTRYCOMMON_H #define __LINUX_IRQENTRYCOMMON_H =20 +#include +#include +#include #include #include -#include #include -#include #include =20 #include @@ -226,6 +227,8 @@ static __always_inline void exit_to_user =20 arch_exit_to_user_mode_prepare(regs, ti_work); =20 + rseq_exit_to_user_mode(); + /* Ensure that kernel state is sane for a return to userspace */ kmap_assert_nomap(); lockdep_assert_irqs_disabled(); --- a/include/linux/rseq.h +++ b/include/linux/rseq.h @@ -66,6 +66,14 @@ static inline void rseq_migrate(struct t rseq_set_notify_resume(t); } =20 +static __always_inline void rseq_exit_to_user_mode(void) +{ + if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) { + if (WARN_ON_ONCE(current->rseq && current->rseq_event_mask)) + current->rseq_event_mask =3D 0; + } +} + /* * If parent process has a registered restartable sequences area, the * child inherits. Unregister rseq for a clone with CLONE_VM set. @@ -118,7 +126,7 @@ static inline void rseq_fork(struct task static inline void rseq_execve(struct task_struct *t) { } - +static inline void rseq_exit_to_user_mode(void) { } #endif =20 #ifdef CONFIG_DEBUG_RSEQ --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -324,9 +324,9 @@ static bool rseq_warn_flags(const char * return true; } =20 -static int rseq_need_restart(struct task_struct *t, u32 cs_flags) +static int rseq_check_flags(struct task_struct *t, u32 cs_flags) { - u32 flags, event_mask; + u32 flags; int ret; =20 if (rseq_warn_flags("rseq_cs", cs_flags)) @@ -339,17 +339,7 @@ static int rseq_need_restart(struct task =20 if (rseq_warn_flags("rseq", flags)) return -EINVAL; - - /* - * Load and clear event mask atomically with respect to - * scheduler preemption and membarrier IPIs. - */ - scoped_guard(RSEQ_EVENT_GUARD) { - event_mask =3D t->rseq_event_mask; - t->rseq_event_mask =3D 0; - } - - return !!event_mask; + return 0; } =20 static int clear_rseq_cs(struct rseq __user *rseq) @@ -380,7 +370,7 @@ static bool in_rseq_cs(unsigned long ip, return ip - rseq_cs->start_ip < rseq_cs->post_commit_offset; } =20 -static int rseq_ip_fixup(struct pt_regs *regs) +static int rseq_ip_fixup(struct pt_regs *regs, bool abort) { unsigned long ip =3D instruction_pointer(regs); struct task_struct *t =3D current; @@ -398,9 +388,11 @@ static int rseq_ip_fixup(struct pt_regs */ if (!in_rseq_cs(ip, &rseq_cs)) return clear_rseq_cs(t->rseq); - ret =3D rseq_need_restart(t, rseq_cs.flags); - if (ret <=3D 0) + ret =3D rseq_check_flags(t, rseq_cs.flags); + if (ret < 0) return ret; + if (!abort) + return 0; ret =3D clear_rseq_cs(t->rseq); if (ret) return ret; @@ -430,14 +422,44 @@ void __rseq_handle_notify_resume(struct return; =20 /* - * regs is NULL if and only if the caller is in a syscall path. Skip - * fixup and leave rseq_cs as is so that rseq_sycall() will detect and - * kill a misbehaving userspace on debug kernels. + * If invoked from hypervisors or IO-URING, then @regs is a NULL + * pointer, so fixup cannot be done. If the syscall which led to + * this invocation was invoked inside a critical section, then it + * will either end up in this code again or a possible violation of + * a syscall inside a critical region can only be detected by the + * debug code in rseq_syscall() in a debug enabled kernel. */ if (regs) { - ret =3D rseq_ip_fixup(regs); - if (unlikely(ret < 0)) - goto error; + /* + * Read and clear the event mask first. If the task was not + * preempted or migrated or a signal is on the way, there + * is no point in doing any of the heavy lifting here on + * production kernels. In that case TIF_NOTIFY_RESUME was + * raised by some other functionality. + * + * This is correct because the read/clear operation is + * guarded against scheduler preemption, which makes it CPU + * local atomic. If the task is preempted right after + * re-enabling preemption then TIF_NOTIFY_RESUME is set + * again and this function is invoked another time _before_ + * the task is able to return to user mode. + * + * On a debug kernel, invoke the fixup code unconditionally + * with the result handed in to allow the detection of + * inconsistencies. + */ + u32 event_mask; + + scoped_guard(RSEQ_EVENT_GUARD) { + event_mask =3D t->rseq_event_mask; + t->rseq_event_mask =3D 0; + } + + if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event_mask) { + ret =3D rseq_ip_fixup(regs, !!event_mask); + if (unlikely(ret < 0)) + goto error; + } } if (unlikely(rseq_update_cpu_node_id(t))) goto error; From nobody Fri Oct 3 21:53:25 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AE88825D204 for ; Sat, 23 Aug 2025 16:39:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967154; cv=none; b=rm0bwBVKJNcksCFAn/1kbY7ICmnjPUmYAuUSekl4Dtrf5lNa3KYYjPeFZBeMVCOoxNN0DRjKPeos95N58OVydVLlOlw1+7BBst4ZEXvKUBC2hEuRVza3E0gHKc5ziPV8DHI8zFQImeHgYJO705Zh5t7lNUbQfZL7SgvAssFBzgU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967154; c=relaxed/simple; bh=HGCpG59G/OHWP2Db6yF1AhuivKTVORyZGDlxLPrBp5Q=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=IM5vvGFXB+cBIrVNH90MyMazL/iJZe9+xrPLOik+lw+bTBG9PMeXP4NhpFu/hdgXGBCDDCKSnuVvwtOazVlhwt7fU3azEeMQv1HIxBsxKw4CJ3ZInVSSxI85L+ObqU54x9tmhztvvO1eHPq05WE/FxeyjDFNmOnTh14StpEqRh4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=D7KXtvxb; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=gQdrW4xv; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="D7KXtvxb"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="gQdrW4xv" Message-ID: <20250823161653.387469844@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1755967150; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=TLyiqO9fhXUolGppUl/GktUryjsQJl4FtcKZ3sesbW8=; b=D7KXtvxbKzsbJ/cvGeDcZb7vczgI1afBKZtLXmysUPjcZInnVzP4SklnTFiCY7u+L/D9AL 1yAXd/SrJfR1xjBJ3iIIfD567jAsttgYAfitSj50x+ZbDVDwj8wWB8o9TZbn9KMJGMC0Yn fnfFPOzXXiYWEuiQTfYdyUxj4ZxI0sSnSK3s6ACDGzpXKtzY2Hv29yBWfXZyLTU4MU+G04 sMS4kyaOzjlDK0judKxfy+GQ6IcecAorcnM2HK3g2eCAwTTMPsbnGgAx+kZdZVjU0kWmL3 bfTAMxehLlNtSO/plCKwTfP+xUZ4t92VmzxUog7eur3ftikchc4oLASYH3f94g== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1755967150; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=TLyiqO9fhXUolGppUl/GktUryjsQJl4FtcKZ3sesbW8=; b=gQdrW4xv5LOz1MHyrEUPR6DAV8e9ovX/HTCYm9SLQfvDwWmLACr8HuFfiQlX/XOpsKMHW3 DIhYLL9UiqMYRHDw== From: Thomas Gleixner To: LKML Cc: Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Paolo Bonzini , Sean Christopherson , Wei Liu , Dexuan Cui , x86@kernel.org, Arnd Bergmann , Heiko Carstens , Christian Borntraeger , Sven Schnelle , Huacai Chen , Paul Walmsley , Palmer Dabbelt Subject: [patch V2 02/37] rseq: Condense the inline stubs References: <20250823161326.635281786@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Sat, 23 Aug 2025 18:39:09 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Thomas Gleixner Scrolling over tons of pointless { } lines to find the actual code is annoying at best. Signed-off-by: Thomas Gleixner Cc: Mathieu Desnoyers Cc: Peter Zijlstra Cc: "Paul E. McKenney" Cc: Boqun Feng Reviewed-by: Mathieu Desnoyers --- include/linux/rseq.h | 47 ++++++++++++----------------------------------- 1 file changed, 12 insertions(+), 35 deletions(-) --- --- a/include/linux/rseq.h +++ b/include/linux/rseq.h @@ -101,44 +101,21 @@ static inline void rseq_execve(struct ta t->rseq_event_mask =3D 0; } =20 -#else - -static inline void rseq_set_notify_resume(struct task_struct *t) -{ -} -static inline void rseq_handle_notify_resume(struct ksignal *ksig, - struct pt_regs *regs) -{ -} -static inline void rseq_signal_deliver(struct ksignal *ksig, - struct pt_regs *regs) -{ -} -static inline void rseq_preempt(struct task_struct *t) -{ -} -static inline void rseq_migrate(struct task_struct *t) -{ -} -static inline void rseq_fork(struct task_struct *t, unsigned long clone_fl= ags) -{ -} -static inline void rseq_execve(struct task_struct *t) -{ -} +#else /* CONFIG_RSEQ */ +static inline void rseq_set_notify_resume(struct task_struct *t) { } +static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct = pt_regs *regs) { } +static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_reg= s *regs) { } +static inline void rseq_preempt(struct task_struct *t) { } +static inline void rseq_migrate(struct task_struct *t) { } +static inline void rseq_fork(struct task_struct *t, unsigned long clone_fl= ags) { } +static inline void rseq_execve(struct task_struct *t) { } static inline void rseq_exit_to_user_mode(void) { } -#endif +#endif /* !CONFIG_RSEQ */ =20 #ifdef CONFIG_DEBUG_RSEQ - void rseq_syscall(struct pt_regs *regs); - -#else - -static inline void rseq_syscall(struct pt_regs *regs) -{ -} - -#endif +#else /* CONFIG_DEBUG_RSEQ */ +static inline void rseq_syscall(struct pt_regs *regs) { } +#endif /* !CONFIG_DEBUG_RSEQ */ =20 #endif /* _LINUX_RSEQ_H */ From nobody Fri Oct 3 21:53:25 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1EBBF2E88B3 for ; Sat, 23 Aug 2025 16:39:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967156; cv=none; b=A+srVXOCgvlOLZvwER35g5oKStDdkBRHcOG3wij/vwTlvcg082fQ0ackSGuCVrPKZu5wAOshdrzUpu5AB9vFU5EfLAZloqFvV3Tgw8O3NNLxfb4ltIAB4m9FVK3I7WfurKDLiAz9tzpR5/SVWnizsOpqGS0Xl6SGMbbHksbTSwY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967156; c=relaxed/simple; bh=R8ZMatXw1QnJnKgh1mnBmeMcfszM5opiigapcj/pbAM=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=og0sta8dm1O2sIGq/9bNOsd6icnQ7hmNgMkv+j7XlwwIcU3OOEQfNQdVgG5WGRE0iJDqxzIcbVtwujWC1jHaLfi42qH716WWfef41mWlyVyBZDI3jDpAHzitokpHYK/EBxYqAoE2fDxd/UFyMs0yov7/xKENBLxYMRYpt74l5pY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=APZ8IoYH; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=Mnrk6zx8; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="APZ8IoYH"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="Mnrk6zx8" Message-ID: <20250823161653.452303254@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1755967153; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=RLR8JIghH49Miso1nJUJEAbkKfMAqqR2bK/RmQbruPo=; b=APZ8IoYHQjqmRehvwD4cxQB/JBIMLy/9BYq/ucRsWgyowPialQWLz0sRdKSR5/JHapAGzr OMJvj1dQI84RuAWmnbCs1KMzgOsuf2BJIFu+jDP6eSiLdBhqulTFrR58vDzGxmVUF6FcWF ucGIV+7qx94sMJwBojCYhKEf9JHGSyCR53DA+hcBvvF/ahmLl+yAomXclXtn3fj/WmeWb8 FSxDOAQSUaJh2dgBkgP9odQZvgZi/UR3zwqvHiW7yE4io7OAjiody2aBeJw9nwgcpPDVng VN1vRtcYSp3C4kiF9yT/pl++EkU9b/uhmPa/0kqCxaWs2n3Y9UPyT0tH5nr10w== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1755967153; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=RLR8JIghH49Miso1nJUJEAbkKfMAqqR2bK/RmQbruPo=; b=Mnrk6zx845WSGZ1PXD1U7kFHbfVpM/jdp8S+OJP3qC9CeSgSKVWpOwBH52keVKEsUrPSqJ PE0Noz65jlXtfXBQ== From: Thomas Gleixner To: LKML Cc: Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Paolo Bonzini , Sean Christopherson , Wei Liu , Dexuan Cui , x86@kernel.org, Arnd Bergmann , Heiko Carstens , Christian Borntraeger , Sven Schnelle , Huacai Chen , Paul Walmsley , Palmer Dabbelt Subject: [patch V2 03/37] resq: Move algorithm comment to top References: <20250823161326.635281786@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Sat, 23 Aug 2025 18:39:12 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Move the comment which documents the RSEQ algorithm to the top of the file, so it does not create horrible diffs later when the actual implementation is fed into the mincer. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- kernel/rseq.c | 119 ++++++++++++++++++++++++++++-------------------------= ----- 1 file changed, 59 insertions(+), 60 deletions(-) --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -8,6 +8,65 @@ * Mathieu Desnoyers */ =20 +/* + * Restartable sequences are a lightweight interface that allows + * user-level code to be executed atomically relative to scheduler + * preemption and signal delivery. Typically used for implementing + * per-cpu operations. + * + * It allows user-space to perform update operations on per-cpu data + * without requiring heavy-weight atomic operations. + * + * Detailed algorithm of rseq user-space assembly sequences: + * + * init(rseq_cs) + * cpu =3D TLS->rseq::cpu_id_start + * [1] TLS->rseq::rseq_cs =3D rseq_cs + * [start_ip] ---------------------------- + * [2] if (cpu !=3D TLS->rseq::cpu_id) + * goto abort_ip; + * [3] + * [post_commit_ip] ---------------------------- + * + * The address of jump target abort_ip must be outside the critical + * region, i.e.: + * + * [abort_ip] < [start_ip] || [abort_ip] >=3D [post_commit_ip] + * + * Steps [2]-[3] (inclusive) need to be a sequence of instructions in + * userspace that can handle being interrupted between any of those + * instructions, and then resumed to the abort_ip. + * + * 1. Userspace stores the address of the struct rseq_cs assembly + * block descriptor into the rseq_cs field of the registered + * struct rseq TLS area. This update is performed through a single + * store within the inline assembly instruction sequence. + * [start_ip] + * + * 2. Userspace tests to check whether the current cpu_id field match + * the cpu number loaded before start_ip, branching to abort_ip + * in case of a mismatch. + * + * If the sequence is preempted or interrupted by a signal + * at or after start_ip and before post_commit_ip, then the kernel + * clears TLS->__rseq_abi::rseq_cs, and sets the user-space return + * ip to abort_ip before returning to user-space, so the preempted + * execution resumes at abort_ip. + * + * 3. Userspace critical section final instruction before + * post_commit_ip is the commit. The critical section is + * self-terminating. + * [post_commit_ip] + * + * 4. + * + * On failure at [2], or if interrupted by preempt or signal delivery + * between [1] and [3]: + * + * [abort_ip] + * F1. + */ + #include #include #include @@ -98,66 +157,6 @@ static int rseq_validate_ro_fields(struc unsafe_put_user(value, &t->rseq->field, error_label) #endif =20 -/* - * - * Restartable sequences are a lightweight interface that allows - * user-level code to be executed atomically relative to scheduler - * preemption and signal delivery. Typically used for implementing - * per-cpu operations. - * - * It allows user-space to perform update operations on per-cpu data - * without requiring heavy-weight atomic operations. - * - * Detailed algorithm of rseq user-space assembly sequences: - * - * init(rseq_cs) - * cpu =3D TLS->rseq::cpu_id_start - * [1] TLS->rseq::rseq_cs =3D rseq_cs - * [start_ip] ---------------------------- - * [2] if (cpu !=3D TLS->rseq::cpu_id) - * goto abort_ip; - * [3] - * [post_commit_ip] ---------------------------- - * - * The address of jump target abort_ip must be outside the critical - * region, i.e.: - * - * [abort_ip] < [start_ip] || [abort_ip] >=3D [post_commit_ip] - * - * Steps [2]-[3] (inclusive) need to be a sequence of instructions in - * userspace that can handle being interrupted between any of those - * instructions, and then resumed to the abort_ip. - * - * 1. Userspace stores the address of the struct rseq_cs assembly - * block descriptor into the rseq_cs field of the registered - * struct rseq TLS area. This update is performed through a single - * store within the inline assembly instruction sequence. - * [start_ip] - * - * 2. Userspace tests to check whether the current cpu_id field match - * the cpu number loaded before start_ip, branching to abort_ip - * in case of a mismatch. - * - * If the sequence is preempted or interrupted by a signal - * at or after start_ip and before post_commit_ip, then the kernel - * clears TLS->__rseq_abi::rseq_cs, and sets the user-space return - * ip to abort_ip before returning to user-space, so the preempted - * execution resumes at abort_ip. - * - * 3. Userspace critical section final instruction before - * post_commit_ip is the commit. The critical section is - * self-terminating. - * [post_commit_ip] - * - * 4. - * - * On failure at [2], or if interrupted by preempt or signal delivery - * between [1] and [3]: - * - * [abort_ip] - * F1. - */ - static int rseq_update_cpu_node_id(struct task_struct *t) { struct rseq __user *rseq =3D t->rseq; From nobody Fri Oct 3 21:53:25 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 13AEE2580EE for ; Sat, 23 Aug 2025 16:39:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967162; cv=none; b=YrWd1xZtTMBVE/JjzFZp7aUeeDJzg5vo30a8ZKqv7tVaOBHSGDfUzrmC/5ED73YshzRqBOHjEVDeIxf5l3vY717NdWj1vSYcEqvmGK+Z268ycAcqQVMev/LpN2JIPjJ0zfd1+Dn0QcDyPq1P4hb/MaHMJyXdiX+4SlrvIbu6tlA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967162; c=relaxed/simple; bh=rXN7Z9/AesfH6+Za0ocaGgdoax1B8BXrj0eYqRUhqZM=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=mbS99C8I+WWy90NGa1tv3ByZ/gZ5gky7aObiyhwhIwPLBZSmjZHSPAOBI31Fd9NsitAUyleoZU+pBlbglrmDxuGHcpUxXCG7zyhdAW9LbgumGfsAmJDZ8DYjfYT9V0efsu5UVfipfPfGhrnsRFgb6fLj+qFwBAI/NMcqqb0YCIw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=AkjUuzlM; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=fd3FFgJY; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="AkjUuzlM"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="fd3FFgJY" Message-ID: <20250823161653.516925982@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1755967155; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=Or+5diiw6i9tGuu8wO/QBs36bT0twehVD97jJCoFun0=; b=AkjUuzlMtu+sFgEaHdEHhoiAJtZ/KwAr7cUnu9n4Q3W/MC26XD/VVpDVC0Gk5nmeHwgQnH 9S0tRwb7cd6CwBToTFSe0a7upiffo4QoAzJozfS+97QE4NIWU1crR1jsJ6/M25MS2TEOfH JXLPCrHsOgcQYElC65Dbkc1BxsHxpnuNirPaw7frSzmVPB4gv55FWAbdYs8MkxSV1dAGzL /o6cEXgX8Su71Xodh4VVZKRcSeTIesNlmidGOhY1tlcmarCfADDNHCpBo4Cb68o7mrlzcj fFZChzY3ioBcikjAzXhSZ2IP+CrWJLTMMy4RbDOHtg93S0x15BfdCRtpPvpRDA== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1755967155; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=Or+5diiw6i9tGuu8wO/QBs36bT0twehVD97jJCoFun0=; b=fd3FFgJYityN9iMl5KoAwRn+LcfCXYPUE1I2x/riswXrZhKRNuiwF9PGRjFyEttq1RME6P kQTz2RoQOagE+xAQ== From: Thomas Gleixner To: LKML Cc: Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Paolo Bonzini , Sean Christopherson , Wei Liu , Dexuan Cui , x86@kernel.org, Arnd Bergmann , Heiko Carstens , Christian Borntraeger , Sven Schnelle , Huacai Chen , Paul Walmsley , Palmer Dabbelt Subject: [patch V2 04/37] rseq: Remove the ksig argument from rseq_handle_notify_resume() References: <20250823161326.635281786@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Sat, 23 Aug 2025 18:39:14 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" There is no point for this being visible in the resume_to_user_mode() handling. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- include/linux/resume_user_mode.h | 2 +- include/linux/rseq.h | 13 +++++++------ 2 files changed, 8 insertions(+), 7 deletions(-) --- a/include/linux/resume_user_mode.h +++ b/include/linux/resume_user_mode.h @@ -59,7 +59,7 @@ static inline void resume_user_mode_work mem_cgroup_handle_over_high(GFP_KERNEL); blkcg_maybe_throttle_current(); =20 - rseq_handle_notify_resume(NULL, regs); + rseq_handle_notify_resume(regs); } =20 #endif /* LINUX_RESUME_USER_MODE_H */ --- a/include/linux/rseq.h +++ b/include/linux/rseq.h @@ -37,19 +37,20 @@ static inline void rseq_set_notify_resum =20 void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs= ); =20 -static inline void rseq_handle_notify_resume(struct ksignal *ksig, - struct pt_regs *regs) +static inline void rseq_handle_notify_resume(struct pt_regs *regs) { if (current->rseq) - __rseq_handle_notify_resume(ksig, regs); + __rseq_handle_notify_resume(NULL, regs); } =20 static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { - scoped_guard(RSEQ_EVENT_GUARD) - __set_bit(RSEQ_EVENT_SIGNAL_BIT, ¤t->rseq_event_mask); - rseq_handle_notify_resume(ksig, regs); + if (current->rseq) { + scoped_guard(RSEQ_EVENT_GUARD) + __set_bit(RSEQ_EVENT_SIGNAL_BIT, ¤t->rseq_event_mask); + __rseq_handle_notify_resume(ksig, regs); + } } =20 /* rseq_preempt() requires preemption to be disabled. */ From nobody Fri Oct 3 21:53:25 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0CF9525C807 for ; Sat, 23 Aug 2025 16:39:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967163; cv=none; b=jIzXK4OLyB2IvW1WR8NulED5nwyDDQxrfldH+pDIpEmcZ1j0g8+CxqC8fbiDIRuJyVWNkkj/bfq62TiUP/BqvaiHPG16FBOxYaEmSyRVhltsT00jj43RCqbD6ViBucEyxY5ipZa7j911SDEdvmqNuPkcQkGIFnR5Wi9sLXoGaG4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967163; c=relaxed/simple; bh=vTJWIftlzUS6KOLWF7NYgAJL2AOBE+7TbvnS0fsCGRw=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=u34DxsbWFHMI0T6A+/XVMADtdUK0cXoitZl2fsiRH2IJ+uV3zKuUR66HNUIGeCXGfucvhbtZsB+sSFbdretzJMZ2IeLv/smXNCM3gm0X6jFrfOjlyf+ItDvqLIrrhgF4Ar+6qr3PuWA0VmzTyzWu6W1RrrpC8MbEUD3mhNI4aLQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=0fALuubN; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=DkFYUpu4; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="0fALuubN"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="DkFYUpu4" Message-ID: <20250823161653.580926530@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1755967158; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=HYnZPOxWRd8jkfi41zon+VDT5SBjHGx9ihBaem0vnNM=; b=0fALuubNu8QtNJevJehktDyEv2N3tZwCFxqH66UPekmm81B3hzTAGFiBZRnC9/T8FoPzSk U0Ve4eyfSSi6lpTVQGTEIbV8W48hUjZx43lf8SslY2NaAVS9WSr7Gdj9MGt1JW4MbpOUdx RtAPBCwHs4ac1XyxwG+WhB9f8nf7dlKtZenR8cye3tXx8qoKZOxo/5UIMKaSV2cgNMzDbY /bfeX3MQYlYl2qFPg4pbvohLs7E15G3AU60eodqhqPH3oRSb6k7SMWx7pmQgBan3+N0pHl glIrrvLhBSXWn6dmYXAzMgVlNJfVVJlb8y0tMbNQ1C2DCtqG48ktcX/G4EnsFw== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1755967158; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=HYnZPOxWRd8jkfi41zon+VDT5SBjHGx9ihBaem0vnNM=; b=DkFYUpu4gAh/phExCwu714k43366vfz82bjUBT/O9uYCdhFJImN+NlA0prTC5uFOd8Uj4t 6/SFK9pCqG9gPiBw== From: Thomas Gleixner To: LKML Cc: Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Paolo Bonzini , Sean Christopherson , Wei Liu , Dexuan Cui , x86@kernel.org, Arnd Bergmann , Heiko Carstens , Christian Borntraeger , Sven Schnelle , Huacai Chen , Paul Walmsley , Palmer Dabbelt Subject: [patch V2 05/37] rseq: Simplify registration References: <20250823161326.635281786@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Sat, 23 Aug 2025 18:39:17 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" There is no point to read the critical section element in the newly registered user space RSEQ struct first in order to clear it. Just clear it and be done with it. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- kernel/rseq.c | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-) --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -492,11 +492,9 @@ void rseq_syscall(struct pt_regs *regs) /* * sys_rseq - setup restartable sequences for caller thread. */ -SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, - int, flags, u32, sig) +SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flag= s, u32, sig) { int ret; - u64 rseq_cs; =20 if (flags & RSEQ_FLAG_UNREGISTER) { if (flags & ~RSEQ_FLAG_UNREGISTER) @@ -557,11 +555,9 @@ SYSCALL_DEFINE4(rseq, struct rseq __user * avoid a potential segfault on return to user-space. The proper thing * to do would have been to fail the registration but this would break * older libcs that reuse the rseq area for new threads without - * clearing the fields. + * clearing the fields. Don't bother reading it, just reset it. */ - if (rseq_get_rseq_cs_ptr_val(rseq, &rseq_cs)) - return -EFAULT; - if (rseq_cs && clear_rseq_cs(rseq)) + if (put_user_masked_u64(0UL, &rseq->rseq_cs)) return -EFAULT; =20 #ifdef CONFIG_DEBUG_RSEQ From nobody Fri Oct 3 21:53:25 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7C68326C39E for ; Sat, 23 Aug 2025 16:39:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967164; cv=none; b=qt3eE2H5mH/ofKvWD4aL+Y/hRdgG8FR4FfM/Z6PM9XWwUfVpzGc7RTWunAAo/q7RpO3jSVEiYkXcB9eONnlj7tZwMx8T9c3xzzFmDiOrlIJlv6Gh4uRMkPjFk6+rD3i64WhnKgsDF5bDHVuYfsc4WdESn49rsCFQgIZ7yzWaGIM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967164; c=relaxed/simple; bh=NQgAMDZG9bJ1vSYuVqvxbYPnQ4JyZOSW+hXCf2oYBBs=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=gkeue13zAgyeVO+dD5jPBC9vUv/837KWjYjM6c/gpO+Ffxz+FXnVg+X3A+8GQMFZmsuT1UsPFb2UBQd7NW3X8LNN8wtyypnZtUEfmLc4emM/Tqf7RBE4n8+Sj5YTiAv1XeP9LgsalL7h2m6lgeDms2fkKnDlNt+VYPa8CX0Az3o= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=jot4fIVy; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=mn7C5sKI; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="jot4fIVy"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="mn7C5sKI" Message-ID: <20250823161653.644902433@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1755967160; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=9x2OIn6EDnOHAJLiBa7SCRZKrsC/CYR9wiaI2sa7Of0=; b=jot4fIVyRgHWOlM5Kdts+0J5VpvxLa+dGsRgfkl5eau4g+NW9FNmCKISuJuOYi0OIB+wkE JrczgjIF+Yu4o86LpiEP1EVAi5GEheQGj3cQWzU0JCsjxC0y3QpkMEoUKyPSpgF5wkn4w3 rH3BEJjfRw44GvJ0VV2RIm9T41SksgtdDJ6WijRvHesqNiORjkeScoRD9B4i8GiXiVg44m 5hIhoePGskHlSpM9cMFhXtJiWFWPmODxmq3dTLYpKs79UZPaUxZmGX6TXCZjN2Tpt9Hm5s 2JAXjH5cXfVSd9Qmatf1D4HxBEm/zwr8vp7eOGBfCU2NwBaDXbb0PalEVmfeuQ== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1755967160; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=9x2OIn6EDnOHAJLiBa7SCRZKrsC/CYR9wiaI2sa7Of0=; b=mn7C5sKIncJoA5reJcGHh0rs35qp/KRYu7DorBZ85Zr0rXAy1ic8SCO2SqApoOWKGh9g0Y 6AVvC5FdgF9QPkCQ== From: Thomas Gleixner To: LKML Cc: Jens Axboe , Peter Zijlstra , Mathieu Desnoyers , "Paul E. McKenney" , Boqun Feng , Paolo Bonzini , Sean Christopherson , Wei Liu , Dexuan Cui , x86@kernel.org, Arnd Bergmann , Heiko Carstens , Christian Borntraeger , Sven Schnelle , Huacai Chen , Paul Walmsley , Palmer Dabbelt Subject: [patch V2 06/37] rseq: Simplify the event notification References: <20250823161326.635281786@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Sat, 23 Aug 2025 18:39:19 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Since commit 0190e4198e47 ("rseq: Deprecate RSEQ_CS_FLAG_NO_RESTART_ON_* flags") the bits in task::rseq_event_mask are meaningless and just extra work in terms of setting them individually. Aside of that the only relevant point where an event has to be raised is context switch. Neither the CPU nor MM CID can change without going through a context switch. Collapse them all into a single boolean which simplifies the code a lot and remove the pointless invocations which have been sprinkled all over the place for no value. Signed-off-by: Thomas Gleixner Cc: Peter Zijlstra Cc: Mathieu Desnoyers Cc: "Paul E. McKenney" Cc: Boqun Feng --- V2: Reduce it to the sched switch event. --- fs/exec.c | 2 - include/linux/rseq.h | 66 +++++++++--------------------------------= ----- include/linux/sched.h | 10 +++--- include/uapi/linux/rseq.h | 21 ++++---------- kernel/rseq.c | 28 +++++++++++-------- kernel/sched/core.c | 5 --- kernel/sched/membarrier.c | 8 ++--- 7 files changed, 48 insertions(+), 92 deletions(-) --- --- a/fs/exec.c +++ b/fs/exec.c @@ -1775,7 +1775,7 @@ static int bprm_execve(struct linux_binp force_fatal_sig(SIGSEGV); =20 sched_mm_cid_after_execve(current); - rseq_set_notify_resume(current); + rseq_sched_switch_event(current); current->in_execve =3D 0; =20 return retval; --- a/include/linux/rseq.h +++ b/include/linux/rseq.h @@ -3,38 +3,8 @@ #define _LINUX_RSEQ_H =20 #ifdef CONFIG_RSEQ - -#include #include =20 -#ifdef CONFIG_MEMBARRIER -# define RSEQ_EVENT_GUARD irq -#else -# define RSEQ_EVENT_GUARD preempt -#endif - -/* - * Map the event mask on the user-space ABI enum rseq_cs_flags - * for direct mask checks. - */ -enum rseq_event_mask_bits { - RSEQ_EVENT_PREEMPT_BIT =3D RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT, - RSEQ_EVENT_SIGNAL_BIT =3D RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT, - RSEQ_EVENT_MIGRATE_BIT =3D RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT, -}; - -enum rseq_event_mask { - RSEQ_EVENT_PREEMPT =3D (1U << RSEQ_EVENT_PREEMPT_BIT), - RSEQ_EVENT_SIGNAL =3D (1U << RSEQ_EVENT_SIGNAL_BIT), - RSEQ_EVENT_MIGRATE =3D (1U << RSEQ_EVENT_MIGRATE_BIT), -}; - -static inline void rseq_set_notify_resume(struct task_struct *t) -{ - if (t->rseq) - set_tsk_thread_flag(t, TIF_NOTIFY_RESUME); -} - void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs= ); =20 static inline void rseq_handle_notify_resume(struct pt_regs *regs) @@ -43,35 +13,27 @@ static inline void rseq_handle_notify_re __rseq_handle_notify_resume(NULL, regs); } =20 -static inline void rseq_signal_deliver(struct ksignal *ksig, - struct pt_regs *regs) +static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_reg= s *regs) { if (current->rseq) { - scoped_guard(RSEQ_EVENT_GUARD) - __set_bit(RSEQ_EVENT_SIGNAL_BIT, ¤t->rseq_event_mask); + current->rseq_event_pending =3D true; __rseq_handle_notify_resume(ksig, regs); } } =20 -/* rseq_preempt() requires preemption to be disabled. */ -static inline void rseq_preempt(struct task_struct *t) +static inline void rseq_sched_switch_event(struct task_struct *t) { - __set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask); - rseq_set_notify_resume(t); -} - -/* rseq_migrate() requires preemption to be disabled. */ -static inline void rseq_migrate(struct task_struct *t) -{ - __set_bit(RSEQ_EVENT_MIGRATE_BIT, &t->rseq_event_mask); - rseq_set_notify_resume(t); + if (t->rseq) { + t->rseq_event_pending =3D true; + set_tsk_thread_flag(t, TIF_NOTIFY_RESUME); + } } =20 static __always_inline void rseq_exit_to_user_mode(void) { if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) { - if (WARN_ON_ONCE(current->rseq && current->rseq_event_mask)) - current->rseq_event_mask =3D 0; + if (WARN_ON_ONCE(current->rseq && current->rseq_event_pending)) + current->rseq_event_pending =3D false; } } =20 @@ -85,12 +47,12 @@ static inline void rseq_fork(struct task t->rseq =3D NULL; t->rseq_len =3D 0; t->rseq_sig =3D 0; - t->rseq_event_mask =3D 0; + t->rseq_event_pending =3D false; } else { t->rseq =3D current->rseq; t->rseq_len =3D current->rseq_len; t->rseq_sig =3D current->rseq_sig; - t->rseq_event_mask =3D current->rseq_event_mask; + t->rseq_event_pending =3D current->rseq_event_pending; } } =20 @@ -99,15 +61,13 @@ static inline void rseq_execve(struct ta t->rseq =3D NULL; t->rseq_len =3D 0; t->rseq_sig =3D 0; - t->rseq_event_mask =3D 0; + t->rseq_event_pending =3D false; } =20 #else /* CONFIG_RSEQ */ -static inline void rseq_set_notify_resume(struct task_struct *t) { } static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct = pt_regs *regs) { } static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_reg= s *regs) { } -static inline void rseq_preempt(struct task_struct *t) { } -static inline void rseq_migrate(struct task_struct *t) { } +static inline void rseq_sched_switch_event(struct task_struct *t) { } static inline void rseq_fork(struct task_struct *t, unsigned long clone_fl= ags) { } static inline void rseq_execve(struct task_struct *t) { } static inline void rseq_exit_to_user_mode(void) { } --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1401,14 +1401,14 @@ struct task_struct { #endif /* CONFIG_NUMA_BALANCING */ =20 #ifdef CONFIG_RSEQ - struct rseq __user *rseq; - u32 rseq_len; - u32 rseq_sig; + struct rseq __user *rseq; + u32 rseq_len; + u32 rseq_sig; /* - * RmW on rseq_event_mask must be performed atomically + * RmW on rseq_event_pending must be performed atomically * with respect to preemption. */ - unsigned long rseq_event_mask; + bool rseq_event_pending; # ifdef CONFIG_DEBUG_RSEQ /* * This is a place holder to save a copy of the rseq fields for --- a/include/uapi/linux/rseq.h +++ b/include/uapi/linux/rseq.h @@ -114,20 +114,13 @@ struct rseq { /* * Restartable sequences flags field. * - * This field should only be updated by the thread which - * registered this data structure. Read by the kernel. - * Mainly used for single-stepping through rseq critical sections - * with debuggers. - * - * - RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT - * Inhibit instruction sequence block restart on preemption - * for this thread. - * - RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL - * Inhibit instruction sequence block restart on signal - * delivery for this thread. - * - RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE - * Inhibit instruction sequence block restart on migration for - * this thread. + * This field was initialy intended to allow event masking for for + * single-stepping through rseq critical sections with debuggers. + * The kernel does not support this anymore and the relevant bits + * are checked for being always false: + * - RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT + * - RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL + * - RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE */ __u32 flags; =20 --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -78,6 +78,12 @@ #define CREATE_TRACE_POINTS #include =20 +#ifdef CONFIG_MEMBARRIER +# define RSEQ_EVENT_GUARD irq +#else +# define RSEQ_EVENT_GUARD preempt +#endif + /* The original rseq structure size (including padding) is 32 bytes. */ #define ORIG_RSEQ_SIZE 32 =20 @@ -430,11 +436,11 @@ void __rseq_handle_notify_resume(struct */ if (regs) { /* - * Read and clear the event mask first. If the task was not - * preempted or migrated or a signal is on the way, there - * is no point in doing any of the heavy lifting here on - * production kernels. In that case TIF_NOTIFY_RESUME was - * raised by some other functionality. + * Read and clear the event pending bit first. If the task + * was not preempted or migrated or a signal is on the way, + * there is no point in doing any of the heavy lifting here + * on production kernels. In that case TIF_NOTIFY_RESUME + * was raised by some other functionality. * * This is correct because the read/clear operation is * guarded against scheduler preemption, which makes it CPU @@ -447,15 +453,15 @@ void __rseq_handle_notify_resume(struct * with the result handed in to allow the detection of * inconsistencies. */ - u32 event_mask; + bool event; =20 scoped_guard(RSEQ_EVENT_GUARD) { - event_mask =3D t->rseq_event_mask; - t->rseq_event_mask =3D 0; + event =3D t->rseq_event_pending; + t->rseq_event_pending =3D false; } =20 - if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event_mask) { - ret =3D rseq_ip_fixup(regs, !!event_mask); + if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) { + ret =3D rseq_ip_fixup(regs, event); if (unlikely(ret < 0)) goto error; } @@ -584,7 +590,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user * registered, ensure the cpu_id_start and cpu_id fields * are updated before returning to user-space. */ - rseq_set_notify_resume(current); + rseq_sched_switch_event(current); =20 return 0; } --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3364,7 +3364,6 @@ void set_task_cpu(struct task_struct *p, if (p->sched_class->migrate_task_rq) p->sched_class->migrate_task_rq(p, new_cpu); p->se.nr_migrations++; - rseq_migrate(p); sched_mm_cid_migrate_from(p); perf_event_task_migrate(p); } @@ -4795,7 +4794,6 @@ int sched_cgroup_fork(struct task_struct p->sched_task_group =3D tg; } #endif - rseq_migrate(p); /* * We're setting the CPU for the first time, we don't migrate, * so use __set_task_cpu(). @@ -4859,7 +4857,6 @@ void wake_up_new_task(struct task_struct * as we're not fully set-up yet. */ p->recent_used_cpu =3D task_cpu(p); - rseq_migrate(p); __set_task_cpu(p, select_task_rq(p, task_cpu(p), &wake_flags)); rq =3D __task_rq_lock(p, &rf); update_rq_clock(rq); @@ -5153,7 +5150,7 @@ prepare_task_switch(struct rq *rq, struc kcov_prepare_switch(prev); sched_info_switch(rq, prev, next); perf_event_task_sched_out(prev, next); - rseq_preempt(prev); + rseq_sched_switch_event(prev); fire_sched_out_preempt_notifiers(prev, next); kmap_local_sched_out(); prepare_task(next); --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -199,7 +199,7 @@ static void ipi_rseq(void *info) * is negligible. */ smp_mb(); - rseq_preempt(current); + rseq_sched_switch_event(current); } =20 static void ipi_sync_rq_state(void *info) @@ -407,9 +407,9 @@ static int membarrier_private_expedited( * membarrier, we will end up with some thread in the mm * running without a core sync. * - * For RSEQ, don't rseq_preempt() the caller. User code - * is not supposed to issue syscalls at all from inside an - * rseq critical section. + * For RSEQ, don't invoke rseq_sched_switch_event() on the + * caller. User code is not supposed to issue syscalls at + * all from inside an rseq critical section. */ if (flags !=3D MEMBARRIER_FLAG_SYNC_CORE) { preempt_disable(); From nobody Fri Oct 3 21:53:25 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 096BA2EB5CD for ; Sat, 23 Aug 2025 16:39:26 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967167; cv=none; b=aBchbdnXiV+5KYpPkODweAKV18zKZt1n2crrqX23538ofC0GkpUWPClMiJj9c4M1/zWfQJ/iVKYqyLm24mdFXib9X1IwUJSgL5RG2S+mJunUjDV2WLzNKnonWjMm185lsBa+mQeNOPPFICf51oLpiqqwsuonMTExoGYRjUJIPdA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967167; c=relaxed/simple; bh=+xCDiLYk4IxXRI6LCA3i7U2Kcs1SMWnuGNUR4uoW0lY=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=mkDlL3G0czO9RF7zVglYZk7vaXJtwIVHQM4jwip5zn5scVh48pkEL0R4K9L/fvMvjsONOtkr0FEvynB3n2oenc6Q3e3iaCQOMuJ6ne5zAnwg0jD1Fi4hj4wdHtj/ZHZxi+xDz3iITjDjaX5Z9fsRn1mesm3ai3ZGti5qq8bBkWA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=IgOHRXST; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=DRolVXy0; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="IgOHRXST"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="DRolVXy0" Message-ID: <20250823161653.711118277@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1755967163; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=yWoqwOu/nLPKroY/DD7nEq70Vh5jZEteJAU8HtQMJiM=; b=IgOHRXSTmELTQA2P4tojm0ga59VGVB/Eq8XMbGeKVZJgRHnsGlCkWB8b2dJDzCL11RvsGz EL802XTLpoOARU4Tx1fvyTNssdwAELFaMuAPenscpJjNrX122apw4Sl5HuBibBkueq1CrO CPlY4ynw8yKhp6Br9gplU/B0B4PytIIw7GV5/zg9W8CbO590SeVU4HKahdb5ayc+NOjFp0 No1/EiWUuWZNy5GThN8a7CaGD6Q/LrV8XrYBBE6YE83wCjCd7CJqUuNT5rGd6XSG04x5FY pznWcLVfoHjEUOgaNkSKDbSRSYal4aS0andIv5WkfFSHBQ613QU8hjA+atEHrQ== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1755967163; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=yWoqwOu/nLPKroY/DD7nEq70Vh5jZEteJAU8HtQMJiM=; b=DRolVXy0MQMPcqDRxeQ+mDyCvcxB5ftnnFajcvAOWOF3035ZJlPhdVskz8tisIYtssmRsf 2c5SZyhfksWmATBg== From: Thomas Gleixner To: LKML Cc: Jens Axboe , Paolo Bonzini , Sean Christopherson , Wei Liu , Dexuan Cui , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , x86@kernel.org, Arnd Bergmann , Heiko Carstens , Christian Borntraeger , Sven Schnelle , Huacai Chen , Paul Walmsley , Palmer Dabbelt Subject: [patch V2 07/37] rseq, virt: Retrigger RSEQ after vcpu_run() References: <20250823161326.635281786@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Sat, 23 Aug 2025 18:39:22 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Hypervisors invoke resume_user_mode_work() before entering the guest, which clears TIF_NOTIFY_RESUME. The @regs argument is NULL as there is no user space context available to them, so the rseq notify handler skips inspecting the critical section, but updates the CPU/MM CID values unconditionally so that the eventual pending rseq event is not lost on the way to user space. This is a pointless exercise as the task might be rescheduled before actually returning to user space and it creates unnecessary work in the vcpu_run() loops. It's way more efficient to ignore that invocation based on @regs =3D=3D NULL and let the hypervisors re-raise TIF_NOTIFY_RESUME after returning from the vcpu_run() loop before returning from the ioctl(). This ensures that a pending RSEQ update is not lost and the IDs are updated before returning to user space. Once the RSEQ handling is decoupled from TIF_NOTIFY_RESUME, this turns into a NOOP. Signed-off-by: Thomas Gleixner Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Wei Liu Cc: Dexuan Cui --- drivers/hv/mshv_root_main.c | 2 + include/linux/rseq.h | 17 +++++++++ kernel/rseq.c | 76 +++++++++++++++++++++++----------------= ----- virt/kvm/kvm_main.c | 3 + 4 files changed, 62 insertions(+), 36 deletions(-) --- a/drivers/hv/mshv_root_main.c +++ b/drivers/hv/mshv_root_main.c @@ -585,6 +585,8 @@ static long mshv_run_vp_with_root_schedu } } while (!vp->run.flags.intercept_suspend); =20 + rseq_virt_userspace_exit(); + return ret; } =20 --- a/include/linux/rseq.h +++ b/include/linux/rseq.h @@ -38,6 +38,22 @@ static __always_inline void rseq_exit_to } =20 /* + * KVM/HYPERV invoke resume_user_mode_work() before entering guest mode, + * which clears TIF_NOTIFY_RESUME. To avoid updating user space RSEQ in + * that case just to do it eventually again before returning to user space, + * the entry resume_user_mode_work() invocation is ignored as the register + * argument is NULL. + * + * After returning from guest mode, they have to invoke this function to + * re-raise TIF_NOTIFY_RESUME if necessary. + */ +static inline void rseq_virt_userspace_exit(void) +{ + if (current->rseq_event_pending) + set_tsk_thread_flag(current, TIF_NOTIFY_RESUME); +} + +/* * If parent process has a registered restartable sequences area, the * child inherits. Unregister rseq for a clone with CLONE_VM set. */ @@ -68,6 +84,7 @@ static inline void rseq_execve(struct ta static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct = pt_regs *regs) { } static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_reg= s *regs) { } static inline void rseq_sched_switch_event(struct task_struct *t) { } +static inline void rseq_virt_userspace_exit(void) { } static inline void rseq_fork(struct task_struct *t, unsigned long clone_fl= ags) { } static inline void rseq_execve(struct task_struct *t) { } static inline void rseq_exit_to_user_mode(void) { } --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -422,50 +422,54 @@ void __rseq_handle_notify_resume(struct { struct task_struct *t =3D current; int ret, sig; + bool event; + + /* + * If invoked from hypervisors before entering the guest via + * resume_user_mode_work(), then @regs is a NULL pointer. + * + * resume_user_mode_work() clears TIF_NOTIFY_RESUME and re-raises + * it before returning from the ioctl() to user space when + * rseq_event.sched_switch is set. + * + * So it's safe to ignore here instead of pointlessly updating it + * in the vcpu_run() loop. + */ + if (!regs) + return; =20 if (unlikely(t->flags & PF_EXITING)) return; =20 /* - * If invoked from hypervisors or IO-URING, then @regs is a NULL - * pointer, so fixup cannot be done. If the syscall which led to - * this invocation was invoked inside a critical section, then it - * will either end up in this code again or a possible violation of - * a syscall inside a critical region can only be detected by the - * debug code in rseq_syscall() in a debug enabled kernel. + * Read and clear the event pending bit first. If the task + * was not preempted or migrated or a signal is on the way, + * there is no point in doing any of the heavy lifting here + * on production kernels. In that case TIF_NOTIFY_RESUME + * was raised by some other functionality. + * + * This is correct because the read/clear operation is + * guarded against scheduler preemption, which makes it CPU + * local atomic. If the task is preempted right after + * re-enabling preemption then TIF_NOTIFY_RESUME is set + * again and this function is invoked another time _before_ + * the task is able to return to user mode. + * + * On a debug kernel, invoke the fixup code unconditionally + * with the result handed in to allow the detection of + * inconsistencies. */ - if (regs) { - /* - * Read and clear the event pending bit first. If the task - * was not preempted or migrated or a signal is on the way, - * there is no point in doing any of the heavy lifting here - * on production kernels. In that case TIF_NOTIFY_RESUME - * was raised by some other functionality. - * - * This is correct because the read/clear operation is - * guarded against scheduler preemption, which makes it CPU - * local atomic. If the task is preempted right after - * re-enabling preemption then TIF_NOTIFY_RESUME is set - * again and this function is invoked another time _before_ - * the task is able to return to user mode. - * - * On a debug kernel, invoke the fixup code unconditionally - * with the result handed in to allow the detection of - * inconsistencies. - */ - bool event; - - scoped_guard(RSEQ_EVENT_GUARD) { - event =3D t->rseq_event_pending; - t->rseq_event_pending =3D false; - } + scoped_guard(RSEQ_EVENT_GUARD) { + event =3D t->rseq_event_pending; + t->rseq_event_pending =3D false; + } =20 - if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) { - ret =3D rseq_ip_fixup(regs, event); - if (unlikely(ret < 0)) - goto error; - } + if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) { + ret =3D rseq_ip_fixup(regs, event); + if (unlikely(ret < 0)) + goto error; } + if (unlikely(rseq_update_cpu_node_id(t))) goto error; return; --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -49,6 +49,7 @@ #include #include #include +#include =20 #include #include @@ -4466,6 +4467,8 @@ static long kvm_vcpu_ioctl(struct file * r =3D kvm_arch_vcpu_ioctl_run(vcpu); vcpu->wants_to_run =3D false; =20 + rseq_virt_userspace_exit(); + trace_kvm_userspace_exit(vcpu->run->exit_reason, r); break; } From nobody Fri Oct 3 21:53:25 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2D1BC2EB87F for ; Sat, 23 Aug 2025 16:39:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967169; cv=none; b=u7dVvMvnWANQcTXmqYsgVhv1BVUeLmYJODqLS0a+xJux5TJLVNlfo6Y4PAYmIQznZZM8U5Dj0NmKwkj+G0VaGi3db9rkZwISYL2RqGX9r6PTMlY6iEOF8KG4vPTHcNM1lV9ldkKZHvgOilNPSqW3s04tlt1jgblOxw0xeaJJy+o= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967169; c=relaxed/simple; bh=eqL9erE46B7rKu7FwxA/5IdztKBktszWQdhwIbuz0zI=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=sazhKkwiUtnZFwK6ZU+Of68MKVSactNddATYzH+iq0CW5q5BdreKuIfse0lYzp9fy5I2YujvW052Y+Izzkm50t0u8SIYbLwxYiWNH7dgbQ68PpmErozCRSX4/Q7FgXHpFNv9XtoZySWlugpN8V9RXPWyzKNS69xGNbh0gRMl+VA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=OoklO4gi; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=v6784Fn6; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="OoklO4gi"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="v6784Fn6" Message-ID: <20250823161653.778777669@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1755967165; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=N+F2iDx5RS83cKT+1sZ4mIe77F1JAxH+QfG7ReD6apc=; b=OoklO4giNu/f7NYcYPjXYi+2D8f/3KXfUkUUj7l2LB3rlniuSAog9QSmGqwnTOIFuX6kaz QdexqpMUK09RhmhvHJ3ptQr+N6JYygqnVFDVS0iguhZamDLEEuE82r7EZj2Lf2QuW2BpeW QKEnzcZfTiAdsKvNuqLuwMkqXxdiOssPNnbPNcQK0PkkzwT4u1/G8rEBPZQQo7pJDcDB4c NQAnIrS87zAp9ZYSGU1O1ULlonETfDq4lu0I4jRv+4cp603JWaGRCIDl2MttUhR0aLrAg8 52ggBGHZhmFRxt4Dtty4hwmoqwtHWPaEByeuwk/6sYNtH1f4qoQILTbxnAZgLw== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1755967165; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=N+F2iDx5RS83cKT+1sZ4mIe77F1JAxH+QfG7ReD6apc=; b=v6784Fn6hopGGkL9uW4wt3pU3tIToAV8Ygv99jM6nOVEIkvE3UrWIi5yibIYQOZUDFpey+ urrMga+IdYWkdDAg== From: Thomas Gleixner To: LKML Cc: Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Paolo Bonzini , Sean Christopherson , Wei Liu , Dexuan Cui , x86@kernel.org, Arnd Bergmann , Heiko Carstens , Christian Borntraeger , Sven Schnelle , Huacai Chen , Paul Walmsley , Palmer Dabbelt Subject: [patch V2 08/37] rseq: Avoid CPU/MM CID updates when no event pending References: <20250823161326.635281786@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Sat, 23 Aug 2025 18:39:24 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" There is no need to update these values unconditionally if there is no event pending. Signed-off-by: Thomas Gleixner --- kernel/rseq.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -464,11 +464,12 @@ void __rseq_handle_notify_resume(struct t->rseq_event_pending =3D false; } =20 - if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) { - ret =3D rseq_ip_fixup(regs, event); - if (unlikely(ret < 0)) - goto error; - } + if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event) + return; + + ret =3D rseq_ip_fixup(regs, event); + if (unlikely(ret < 0)) + goto error; =20 if (unlikely(rseq_update_cpu_node_id(t))) goto error; From nobody Fri Oct 3 21:53:25 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 642CC2EBDE6 for ; Sat, 23 Aug 2025 16:39:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967172; cv=none; b=cEQheUwraF74yTkhoJzmniq7zJXUM6otydRDxHNIqQd0A8hEJl9tBYuwEPG4Jmr3IEzVXN9wU0YZwMNwioSOKr6pkqKPWtPMrb3HmIVJPuz1+FYr/jYBiP/ygbLmfr0lv2DcYUYWruOjvGhokwu8MVI/C/xNgxnP5mdqKDggSn4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967172; c=relaxed/simple; bh=w/qcnPmENs3F4421q7Jqr8W1YT0llfNwlcN0WM9MJl8=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=pY1BvkEpSc1XD+8g72MiISEjPW/1jQLE5Ptam7oXXPsvk5NmzPYyZWPk3CHY2awZz7dYJRIs7xBcc+aOqahIOsTsBtU4pUqrSWoigP2VZoCTbGWnc8fGcjz8ILzlpu0n/mcP81EU8jx90GzBVCbKeqZGeKGiK3JCNu4rhu+ULmw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=0MmDDKcW; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=GRNDvH4T; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="0MmDDKcW"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="GRNDvH4T" Message-ID: <20250823161653.843757955@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1755967168; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=Wzv+6pqW9h0h+vlBMXqPxDpBXBVHrRwhkzgkO9whBoI=; b=0MmDDKcWPcsjE1XsOF2BRFzs9tme/YxI0RoTdD9Lm9i+TQB5OpPEVWtgkBNUz/eji/NpyN oBtWnVvM6HNK0Q62ECfzbkR9QKUpGlgUgG6aOR+MN0rXY+NXaPixWFVoQ8upVoO0uyX5iV 0Urk7Ur3a2UigYISQdpgbzpsWnZBi208A45QJRUUkXH7hJZHZoKtUiqWZgAY5XVETUqf9X qCpaks+SuL9sqWNjYhBU1oQ82DL3y2zV2H7wWT3N/Y8YZKNry5QAIGr7LgQ83436nYXuy+ sjTiKFNU4sfRjbcpToRxzAPI92u2j2DegPV8YrwDlpRH8UyVYiattUvgG2atSA== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1755967168; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=Wzv+6pqW9h0h+vlBMXqPxDpBXBVHrRwhkzgkO9whBoI=; b=GRNDvH4T0RbBUfVflgo2zSOzZMGLTqQOLCQF+J8JSGBeeluOLQUgxKJLoCP/QryYokc4qk bA0TR2PDUn2sp5AA== From: Thomas Gleixner To: LKML Cc: Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Paolo Bonzini , Sean Christopherson , Wei Liu , Dexuan Cui , x86@kernel.org, Arnd Bergmann , Heiko Carstens , Christian Borntraeger , Sven Schnelle , Huacai Chen , Paul Walmsley , Palmer Dabbelt Subject: [patch V2 09/37] rseq: Introduce struct rseq_event References: <20250823161326.635281786@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Sat, 23 Aug 2025 18:39:27 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" In preparation for a major rewrite of this code, provide a data structure for event management. Put the sched_switch event and a indicator for RSEQ on a task into it as a start. That uses a union, which allows to mask and clear the whole lot efficiently. The indicators are explicitely not a bit field. Bit fields generate abysmal code. The boolean members are defined as u8 as that actually guarantees that it fits. There seem to be strange architecture ABIs which need more than 8bits for a boolean. The has_rseq member is redudandant vs. task::rseq, but it turns out that boolean operations and quick checks on the union generate better code than fiddling with seperate entities and data types.=20 This struct will be extended over time to carry more information. Signed-off-by: Thomas Gleixner --- include/linux/rseq.h | 23 ++++++++++++----------- include/linux/rseq_types.h | 30 ++++++++++++++++++++++++++++++ include/linux/sched.h | 7 ++----- kernel/rseq.c | 6 ++++-- 4 files changed, 48 insertions(+), 18 deletions(-) --- a/include/linux/rseq.h +++ b/include/linux/rseq.h @@ -9,22 +9,22 @@ void __rseq_handle_notify_resume(struct =20 static inline void rseq_handle_notify_resume(struct pt_regs *regs) { - if (current->rseq) + if (current->rseq_event.has_rseq) __rseq_handle_notify_resume(NULL, regs); } =20 static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_reg= s *regs) { - if (current->rseq) { - current->rseq_event_pending =3D true; + if (current->rseq_event.has_rseq) { + current->rseq_event.sched_switch =3D true; __rseq_handle_notify_resume(ksig, regs); } } =20 static inline void rseq_sched_switch_event(struct task_struct *t) { - if (t->rseq) { - t->rseq_event_pending =3D true; + if (t->rseq_event.has_rseq) { + t->rseq_event.sched_switch =3D true; set_tsk_thread_flag(t, TIF_NOTIFY_RESUME); } } @@ -32,8 +32,9 @@ static inline void rseq_sched_switch_eve static __always_inline void rseq_exit_to_user_mode(void) { if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) { - if (WARN_ON_ONCE(current->rseq && current->rseq_event_pending)) - current->rseq_event_pending =3D false; + if (WARN_ON_ONCE(current->rseq_event.has_rseq && + current->rseq_event.events)) + current->rseq_event.events =3D 0; } } =20 @@ -49,7 +50,7 @@ static __always_inline void rseq_exit_to */ static inline void rseq_virt_userspace_exit(void) { - if (current->rseq_event_pending) + if (current->rseq_event.sched_switch) set_tsk_thread_flag(current, TIF_NOTIFY_RESUME); } =20 @@ -63,12 +64,12 @@ static inline void rseq_fork(struct task t->rseq =3D NULL; t->rseq_len =3D 0; t->rseq_sig =3D 0; - t->rseq_event_pending =3D false; + t->rseq_event.all =3D 0; } else { t->rseq =3D current->rseq; t->rseq_len =3D current->rseq_len; t->rseq_sig =3D current->rseq_sig; - t->rseq_event_pending =3D current->rseq_event_pending; + t->rseq_event =3D current->rseq_event; } } =20 @@ -77,7 +78,7 @@ static inline void rseq_execve(struct ta t->rseq =3D NULL; t->rseq_len =3D 0; t->rseq_sig =3D 0; - t->rseq_event_pending =3D false; + t->rseq_event.all =3D 0; } =20 #else /* CONFIG_RSEQ */ --- /dev/null +++ b/include/linux/rseq_types.h @@ -0,0 +1,30 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_RSEQ_TYPES_H +#define _LINUX_RSEQ_TYPES_H + +#include + +/* + * struct rseq_event - Storage for rseq related event management + * @all: Compound to initialize and clear the data efficiently + * @events: Compund to access events with a single load/store + * @sched_switch: True if the task was scheduled out + * @has_rseq: True if the task has a rseq pointer installed + */ +struct rseq_event { + union { + u32 all; + struct { + union { + u16 events; + struct { + u8 sched_switch; + }; + }; + + u8 has_rseq; + }; + }; +}; + +#endif --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -41,6 +41,7 @@ #include #include #include +#include #include #include #include @@ -1404,11 +1405,7 @@ struct task_struct { struct rseq __user *rseq; u32 rseq_len; u32 rseq_sig; - /* - * RmW on rseq_event_pending must be performed atomically - * with respect to preemption. - */ - bool rseq_event_pending; + struct rseq_event rseq_event; # ifdef CONFIG_DEBUG_RSEQ /* * This is a place holder to save a copy of the rseq fields for --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -460,8 +460,8 @@ void __rseq_handle_notify_resume(struct * inconsistencies. */ scoped_guard(RSEQ_EVENT_GUARD) { - event =3D t->rseq_event_pending; - t->rseq_event_pending =3D false; + event =3D t->rseq_event.sched_switch; + t->rseq_event.sched_switch =3D false; } =20 if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event) @@ -523,6 +523,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user current->rseq =3D NULL; current->rseq_sig =3D 0; current->rseq_len =3D 0; + current->rseq_event.all =3D 0; return 0; } =20 @@ -595,6 +596,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user * registered, ensure the cpu_id_start and cpu_id fields * are updated before returning to user-space. */ + current->rseq_event.has_rseq =3D true; rseq_sched_switch_event(current); =20 return 0; From nobody Fri Oct 3 21:53:25 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D3D782E7BAC for ; Sat, 23 Aug 2025 16:39:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967174; cv=none; b=KtTQtX7BTtxcDzduBQ3DK1Pvd8SCrSatJc1lz6snPg9BETluoNN1dwpCJTP+4TF18a94paoiU8tMEVb3XNzFZ+hOjs0Frjsz/cn+vIfjknjsq9hEzIzh+uVtr+ZYh6+c1Qs0eNHRKrQKSCW/h/0a3NJNyyCJCh0e3ftlPS2/xcI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967174; c=relaxed/simple; bh=+zlItehC5oeBmZec4e/XUR5hjHWD0Fzmqplkgt8Dst4=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=n9rJqg8crZj0WyNBWYM9yl6N3Osm1vMVFfBHRnyA8vE6V4OVfCTf8CPoqraS3y8VS/dIXb46XW3sJxQfhNLZfVorIdxe95mHJpH/xK20yA8GSqL9nNo22fdtWikYqOZX0kEONmmk227+QlYYHbHuAaTfAZEQa0/AHDraPOGLidk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=zUBvPX3L; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=TX/j17TQ; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="zUBvPX3L"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="TX/j17TQ" Message-ID: <20250823161653.907872775@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1755967170; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=4xQpSE03ORBdDfdjJuV6wcaPcbGOF7qlWXG2nsTAmDs=; b=zUBvPX3L94rAzOivqXbgTsE/jWxVLwioK3id0hcb1UqTsE9/soeSlp9T6AoEKTxQ75eZab lwTuH1XVsQcuMLhpBai6GI1LxMY3kQ/bQaBsusaT+6E0ppa0/Jo6Wda9Ar3Arf0BhNwe+0 lcqJTJovsYZ9QA+r8g9KCmOfecAg24Q+RT291y0WBsdfWQIBw3TPFbyf3M92+vDRZOG9Uv hKbNw3Pi3V6DcGx3vh898Beg2QNN2Kc5QXC1DjlwYKi8JLphEEsYZ6n4W6waCKC5jyg77O 7EK/pRjdYWuu7NtD0jNFw2E06q7VyfjXoitUzulMDhMGCd+7/5SN8cYujkCmtg== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1755967170; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=4xQpSE03ORBdDfdjJuV6wcaPcbGOF7qlWXG2nsTAmDs=; b=TX/j17TQF6nO8p12WP4CRzEX9Qz8ZJfh/LJpPhpKHRzOr2VEQFqTpg5L/PuMWBj5Yr8nt8 rY3dWwJwHxgI49Cg== From: Thomas Gleixner To: LKML Cc: Jens Axboe , Peter Zijlstra , Mathieu Desnoyers , "Paul E. McKenney" , Boqun Feng , Paolo Bonzini , Sean Christopherson , Wei Liu , Dexuan Cui , x86@kernel.org, Arnd Bergmann , Heiko Carstens , Christian Borntraeger , Sven Schnelle , Huacai Chen , Paul Walmsley , Palmer Dabbelt Subject: [patch V2 10/37] entry: Cleanup header References: <20250823161326.635281786@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Sat, 23 Aug 2025 18:39:29 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Thomas Gleixner Cleanup the include ordering, kernel-doc and other trivialities before making further changes. Signed-off-by: Thomas Gleixner Cc: Peter Zijlstra Reviewed-by: Mathieu Desnoyers --- include/linux/entry-common.h | 8 ++++---- include/linux/irq-entry-common.h | 2 ++ 2 files changed, 6 insertions(+), 4 deletions(-) --- --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -3,11 +3,11 @@ #define __LINUX_ENTRYCOMMON_H =20 #include +#include #include +#include #include #include -#include -#include =20 #include #include @@ -37,6 +37,7 @@ SYSCALL_WORK_SYSCALL_AUDIT | \ SYSCALL_WORK_SYSCALL_USER_DISPATCH | \ ARCH_SYSCALL_WORK_ENTER) + #define SYSCALL_WORK_EXIT (SYSCALL_WORK_SYSCALL_TRACEPOINT | \ SYSCALL_WORK_SYSCALL_TRACE | \ SYSCALL_WORK_SYSCALL_AUDIT | \ @@ -61,8 +62,7 @@ */ void syscall_enter_from_user_mode_prepare(struct pt_regs *regs); =20 -long syscall_trace_enter(struct pt_regs *regs, long syscall, - unsigned long work); +long syscall_trace_enter(struct pt_regs *regs, long syscall, unsigned long= work); =20 /** * syscall_enter_from_user_mode_work - Check and handle work before invoki= ng --- a/include/linux/irq-entry-common.h +++ b/include/linux/irq-entry-common.h @@ -68,6 +68,7 @@ static __always_inline bool arch_in_rcu_ =20 /** * enter_from_user_mode - Establish state when coming from user mode + * @regs: Pointer to currents pt_regs * * Syscall/interrupt entry disables interrupts, but user mode is traced as * interrupts enabled. Also with NO_HZ_FULL RCU might be idle. @@ -357,6 +358,7 @@ irqentry_state_t noinstr irqentry_enter( * Conditional reschedule with additional sanity checks. */ void raw_irqentry_exit_cond_resched(void); + #ifdef CONFIG_PREEMPT_DYNAMIC #if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL) #define irqentry_exit_cond_resched_dynamic_enabled raw_irqentry_exit_cond_= resched From nobody Fri Oct 3 21:53:25 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 803932E8DE4 for ; Sat, 23 Aug 2025 16:39:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967180; cv=none; b=mIm6GJwNzDk/KocQ3Jz7tN297rTT5X2FQ85F7SYBU+BMn1FnNve0L/Hg+BppNuCLqD2WqNRqWId7sRHmssJ8iLXxjiTS38lKrtcY5Ed1EKrd2A7Ac6aqeZ2IP63GMRf6YpR2WqmzOzpDuKe/2TbcGHM1oTKLHpDBgh82ZMxIMYs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967180; c=relaxed/simple; bh=DwYUJV9Z7jBJZ35qXry4rB5S7y91N9AmUhAbDgAUROo=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=HuZuMC2YRfCMZlg+/J8f0gjWYA18i3GtIUx5uPRj/Sx5pM1LruC0IZDOkn4lJJg3cv4uWNDK6HtHOwWljDuu8LuFpEY0POWlnT8IPxgHV1CTT3mnczQ8ekhnqU/G4V+CrhhI7PHFmm5ytDLzCUic1A19DQMxQvpHHVdi0UKLWdM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=Fj8JzU0/; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=Zl37Z7a0; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="Fj8JzU0/"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="Zl37Z7a0" Message-ID: <20250823161653.973036415@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1755967173; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=8Y1AM2Y3D83aMn+Y9L7edRuZB75zoQbrIxlXcOQk5JU=; b=Fj8JzU0/domlVWNOtwO1Z1kyWHOPG3yXJVuUrus7INpcbeMA3AE0NO7IREmx1u4xgEnlnw n/bZdy6buZ5omWSI6LpUrzF4ALZqerEDiDUReefTkUB4SoVvPW/xnxmQ95GRMXBxO4RAx3 Mx/0MPdjrZHvFZC7eH9T36axDy7FkBy2leFNG3ZixHtigzhGR+XcK4D1IR76kUG9O9jEei gam7o7lR/dy7/o22Ns9tTdIkvaYKMQJnmvtMOwW32aRbL017Sfiq0L2rl6idepS+wKkRRh XHYVMJkNoNiVtWJ1jyCHjPe7wqoyybaFq1tirVNClQI3HTBuMm86uBOUCvemhg== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1755967173; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=8Y1AM2Y3D83aMn+Y9L7edRuZB75zoQbrIxlXcOQk5JU=; b=Zl37Z7a0Gyb5Xsc0+5WdtvMRgj2VZh1wlZTI7SksNpKrqov6q4MUmEAKVl+SVduggno/Pc c/Gx16+0NYqsiNDg== From: Thomas Gleixner To: LKML Cc: Jens Axboe , x86@kernel.org, Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Paolo Bonzini , Sean Christopherson , Wei Liu , Dexuan Cui , Arnd Bergmann , Heiko Carstens , Christian Borntraeger , Sven Schnelle , Huacai Chen , Paul Walmsley , Palmer Dabbelt Subject: [patch V2 11/37] entry: Remove syscall_enter_from_user_mode_prepare() References: <20250823161326.635281786@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Sat, 23 Aug 2025 18:39:32 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Open code the only user in the x86 syscall code and reduce the zoo of functions. Signed-off-by: Thomas Gleixner Cc: x86@kernel.org --- arch/x86/entry/syscall_32.c | 3 ++- include/linux/entry-common.h | 26 +++++--------------------- kernel/entry/syscall-common.c | 8 -------- 3 files changed, 7 insertions(+), 30 deletions(-) --- a/arch/x86/entry/syscall_32.c +++ b/arch/x86/entry/syscall_32.c @@ -274,9 +274,10 @@ static noinstr bool __do_fast_syscall_32 * fetch EBP before invoking any of the syscall entry work * functions. */ - syscall_enter_from_user_mode_prepare(regs); + enter_from_user_mode(regs); =20 instrumentation_begin(); + local_irq_enable(); /* Fetch EBP from where the vDSO stashed it. */ if (IS_ENABLED(CONFIG_X86_64)) { /* --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -45,23 +45,6 @@ SYSCALL_WORK_SYSCALL_EXIT_TRAP | \ ARCH_SYSCALL_WORK_EXIT) =20 -/** - * syscall_enter_from_user_mode_prepare - Establish state and enable inter= rupts - * @regs: Pointer to currents pt_regs - * - * Invoked from architecture specific syscall entry code with interrupts - * disabled. The calling code has to be non-instrumentable. When the - * function returns all state is correct, interrupts are enabled and the - * subsequent functions can be instrumented. - * - * This handles lockdep, RCU (context tracking) and tracing state, i.e. - * the functionality provided by enter_from_user_mode(). - * - * This is invoked when there is extra architecture specific functionality - * to be done between establishing state and handling user mode entry work. - */ -void syscall_enter_from_user_mode_prepare(struct pt_regs *regs); - long syscall_trace_enter(struct pt_regs *regs, long syscall, unsigned long= work); =20 /** @@ -71,8 +54,8 @@ long syscall_trace_enter(struct pt_regs * @syscall: The syscall number * * Invoked from architecture specific syscall entry code with interrupts - * enabled after invoking syscall_enter_from_user_mode_prepare() and extra - * architecture specific work. + * enabled after invoking enter_from_user_mode(), enabling interrupts and + * extra architecture specific work. * * Returns: The original or a modified syscall number * @@ -108,8 +91,9 @@ static __always_inline long syscall_ente * function returns all state is correct, interrupts are enabled and the * subsequent functions can be instrumented. * - * This is combination of syscall_enter_from_user_mode_prepare() and - * syscall_enter_from_user_mode_work(). + * This is the combination of enter_from_user_mode() and + * syscall_enter_from_user_mode_work() to be used when there is no + * architecture specific work to be done between the two. * * Returns: The original or a modified syscall number. See * syscall_enter_from_user_mode_work() for further explanation. --- a/kernel/entry/syscall-common.c +++ b/kernel/entry/syscall-common.c @@ -63,14 +63,6 @@ long syscall_trace_enter(struct pt_regs return ret ? : syscall; } =20 -noinstr void syscall_enter_from_user_mode_prepare(struct pt_regs *regs) -{ - enter_from_user_mode(regs); - instrumentation_begin(); - local_irq_enable(); - instrumentation_end(); -} - /* * If SYSCALL_EMU is set, then the only reason to report is when * SINGLESTEP is set (i.e. PTRACE_SYSEMU_SINGLESTEP). This syscall From nobody Fri Oct 3 21:53:25 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 555272ED15A for ; Sat, 23 Aug 2025 16:39:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967178; cv=none; b=nMztt7Jr3TkPpz7t8dYpCMOTLT+mOJwKYW75C8G41WsdASWjmfjYwMunZRQ+h+rovNErOkSJxWnpel3hD19u2NgT/lfHGu+CTPyBd6Lg8xJeQRizjOywOXXrAeHWUlmlvvGo5yArElX6AJGkNvdQVorbOXOKoTSaef7Fk4xZQLM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967178; c=relaxed/simple; bh=McxjK3YF4Tk0aC4xps1tzuK+1uuQYOoVjRa/x0Onvgk=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=oi8k0WTP0r9gXULwIWTWx0zlCJ/uCf1vicfZXIAP7rmn6cQm+j5N80YWrXLSS3xo/tGik/xuzS+xCSlM2fLfA350iYkReLy4/ONYNU8tvC3dnjNdGbsnIcyWzHXjWAgzTbjXo4VYcdwzH0oxC5//W7cVOXnjwyeBp3W5ECnAfbc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=0gCHuE4R; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=mY3IBPWP; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="0gCHuE4R"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="mY3IBPWP" Message-ID: <20250823161654.038904706@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1755967175; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=bXk/MLZ59A3tXGHauPbjPR/7KVfuJVMFZfmg9sxAdmE=; b=0gCHuE4RZRRREvcViajQ433BhvQ+p+oNSmPk2JaWLnCVf5pcWl9C0x5SToXS1w5T9v5tcF GhD18hqX/7+mikc3VU8zKav9rXFhv60UvfHZSZUeHK9g7arOKWxZyN7WEGA2T1Tb//KmEp 4I56rdEmdRN6vSRVuOeyfEe9gvkFY3Hdkfvj+R3Teha/FZAm4F0GFPPimT0kPFTh57Rb1d PvliT2WK5+gopTnRtjiYbUxnnf/2KdXu3dR7Ty8WpTWfWikc145m8OchwmovzTfcLjaiqB 8oS89R/fi/6/+5KpHTaLMFTOwEt27w2pd1QIsTORAqJUv6kLGZ8SCzz+ugudkA== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1755967175; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=bXk/MLZ59A3tXGHauPbjPR/7KVfuJVMFZfmg9sxAdmE=; b=mY3IBPWPV5HkX56rM1yOw76MQoU2yjzCo56V53UnksfApNVPC9rAsRQhubiBqi0djzEvU9 nVFVv2DdhWWoESAA== From: Thomas Gleixner To: LKML Cc: Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Paolo Bonzini , Sean Christopherson , Wei Liu , Dexuan Cui , x86@kernel.org, Arnd Bergmann , Heiko Carstens , Christian Borntraeger , Sven Schnelle , Huacai Chen , Paul Walmsley , Palmer Dabbelt Subject: [patch V2 12/37] entry: Inline irqentry_enter/exit_from/to_user_mode() References: <20250823161326.635281786@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Sat, 23 Aug 2025 18:39:34 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" There is no point to have this as a function which just inlines enter_from_user_mode(). The function call overhead is larger than the function itself. Signed-off-by: Thomas Gleixner --- include/linux/irq-entry-common.h | 13 +++++++++++-- kernel/entry/common.c | 13 ------------- 2 files changed, 11 insertions(+), 15 deletions(-) --- a/include/linux/irq-entry-common.h +++ b/include/linux/irq-entry-common.h @@ -278,7 +278,10 @@ static __always_inline void exit_to_user * * The function establishes state (lockdep, RCU (context tracking), tracin= g) */ -void irqentry_enter_from_user_mode(struct pt_regs *regs); +static __always_inline void irqentry_enter_from_user_mode(struct pt_regs *= regs) +{ + enter_from_user_mode(regs); +} =20 /** * irqentry_exit_to_user_mode - Interrupt exit work @@ -293,7 +296,13 @@ void irqentry_enter_from_user_mode(struc * Interrupt exit is not invoking #1 which is the syscall specific one time * work. */ -void irqentry_exit_to_user_mode(struct pt_regs *regs); +static __always_inline void irqentry_exit_to_user_mode(struct pt_regs *reg= s) +{ + instrumentation_begin(); + exit_to_user_mode_prepare(regs); + instrumentation_end(); + exit_to_user_mode(); +} =20 #ifndef irqentry_state /** --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -62,19 +62,6 @@ void __weak arch_do_signal_or_restart(st return ti_work; } =20 -noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs) -{ - enter_from_user_mode(regs); -} - -noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs) -{ - instrumentation_begin(); - exit_to_user_mode_prepare(regs); - instrumentation_end(); - exit_to_user_mode(); -} - noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs) { irqentry_state_t ret =3D { From nobody Fri Oct 3 21:53:25 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2083E2ED86F for ; Sat, 23 Aug 2025 16:39:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967182; cv=none; b=J9zUePtByDvpFCisAtagkHU0XGTx7WaVQMMN4kPHBMXWnBhZKM07DDN2C5lY1JeJERYaZqEPsdnaKN8dCExDdu/4Z1ADK3OlsTkIHnrxetKnus4XpbYTuUlja1+mCd9ksmbUPOZ1Wska7xY7mKhwnHbjhCFOjBLSB4ED8EsLUjM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967182; c=relaxed/simple; bh=zWogiLI32y56lyIiiKjVJfTGiLOGPR+uyhBbkbNIWZU=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=B9n3FBbVd7uYdQm6dtEqlKQO14jEeexnrqq/fFv2BIir+WowmMLI/o/Eym2CjViCmsdy4cWuBuYyKqhZi5rwDcdefxwdbA+uglKl16TCcak/z4TxgyZr9LplxryDLyr/OM0DkX2rRwmUAhs8AEjNvvUTlotAnPz4xCEhrCxDUmQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=bzoOKFvK; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=s04sxqff; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="bzoOKFvK"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="s04sxqff" Message-ID: <20250823161654.102905434@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1755967178; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=vB4qNxPbyFropiAHJLd8M6UT1xH4V1GwhPc78Lrvq8o=; b=bzoOKFvKS6x7oIJF2E8Zc1xUcD0AXATAT+E4ENkA66f1DYETeUKpxtL53H9Gx5/tlUhx4B UM1ckkI9CZhVVYSTzuihY9ajUgPnqsUdFu2h+Oihls27MX3PG1jCwojbt+nlUvpftvYuVp vOzVIdkh9OX6Ld38FFWAAMfDN4tGYyXbJcHPW2WpkbsCWZ91udIQlVtQq7sQJP84U61pVy 1/30Im5sfTf5749rhaM28ndEDq7V6kcsJpQbqra+SNRiBg75OHEq3kq6y3EUK8h/IHD9w4 QSERRQgbF4fse6oljCBKa2p08bnD8k+ib+uGN3GG8X3IGp5ZkFrpwZe4AAly9A== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1755967178; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=vB4qNxPbyFropiAHJLd8M6UT1xH4V1GwhPc78Lrvq8o=; b=s04sxqffLOX+qMiFotU4AoDHxscXxMxQthDN+rOaLSDCr5p6ZObTqk7hXmhCzmziPg6X1s +vrzWFulbcrNcPDA== From: Thomas Gleixner To: LKML Cc: Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Paolo Bonzini , Sean Christopherson , Wei Liu , Dexuan Cui , x86@kernel.org, Arnd Bergmann , Heiko Carstens , Christian Borntraeger , Sven Schnelle , Huacai Chen , Paul Walmsley , Palmer Dabbelt Subject: [patch V2 13/37] sched: Move MM CID related functions to sched.h References: <20250823161326.635281786@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Sat, 23 Aug 2025 18:39:37 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" There is nothing mm specific in that and including mm.h can cause header recursion hell. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- include/linux/mm.h | 25 ------------------------- include/linux/sched.h | 26 ++++++++++++++++++++++++++ 2 files changed, 26 insertions(+), 25 deletions(-) --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2310,31 +2310,6 @@ struct zap_details { /* Set in unmap_vmas() to indicate a final unmap call. Only used by huget= lb */ #define ZAP_FLAG_UNMAP ((__force zap_flags_t) BIT(1)) =20 -#ifdef CONFIG_SCHED_MM_CID -void sched_mm_cid_before_execve(struct task_struct *t); -void sched_mm_cid_after_execve(struct task_struct *t); -void sched_mm_cid_fork(struct task_struct *t); -void sched_mm_cid_exit_signals(struct task_struct *t); -static inline int task_mm_cid(struct task_struct *t) -{ - return t->mm_cid; -} -#else -static inline void sched_mm_cid_before_execve(struct task_struct *t) { } -static inline void sched_mm_cid_after_execve(struct task_struct *t) { } -static inline void sched_mm_cid_fork(struct task_struct *t) { } -static inline void sched_mm_cid_exit_signals(struct task_struct *t) { } -static inline int task_mm_cid(struct task_struct *t) -{ - /* - * Use the processor id as a fall-back when the mm cid feature is - * disabled. This provides functional per-cpu data structure accesses - * in user-space, althrough it won't provide the memory usage benefits. - */ - return raw_smp_processor_id(); -} -#endif - #ifdef CONFIG_MMU extern bool can_do_mlock(void); #else --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2309,4 +2309,30 @@ static __always_inline void alloc_tag_re #define alloc_tag_restore(_tag, _old) do {} while (0) #endif =20 +/* Avoids recursive inclusion hell */ +#ifdef CONFIG_SCHED_MM_CID +void sched_mm_cid_before_execve(struct task_struct *t); +void sched_mm_cid_after_execve(struct task_struct *t); +void sched_mm_cid_fork(struct task_struct *t); +void sched_mm_cid_exit_signals(struct task_struct *t); +static inline int task_mm_cid(struct task_struct *t) +{ + return t->mm_cid; +} +#else +static inline void sched_mm_cid_before_execve(struct task_struct *t) { } +static inline void sched_mm_cid_after_execve(struct task_struct *t) { } +static inline void sched_mm_cid_fork(struct task_struct *t) { } +static inline void sched_mm_cid_exit_signals(struct task_struct *t) { } +static inline int task_mm_cid(struct task_struct *t) +{ + /* + * Use the processor id as a fall-back when the mm cid feature is + * disabled. This provides functional per-cpu data structure accesses + * in user-space, althrough it won't provide the memory usage benefits. + */ + return task_cpu(t); +} +#endif + #endif From nobody Fri Oct 3 21:53:25 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 04FB82EDD6D for ; Sat, 23 Aug 2025 16:39:43 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967184; cv=none; b=W13gNa9di0SwH5P3NnaAKcs/7+gezvJwt6eXd7n60+5tLLJFTweFagu33vbEv0Iny0uwfepNN3XJP6uRKVmCfJvGLohDDqlydpgFKOKeWppOuzimpxLsvUJ//iyB1oWEPsMEOtVqENaQMmvyc7Vc59Z+iyzVEDYaGB512heA1oE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967184; c=relaxed/simple; bh=LKdUwV0PZYbYQja2MqsQSsVCNFQ1JIsTbnynWYyH5pQ=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=OdUgxT9iQGrib+00XPWAZP0QaeXFpfZddIaIveOH6Y/Vj2QrgyNtT5mC1Kkkgt4cOUt3CzMylfKsfTvI6sdPjKz+auFJqhgxMn5nOxuGjzWtI8SVHqrNadzt/F1U8wT3/lCVrY5Yg7127mGdanfrHXU15IXl44PWCYlVhr1Gfls= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=Bbdom8VJ; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=H2xj6sca; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="Bbdom8VJ"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="H2xj6sca" Message-ID: <20250823161654.164761547@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1755967181; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=XVVGR1r2cZfwA6qxDUSc18qdJmphQ9nAFOcoxmtwUmU=; b=Bbdom8VJifpIgcJDGe1rAfYwxNEo2zBLKqR3S3F24XzWSQVcgl9v4jTNe9f3bAlWLQLieH KhMqnqRLFaDH+/MROdvFTrWm+tDHqxJ1ejR2+owb8dZa02gIU3grPtHcNqx3HQAd9njHcs dj29qFp8wKT/ryIrZ4FDKfPd6Ti781QaFCmqCtDgUHY98+SNwFJ+PaeXMns3lRL5xzygUC Kap+4zQS4Fw3aWjn4uIpjWI/f4qPAtbIbhNpHgteHXia3o6RwwgNSUu0jpuWeoUMxMlnJY nmFiayQlvwjLfRgjA/5SVYmO4Fe3g+uqVko7jtNAGIc5V1Tn1zrrUeu5v3USjg== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1755967181; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=XVVGR1r2cZfwA6qxDUSc18qdJmphQ9nAFOcoxmtwUmU=; b=H2xj6scaPNlVMdThe6/Q1cejRAGRkMOn5iDY8pz1HPBjh4+SB/tuBEi40q5JBEaIbphb+D j9VLyL1umApSD9Aw== From: Thomas Gleixner To: LKML Cc: Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Paolo Bonzini , Sean Christopherson , Wei Liu , Dexuan Cui , x86@kernel.org, Arnd Bergmann , Heiko Carstens , Christian Borntraeger , Sven Schnelle , Huacai Chen , Paul Walmsley , Palmer Dabbelt Subject: [patch V2 14/37] rseq: Cache CPU ID and MM CID values References: <20250823161326.635281786@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Sat, 23 Aug 2025 18:39:39 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" In preparation for rewriting RSEQ exit to user space handling provide storage to cache the CPU ID and MM CID values which were written to user space. That prepares for a quick check, which avoids the update when nothing changed. Signed-off-by: Thomas Gleixner --- include/linux/rseq.h | 3 +++ include/linux/rseq_types.h | 19 +++++++++++++++++++ include/linux/sched.h | 1 + include/trace/events/rseq.h | 4 ++-- kernel/rseq.c | 4 ++++ 5 files changed, 29 insertions(+), 2 deletions(-) --- a/include/linux/rseq.h +++ b/include/linux/rseq.h @@ -64,11 +64,13 @@ static inline void rseq_fork(struct task t->rseq =3D NULL; t->rseq_len =3D 0; t->rseq_sig =3D 0; + t->rseq_ids.cpu_cid =3D ~0ULL; t->rseq_event.all =3D 0; } else { t->rseq =3D current->rseq; t->rseq_len =3D current->rseq_len; t->rseq_sig =3D current->rseq_sig; + t->rseq_ids.cpu_cid =3D ~0ULL; t->rseq_event =3D current->rseq_event; } } @@ -78,6 +80,7 @@ static inline void rseq_execve(struct ta t->rseq =3D NULL; t->rseq_len =3D 0; t->rseq_sig =3D 0; + t->rseq_ids.cpu_cid =3D ~0ULL; t->rseq_event.all =3D 0; } =20 --- a/include/linux/rseq_types.h +++ b/include/linux/rseq_types.h @@ -27,4 +27,23 @@ struct rseq_event { }; }; =20 +/* + * struct rseq_ids - Cache for ids, which need to be updated + * @cpu_cid: Compound of @cpu_id and @mm_cid to make the + * compiler emit a single compare on 64-bit + * @cpu_id: The CPU ID which was written last to user space + * @mm_cid: The MM CID which was written last to user space + * + * @cpu_id and @mm_cid are updated when the data is written to user space. + */ +struct rseq_ids { + union { + u64 cpu_cid; + struct { + u32 cpu_id; + u32 mm_cid; + }; + }; +}; + #endif --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1406,6 +1406,7 @@ struct task_struct { u32 rseq_len; u32 rseq_sig; struct rseq_event rseq_event; + struct rseq_ids rseq_ids; # ifdef CONFIG_DEBUG_RSEQ /* * This is a place holder to save a copy of the rseq fields for --- a/include/trace/events/rseq.h +++ b/include/trace/events/rseq.h @@ -21,9 +21,9 @@ TRACE_EVENT(rseq_update, ), =20 TP_fast_assign( - __entry->cpu_id =3D raw_smp_processor_id(); + __entry->cpu_id =3D t->rseq_ids.cpu_id; __entry->node_id =3D cpu_to_node(__entry->cpu_id); - __entry->mm_cid =3D task_mm_cid(t); + __entry->mm_cid =3D t->rseq_ids.mm_cid; ), =20 TP_printk("cpu_id=3D%d node_id=3D%d mm_cid=3D%d", __entry->cpu_id, --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -184,6 +184,10 @@ static int rseq_update_cpu_node_id(struc rseq_unsafe_put_user(t, node_id, node_id, efault_end); rseq_unsafe_put_user(t, mm_cid, mm_cid, efault_end); =20 + /* Cache the user space values */ + t->rseq_ids.cpu_id =3D cpu_id; + t->rseq_ids.mm_cid =3D mm_cid; + /* * Additional feature fields added after ORIG_RSEQ_SIZE * need to be conditionally updated only if From nobody Fri Oct 3 21:53:25 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 566082EE601 for ; Sat, 23 Aug 2025 16:39:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967186; cv=none; b=qdBVKOlM3sVDfFB9WudsAZfRbn7M9HzAM0h3c5j7GSsw4Y0VykwwbR0f6S1SvobugBQWC7WNmi7UEzhO2UFh9sVx6nLRY9VCNPdio4/cxsLcWXTxB2EkOkp8bVabxBABxDcCAkLeG60GdsrQ9hvMjU/QwbZOYfczR0UqkCRVPrs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967186; c=relaxed/simple; bh=WLNIqpOPg/1O65BOvdLaOKdkw3oEkWVH+0B2zEhNcxg=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=WfFbSRBCX0CsW2caJyhXHpMWg8QyUruM96rROdiWTVRUPhbIVTEqTeRns7yBpyOGoT66rV99TYkP+Z7FIIivjXKL6l7k1VEzSukaT12K9V7yrPshJI7Ht/ipOE6Y5ylXHTg3baINfm0lF8ZhTk3/7OL/lt8BcqqP7Oyq2pY2s3Q= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=VjlB2m4f; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=X2uSI2XR; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="VjlB2m4f"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="X2uSI2XR" Message-ID: <20250823161654.228227253@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1755967183; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=hhHH3KXKzGJtzk15V3M02LNMhmwtvsKIBEuNiLXnbWk=; b=VjlB2m4fRBX4Pl60Lr3QhZZmTiFdSNrVaqEjEOW/9f3FgCfX8oJNKGa3UA/vh9EKMAHwnk 8jopZY8ZewCyYkol8XPHGrFo0VLklBb1FQ55wtv6RYH7ikx7kBAiw1+6yepuVWlOOsT6gc wVf6W8f+wcDkeWsOBKfkm1012UqwR9StCOcihaoEyWbYOuzJuFEI5H93ghuI1Zobkt37kC S+f0/PG2FxEqNWRw3MFIodG6dVa7FZXNeYfm7UpwXc3211SimnMjBfvyfyg44KvcCNEo9d vwLRHaiaw1acGBsCY9TpvNV4tGzFhz8D1v+G3vnIpTXLnmRBrVskEXT6fad73Q== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1755967183; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=hhHH3KXKzGJtzk15V3M02LNMhmwtvsKIBEuNiLXnbWk=; b=X2uSI2XRVK0cuHQUoTSI+27p2OwWIosD9+FGDM9M3K+/FzmHnRBDn5JIaGtQi4CVfm+Qvt 3JOmYqBLsGbeZUAw== From: Thomas Gleixner To: LKML Cc: Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Paolo Bonzini , Sean Christopherson , Wei Liu , Dexuan Cui , x86@kernel.org, Arnd Bergmann , Heiko Carstens , Christian Borntraeger , Sven Schnelle , Huacai Chen , Paul Walmsley , Palmer Dabbelt Subject: [patch V2 15/37] rseq: Record interrupt from user space References: <20250823161326.635281786@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Sat, 23 Aug 2025 18:39:42 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" For RSEQ the only relevant reason to inspect and eventually fixup (abort) user space critical sections is when user space was interrupted and the task was scheduled out. If the user to kernel entry was from a syscall no fixup is required. If user space invokes a syscall from a critical section it can keep the pieces as documented. This is only supported on architectures, which utilize the generic entry code. If your architecture does not use it, bad luck. Signed-off-by: Thomas Gleixner --- include/linux/irq-entry-common.h | 3 ++- include/linux/rseq.h | 16 +++++++++++----- include/linux/rseq_entry.h | 18 ++++++++++++++++++ include/linux/rseq_types.h | 2 ++ 4 files changed, 33 insertions(+), 6 deletions(-) --- a/include/linux/irq-entry-common.h +++ b/include/linux/irq-entry-common.h @@ -4,7 +4,7 @@ =20 #include #include -#include +#include #include #include #include @@ -281,6 +281,7 @@ static __always_inline void exit_to_user static __always_inline void irqentry_enter_from_user_mode(struct pt_regs *= regs) { enter_from_user_mode(regs); + rseq_note_user_irq_entry(); } =20 /** --- a/include/linux/rseq.h +++ b/include/linux/rseq.h @@ -31,11 +31,17 @@ static inline void rseq_sched_switch_eve =20 static __always_inline void rseq_exit_to_user_mode(void) { - if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) { - if (WARN_ON_ONCE(current->rseq_event.has_rseq && - current->rseq_event.events)) - current->rseq_event.events =3D 0; - } + struct rseq_event *ev =3D ¤t->rseq_event; + + if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) + WARN_ON_ONCE(ev->sched_switch); + + /* + * Ensure that event (especially user_irq) is cleared when the + * interrupt did not result in a schedule and therefore the + * rseq processing did not clear it. + */ + ev->events =3D 0; } =20 /* --- /dev/null +++ b/include/linux/rseq_entry.h @@ -0,0 +1,18 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_RSEQ_ENTRY_H +#define _LINUX_RSEQ_ENTRY_H + +#ifdef CONFIG_RSEQ +#include + +static __always_inline void rseq_note_user_irq_entry(void) +{ + if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) + current->rseq_event.user_irq =3D true; +} + +#else /* CONFIG_RSEQ */ +static inline void rseq_note_user_irq_entry(void) { } +#endif /* !CONFIG_RSEQ */ + +#endif /* _LINUX_RSEQ_ENTRY_H */ --- a/include/linux/rseq_types.h +++ b/include/linux/rseq_types.h @@ -9,6 +9,7 @@ * @all: Compound to initialize and clear the data efficiently * @events: Compund to access events with a single load/store * @sched_switch: True if the task was scheduled out + * @user_irq: True on interrupt entry from user mode * @has_rseq: True if the task has a rseq pointer installed */ struct rseq_event { @@ -19,6 +20,7 @@ struct rseq_event { u16 events; struct { u8 sched_switch; + u8 user_irq; }; }; From nobody Fri Oct 3 21:53:25 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F312E2EF662 for ; Sat, 23 Aug 2025 16:39:47 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967189; cv=none; b=DBID4vZ3pR6DZFnrXHE44F3O/9YsYR9Eey+8NQXti9puzzOkeqpdw+a8yWSmAcFM9ZJS2T7EumgHtRYEiOpmPayBYf7njC9zd2WdtJOAbja4q/ockbCvXycxrX/evKNjQ+7EgBISF+mTcDfQjj/qfAi6wvGjv4rE2vNgv2yFSGE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967189; c=relaxed/simple; bh=ZweXSPz6QuEnl5pJKupeYVZ//OISnoj+Hclj7xxALFc=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=T07rpYafeRNywv/Lyxb+V+gH+8ueogmsRni99HH73q5ToZ2LFhPc5jxKB3LUVXZ3WzqQrf3KAwsLs9CnlRmBgJqc/he75AoUX9unkmQPVsY725Fz2mi9iy7o0JnYpMobAluursD/Oj2X+ZJRXLptdY/4pyY25QMgkQ96lnfPXiY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=tpFl5S9l; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=Tpm3Mpvk; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="tpFl5S9l"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="Tpm3Mpvk" Message-ID: <20250823161654.292334353@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1755967186; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=X1p8CYCrwTeHl6w7ZxboATWMSxsZCeOES6hcV4qR6j4=; b=tpFl5S9lLJcLkjFc88nXRasoOEwjc8YChN32rvT9aZ5vB27KYDK8f/9AQAd8jTIe/g1Bni /f3QGzgOoTTsDFhhJFdSflmymFbU2VBudQCsL32F1041iB41K9Nx/FuPmTaumtP75ns8Mr 7YZjc5qXlOYYekvlGmyT0v46BexeLdpuz4CPeO3x91PF3UzYD6HHfaIy/D9RbVKvN6jUxK ABJpAkLwZMqD/YEf8XtBo4BxW3CvPg12wJVFKuOBCcL6qj8r6HOsFopt/THmB1p61CATpm YiWBD6yZHGeCtoa1cPUEY/mV7+2oo5QULR96D+aF7ljECfxovDgsqktTmOTH6w== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1755967186; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=X1p8CYCrwTeHl6w7ZxboATWMSxsZCeOES6hcV4qR6j4=; b=Tpm3MpvkaYe1P07NEMSo32tA44s6JMV6nt0dbHCKFXRpdOfMEJ7O4SFaYNCOPjPR+eMmPb UjO7aXiYcrrO3VBw== From: Thomas Gleixner To: LKML Cc: Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Paolo Bonzini , Sean Christopherson , Wei Liu , Dexuan Cui , x86@kernel.org, Arnd Bergmann , Heiko Carstens , Christian Borntraeger , Sven Schnelle , Huacai Chen , Paul Walmsley , Palmer Dabbelt Subject: [patch V2 16/37] rseq: Provide tracepoint wrappers for inline code References: <20250823161326.635281786@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Sat, 23 Aug 2025 18:39:45 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Provide tracepoint wrappers for the upcoming RSEQ exit to user space inline fast path, so that the header can be safely included by code which defines actual trace points. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- include/linux/rseq_entry.h | 30 ++++++++++++++++++++++++++++++ kernel/rseq.c | 17 +++++++++++++++++ 2 files changed, 47 insertions(+) --- a/include/linux/rseq_entry.h +++ b/include/linux/rseq_entry.h @@ -5,6 +5,36 @@ #ifdef CONFIG_RSEQ #include =20 +#include + +#ifdef CONFIG_TRACEPOINTS +DECLARE_TRACEPOINT(rseq_update); +DECLARE_TRACEPOINT(rseq_ip_fixup); +void __rseq_trace_update(struct task_struct *t); +void __rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip, + unsigned long offset, unsigned long abort_ip); + +static inline void rseq_trace_update(struct task_struct *t, struct rseq_id= s *ids) +{ + if (tracepoint_enabled(rseq_update)) { + if (ids) + __rseq_trace_update(t); + } +} + +static inline void rseq_trace_ip_fixup(unsigned long ip, unsigned long sta= rt_ip, + unsigned long offset, unsigned long abort_ip) +{ + if (tracepoint_enabled(rseq_ip_fixup)) + __rseq_trace_ip_fixup(ip, start_ip, offset, abort_ip); +} + +#else /* CONFIG_TRACEPOINT */ +static inline void rseq_trace_update(struct task_struct *t) { } +static inline void rseq_trace_ip_fixup(unsigned long ip, unsigned long sta= rt_ip, + unsigned long offset, unsigned long abort_ip) { } +#endif /* !CONFIG_TRACEPOINT */ + static __always_inline void rseq_note_user_irq_entry(void) { if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -91,6 +91,23 @@ RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL | \ RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE) =20 +#ifdef CONFIG_TRACEPOINTS +/* + * Out of line, so the actual update functions can be in a header to be + * inlined into the exit to user code. + */ +void __rseq_trace_update(struct task_struct *t) +{ + trace_rseq_update(t); +} + +void __rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip, + unsigned long offset, unsigned long abort_ip) +{ + trace_rseq_ip_fixup(ip, start_ip, offset, abort_ip); +} +#endif /* CONFIG_TRACEPOINTS */ + #ifdef CONFIG_DEBUG_RSEQ static struct rseq *rseq_kernel_fields(struct task_struct *t) { From nobody Fri Oct 3 21:53:25 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1EE722EFDB5 for ; Sat, 23 Aug 2025 16:39:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967192; cv=none; b=n6gOxtbKv+cRCrAZnKKow75Ns0IzSRNkndvb4ddAT1VKKODdSR/IZ8WNfHKziB/Ayv3PVbmr7MbSCA5meUyFcD5HB0PCLj8uweYS1KCcRiwqRCR1GSqfrem3dMqZ2L9J+kzWhALN/jSPky2i5ko/6A2GPbzg94yaKstFfULzrhc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967192; c=relaxed/simple; bh=Nh/iSa5Sc/7MgdgZ4stAhJ4paHcvNPZRGq5Py5DmTSo=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=UCsvUD7Io3ERKSoFebmEU9I9TJiBHGYByHlcUL/24VSQ3qrW7BsEVrbNEo2XbD2sOBMi22whewv/oox3S505AhyiFPdIvpJnXxf9hJcVHkPsJxMf9ULzDPM6P9ao7FHjRir0TqrMfAKXeNOUKi/GsGeZazDCIB/3X2JetY/V9KI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=xCuTnoE7; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=dxEOqAns; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="xCuTnoE7"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="dxEOqAns" Message-ID: <20250823161654.357235187@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1755967189; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=AZY3MhzhIZs5t8ZAZ1Pc6WdT6jOKzWWstrcsheqV7kQ=; b=xCuTnoE7NtHXs1frevzmhhX1c2cujWNqQZN7FMDH5SnW7FGMsdAjiGwRE6DgZFKHEHoI6E 7unhOYXkKZxBerxn+nMxJYvr9JHxBFRp2vkoC5i4PEpzPV6XwR3/Ib5Lp3DA6TsjtMoQd2 MkKP+WYsoymB0lyNV+ipkth/crLXAv3pof/TgBLxgLomg88rDqnf/GoSePGGBaNYZu1owQ poAhx+rhLAJsvy+dFxgJiqsztEQ7sQqSQ5cskEOs7sOW6rHSeys5NgJKpK2GgKzGixEaQR gofR4V+kxs0ibxxSzI2S2jYsJE+ruJEy2QDASFghZVUNt7iZEoM8diK3ILOPNw== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1755967189; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=AZY3MhzhIZs5t8ZAZ1Pc6WdT6jOKzWWstrcsheqV7kQ=; b=dxEOqAnsBoFgGjsvoCssGzaDj/oPfASNRlUAl0HN4sVkzKPCPWM0Ta6A9Lq2tAbVRboLBo IuM8SpV0TdsAdDAg== From: Thomas Gleixner To: LKML Cc: Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Paolo Bonzini , Sean Christopherson , Wei Liu , Dexuan Cui , x86@kernel.org, Arnd Bergmann , Heiko Carstens , Christian Borntraeger , Sven Schnelle , Huacai Chen , Paul Walmsley , Palmer Dabbelt Subject: [patch V2 17/37] rseq: Expose lightweight statistics in debugfs References: <20250823161326.635281786@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Sat, 23 Aug 2025 18:39:48 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Analyzing the call frequency without actually using tracing is helpful for analysis of this infrastructure. The overhead is minimal as it just increments a per CPU counter associated to each operation. The debugfs readout provides a racy sum of all counters. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- include/linux/rseq.h | 16 --------- include/linux/rseq_entry.h | 49 +++++++++++++++++++++++++++ init/Kconfig | 12 ++++++ kernel/rseq.c | 79 ++++++++++++++++++++++++++++++++++++++++= +---- 4 files changed, 133 insertions(+), 23 deletions(-) --- a/include/linux/rseq.h +++ b/include/linux/rseq.h @@ -29,21 +29,6 @@ static inline void rseq_sched_switch_eve } } =20 -static __always_inline void rseq_exit_to_user_mode(void) -{ - struct rseq_event *ev =3D ¤t->rseq_event; - - if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) - WARN_ON_ONCE(ev->sched_switch); - - /* - * Ensure that event (especially user_irq) is cleared when the - * interrupt did not result in a schedule and therefore the - * rseq processing did not clear it. - */ - ev->events =3D 0; -} - /* * KVM/HYPERV invoke resume_user_mode_work() before entering guest mode, * which clears TIF_NOTIFY_RESUME. To avoid updating user space RSEQ in @@ -97,7 +82,6 @@ static inline void rseq_sched_switch_eve static inline void rseq_virt_userspace_exit(void) { } static inline void rseq_fork(struct task_struct *t, unsigned long clone_fl= ags) { } static inline void rseq_execve(struct task_struct *t) { } -static inline void rseq_exit_to_user_mode(void) { } #endif /* !CONFIG_RSEQ */ =20 #ifdef CONFIG_DEBUG_RSEQ --- a/include/linux/rseq_entry.h +++ b/include/linux/rseq_entry.h @@ -2,6 +2,37 @@ #ifndef _LINUX_RSEQ_ENTRY_H #define _LINUX_RSEQ_ENTRY_H =20 +/* Must be outside the CONFIG_RSEQ guard to resolve the stubs */ +#ifdef CONFIG_RSEQ_STATS +#include + +struct rseq_stats { + unsigned long exit; + unsigned long signal; + unsigned long slowpath; + unsigned long ids; + unsigned long cs; + unsigned long clear; + unsigned long fixup; +}; + +DECLARE_PER_CPU(struct rseq_stats, rseq_stats); + +/* + * Slow path has interrupts and preemption enabled, but the fast path + * runs with interrupts disabled so there is no point in having the + * preemption checks implied in __this_cpu_inc() for every operation. + */ +#ifdef RSEQ_BUILD_SLOW_PATH +#define rseq_stat_inc(which) this_cpu_inc((which)) +#else +#define rseq_stat_inc(which) raw_cpu_inc((which)) +#endif + +#else /* CONFIG_RSEQ_STATS */ +#define rseq_stat_inc(x) do { } while (0) +#endif /* !CONFIG_RSEQ_STATS */ + #ifdef CONFIG_RSEQ #include =20 @@ -41,8 +72,26 @@ static __always_inline void rseq_note_us current->rseq_event.user_irq =3D true; } =20 +static __always_inline void rseq_exit_to_user_mode(void) +{ + struct rseq_event *ev =3D ¤t->rseq_event; + + rseq_stat_inc(rseq_stats.exit); + + if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) + WARN_ON_ONCE(ev->sched_switch); + + /* + * Ensure that event (especially user_irq) is cleared when the + * interrupt did not result in a schedule and therefore the + * rseq processing did not clear it. + */ + ev->events =3D 0; +} + #else /* CONFIG_RSEQ */ static inline void rseq_note_user_irq_entry(void) { } +static inline void rseq_exit_to_user_mode(void) { } #endif /* !CONFIG_RSEQ */ =20 #endif /* _LINUX_RSEQ_ENTRY_H */ --- a/init/Kconfig +++ b/init/Kconfig @@ -1883,6 +1883,18 @@ config RSEQ =20 If unsure, say Y. =20 +config RSEQ_STATS + default n + bool "Enable lightweight statistics of restartable sequences" if EXPERT + depends on RSEQ && DEBUG_FS + help + Enable lightweight counters which expose information about the + frequency of RSEQ operations via debugfs. Mostly interesting for + kernel debugging or performance analysis. While lightweight it's + still adding code into the user/kernel mode transitions. + + If unsure, say N. + config DEBUG_RSEQ default n bool "Enable debugging of rseq() system call" if EXPERT --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -67,12 +67,16 @@ * F1. */ =20 +/* Required to select the proper per_cpu ops for rseq_stats_inc() */ +#define RSEQ_BUILD_SLOW_PATH + +#include +#include +#include #include -#include #include -#include +#include #include -#include #include =20 #define CREATE_TRACE_POINTS @@ -108,6 +112,56 @@ void __rseq_trace_ip_fixup(unsigned long } #endif /* CONFIG_TRACEPOINTS */ =20 +#ifdef CONFIG_RSEQ_STATS +DEFINE_PER_CPU(struct rseq_stats, rseq_stats); + +static int rseq_debug_show(struct seq_file *m, void *p) +{ + struct rseq_stats stats =3D { }; + unsigned int cpu; + + for_each_possible_cpu(cpu) { + stats.exit +=3D data_race(per_cpu(rseq_stats.exit, cpu)); + stats.signal +=3D data_race(per_cpu(rseq_stats.signal, cpu)); + stats.slowpath +=3D data_race(per_cpu(rseq_stats.slowpath, cpu)); + stats.ids +=3D data_race(per_cpu(rseq_stats.ids, cpu)); + stats.cs +=3D data_race(per_cpu(rseq_stats.cs, cpu)); + stats.clear +=3D data_race(per_cpu(rseq_stats.clear, cpu)); + stats.fixup +=3D data_race(per_cpu(rseq_stats.fixup, cpu)); + } + + seq_printf(m, "exit: %16lu\n", stats.exit); + seq_printf(m, "signal: %16lu\n", stats.signal); + seq_printf(m, "slowp: %16lu\n", stats.slowpath); + seq_printf(m, "ids: %16lu\n", stats.ids); + seq_printf(m, "cs: %16lu\n", stats.cs); + seq_printf(m, "clear: %16lu\n", stats.clear); + seq_printf(m, "fixup: %16lu\n", stats.fixup); + return 0; +} + +static int rseq_debug_open(struct inode *inode, struct file *file) +{ + return single_open(file, rseq_debug_show, inode->i_private); +} + +static const struct file_operations dfs_ops =3D { + .open =3D rseq_debug_open, + .read =3D seq_read, + .llseek =3D seq_lseek, + .release =3D single_release, +}; + +static int __init rseq_debugfs_init(void) +{ + struct dentry *root_dir =3D debugfs_create_dir("rseq", NULL); + + debugfs_create_file("stats", 0444, root_dir, NULL, &dfs_ops); + return 0; +} +__initcall(rseq_debugfs_init); +#endif /* CONFIG_RSEQ_STATS */ + #ifdef CONFIG_DEBUG_RSEQ static struct rseq *rseq_kernel_fields(struct task_struct *t) { @@ -187,12 +241,13 @@ static int rseq_update_cpu_node_id(struc u32 node_id =3D cpu_to_node(cpu_id); u32 mm_cid =3D task_mm_cid(t); =20 - /* - * Validate read-only rseq fields. - */ + rseq_stat_inc(rseq_stats.ids); + + /* Validate read-only rseq fields on debug kernels */ if (rseq_validate_ro_fields(t)) goto efault; WARN_ON_ONCE((int) mm_cid < 0); + if (!user_write_access_begin(rseq, t->rseq_len)) goto efault; =20 @@ -403,6 +458,8 @@ static int rseq_ip_fixup(struct pt_regs struct rseq_cs rseq_cs; int ret; =20 + rseq_stat_inc(rseq_stats.cs); + ret =3D rseq_get_rseq_cs(t, &rseq_cs); if (ret) return ret; @@ -412,8 +469,10 @@ static int rseq_ip_fixup(struct pt_regs * If not nested over a rseq critical section, restart is useless. * Clear the rseq_cs pointer and return. */ - if (!in_rseq_cs(ip, &rseq_cs)) + if (!in_rseq_cs(ip, &rseq_cs)) { + rseq_stat_inc(rseq_stats.clear); return clear_rseq_cs(t->rseq); + } ret =3D rseq_check_flags(t, rseq_cs.flags); if (ret < 0) return ret; @@ -422,6 +481,7 @@ static int rseq_ip_fixup(struct pt_regs ret =3D clear_rseq_cs(t->rseq); if (ret) return ret; + rseq_stat_inc(rseq_stats.fixup); trace_rseq_ip_fixup(ip, rseq_cs.start_ip, rseq_cs.post_commit_offset, rseq_cs.abort_ip); instruction_pointer_set(regs, (unsigned long)rseq_cs.abort_ip); @@ -462,6 +522,11 @@ void __rseq_handle_notify_resume(struct if (unlikely(t->flags & PF_EXITING)) return; =20 + if (ksig) + rseq_stat_inc(rseq_stats.signal); + else + rseq_stat_inc(rseq_stats.slowpath); + /* * Read and clear the event pending bit first. If the task * was not preempted or migrated or a signal is on the way, From nobody Fri Oct 3 21:53:25 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D997C2EFDBB for ; Sat, 23 Aug 2025 16:39:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967195; cv=none; b=ScDVJ08oRIBC8UXFZn4DJ1DD7dwz9UISOGEq+/rzZz+ZYO4Wi2Rbdq6X63z4Czf8NIIfDjD4oq9ah8xdErjwUx+Q6jWS1lTccsbb6cw4hRhVTvljipp7R/HsPHFuia9Uk9f5srtrdUp8Ru3ym+chuF7exjTv2hXHKegxAumY5Rs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967195; c=relaxed/simple; bh=ib147srNtRaGfMjNJX6VP8Nrd59NNxlfgAx9sJK6GmM=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=kNQfvGmlMo8VlTNkQ7hfGeA8XGky6b6vHUVrHKarBCDWtm34aBLcUcnBgwyMtxHb3KOepqxz0+15mwCnKeIz2+QRYl4TK4lu0oMKeaVURiVLxGBRGJpGR9G0wtJNtoymyzd429Ts3ia+952hJoET7CVEkren+mCxhLDAXvCb3dA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=2ztJn4vo; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=ua3Gw0jp; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="2ztJn4vo"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="ua3Gw0jp" Message-ID: <20250823161654.421576400@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1755967192; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=GN7N/WjuobM3/hdkgPG84P+YkkU3yzsHrdvAdHmio98=; b=2ztJn4vomKVRfle7FKcGlnOJiijZNyTamAaoWz7sBv2UQMgoI3Q7jXaCCCV/CbO+NMLQHz ll5l62uxG4I+mh2w3qSSNR/YxPXdMObH7BakpMFUUc9CJPP8zWst9SLusk/5FYW8B/nVzo 8dd0t4ESiJAaxj5jNVkHEx8Y5kVRTTNRMJQKV1I/pKrXppqLokH4Fyc0kt5+eUQekCA7Gq qoHDxh2NrFGTWHWJ0wbWCYrkNt1wf5/2Lxl1Qxs9x9F0+KdMD8heg65zNSfdt/J6hvQovi zvzz2lJFs5FDHlMMp35mNfoUvBbe4pg8vz0PoKQnCzcZqh7lBHohsMR1zYfpvQ== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1755967192; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=GN7N/WjuobM3/hdkgPG84P+YkkU3yzsHrdvAdHmio98=; b=ua3Gw0jpMOxUfC3Cu4w1bUhpHhBj0jR67V/b2m9X5swCzHQmhsP1jUyA7eFYdhDm+aCxri P+f29T/zeFcPzSDw== From: Thomas Gleixner To: LKML Cc: Jens Axboe , Peter Zijlstra , Mathieu Desnoyers , "Paul E. McKenney" , Boqun Feng , Paolo Bonzini , Sean Christopherson , Wei Liu , Dexuan Cui , x86@kernel.org, Arnd Bergmann , Heiko Carstens , Christian Borntraeger , Sven Schnelle , Huacai Chen , Paul Walmsley , Palmer Dabbelt Subject: [patch V2 18/37] rseq: Provide static branch for runtime debugging References: <20250823161326.635281786@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Sat, 23 Aug 2025 18:39:50 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Config based debug is rarely turned on and is not available easily when things go wrong. Provide a static branch to allow permanent integration of debug mechanisms along with the usual toggles in Kconfig, command line and debugfs. Requested-by: Peter Zijlstra Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- Documentation/admin-guide/kernel-parameters.txt | 4 + include/linux/rseq_entry.h | 3=20 init/Kconfig | 14 ++++ kernel/rseq.c | 73 +++++++++++++++++++= +++-- 4 files changed, 90 insertions(+), 4 deletions(-) --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -6443,6 +6443,10 @@ Memory area to be used by remote processor image, managed by CMA. =20 + rseq_debug=3D [KNL] Enable or disable restartable sequence + debug mode. Defaults to CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE. + Format: + rt_group_sched=3D [KNL] Enable or disable SCHED_RR/FIFO group scheduling when CONFIG_RT_GROUP_SCHED=3Dy. Defaults to !CONFIG_RT_GROUP_SCHED_DEFAULT_DISABLED. --- a/include/linux/rseq_entry.h +++ b/include/linux/rseq_entry.h @@ -34,6 +34,7 @@ DECLARE_PER_CPU(struct rseq_stats, rseq_ #endif /* !CONFIG_RSEQ_STATS */ =20 #ifdef CONFIG_RSEQ +#include #include =20 #include @@ -66,6 +67,8 @@ static inline void rseq_trace_ip_fixup(u unsigned long offset, unsigned long abort_ip) { } #endif /* !CONFIG_TRACEPOINT */ =20 +DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enab= led); + static __always_inline void rseq_note_user_irq_entry(void) { if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) --- a/init/Kconfig +++ b/init/Kconfig @@ -1893,10 +1893,24 @@ config RSEQ_STATS =20 If unsure, say N. =20 +config RSEQ_DEBUG_DEFAULT_ENABLE + default n + bool "Enable restartable sequences debug mode by default" if EXPERT + depends on RSEQ + help + This enables the static branch for debug mode of restartable + sequences. + + This also can be controlled on the kernel command line via the + command line parameter "rseq_debug=3D0/1" and through debugfs. + + If unsure, say N. + config DEBUG_RSEQ default n bool "Enable debugging of rseq() system call" if EXPERT depends on RSEQ && DEBUG_KERNEL + select RSEQ_DEBUG_DEFAULT_ENABLE help Enable extra debugging checks for the rseq system call. =20 --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -95,6 +95,27 @@ RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL | \ RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE) =20 +DEFINE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabl= ed); + +static inline void rseq_control_debug(bool on) +{ + if (on) + static_branch_enable(&rseq_debug_enabled); + else + static_branch_disable(&rseq_debug_enabled); +} + +static int __init rseq_setup_debug(char *str) +{ + bool on; + + if (kstrtobool(str, &on)) + return -EINVAL; + rseq_control_debug(on); + return 0; +} +__setup("rseq_debug=3D", rseq_setup_debug); + #ifdef CONFIG_TRACEPOINTS /* * Out of line, so the actual update functions can be in a header to be @@ -112,10 +133,11 @@ void __rseq_trace_ip_fixup(unsigned long } #endif /* CONFIG_TRACEPOINTS */ =20 +#ifdef CONFIG_DEBUG_FS #ifdef CONFIG_RSEQ_STATS DEFINE_PER_CPU(struct rseq_stats, rseq_stats); =20 -static int rseq_debug_show(struct seq_file *m, void *p) +static int rseq_stats_show(struct seq_file *m, void *p) { struct rseq_stats stats =3D { }; unsigned int cpu; @@ -140,14 +162,56 @@ static int rseq_debug_show(struct seq_fi return 0; } =20 +static int rseq_stats_open(struct inode *inode, struct file *file) +{ + return single_open(file, rseq_stats_show, inode->i_private); +} + +static const struct file_operations stat_ops =3D { + .open =3D rseq_stats_open, + .read =3D seq_read, + .llseek =3D seq_lseek, + .release =3D single_release, +}; + +static int __init rseq_stats_init(struct dentry *root_dir) +{ + debugfs_create_file("stats", 0444, root_dir, NULL, &stat_ops); + return 0; +} +#else +static inline void rseq_stats_init(struct dentry *root_dir) { } +#endif /* CONFIG_RSEQ_STATS */ + +static int rseq_debug_show(struct seq_file *m, void *p) +{ + bool on =3D static_branch_unlikely(&rseq_debug_enabled); + + seq_printf(m, "%d\n", on); + return 0; +} + +static ssize_t rseq_debug_write(struct file *file, const char __user *ubuf, + size_t count, loff_t *ppos) +{ + bool on; + + if (kstrtobool_from_user(ubuf, count, &on)) + return -EINVAL; + + rseq_control_debug(on); + return count; +} + static int rseq_debug_open(struct inode *inode, struct file *file) { return single_open(file, rseq_debug_show, inode->i_private); } =20 -static const struct file_operations dfs_ops =3D { +static const struct file_operations debug_ops =3D { .open =3D rseq_debug_open, .read =3D seq_read, + .write =3D rseq_debug_write, .llseek =3D seq_lseek, .release =3D single_release, }; @@ -156,11 +220,12 @@ static int __init rseq_debugfs_init(void { struct dentry *root_dir =3D debugfs_create_dir("rseq", NULL); =20 - debugfs_create_file("stats", 0444, root_dir, NULL, &dfs_ops); + debugfs_create_file("debug", 0644, root_dir, NULL, &debug_ops); + rseq_stats_init(root_dir); return 0; } __initcall(rseq_debugfs_init); -#endif /* CONFIG_RSEQ_STATS */ +#endif /* CONFIG_DEBUG_FS */ =20 #ifdef CONFIG_DEBUG_RSEQ static struct rseq *rseq_kernel_fields(struct task_struct *t) From nobody Fri Oct 3 21:53:25 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D30342F0693 for ; Sat, 23 Aug 2025 16:39:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967199; cv=none; b=BJ1XvH6le41t7VP6DDNDMUPinWk8Vu7zH587Xy03c85yXNAbytlJ4y/qhJ15y+rCIxHVwdECgWnfYmxiZE+6O8WrtjOAbjAjyafjKnD3SeV985ed7mzBIX0VllTdz6RtbHbrIvWuMBqJsB1ICdScKWen4UuCCEoUQV5Pu19UFpc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967199; c=relaxed/simple; bh=Saea/lUEebGbpIlJsRI6zr571NOidClqZ9Qtwfy62Q0=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=SmQQ4CzQK2Mb/hSpdJsB/XA31sfdtDr9iWSFhvzlDaeIBb9HtCJYu9mwyTJGps0+xOZ763tbn4l51WS2tljE3mBBIoFcKOMOKC2d1vW1L8YLMYYWJzs9ipSV5UhjZ4uY2sHh4zdXkvFl1aQMf7WV9F2+V6qGb2hREBcnBK6341I= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=ogoNHNLN; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=ls9VTPnZ; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="ogoNHNLN"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="ls9VTPnZ" Message-ID: <20250823161654.485063556@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1755967195; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=6wU70L6aiH6anSVUcj0JFrwe/8y9sXXw0wXa0Km/v3I=; b=ogoNHNLNhIERodVAnL7CvwIpuseoU00IuGhnCLrFKtLfm1MVgFUS3SpcfsNFwmlYh2G4F8 L15xu017QKIpDzm6sItoAfo/3szPJTuaih+7B/zJu7q2tSHDP0IxyS1QjW+JNzmr9fnEHd F1KySzJnkfKTQfFwPOiCUhfLemC7659za4kx/tITJV7Qmiw2WqUsaB8Cp1TQzHXrWMnXrt QNII1zIIYZO3EB3OYK5geHyG0zaXvT3MrFfUBPeefVrMPdLioG1i5aR/9SBTNspelAR/nq Vq+pWx49EwCJXKSnRUqljqm/SykJkNUfy/f0ZG6yzfwQpJ4EYOIChCgYEXtJHw== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1755967195; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=6wU70L6aiH6anSVUcj0JFrwe/8y9sXXw0wXa0Km/v3I=; b=ls9VTPnZURR26/1eP5ePS0vov2EkzpZ0/9IDuW64GD0rKdjf72VWSzVUvjmCOGjFfjkvJB fABzW6eJOWJBF9AA== From: Thomas Gleixner To: LKML Cc: Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Paolo Bonzini , Sean Christopherson , Wei Liu , Dexuan Cui , x86@kernel.org, Arnd Bergmann , Heiko Carstens , Christian Borntraeger , Sven Schnelle , Huacai Chen , Paul Walmsley , Palmer Dabbelt Subject: [patch V2 19/37] rseq: Provide and use rseq_update_user_cs() References: <20250823161326.635281786@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Sat, 23 Aug 2025 18:39:53 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Provide a straight forward implementation to check for and eventually clear/fixup critical sections in user space. The non-debug version does not any sanity checks and aims for efficiency. The only attack vector, which needs to be reliably prevented is an abort IP which is in the kernel address space. That would cause at least x86 to return to kernel space via IRET. Instead of a check, just mask the address and be done with it. The magic signature check along with it's obscure "possible attack" printk is just voodoo security. If an attacker manages to manipulate the abort_ip member in the critical section descriptor, then it can equally manipulate any other indirection in the application. If user space truly cares about the security of the critical section descriptors, then it should set them up once and map the descriptor memory read only. There is no justification for voodoo security in the kernel fast path to encourage user space to be careless under a completely non-sensical "security" claim. If the section descriptors are invalid then the resulting misbehaviour of the user space application is not the kernels problem. The kernel provides a run-time switchable debug slow path, which implements the full zoo of checks (except the silly attack message) including termination of the task when one of the gazillion conditions is not met. Replace the zoo in rseq.c with it and invoke it from the TIF_NOTIFY_RESUME handler. Move the reminders into the CONFIG_DEBUG_RSEQ section, which will be replaced and removed in a subsequent step. Signed-off-by: Thomas Gleixner --- include/linux/rseq_entry.h | 194 ++++++++++++++++++++++++++++++++++++ include/linux/rseq_types.h | 11 +- kernel/rseq.c | 238 +++++++++++++---------------------------= ----- 3 files changed, 273 insertions(+), 170 deletions(-) --- a/include/linux/rseq_entry.h +++ b/include/linux/rseq_entry.h @@ -36,6 +36,7 @@ DECLARE_PER_CPU(struct rseq_stats, rseq_ #ifdef CONFIG_RSEQ #include #include +#include =20 #include =20 @@ -69,12 +70,205 @@ static inline void rseq_trace_ip_fixup(u =20 DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enab= led); =20 +#ifdef RSEQ_BUILD_SLOW_PATH +#define rseq_inline +#else +#define rseq_inline __always_inline +#endif + +bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs= , unsigned long csaddr); + static __always_inline void rseq_note_user_irq_entry(void) { if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) current->rseq_event.user_irq =3D true; } =20 +/* + * Check whether there is a valid critical section and whether the + * instruction pointer in @regs is inside the critical section. + * + * - If the critical section is invalid, terminate the task. + * + * - If valid and the instruction pointer is inside, set it to the abort = IP + * + * - If valid and the instruction pointer is outside, clear the critical + * section address. + * + * Returns true, if the section was valid and either fixup or clear was + * done, false otherwise. + * + * In the failure case task::rseq_event::fatal is set when a invalid + * section was found. It's clear when the failure was an unresolved page + * fault. + * + * If inlined into the exit to user path with interrupts disabled, the + * caller has to protect against page faults with pagefault_disable(). + * + * In preemptible task context this would be counterproductive as the page + * faults could not be fully resolved. As a consequence unresolved page + * faults in task context are fatal too. + */ + +#ifdef RSEQ_BUILD_SLOW_PATH +/* + * The debug version is put out of line, but kept here so the code stays + * together. + * + * @csaddr has already been checked by the caller to be in user space + */ +bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs= , unsigned long csaddr) +{ + struct rseq_cs __user *ucs =3D (struct rseq_cs __user *)(unsigned long)cs= addr; + u64 start_ip, abort_ip, offset, cs_end, head, tasksize =3D TASK_SIZE; + unsigned long ip =3D instruction_pointer(regs); + u64 __user *uc_head =3D (u64 __user *) ucs; + u32 usig, __user *uc_sig; + + if (!user_rw_masked_begin(ucs)) + return false; + + /* + * Evaluate the user pile and exit if one of the conditions is not + * fulfilled. + */ + unsafe_get_user(start_ip, &ucs->start_ip, fail); + if (unlikely(start_ip >=3D tasksize)) + goto die; + /* If outside, just clear the critical section. */ + if (ip < start_ip) + goto clear; + + unsafe_get_user(offset, &ucs->post_commit_offset, fail); + cs_end =3D start_ip + offset; + /* Check for overflow and wraparound */ + if (unlikely(cs_end >=3D tasksize || cs_end < start_ip)) + goto die; + + /* If not inside, clear it. */ + if (ip >=3D cs_end) + goto clear; + + unsafe_get_user(abort_ip, &ucs->abort_ip, fail); + /* Ensure it's "valid" */ + if (unlikely(abort_ip >=3D tasksize || abort_ip < sizeof(*uc_sig))) + goto die; + /* Validate that the abort IP is not in the critical section */ + if (unlikely(abort_ip - start_ip < offset)) + goto die; + + /* + * Check version and flags for 0. No point in emitting deprecated + * warnings before dying. That could be done in the slow path + * eventually, but *shrug*. + */ + unsafe_get_user(head, uc_head, fail); + if (unlikely(head)) + goto die; + + /* abort_ip - 4 is >=3D 0. See abort_ip check above */ + uc_sig =3D (u32 __user *)(unsigned long)(abort_ip - sizeof(*uc_sig)); + unsafe_get_user(usig, uc_sig, fail); + if (unlikely(usig !=3D t->rseq_sig)) + goto die; + + /* rseq_event.user_irq is only valid if CONFIG_GENERIC_IRQ_ENTRY=3Dy */ + if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) { + /* If not in interrupt from user context, let it die */ + if (unlikely(!t->rseq_event.user_irq)) + goto die; + } + + unsafe_put_user(0ULL, &t->rseq->rseq_cs, fail); + user_access_end(); + + instruction_pointer_set(regs, (unsigned long)abort_ip); + + rseq_stat_inc(rseq_stats.fixup); + rseq_trace_ip_fixup(ip, start_ip, offset, abort_ip); + return true; +clear: + unsafe_put_user(0ULL, &t->rseq->rseq_cs, fail); + user_access_end(); + rseq_stat_inc(rseq_stats.clear); + return true; +die: + t->rseq_event.fatal =3D true; +fail: + user_access_end(); + return false; +} +#endif /* RSEQ_BUILD_SLOW_PATH */ + +/* + * This only ensures that abort_ip is in the user address space by masking= it. + * No other sanity checks are done here, that's what the debug code is for. + */ +static rseq_inline bool +rseq_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned = long csaddr) +{ + struct rseq_cs __user *ucs =3D (struct rseq_cs __user *)(unsigned long)cs= addr; + unsigned long ip =3D instruction_pointer(regs); + u64 start_ip, abort_ip, offset; + + rseq_stat_inc(rseq_stats.cs); + + if (unlikely(csaddr >=3D TASK_SIZE)) { + t->rseq_event.fatal =3D true; + return false; + } + + if (static_branch_unlikely(&rseq_debug_enabled)) + return rseq_debug_update_user_cs(t, regs, csaddr); + + if (!user_rw_masked_begin(ucs)) + return false; + + unsafe_get_user(start_ip, &ucs->start_ip, fail); + unsafe_get_user(offset, &ucs->post_commit_offset, fail); + unsafe_get_user(abort_ip, &ucs->abort_ip, fail); + + /* + * No sanity checks. If user space screwed it up, it can + * keep the pieces. That's what debug code is for. + * + * If outside, just clear the critical section. + */ + if (ip - start_ip >=3D offset) + goto clear; + + /* + * Force it to be in user space as x86 IRET would happily return to + * the kernel. Can't use TASK_SIZE as a mask because that's not + * necessarily a power of two. Just make sure it's in the user + * address space. Let the pagefault handler sort it out. + * + * Use LONG_MAX and not LLONG_MAX to keep it correct for 32 and 64 + * bit architectures. + */ + abort_ip &=3D (u64)LONG_MAX; + + /* Invalidate the critical section */ + unsafe_put_user(0ULL, &t->rseq->rseq_cs, fail); + user_access_end(); + + /* Update the instruction pointer */ + instruction_pointer_set(regs, (unsigned long)abort_ip); + + rseq_stat_inc(rseq_stats.fixup); + rseq_trace_ip_fixup(ip, start_ip, offset, abort_ip); + return true; +clear: + unsafe_put_user(0ULL, &t->rseq->rseq_cs, fail); + user_access_end(); + rseq_stat_inc(rseq_stats.clear); + return true; + +fail: + user_access_end(); + return false; +} + static __always_inline void rseq_exit_to_user_mode(void) { struct rseq_event *ev =3D ¤t->rseq_event; --- a/include/linux/rseq_types.h +++ b/include/linux/rseq_types.h @@ -11,10 +11,12 @@ * @sched_switch: True if the task was scheduled out * @user_irq: True on interrupt entry from user mode * @has_rseq: True if the task has a rseq pointer installed + * @error: Compound error code for the slow path to analyze + * @fatal: User space data corrupted or invalid */ struct rseq_event { union { - u32 all; + u64 all; struct { union { u16 events; @@ -25,6 +27,13 @@ struct rseq_event { }; =20 u8 has_rseq; + u8 __pad; + union { + u16 error; + struct { + u8 fatal; + }; + }; }; }; }; --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -382,175 +382,15 @@ static int rseq_reset_rseq_cpu_node_id(s return -EFAULT; } =20 -/* - * Get the user-space pointer value stored in the 'rseq_cs' field. - */ -static int rseq_get_rseq_cs_ptr_val(struct rseq __user *rseq, u64 *rseq_cs) -{ - if (!rseq_cs) - return -EFAULT; - -#ifdef CONFIG_64BIT - if (get_user(*rseq_cs, &rseq->rseq_cs)) - return -EFAULT; -#else - if (copy_from_user(rseq_cs, &rseq->rseq_cs, sizeof(*rseq_cs))) - return -EFAULT; -#endif - - return 0; -} - -/* - * If the rseq_cs field of 'struct rseq' contains a valid pointer to - * user-space, copy 'struct rseq_cs' from user-space and validate its fiel= ds. - */ -static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs) -{ - struct rseq_cs __user *urseq_cs; - u64 ptr; - u32 __user *usig; - u32 sig; - int ret; - - ret =3D rseq_get_rseq_cs_ptr_val(t->rseq, &ptr); - if (ret) - return ret; - - /* If the rseq_cs pointer is NULL, return a cleared struct rseq_cs. */ - if (!ptr) { - memset(rseq_cs, 0, sizeof(*rseq_cs)); - return 0; - } - /* Check that the pointer value fits in the user-space process space. */ - if (ptr >=3D TASK_SIZE) - return -EINVAL; - urseq_cs =3D (struct rseq_cs __user *)(unsigned long)ptr; - if (copy_from_user(rseq_cs, urseq_cs, sizeof(*rseq_cs))) - return -EFAULT; - - if (rseq_cs->start_ip >=3D TASK_SIZE || - rseq_cs->start_ip + rseq_cs->post_commit_offset >=3D TASK_SIZE || - rseq_cs->abort_ip >=3D TASK_SIZE || - rseq_cs->version > 0) - return -EINVAL; - /* Check for overflow. */ - if (rseq_cs->start_ip + rseq_cs->post_commit_offset < rseq_cs->start_ip) - return -EINVAL; - /* Ensure that abort_ip is not in the critical section. */ - if (rseq_cs->abort_ip - rseq_cs->start_ip < rseq_cs->post_commit_offset) - return -EINVAL; - - usig =3D (u32 __user *)(unsigned long)(rseq_cs->abort_ip - sizeof(u32)); - ret =3D get_user(sig, usig); - if (ret) - return ret; - - if (current->rseq_sig !=3D sig) { - printk_ratelimited(KERN_WARNING - "Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%= x (pid=3D%d, addr=3D%p).\n", - sig, current->rseq_sig, current->pid, usig); - return -EINVAL; - } - return 0; -} - -static bool rseq_warn_flags(const char *str, u32 flags) +static bool rseq_handle_cs(struct task_struct *t, struct pt_regs *regs) { - u32 test_flags; + u64 csaddr; =20 - if (!flags) + if (get_user_masked_u64(&csaddr, &t->rseq->rseq_cs)) return false; - test_flags =3D flags & RSEQ_CS_NO_RESTART_FLAGS; - if (test_flags) - pr_warn_once("Deprecated flags (%u) in %s ABI structure", test_flags, st= r); - test_flags =3D flags & ~RSEQ_CS_NO_RESTART_FLAGS; - if (test_flags) - pr_warn_once("Unknown flags (%u) in %s ABI structure", test_flags, str); - return true; -} - -static int rseq_check_flags(struct task_struct *t, u32 cs_flags) -{ - u32 flags; - int ret; - - if (rseq_warn_flags("rseq_cs", cs_flags)) - return -EINVAL; - - /* Get thread flags. */ - ret =3D get_user(flags, &t->rseq->flags); - if (ret) - return ret; - - if (rseq_warn_flags("rseq", flags)) - return -EINVAL; - return 0; -} - -static int clear_rseq_cs(struct rseq __user *rseq) -{ - /* - * The rseq_cs field is set to NULL on preemption or signal - * delivery on top of rseq assembly block, as well as on top - * of code outside of the rseq assembly block. This performs - * a lazy clear of the rseq_cs field. - * - * Set rseq_cs to NULL. - */ -#ifdef CONFIG_64BIT - return put_user(0UL, &rseq->rseq_cs); -#else - if (clear_user(&rseq->rseq_cs, sizeof(rseq->rseq_cs))) - return -EFAULT; - return 0; -#endif -} - -/* - * Unsigned comparison will be true when ip >=3D start_ip, and when - * ip < start_ip + post_commit_offset. - */ -static bool in_rseq_cs(unsigned long ip, struct rseq_cs *rseq_cs) -{ - return ip - rseq_cs->start_ip < rseq_cs->post_commit_offset; -} - -static int rseq_ip_fixup(struct pt_regs *regs, bool abort) -{ - unsigned long ip =3D instruction_pointer(regs); - struct task_struct *t =3D current; - struct rseq_cs rseq_cs; - int ret; - - rseq_stat_inc(rseq_stats.cs); - - ret =3D rseq_get_rseq_cs(t, &rseq_cs); - if (ret) - return ret; - - /* - * Handle potentially not being within a critical section. - * If not nested over a rseq critical section, restart is useless. - * Clear the rseq_cs pointer and return. - */ - if (!in_rseq_cs(ip, &rseq_cs)) { - rseq_stat_inc(rseq_stats.clear); - return clear_rseq_cs(t->rseq); - } - ret =3D rseq_check_flags(t, rseq_cs.flags); - if (ret < 0) - return ret; - if (!abort) - return 0; - ret =3D clear_rseq_cs(t->rseq); - if (ret) - return ret; - rseq_stat_inc(rseq_stats.fixup); - trace_rseq_ip_fixup(ip, rseq_cs.start_ip, rseq_cs.post_commit_offset, - rseq_cs.abort_ip); - instruction_pointer_set(regs, (unsigned long)rseq_cs.abort_ip); - return 0; + if (likely(!csaddr)) + return true; + return rseq_update_user_cs(t, regs, csaddr); } =20 /* @@ -567,8 +407,8 @@ static int rseq_ip_fixup(struct pt_regs void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *reg= s) { struct task_struct *t =3D current; - int ret, sig; bool event; + int sig; =20 /* * If invoked from hypervisors before entering the guest via @@ -618,8 +458,7 @@ void __rseq_handle_notify_resume(struct if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event) return; =20 - ret =3D rseq_ip_fixup(regs, event); - if (unlikely(ret < 0)) + if (!rseq_handle_cs(t, regs)) goto error; =20 if (unlikely(rseq_update_cpu_node_id(t))) @@ -632,6 +471,67 @@ void __rseq_handle_notify_resume(struct } =20 #ifdef CONFIG_DEBUG_RSEQ +/* + * Unsigned comparison will be true when ip >=3D start_ip, and when + * ip < start_ip + post_commit_offset. + */ +static bool in_rseq_cs(unsigned long ip, struct rseq_cs *rseq_cs) +{ + return ip - rseq_cs->start_ip < rseq_cs->post_commit_offset; +} + +/* + * If the rseq_cs field of 'struct rseq' contains a valid pointer to + * user-space, copy 'struct rseq_cs' from user-space and validate its fiel= ds. + */ +static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs) +{ + struct rseq_cs __user *urseq_cs; + u64 ptr; + u32 __user *usig; + u32 sig; + int ret; + + if (get_user_masked_u64(&ptr, &t->rseq->rseq_cs)) + return -EFAULT; + + /* If the rseq_cs pointer is NULL, return a cleared struct rseq_cs. */ + if (!ptr) { + memset(rseq_cs, 0, sizeof(*rseq_cs)); + return 0; + } + /* Check that the pointer value fits in the user-space process space. */ + if (ptr >=3D TASK_SIZE) + return -EINVAL; + urseq_cs =3D (struct rseq_cs __user *)(unsigned long)ptr; + if (copy_from_user(rseq_cs, urseq_cs, sizeof(*rseq_cs))) + return -EFAULT; + + if (rseq_cs->start_ip >=3D TASK_SIZE || + rseq_cs->start_ip + rseq_cs->post_commit_offset >=3D TASK_SIZE || + rseq_cs->abort_ip >=3D TASK_SIZE || + rseq_cs->version > 0) + return -EINVAL; + /* Check for overflow. */ + if (rseq_cs->start_ip + rseq_cs->post_commit_offset < rseq_cs->start_ip) + return -EINVAL; + /* Ensure that abort_ip is not in the critical section. */ + if (rseq_cs->abort_ip - rseq_cs->start_ip < rseq_cs->post_commit_offset) + return -EINVAL; + + usig =3D (u32 __user *)(unsigned long)(rseq_cs->abort_ip - sizeof(u32)); + ret =3D get_user(sig, usig); + if (ret) + return ret; + + if (current->rseq_sig !=3D sig) { + printk_ratelimited(KERN_WARNING + "Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%= x (pid=3D%d, addr=3D%p).\n", + sig, current->rseq_sig, current->pid, usig); + return -EINVAL; + } + return 0; +} =20 /* * Terminate the process if a syscall is issued within a restartable From nobody Fri Oct 3 21:53:25 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 46F722F0C50 for ; Sat, 23 Aug 2025 16:39:59 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967200; cv=none; b=svkn3wbJpxf7e4fhq/EBac+bNma7MUYw7ux4geXMYrJMTEpSXPPiGXq4miT87V78fOTxZ11ftyKauSSYKfl9NveMYuy9Kayq0J5PrMZPpbwNbgJ/sW3Tsdpg7QvxaIg0i3ozA8yJ+ZB5EUN87Xvbr3gQ1a4qrLFaU9/zbJLkICY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967200; c=relaxed/simple; bh=jxWfc4ZGF7gMUH+9j9atldUYMU8wuTOf7zgFXjl82tc=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=Uz8jWn7kJBR6yFH6JIWnOoEop9joVx8AT5PdV+z2ZNJScxP6QHPOJqhHIVthHzR8D00Ns2jUSul+XfUSTaH2XK6IlMdFLzL6ZZXfEd07v3Vmz97WHZ7CHc+8VuvSCfURgahsagyc6YwLR3JS/PJRvJ2u7wKS5+F5aX3RqR9VgiU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=1OwiNDf3; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=5U5uDl7G; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="1OwiNDf3"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="5U5uDl7G" Message-ID: <20250823161654.547589855@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1755967197; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=rq8NUAnUcbySGYLgiYD+HlneHsCvVdZ5Dsg0Eo8YCp4=; b=1OwiNDf3Slrb38P5fclYHsVg2ljS9fO7udIQcwUeTU7+4wx2kqOcRlvxE8XRR2T9hPCLhl 79e2XyAnhxf8t9SAmLH9aMi6VWIUPJW5NDQZ3vlLAqLEto5t9JIe/Un4X6evYIGiqCp7gF DLdiJi7yT/2WZVZDl/bnqx5i5UgCNYXpmZ+4BxohXAUquY8hYv149dsUG7lAfCofUJiMl6 Sz+h2jwaRuN6INRg1nHIYnKDci3kM/y+QGDnyQhvklDc6r1apeRz0YMv4NUgJoIqBxutUY WjvxTFxHA7eheqoF143WNuH1Zpamn3XrV1BzPSxcFQij6ay8MaCe8AP+B37kxQ== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1755967197; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=rq8NUAnUcbySGYLgiYD+HlneHsCvVdZ5Dsg0Eo8YCp4=; b=5U5uDl7G8XP7f2QUEaQaD8Pe922doZJIG48oLdUAgh0lsMuMn06Wdg1OYAw2kmJ9V7HcVN hhOAwhp6bxrsv6AA== From: Thomas Gleixner To: LKML Cc: Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Paolo Bonzini , Sean Christopherson , Wei Liu , Dexuan Cui , x86@kernel.org, Arnd Bergmann , Heiko Carstens , Christian Borntraeger , Sven Schnelle , Huacai Chen , Paul Walmsley , Palmer Dabbelt Subject: [patch V2 20/37] rseq: Replace the debug crud References: <20250823161326.635281786@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Sat, 23 Aug 2025 18:39:56 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Just utilize the new infrastructure and put the original one to rest. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- kernel/rseq.c | 80 ++++++++---------------------------------------------= ----- 1 file changed, 12 insertions(+), 68 deletions(-) --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -472,83 +472,27 @@ void __rseq_handle_notify_resume(struct =20 #ifdef CONFIG_DEBUG_RSEQ /* - * Unsigned comparison will be true when ip >=3D start_ip, and when - * ip < start_ip + post_commit_offset. - */ -static bool in_rseq_cs(unsigned long ip, struct rseq_cs *rseq_cs) -{ - return ip - rseq_cs->start_ip < rseq_cs->post_commit_offset; -} - -/* - * If the rseq_cs field of 'struct rseq' contains a valid pointer to - * user-space, copy 'struct rseq_cs' from user-space and validate its fiel= ds. - */ -static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs) -{ - struct rseq_cs __user *urseq_cs; - u64 ptr; - u32 __user *usig; - u32 sig; - int ret; - - if (get_user_masked_u64(&ptr, &t->rseq->rseq_cs)) - return -EFAULT; - - /* If the rseq_cs pointer is NULL, return a cleared struct rseq_cs. */ - if (!ptr) { - memset(rseq_cs, 0, sizeof(*rseq_cs)); - return 0; - } - /* Check that the pointer value fits in the user-space process space. */ - if (ptr >=3D TASK_SIZE) - return -EINVAL; - urseq_cs =3D (struct rseq_cs __user *)(unsigned long)ptr; - if (copy_from_user(rseq_cs, urseq_cs, sizeof(*rseq_cs))) - return -EFAULT; - - if (rseq_cs->start_ip >=3D TASK_SIZE || - rseq_cs->start_ip + rseq_cs->post_commit_offset >=3D TASK_SIZE || - rseq_cs->abort_ip >=3D TASK_SIZE || - rseq_cs->version > 0) - return -EINVAL; - /* Check for overflow. */ - if (rseq_cs->start_ip + rseq_cs->post_commit_offset < rseq_cs->start_ip) - return -EINVAL; - /* Ensure that abort_ip is not in the critical section. */ - if (rseq_cs->abort_ip - rseq_cs->start_ip < rseq_cs->post_commit_offset) - return -EINVAL; - - usig =3D (u32 __user *)(unsigned long)(rseq_cs->abort_ip - sizeof(u32)); - ret =3D get_user(sig, usig); - if (ret) - return ret; - - if (current->rseq_sig !=3D sig) { - printk_ratelimited(KERN_WARNING - "Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%= x (pid=3D%d, addr=3D%p).\n", - sig, current->rseq_sig, current->pid, usig); - return -EINVAL; - } - return 0; -} - -/* * Terminate the process if a syscall is issued within a restartable * sequence. */ void rseq_syscall(struct pt_regs *regs) { - unsigned long ip =3D instruction_pointer(regs); struct task_struct *t =3D current; - struct rseq_cs rseq_cs; + u64 csaddr; =20 - if (!t->rseq) + if (!t->rseq_event.has_rseq) + return; + if (get_user_masked_u64(&csaddr, &t->rseq->rseq_cs)) + goto fail; + if (likely(!csaddr)) return; - if (rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs)) - force_sig(SIGSEGV); + if (unlikely(csaddr >=3D TASK_SIZE)) + goto fail; + if (rseq_debug_update_user_cs(t, regs, csaddr)) + return; +fail: + force_sig(SIGSEGV); } - #endif =20 /* From nobody Fri Oct 3 21:53:25 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E2E222F0C6D for ; Sat, 23 Aug 2025 16:40:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967203; cv=none; b=hRiBKHM5j3j8bxlFaVzgUyuyy/jyOjKOFT66SXPSjSyUodyETFYkfh36UKI2OY4lQN7EC0wVkhU95RbGerUMx1NhgWfoXBZI+1QkhHmnT0x2f6q/GiXTkslmbFL5+2RGSesUJp9XzR+oevtr8dAbmdngUaGyz+Kx8A8jOFQGqp8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967203; c=relaxed/simple; bh=nMlNVmuLCaPER50+OebglV5Z2zMqBLfEuJ8hNIcC/7A=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=grcDEe4WUthphARII0ZfMqBR6BFdeOR5o3e0qvBxug7+DOSFn1EngR2GNSr4h+bOGGWysj7ix7r92AYh5n6r03F092tkWkz7lq04sH3V/FaJoZKw/4fLjddbvus4aco3iYGTlu3ajtlaxWhBDr84OI+Woo/z3Kgyj33G5pcg8BA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=OYiTH1CB; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=h4ZSn6v+; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="OYiTH1CB"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="h4ZSn6v+" Message-ID: <20250823161654.612526581@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1755967200; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=IvFox5Rkn87y9XhjhzrcVRW95fbzbaB+KfxmN/l1TiU=; b=OYiTH1CBWadZF4m3+aVx96g2mTd2YCh1ZsGFpMhhDGVRJltlmKH+fwCQyPH/gBdxEAtao9 /7HPrcqqwJ9jgksK4fYriCJO9dcgu5wViLwrXaVVP4ecBHpk40M4i3dXr5Gvj/Dv3lW5X4 kSv7MLAprYjff66AMgsV5CjprkV5U+yBB3DC12qarWj+314STBBtn0sWJGRUiaOa7UnK7E 9sYborRyif+4NJS0mN5oMThs/GCsWmgr/O4i4niNXhaxXZSwpIzPrJ+R2QanWsTnHgF6zA tc+gHZOowCdkPu5EhILMy/oV7AU89iHbeO6CaK51J5pNG15WUVNVbh2iaTb30A== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1755967200; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=IvFox5Rkn87y9XhjhzrcVRW95fbzbaB+KfxmN/l1TiU=; b=h4ZSn6v+IvTOeIPBx/cyKAF0E1AVkgzKVpgXLSiWvvvaVCMKPuEMXqnpCarENc9RT6S8OY stKLgXl1X/F8KqDw== From: Thomas Gleixner To: LKML Cc: Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Paolo Bonzini , Sean Christopherson , Wei Liu , Dexuan Cui , x86@kernel.org, Arnd Bergmann , Heiko Carstens , Christian Borntraeger , Sven Schnelle , Huacai Chen , Paul Walmsley , Palmer Dabbelt Subject: [patch V2 21/37] rseq: Make exit debugging static branch based References: <20250823161326.635281786@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Sat, 23 Aug 2025 18:39:59 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Disconnect it from the config switch and use the static debug branch. This is a temporary measure for validating the rework. At the end this check needs to be hidden behind lockdep as it has nothing to do with the other debug infrastructure, which mainly aids user space debugging by enabling a zoo of checks which terminate misbehaving tasks instead of letting them keep the hard to diagnose pieces. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- include/linux/rseq_entry.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/include/linux/rseq_entry.h +++ b/include/linux/rseq_entry.h @@ -275,7 +275,7 @@ static __always_inline void rseq_exit_to =20 rseq_stat_inc(rseq_stats.exit); =20 - if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) + if (static_branch_unlikely(&rseq_debug_enabled)) WARN_ON_ONCE(ev->sched_switch); =20 /* From nobody Fri Oct 3 21:53:25 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9F1F82F0C50 for ; Sat, 23 Aug 2025 16:40:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967206; cv=none; b=a5OXhnsi7dY9eU04aopEGSSDvig6RJJig60iQNQLQmnXLNCNwqHzhVssHVFjqoZ53fKfCePd2SxVfg/b3riZv/iPAoXoRwCHvs+ajloMYeceGctoxl+IIVVIoM+9J1touLygWGSuCS99v/kd4P+I/ZQEpljToUoGuVbHtlsRBNA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967206; c=relaxed/simple; bh=aHyYhrngWqwek+zL2LXHM/m7IygoiX4dvSyBLnV+pro=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=lXBmVYaufN+Otzy2SZMc1StPNq+7kc/ZXnlDXGaB9rU4QSxgpkFGoAEeNV7np4UGvvEs70XoqGh70c7IkHp7REk80QLjdmWhOt9ofHEew7CxiuiIoifZGnYMrZZDMYlVXwY0UVIX53to4K1CeD3OBhu9w+PD2r/re3oRqZi1L0U= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=JlVmszf6; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=BH9vQR+Q; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="JlVmszf6"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="BH9vQR+Q" Message-ID: <20250823161654.677961303@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1755967202; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=w6iXRrtymKNRLhlRnwTuItMwaWaIGSiBnDPpk0Weh9Q=; b=JlVmszf6zuLa558wwGwXfsGer34qAWJgk4Io4gI4Hi7UQHBaySGe7ETrrzvPeD3tV22SOw 9aQMn7iLlyTBMzB4t9Wl+WiW8XrLAQmNBx/8jkygSYHpbs2xwdfC16/QLAEPzQA/0xPTUO QG9tUVGpIv3nDqo1BKJz0ytFoHZ0IonotzgD2KbtiLjs1SH0Yw1XsZUj1KQ/xia5j0tmGu sQkKs9MDeOO3NRfmWeAdfKAL1pnxkHv+aFyKCXV5xRneybb1T574STRPspaxx/+aME59lt KxpQzSk1Ddri4SGXC+cjnDlcgNp3NTZg9eaNRlJOQo4CorE/U6JoDTseFOYHfw== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1755967202; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=w6iXRrtymKNRLhlRnwTuItMwaWaIGSiBnDPpk0Weh9Q=; b=BH9vQR+Qa4g0mmwNxUL2Axhqar7ElTUzcJmwtQoKv5dlRRO9dq4RqjAhCJzqKh+khhjw1V tblihcYN0GmnkGDw== From: Thomas Gleixner To: LKML Cc: Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Paolo Bonzini , Sean Christopherson , Wei Liu , Dexuan Cui , x86@kernel.org, Arnd Bergmann , Heiko Carstens , Christian Borntraeger , Sven Schnelle , Huacai Chen , Paul Walmsley , Palmer Dabbelt Subject: [patch V2 22/37] rseq: Use static branch for syscall exit debug when GENERIC_IRQ_ENTRY=y References: <20250823161326.635281786@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Sat, 23 Aug 2025 18:40:01 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Make the syscall exit debug mechanism available via the static branch on architectures which utilize the generic entry code. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- include/linux/entry-common.h | 2 +- include/linux/rseq_entry.h | 9 +++++++++ kernel/rseq.c | 19 +++++++++++++------ 3 files changed, 23 insertions(+), 7 deletions(-) --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -146,7 +146,7 @@ static __always_inline void syscall_exit local_irq_enable(); } =20 - rseq_syscall(regs); + rseq_debug_syscall_return(regs); =20 /* * Do one-time syscall specific work. If these work items are --- a/include/linux/rseq_entry.h +++ b/include/linux/rseq_entry.h @@ -286,9 +286,18 @@ static __always_inline void rseq_exit_to ev->events =3D 0; } =20 +void __rseq_debug_syscall_return(struct pt_regs *regs); + +static inline void rseq_debug_syscall_return(struct pt_regs *regs) +{ + if (static_branch_unlikely(&rseq_debug_enabled)) + __rseq_debug_syscall_return(regs); +} + #else /* CONFIG_RSEQ */ static inline void rseq_note_user_irq_entry(void) { } static inline void rseq_exit_to_user_mode(void) { } +static inline void rseq_debug_syscall_return(struct pt_regs *regs) { } #endif /* !CONFIG_RSEQ */ =20 #endif /* _LINUX_RSEQ_ENTRY_H */ --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -470,12 +470,7 @@ void __rseq_handle_notify_resume(struct force_sigsegv(sig); } =20 -#ifdef CONFIG_DEBUG_RSEQ -/* - * Terminate the process if a syscall is issued within a restartable - * sequence. - */ -void rseq_syscall(struct pt_regs *regs) +void __rseq_debug_syscall_return(struct pt_regs *regs) { struct task_struct *t =3D current; u64 csaddr; @@ -493,6 +488,18 @@ void rseq_syscall(struct pt_regs *regs) fail: force_sig(SIGSEGV); } + +#ifdef CONFIG_DEBUG_RSEQ +/* + * Kept around to keep GENERIC_ENTRY=3Dn architectures supported. + * + * Terminate the process if a syscall is issued within a restartable + * sequence. + */ +void rseq_syscall(struct pt_regs *regs) +{ + __rseq_debug_syscall_return(regs); +} #endif =20 /* From nobody Fri Oct 3 21:53:25 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AFF072F28F6 for ; Sat, 23 Aug 2025 16:40:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967209; cv=none; b=bMNaoII2FKhG0YFhLLvdO3ql6KoJvoAc7L+e7yqogykRpOhd6Hlctp0m9RxcF7GSxX9QSOsGlWgoYuxAxvWRipuJa2TmyMERbJbw8dG79Pv1FlOlNm2Nb5wDxUA3ZWcD3LJiQ/mZfCodkyINQzZdpwzUJrsLYn+2TgQXJmQ2hSM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967209; c=relaxed/simple; bh=3vcMTFnzdFo7iLS+NmZzXu6jyhCw3oUQa9sL2gREZDg=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=ulhQnyCfu3sz1K4Ocwx15oV5lXpsQU+0GIdFN9BJp+arcD1y/NyMNduolei046Xvpz8QRutVRtzXJ5v4tKWv44dmi9zcSkF8ik6iuRIy5f0YLdrjqy7kuR/4EP6hVL8kZ7ZmNK6FE/cl0qaB9KxbtaAQNCy7NPraulx8r6k3km8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=EWwSX5lh; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=O+6ImgiV; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="EWwSX5lh"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="O+6ImgiV" Message-ID: <20250823161654.741798449@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1755967205; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=Cq5YKBhli0VY0E2wrTbqGwUFfuY1eQ9zjTaie1zvtx8=; b=EWwSX5lhpcACPKKEzgIU3XmSFCl9XS0424+Tck91zpRZEI941qrexNR2mWSs8s0YvPILkL KaPR/mYw5F1lR3WBchcfxzniC+dCTx80XyL6ChbXFv+N5QEYl3jV+CvLqu10VsTlKpkLDJ HAwoYSd5+wyq86uXmSa2Eq7zprAdc7aynx+ezeSJuWgt2J0ZCvCs8j12WHW/gdCWnQDz71 p0M00nD+XYNZIUUzEKFvSalA5JNAKmrHmlPWLBxUX8cUrZiiuAJxMGrYE4ZxzI0CxyE0xF dwwoTG4N77nU+/lTV6HExqLM3l4yi8tjmgNh+0tou7E7uk+S2xYaVHOyvvksxw== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1755967205; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=Cq5YKBhli0VY0E2wrTbqGwUFfuY1eQ9zjTaie1zvtx8=; b=O+6ImgiVEkzrEJFzx3OsUowwznArRTWf3slwuXlqVa6lMERvkrhAcltU7vWDPOorIr7Mkz mFKkZdlTDWu0juAQ== From: Thomas Gleixner To: LKML Cc: Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Paolo Bonzini , Sean Christopherson , Wei Liu , Dexuan Cui , x86@kernel.org, Arnd Bergmann , Heiko Carstens , Christian Borntraeger , Sven Schnelle , Huacai Chen , Paul Walmsley , Palmer Dabbelt Subject: [patch V2 23/37] rseq: Provide and use rseq_set_uids() References: <20250823161326.635281786@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Sat, 23 Aug 2025 18:40:04 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Provide a new and straight forward implementation to set the IDs (CPU ID, Node ID and MM CID), which can be later inlined into the fast path. It does all operations in one user_rw_masked_begin() section and retrieves also the critical section member (rseq::cs_rseq) from user space to avoid another user..begin/end() pair. This is in preparation for optimizing the fast path to avoid extra work when not required. Use it to replace the whole related zoo in rseq.c Signed-off-by: Thomas Gleixner --- fs/binfmt_elf.c | 2=20 include/linux/rseq_entry.h | 95 ++++++++++++++++++++ include/linux/rseq_types.h | 2=20 include/linux/sched.h | 10 -- kernel/rseq.c | 208 ++++++----------------------------------= ----- 5 files changed, 130 insertions(+), 187 deletions(-) --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -46,7 +46,7 @@ #include #include #include -#include +#include #include #include =20 --- a/include/linux/rseq_entry.h +++ b/include/linux/rseq_entry.h @@ -38,6 +38,8 @@ DECLARE_PER_CPU(struct rseq_stats, rseq_ #include #include =20 +#include + #include =20 #ifdef CONFIG_TRACEPOINTS @@ -77,6 +79,7 @@ DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEB #endif =20 bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs= , unsigned long csaddr); +bool rseq_debug_validate_uids(struct task_struct *t); =20 static __always_inline void rseq_note_user_irq_entry(void) { @@ -198,6 +201,44 @@ bool rseq_debug_update_user_cs(struct ta user_access_end(); return false; } + +/* + * On debug kernels validate that user space did not mess with it if + * DEBUG_RSEQ is enabled, but don't on the first exit to user space. In + * that case cpu_cid is ~0. See fork/execve. + */ +bool rseq_debug_validate_uids(struct task_struct *t) +{ + u32 cpu_id, uval, node_id =3D cpu_to_node(task_cpu(t)); + struct rseq __user *rseq =3D t->rseq; + + if (t->rseq_ids.cpu_cid =3D=3D ~0) + return true; + + if (!user_read_masked_begin(rseq)) + return false; + + unsafe_get_user(cpu_id, &rseq->cpu_id_start, efault); + if (cpu_id !=3D t->rseq_ids.cpu_id) + goto die; + unsafe_get_user(uval, &rseq->cpu_id, efault); + if (uval !=3D cpu_id) + goto die; + unsafe_get_user(uval, &rseq->node_id, efault); + if (uval !=3D node_id) + goto die; + unsafe_get_user(uval, &rseq->mm_cid, efault); + if (uval !=3D t->rseq_ids.mm_cid) + goto die; + user_access_end(); + return true; +die: + t->rseq_event.fatal =3D true; +efault: + user_access_end(); + return false; +} + #endif /* RSEQ_BUILD_SLOW_PATH */ =20 /* @@ -268,6 +309,60 @@ rseq_update_user_cs(struct task_struct * user_access_end(); return false; } + +/* + * Updates CPU ID, Node ID and MM CID and reads the critical section + * address, when @csaddr !=3D NULL. This allows to put the ID update and t= he + * read under the same uaccess region to spare a seperate begin/end. + * + * As this is either invoked from a C wrapper with @csaddr =3D NULL or from + * the fast path code with a valid pointer, a clever compiler should be + * able to optimize the read out. Spares a duplicate implementation. + * + * Returns true, if the operation was successful, false otherwise. + * + * In the failure case task::rseq_event::fatal is set when invalid data + * was found on debug kernels. It's clear when the failure was an unresolv= ed page + * fault. + * + * If inlined into the exit to user path with interrupts disabled, the + * caller has to protect against page faults with pagefault_disable(). + * + * In preemptible task context this would be counterproductive as the page + * faults could not be fully resolved. As a consequence unresolved page + * faults in task context are fatal too. + */ +static rseq_inline +bool rseq_set_uids_get_csaddr(struct task_struct *t, struct rseq_ids *ids, + u32 node_id, u64 *csaddr) +{ + struct rseq __user *rseq =3D t->rseq; + + if (static_branch_unlikely(&rseq_debug_enabled)) { + if (!rseq_debug_validate_uids(t)) + return false; + } + + if (!user_rw_masked_begin(rseq)) + return false; + + unsafe_put_user(ids->cpu_id, &rseq->cpu_id_start, efault); + unsafe_put_user(ids->cpu_id, &rseq->cpu_id, efault); + unsafe_put_user(node_id, &rseq->node_id, efault); + unsafe_put_user(ids->mm_cid, &rseq->mm_cid, efault); + if (csaddr) + unsafe_get_user(*csaddr, &rseq->rseq_cs, efault); + user_access_end(); + + /* Cache the new values */ + t->rseq_ids.cpu_cid =3D ids->cpu_cid; + rseq_stat_inc(rseq_stats.ids); + rseq_trace_update(t, ids); + return true; +efault: + user_access_end(); + return false; +} =20 static __always_inline void rseq_exit_to_user_mode(void) { --- a/include/linux/rseq_types.h +++ b/include/linux/rseq_types.h @@ -3,6 +3,8 @@ #define _LINUX_RSEQ_TYPES_H =20 #include +/* Forward declaration for the sched.h */ +struct rseq; =20 /* * struct rseq_event - Storage for rseq related event management --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -42,7 +42,6 @@ #include #include #include -#include #include #include #include @@ -1407,15 +1406,6 @@ struct task_struct { u32 rseq_sig; struct rseq_event rseq_event; struct rseq_ids rseq_ids; -# ifdef CONFIG_DEBUG_RSEQ - /* - * This is a place holder to save a copy of the rseq fields for - * validation of read-only fields. The struct rseq has a - * variable-length array at the end, so it cannot be used - * directly. Reserve a size large enough for the known fields. - */ - char rseq_fields[sizeof(struct rseq)]; -# endif #endif =20 #ifdef CONFIG_SCHED_MM_CID --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -88,13 +88,6 @@ # define RSEQ_EVENT_GUARD preempt #endif =20 -/* The original rseq structure size (including padding) is 32 bytes. */ -#define ORIG_RSEQ_SIZE 32 - -#define RSEQ_CS_NO_RESTART_FLAGS (RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT | \ - RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL | \ - RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE) - DEFINE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabl= ed); =20 static inline void rseq_control_debug(bool on) @@ -227,159 +220,9 @@ static int __init rseq_debugfs_init(void __initcall(rseq_debugfs_init); #endif /* CONFIG_DEBUG_FS */ =20 -#ifdef CONFIG_DEBUG_RSEQ -static struct rseq *rseq_kernel_fields(struct task_struct *t) -{ - return (struct rseq *) t->rseq_fields; -} - -static int rseq_validate_ro_fields(struct task_struct *t) -{ - static DEFINE_RATELIMIT_STATE(_rs, - DEFAULT_RATELIMIT_INTERVAL, - DEFAULT_RATELIMIT_BURST); - u32 cpu_id_start, cpu_id, node_id, mm_cid; - struct rseq __user *rseq =3D t->rseq; - - /* - * Validate fields which are required to be read-only by - * user-space. - */ - if (!user_read_access_begin(rseq, t->rseq_len)) - goto efault; - unsafe_get_user(cpu_id_start, &rseq->cpu_id_start, efault_end); - unsafe_get_user(cpu_id, &rseq->cpu_id, efault_end); - unsafe_get_user(node_id, &rseq->node_id, efault_end); - unsafe_get_user(mm_cid, &rseq->mm_cid, efault_end); - user_read_access_end(); - - if ((cpu_id_start !=3D rseq_kernel_fields(t)->cpu_id_start || - cpu_id !=3D rseq_kernel_fields(t)->cpu_id || - node_id !=3D rseq_kernel_fields(t)->node_id || - mm_cid !=3D rseq_kernel_fields(t)->mm_cid) && __ratelimit(&_rs)) { - - pr_warn("Detected rseq corruption for pid: %d, name: %s\n" - "\tcpu_id_start: %u ?=3D %u\n" - "\tcpu_id: %u ?=3D %u\n" - "\tnode_id: %u ?=3D %u\n" - "\tmm_cid: %u ?=3D %u\n", - t->pid, t->comm, - cpu_id_start, rseq_kernel_fields(t)->cpu_id_start, - cpu_id, rseq_kernel_fields(t)->cpu_id, - node_id, rseq_kernel_fields(t)->node_id, - mm_cid, rseq_kernel_fields(t)->mm_cid); - } - - /* For now, only print a console warning on mismatch. */ - return 0; - -efault_end: - user_read_access_end(); -efault: - return -EFAULT; -} - -/* - * Update an rseq field and its in-kernel copy in lock-step to keep a cohe= rent - * state. - */ -#define rseq_unsafe_put_user(t, value, field, error_label) \ - do { \ - unsafe_put_user(value, &t->rseq->field, error_label); \ - rseq_kernel_fields(t)->field =3D value; \ - } while (0) - -#else -static int rseq_validate_ro_fields(struct task_struct *t) -{ - return 0; -} - -#define rseq_unsafe_put_user(t, value, field, error_label) \ - unsafe_put_user(value, &t->rseq->field, error_label) -#endif - -static int rseq_update_cpu_node_id(struct task_struct *t) -{ - struct rseq __user *rseq =3D t->rseq; - u32 cpu_id =3D raw_smp_processor_id(); - u32 node_id =3D cpu_to_node(cpu_id); - u32 mm_cid =3D task_mm_cid(t); - - rseq_stat_inc(rseq_stats.ids); - - /* Validate read-only rseq fields on debug kernels */ - if (rseq_validate_ro_fields(t)) - goto efault; - WARN_ON_ONCE((int) mm_cid < 0); - - if (!user_write_access_begin(rseq, t->rseq_len)) - goto efault; - - rseq_unsafe_put_user(t, cpu_id, cpu_id_start, efault_end); - rseq_unsafe_put_user(t, cpu_id, cpu_id, efault_end); - rseq_unsafe_put_user(t, node_id, node_id, efault_end); - rseq_unsafe_put_user(t, mm_cid, mm_cid, efault_end); - - /* Cache the user space values */ - t->rseq_ids.cpu_id =3D cpu_id; - t->rseq_ids.mm_cid =3D mm_cid; - - /* - * Additional feature fields added after ORIG_RSEQ_SIZE - * need to be conditionally updated only if - * t->rseq_len !=3D ORIG_RSEQ_SIZE. - */ - user_write_access_end(); - trace_rseq_update(t); - return 0; - -efault_end: - user_write_access_end(); -efault: - return -EFAULT; -} - -static int rseq_reset_rseq_cpu_node_id(struct task_struct *t) +static bool rseq_set_uids(struct task_struct *t, struct rseq_ids *ids, u32= node_id) { - struct rseq __user *rseq =3D t->rseq; - u32 cpu_id_start =3D 0, cpu_id =3D RSEQ_CPU_ID_UNINITIALIZED, node_id =3D= 0, - mm_cid =3D 0; - - /* - * Validate read-only rseq fields. - */ - if (rseq_validate_ro_fields(t)) - goto efault; - - if (!user_write_access_begin(rseq, t->rseq_len)) - goto efault; - - /* - * Reset all fields to their initial state. - * - * All fields have an initial state of 0 except cpu_id which is set to - * RSEQ_CPU_ID_UNINITIALIZED, so that any user coming in after - * unregistration can figure out that rseq needs to be registered - * again. - */ - rseq_unsafe_put_user(t, cpu_id_start, cpu_id_start, efault_end); - rseq_unsafe_put_user(t, cpu_id, cpu_id, efault_end); - rseq_unsafe_put_user(t, node_id, node_id, efault_end); - rseq_unsafe_put_user(t, mm_cid, mm_cid, efault_end); - - /* - * Additional feature fields added after ORIG_RSEQ_SIZE - * need to be conditionally reset only if - * t->rseq_len !=3D ORIG_RSEQ_SIZE. - */ - user_write_access_end(); - return 0; - -efault_end: - user_write_access_end(); -efault: - return -EFAULT; + return rseq_set_uids_get_csaddr(t, ids, node_id, NULL); } =20 static bool rseq_handle_cs(struct task_struct *t, struct pt_regs *regs) @@ -407,6 +250,8 @@ static bool rseq_handle_cs(struct task_s void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *reg= s) { struct task_struct *t =3D current; + struct rseq_ids ids; + u32 node_id; bool event; int sig; =20 @@ -453,6 +298,8 @@ void __rseq_handle_notify_resume(struct scoped_guard(RSEQ_EVENT_GUARD) { event =3D t->rseq_event.sched_switch; t->rseq_event.sched_switch =3D false; + ids.cpu_id =3D task_cpu(t); + ids.mm_cid =3D task_mm_cid(t); } =20 if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event) @@ -461,7 +308,8 @@ void __rseq_handle_notify_resume(struct if (!rseq_handle_cs(t, regs)) goto error; =20 - if (unlikely(rseq_update_cpu_node_id(t))) + node_id =3D cpu_to_node(ids.cpu_id); + if (!rseq_set_uids(t, &ids, node_id)) goto error; return; =20 @@ -502,13 +350,33 @@ void rseq_syscall(struct pt_regs *regs) } #endif =20 +static bool rseq_reset_ids(void) +{ + struct rseq_ids ids =3D { + .cpu_id =3D RSEQ_CPU_ID_UNINITIALIZED, + .mm_cid =3D 0, + }; + + /* + * If this fails, terminate it because this leaves the kernel in + * stupid state as exit to user space will try to fixup the ids + * again. + */ + if (rseq_set_uids(current, &ids, 0)) + return true; + + force_sig(SIGSEGV); + return false; +} + +/* The original rseq structure size (including padding) is 32 bytes. */ +#define ORIG_RSEQ_SIZE 32 + /* * sys_rseq - setup restartable sequences for caller thread. */ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flag= s, u32, sig) { - int ret; - if (flags & RSEQ_FLAG_UNREGISTER) { if (flags & ~RSEQ_FLAG_UNREGISTER) return -EINVAL; @@ -519,9 +387,8 @@ SYSCALL_DEFINE4(rseq, struct rseq __user return -EINVAL; if (current->rseq_sig !=3D sig) return -EPERM; - ret =3D rseq_reset_rseq_cpu_node_id(current); - if (ret) - return ret; + if (!rseq_reset_ids()) + return -EFAULT; current->rseq =3D NULL; current->rseq_sig =3D 0; current->rseq_len =3D 0; @@ -574,17 +441,6 @@ SYSCALL_DEFINE4(rseq, struct rseq __user if (put_user_masked_u64(0UL, &rseq->rseq_cs)) return -EFAULT; =20 -#ifdef CONFIG_DEBUG_RSEQ - /* - * Initialize the in-kernel rseq fields copy for validation of - * read-only fields. - */ - if (get_user(rseq_kernel_fields(current)->cpu_id_start, &rseq->cpu_id_sta= rt) || - get_user(rseq_kernel_fields(current)->cpu_id, &rseq->cpu_id) || - get_user(rseq_kernel_fields(current)->node_id, &rseq->node_id) || - get_user(rseq_kernel_fields(current)->mm_cid, &rseq->mm_cid)) - return -EFAULT; -#endif /* * Activate the registration by setting the rseq area address, length * and signature in the task struct. From nobody Fri Oct 3 21:53:25 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 378F22F0693 for ; Sat, 23 Aug 2025 16:40:09 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967211; cv=none; b=MeA7vpsyLod/tQ8NabKUeKrmwrrVEmU7la7a9c/jZ53OzAzIRbiP5MBiM4TEUhJw/k1SGomxcYvnp5PwlamH2pwk5r9gSXLRb2Bh2h3oBBo0EaItBPVlcXJl1pmxtStkx9CjoNMlwQpAWDsbQV1MW0woXTM/2if7pK/ZaOVlqq4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967211; c=relaxed/simple; bh=9dpfNEBd4IIEBWj4Bpxoq7yWO/NN/RtWzmPAtxC+pDg=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=olzreFQGhWFvq+Ea98vQ6brV9k6OpuSOCbHHxgpjj2butMB7GUa8ccokUXO0WtYFx4fNRG9cuH0CWAX9QZ7HA1YROLfxY+kr8pEmqlq+hKxxcjlHo+NdU2BvxrorvXTFsu1nbhT91rGqz5QDxO0npaT54JOyMt5mSmg9AArQP+U= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=Iu3y26zA; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=kKLBv5Xr; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="Iu3y26zA"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="kKLBv5Xr" Message-ID: <20250823161654.805274429@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1755967207; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=Q+9W5yt61VATLXuYO6rsssXAZcRUaqXzZwLBPp6glFU=; b=Iu3y26zA9DxsCPQ3q0Uk5YfkBWAUGMhnBsgPAyP+wClViW1P8sPe/yXaLnrkXMnp1t47Nt DDJQQjYID83bQ94mizF4DOEZfivlNm6H4/A7GOrpJQg3n1U4fUUpQGqZ32GI/Am9XTiYGH bgqubdChN2uLdNNexLp06orjEvLpy9SNMi0jSROxmRQ2ODQdZz/O9/25+zeTg0tv+BByL2 Scl+udWCOzXk4JUSA53UMaX3lVMYoxtKjmBlgTvkaqnUsOuEaoHJNRw24qOc4L+tsPBogx ECZrlR1i25H+QThjca5c64KV+KwJn6Z/wKefG1v5gOFzZfjDml4wv6loJWrKog== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1755967207; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=Q+9W5yt61VATLXuYO6rsssXAZcRUaqXzZwLBPp6glFU=; b=kKLBv5Xr12Yz7zdPFomJuU5pltGoaye+xCFE0KqjHWyMZtUEc4fWhxudu8J9ddDSBMADrx tkg+X59kgGLPIqBg== From: Thomas Gleixner To: LKML Cc: Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Paolo Bonzini , Sean Christopherson , Wei Liu , Dexuan Cui , x86@kernel.org, Arnd Bergmann , Heiko Carstens , Christian Borntraeger , Sven Schnelle , Huacai Chen , Paul Walmsley , Palmer Dabbelt Subject: [patch V2 24/37] rseq: Seperate the signal delivery path References: <20250823161326.635281786@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Sat, 23 Aug 2025 18:40:06 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Completely seperate the signal delivery path from the notify handler as they have different semantics versus the event handling. The signal delivery only needs to ensure that the interrupted user context was not in a critical section or the section is aborted before it switches to the signal frame context. The signal frame context does not have the original instruction pointer anymore, so that can't be handled on exit to user space. No point in updating the CPU/CID ids as they might change again before the task returns to user space for real. The fast path optimization, which checks for the 'entry from user via interrupt' condition is only available for architectures which use the generic entry code. Signed-off-by: Thomas Gleixner --- include/linux/rseq.h | 21 ++++++++++++++++----- include/linux/rseq_entry.h | 29 +++++++++++++++++++++++++++++ kernel/rseq.c | 30 ++++++++++++++++++++++-------- 3 files changed, 67 insertions(+), 13 deletions(-) --- a/include/linux/rseq.h +++ b/include/linux/rseq.h @@ -5,22 +5,33 @@ #ifdef CONFIG_RSEQ #include =20 -void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs= ); +void __rseq_handle_notify_resume(struct pt_regs *regs); =20 static inline void rseq_handle_notify_resume(struct pt_regs *regs) { if (current->rseq_event.has_rseq) - __rseq_handle_notify_resume(NULL, regs); + __rseq_handle_notify_resume(regs); } =20 +void __rseq_signal_deliver(int sig, struct pt_regs *regs); + +/* + * Invoked from signal delivery to fixup based on the register context bef= ore + * switching to the signal delivery context. + */ static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_reg= s *regs) { - if (current->rseq_event.has_rseq) { - current->rseq_event.sched_switch =3D true; - __rseq_handle_notify_resume(ksig, regs); + if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) { + /* '&' is intentional to spare one conditional branch */ + if (current->rseq_event.has_rseq & current->rseq_event.user_irq) + __rseq_signal_deliver(ksig->sig, regs); + } else { + if (current->rseq_event.has_rseq) + __rseq_signal_deliver(ksig->sig, regs); } } =20 +/* Raised from context switch and exevce to force evaluation on exit to us= er */ static inline void rseq_sched_switch_event(struct task_struct *t) { if (t->rseq_event.has_rseq) { --- a/include/linux/rseq_entry.h +++ b/include/linux/rseq_entry.h @@ -364,6 +364,35 @@ bool rseq_set_uids_get_csaddr(struct tas return false; } =20 +/* + * Update user space with new IDs and conditionally check whether the task + * is in a critical section. + */ +static rseq_inline bool rseq_update_usr(struct task_struct *t, struct pt_r= egs *regs, + struct rseq_ids *ids, u32 node_id) +{ + u64 csaddr; + + if (!rseq_set_uids_get_csaddr(t, ids, node_id, &csaddr)) + return false; + + /* + * On architectures which utilize the generic entry code this + * allows to skip the critical section when the entry was not from + * a user space interrupt, unless debug mode is enabled. + */ + if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) { + if (!static_branch_unlikely(&rseq_debug_enabled)) { + if (likely(!t->rseq_event.user_irq)) + return true; + } + } + if (likely(!csaddr)) + return true; + /* Sigh, this really needs to do work */ + return rseq_update_user_cs(t, regs, csaddr); +} + static __always_inline void rseq_exit_to_user_mode(void) { struct rseq_event *ev =3D ¤t->rseq_event; --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -247,13 +247,12 @@ static bool rseq_handle_cs(struct task_s * respect to other threads scheduled on the same CPU, and with respect * to signal handlers. */ -void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *reg= s) +void __rseq_handle_notify_resume(struct pt_regs *regs) { struct task_struct *t =3D current; struct rseq_ids ids; u32 node_id; bool event; - int sig; =20 /* * If invoked from hypervisors before entering the guest via @@ -272,10 +271,7 @@ void __rseq_handle_notify_resume(struct if (unlikely(t->flags & PF_EXITING)) return; =20 - if (ksig) - rseq_stat_inc(rseq_stats.signal); - else - rseq_stat_inc(rseq_stats.slowpath); + rseq_stat_inc(rseq_stats.slowpath); =20 /* * Read and clear the event pending bit first. If the task @@ -314,8 +310,26 @@ void __rseq_handle_notify_resume(struct return; =20 error: - sig =3D ksig ? ksig->sig : 0; - force_sigsegv(sig); + force_sig(SIGSEGV); +} + +void __rseq_signal_deliver(int sig, struct pt_regs *regs) +{ + rseq_stat_inc(rseq_stats.signal); + /* + * Don't update IDs, they are handled on exit to user if + * necessary. The important thing is to abort a critical section of + * the interrupted context as after this point the instruction + * pointer in @regs points to the signal handler. + */ + if (unlikely(!rseq_handle_cs(current, regs))) { + /* + * Clear the errors just in case this might survive + * magically, but leave the rest intact. + */ + current->rseq_event.error =3D 0; + force_sigsegv(sig); + } } =20 void __rseq_debug_syscall_return(struct pt_regs *regs) From nobody Fri Oct 3 21:53:25 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D8B552EA499 for ; Sat, 23 Aug 2025 16:40:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967213; cv=none; b=JkbQ+BFyuH53Z1Eh/f71MpFBfO61A4Li7SErfhPyMijFsxuVz+BLk4bx1XIsJqZZjut9Y3sBNWAs/LO7cI+MiytxxqHzBvF88OoEUICUAiFLK76oHYxDus4FFMmCd20A5F96Fr1QmIvG7OKyxVwocQR94zOKBS0A2IEjo1w3Z2k= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967213; c=relaxed/simple; bh=JIO/8P1XZxdDWBrLrh41OLmkhf9i3/wHJLeNBSy0xA0=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=ZFK/vdwVubG8SjEgN6BVnvPT0+lssy0lhmUv5a1Y4Hl4SJExkwpER6hLoWdluOU3vU5Int20V7QzcjhHIO97ex2+neXzyxcuU3mhJTwbHEmaJulGa8+b6oUaJ1piAC5Sg7mqIH8XsaxLKaJWon5C+hIJEJyzDwFI8aCeo0YdMR0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=DBoMFfY8; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=7Oa26scQ; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="DBoMFfY8"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="7Oa26scQ" Message-ID: <20250823161654.869197102@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1755967210; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=wmILpI4Ja998LxCNN22+E8ZtCKeqchcvcaXrbRm5AWw=; b=DBoMFfY8mFn1BtsjmohTdENncBqsWKa0ib7zpc5ujA1nPTfQBwEh8En7HfybyKQGqAllxa GlMGk7uhokniKRhVuSzfqo1x4GExco7Lv7jnC5iWKrvCWIjE5jhiQwxS+azYlDXgLcB3SM rwNkM2CQNg/UAHIe+crD1aSnj8wDhXEOizc/5vZh6xULtMQuMJbTxDu+kU/J4n38ehlre7 Qy1cIGIohPs15b69ckFxqO+hj30xsisN3/4DntZEDP2duboXJLl7K4Wmxojy4AzO9dqda4 QmJIsqxFpwK0m40SF/LNMDnQTeApLiUSImk4h1baaqXuNPqH+BWTe/eRlEZJFg== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1755967210; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=wmILpI4Ja998LxCNN22+E8ZtCKeqchcvcaXrbRm5AWw=; b=7Oa26scQM4Ij8PRXHud7b4aJU95D0svh98oiLmf2DiUsVvENwxfLqgGPiG292HlfQ1OgHd X3apGIJm52BRIPDg== From: Thomas Gleixner To: LKML Cc: Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Paolo Bonzini , Sean Christopherson , Wei Liu , Dexuan Cui , x86@kernel.org, Arnd Bergmann , Heiko Carstens , Christian Borntraeger , Sven Schnelle , Huacai Chen , Paul Walmsley , Palmer Dabbelt Subject: [patch V2 25/37] rseq: Rework the TIF_NOTIFY handler References: <20250823161326.635281786@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Sat, 23 Aug 2025 18:40:09 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Replace the whole logic with the new implementation, which is shared with signal delivery and the upcoming exit fast path. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- kernel/rseq.c | 78 +++++++++++++++++++++++++----------------------------= ----- 1 file changed, 34 insertions(+), 44 deletions(-) --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -82,12 +82,6 @@ #define CREATE_TRACE_POINTS #include =20 -#ifdef CONFIG_MEMBARRIER -# define RSEQ_EVENT_GUARD irq -#else -# define RSEQ_EVENT_GUARD preempt -#endif - DEFINE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabl= ed); =20 static inline void rseq_control_debug(bool on) @@ -236,38 +230,15 @@ static bool rseq_handle_cs(struct task_s return rseq_update_user_cs(t, regs, csaddr); } =20 -/* - * This resume handler must always be executed between any of: - * - preemption, - * - signal delivery, - * and return to user-space. - * - * This is how we can ensure that the entire rseq critical section - * will issue the commit instruction only if executed atomically with - * respect to other threads scheduled on the same CPU, and with respect - * to signal handlers. - */ -void __rseq_handle_notify_resume(struct pt_regs *regs) +static void rseq_slowpath_update_usr(struct pt_regs *regs) { + /* Preserve rseq state and user_irq state for exit to user */ + const struct rseq_event evt_mask =3D { .has_rseq =3D true, .user_irq =3D = true, }; struct task_struct *t =3D current; struct rseq_ids ids; u32 node_id; bool event; =20 - /* - * If invoked from hypervisors before entering the guest via - * resume_user_mode_work(), then @regs is a NULL pointer. - * - * resume_user_mode_work() clears TIF_NOTIFY_RESUME and re-raises - * it before returning from the ioctl() to user space when - * rseq_event.sched_switch is set. - * - * So it's safe to ignore here instead of pointlessly updating it - * in the vcpu_run() loop. - */ - if (!regs) - return; - if (unlikely(t->flags & PF_EXITING)) return; =20 @@ -291,26 +262,45 @@ void __rseq_handle_notify_resume(struct * with the result handed in to allow the detection of * inconsistencies. */ - scoped_guard(RSEQ_EVENT_GUARD) { - event =3D t->rseq_event.sched_switch; - t->rseq_event.sched_switch =3D false; + scoped_guard(irq) { ids.cpu_id =3D task_cpu(t); ids.mm_cid =3D task_mm_cid(t); + event =3D t->rseq_event.sched_switch; + t->rseq_event.all &=3D evt_mask.all; } =20 - if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event) + if (!event) return; =20 - if (!rseq_handle_cs(t, regs)) - goto error; - node_id =3D cpu_to_node(ids.cpu_id); - if (!rseq_set_uids(t, &ids, node_id)) - goto error; - return; =20 -error: - force_sig(SIGSEGV); + if (unlikely(!rseq_update_usr(t, regs, &ids, node_id))) { + /* + * Clear the errors just in case this might survive magically, but + * leave the rest intact. + */ + t->rseq_event.error =3D 0; + force_sig(SIGSEGV); + } +} + +void __rseq_handle_notify_resume(struct pt_regs *regs) +{ + /* + * If invoked from hypervisors before entering the guest via + * resume_user_mode_work(), then @regs is a NULL pointer. + * + * resume_user_mode_work() clears TIF_NOTIFY_RESUME and re-raises + * it before returning from the ioctl() to user space when + * rseq_event.sched_switch is set. + * + * So it's safe to ignore here instead of pointlessly updating it + * in the vcpu_run() loop. + */ + if (!regs) + return; + + rseq_slowpath_update_usr(regs); } =20 void __rseq_signal_deliver(int sig, struct pt_regs *regs) From nobody Fri Oct 3 21:53:25 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2BDDC2F3614 for ; Sat, 23 Aug 2025 16:40:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967216; cv=none; b=jkU+gr2dBcv97qbIPiVwquU3hl34NKtYG4yUbRasn4/uTst/YAzGFeo5yKgTH3SlV84Nk/1jUUxKmegyEeR+kC+BFZ8le4+qdy+gKtGPjH1yXX84qRAfpNbELvAF824BjHkYx/1HZ40p951A5evJGyyW6YRe11FnXk755iuK+RE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967216; c=relaxed/simple; bh=3p2/g003XF+HmFJBfgiq9b9e6a1sNqWLnuTWzRUtmh4=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=RFfEWkhiDQgfhDH86caaXgtCeyIYCC2NGzzPgEI5VV/xdOHVsyUv45NdYpR/r7fCUlzp8wapvnFpw45UTPfrbBdgE8GL02lRWDB/eR3maabCatdqvgrUGRaf2d3eh65PViaZ+MTIrnkBKzDV1AA29sK6mQejn5LSmVpmGos5ATw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=sN/W+LJA; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=JbRX0sYR; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="sN/W+LJA"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="JbRX0sYR" Message-ID: <20250823161654.935413328@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1755967213; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=U5yXWVTT7HpJWEiQLE8bFwfE6SLIRsYcHvPjnHv5KDw=; b=sN/W+LJACF/YHLnFnxtrcUrP59WsayBM6p/E0b0vETvqgRGqC7V0f8i7xrN5BF/dYDrZXF Y+ABgF4m3uDh+35BCI6jC8uPHZYFwW0rrdkUmJoeOEYm1aX/+zFk087buP0wqL8ZEBTZVM lug8UEE9yW0x9Lduyq5V6s1Gx9pC/0pkVR9gQ5v69GqyaKFTH290MSxyYlCiIH39ETexjp W5wrsl+Nijw7DgeAlQK1eT2MP+l8PXUAk4/r0b6ZDEh4QaQnIxhtaypCS78KVkCsS+hMy2 261K9ogquLguZWRm4RbuWoBl5KHlp7HIZ1WKMPKhDrO7mvg0SaMInbnuhLVt8Q== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1755967213; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=U5yXWVTT7HpJWEiQLE8bFwfE6SLIRsYcHvPjnHv5KDw=; b=JbRX0sYRT5BLsOcnxQxKgkYUxKFRRQV7ZOGzY7sq+Xlh1QIzPfo7KLToioJU7PXQ+h33nU cPmjO2u/Js52RpAg== From: Thomas Gleixner To: LKML Cc: Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Paolo Bonzini , Sean Christopherson , Wei Liu , Dexuan Cui , x86@kernel.org, Arnd Bergmann , Heiko Carstens , Christian Borntraeger , Sven Schnelle , Huacai Chen , Paul Walmsley , Palmer Dabbelt Subject: [patch V2 26/37] rseq: Optimize event setting References: <20250823161326.635281786@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Sat, 23 Aug 2025 18:40:12 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" After removing the various condition bits earlier it turns out that one extra information is needed to avoid setting event::sched_switch and TIF_NOTIFY_RESUME unconditionally on every context switch. The update of the RSEQ user space memory is only required, when either the task was interrupted in user space and schedules or the CPU or MM CID changes in schedule() independent of the entry mode Right now only the interrupt from user information is available. Add a event flag, which is set when the CPU or MM CID or both change. Evaluate this event in the scheduler to decide whether the sched_switch event and the TIF bit need to be set. It's an extra conditional in context_switch(), but the downside of unconditionally handling RSEQ after a context switch to user is way more significant. The utilized boolean logic minimizes this to a single conditional branch. Signed-off-by: Thomas Gleixner --- fs/exec.c | 2 - include/linux/rseq.h | 81 ++++++++++++++++++++++++++++++++++++++++= +---- include/linux/rseq_types.h | 11 +++++- kernel/rseq.c | 2 - kernel/sched/core.c | 7 +++ kernel/sched/sched.h | 5 ++ 6 files changed, 95 insertions(+), 13 deletions(-) --- a/fs/exec.c +++ b/fs/exec.c @@ -1775,7 +1775,7 @@ static int bprm_execve(struct linux_binp force_fatal_sig(SIGSEGV); =20 sched_mm_cid_after_execve(current); - rseq_sched_switch_event(current); + rseq_force_update(); current->in_execve =3D 0; =20 return retval; --- a/include/linux/rseq.h +++ b/include/linux/rseq.h @@ -9,7 +9,8 @@ void __rseq_handle_notify_resume(struct =20 static inline void rseq_handle_notify_resume(struct pt_regs *regs) { - if (current->rseq_event.has_rseq) + /* '&' is intentional to spare one conditional branch */ + if (current->rseq_event.sched_switch & current->rseq_event.has_rseq) __rseq_handle_notify_resume(regs); } =20 @@ -31,12 +32,75 @@ static inline void rseq_signal_deliver(s } } =20 -/* Raised from context switch and exevce to force evaluation on exit to us= er */ -static inline void rseq_sched_switch_event(struct task_struct *t) +static inline void rseq_raise_notify_resume(struct task_struct *t) { - if (t->rseq_event.has_rseq) { - t->rseq_event.sched_switch =3D true; - set_tsk_thread_flag(t, TIF_NOTIFY_RESUME); + set_tsk_thread_flag(t, TIF_NOTIFY_RESUME); +} + +/* Invoked from context switch to force evaluation on exit to user */ +static __always_inline void rseq_sched_switch_event(struct task_struct *t) +{ + struct rseq_event *ev =3D &t->rseq_event; + + if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) { + /* + * Avoid a boat load of conditionals by using simple logic + * to determine whether NOTIFY_RESUME needs to be raised. + * + * It's required when the CPU or MM CID has changed or + * the entry was from user space. + */ + bool raise =3D (ev->user_irq | ev->ids_changed) & ev->has_rseq; + + if (raise) { + ev->sched_switch =3D true; + rseq_raise_notify_resume(t); + } + } else { + if (ev->has_rseq) { + t->rseq_event.sched_switch =3D true; + rseq_raise_notify_resume(t); + } + } +} + +/* + * Invoked from __set_task_cpu() when a task migrates to enforce an IDs + * update. + * + * This does not raise TIF_NOTIFY_RESUME as that happens in + * rseq_sched_switch_event(). + */ +static __always_inline void rseq_sched_set_task_cpu(struct task_struct *t,= unsigned int cpu) +{ + t->rseq_event.ids_changed =3D true; +} + +/* + * Invoked from switch_mm_cid() in context switch when the task gets a MM + * CID assigned. + * + * This does not raise TIF_NOTIFY_RESUME as that happens in + * rseq_sched_switch_event(). + */ +static __always_inline void rseq_sched_set_task_mm_cid(struct task_struct = *t, unsigned int cid) +{ + /* + * Requires a comparison as the switch_mm_cid() code does not + * provide a conditional for it readily. So avoid excessive updates + * when nothing changes. + */ + if (t->rseq_ids.mm_cid !=3D cid) + t->rseq_event.ids_changed =3D true; +} + +/* Enforce a full update after RSEQ registration and when execve() failed = */ +static inline void rseq_force_update(void) +{ + if (current->rseq_event.has_rseq) { + current->rseq_event.ids_changed =3D true; + current->rseq_event.sched_switch =3D true; + rseq_raise_notify_resume(current); } } =20 @@ -53,7 +117,7 @@ static inline void rseq_sched_switch_eve static inline void rseq_virt_userspace_exit(void) { if (current->rseq_event.sched_switch) - set_tsk_thread_flag(current, TIF_NOTIFY_RESUME); + rseq_raise_notify_resume(current); } =20 /* @@ -90,6 +154,9 @@ static inline void rseq_execve(struct ta static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct = pt_regs *regs) { } static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_reg= s *regs) { } static inline void rseq_sched_switch_event(struct task_struct *t) { } +static inline void rseq_sched_set_task_cpu(struct task_struct *t, unsigned= int cpu) { } +static inline void rseq_sched_set_task_mm_cid(struct task_struct *t, unsig= ned int cid) { } +static inline void rseq_force_update(void) { } static inline void rseq_virt_userspace_exit(void) { } static inline void rseq_fork(struct task_struct *t, unsigned long clone_fl= ags) { } static inline void rseq_execve(struct task_struct *t) { } --- a/include/linux/rseq_types.h +++ b/include/linux/rseq_types.h @@ -10,20 +10,27 @@ struct rseq; * struct rseq_event - Storage for rseq related event management * @all: Compound to initialize and clear the data efficiently * @events: Compund to access events with a single load/store - * @sched_switch: True if the task was scheduled out + * @sched_switch: True if the task was scheduled and needs update on + * exit to user + * @ids_changed: Indicator that IDs need to be updated * @user_irq: True on interrupt entry from user mode * @has_rseq: True if the task has a rseq pointer installed * @error: Compound error code for the slow path to analyze * @fatal: User space data corrupted or invalid + * + * @sched_switch and @ids_changed must be adjacent and the combo must be + * 16bit aligned to allow a single store, when both are set at the same + * time in the scheduler. */ struct rseq_event { union { u64 all; struct { union { - u16 events; + u32 events; struct { u8 sched_switch; + u8 ids_changed; u8 user_irq; }; }; --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -459,7 +459,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user * are updated before returning to user-space. */ current->rseq_event.has_rseq =3D true; - rseq_sched_switch_event(current); + rseq_force_update(); =20 return 0; } --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5150,7 +5150,6 @@ prepare_task_switch(struct rq *rq, struc kcov_prepare_switch(prev); sched_info_switch(rq, prev, next); perf_event_task_sched_out(prev, next); - rseq_sched_switch_event(prev); fire_sched_out_preempt_notifiers(prev, next); kmap_local_sched_out(); prepare_task(next); @@ -5348,6 +5347,12 @@ context_switch(struct rq *rq, struct tas /* switch_mm_cid() requires the memory barriers above. */ switch_mm_cid(rq, prev, next); =20 + /* + * Tell rseq that the task was scheduled in. Must be after + * switch_mm_cid() to get the TIF flag set. + */ + rseq_sched_switch_event(next); + prepare_lock_switch(rq, next, rf); =20 /* Here we just switch the register state and the stack. */ --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2181,6 +2181,7 @@ static inline void __set_task_cpu(struct smp_wmb(); WRITE_ONCE(task_thread_info(p)->cpu, cpu); p->wake_cpu =3D cpu; + rseq_sched_set_task_cpu(p, cpu); #endif /* CONFIG_SMP */ } =20 @@ -3778,8 +3779,10 @@ static inline void switch_mm_cid(struct mm_cid_put_lazy(prev); prev->mm_cid =3D -1; } - if (next->mm_cid_active) + if (next->mm_cid_active) { next->last_mm_cid =3D next->mm_cid =3D mm_cid_get(rq, next, next->mm); + rseq_sched_set_task_mm_cid(next, next->mm_cid); + } } =20 #else /* !CONFIG_SCHED_MM_CID: */ From nobody Fri Oct 3 21:53:25 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3D9F02F39A1 for ; Sat, 23 Aug 2025 16:40:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967219; cv=none; b=uJ4ZBQav3HDmedmcpzXXL2UwqgPh4UBigqRVlBClqNMYeiW1zAqJ8BAyLHyYjEK2sOtAEG8ORslsB/Jpu3KKy7Al1IaJ8uhR3o3BLEQtuAQSauU/NMc4WRDc1boliCVzexYDbd+9TzxQ9UkeFJcShp3QPlF3X5ZO5/oliJRIims= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967219; c=relaxed/simple; bh=2PyWNB1UVf6D3ZAAuI5VPtSHwwaQ75GIXfowJJ0YalE=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=IQczvnZ4FA8TOMOyc0TkpA4BaPzgQ2HuhFKibnjsEBebRfN2TO6wmcsZbK+bpaIkyioa3mOUoYKuUIrhI/SaDfq97NeQSroueLvkVsdknulOyeYygQjkF13RG8gNuR3uuKLl2qUUG6WhASSpwSrhYIWG6SVz8ivDK1QDBcINBNE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=p5wQcMZ3; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=ewfstKNU; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="p5wQcMZ3"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="ewfstKNU" Message-ID: <20250823161655.000005616@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1755967215; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=vzSWhwtjgZNJAeze0LFqDC26Jo5lUIslDEzgumuo/Kg=; b=p5wQcMZ3SRXXOKsbtOpI9xEFY5xqGKBCs/NDO7sFfvHcnFMNmqlfQ4Mi9MR8ybd4kDRt3y DyIhTXYEh2lHljLKqOZW67oPAkYVZobbUAX+cNS9fwQneJnRR+MLGxaFdPWmAiN1iIgz9u MngvO4kmnvaHuCR6yROaUs9XcwjY83g3g+JtxgdJIEZqJd5btupbziUxZJyM66TRHxXjr+ Dg9PrcCnVBW4DJXyNFhxRMSZRRBC7FwvoWlM2L0JhQLWNTSXmVVLlApR4+M42TieX8Pne+ 7sufaQndc1f3rO//aRy6c4vGDj+we3kGOO+bzvZK0z7B/pMKsJUl7VyQtp0zyw== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1755967215; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=vzSWhwtjgZNJAeze0LFqDC26Jo5lUIslDEzgumuo/Kg=; b=ewfstKNURCwfxMo1SN2QIgZeP76vffjPyFfJ353CrzsmmaVoy1GV1Va4dJrLzcpQ2yk5SP 5ZzOHilaZu5j1BCw== From: Thomas Gleixner To: LKML Cc: Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Paolo Bonzini , Sean Christopherson , Wei Liu , Dexuan Cui , x86@kernel.org, Arnd Bergmann , Heiko Carstens , Christian Borntraeger , Sven Schnelle , Huacai Chen , Paul Walmsley , Palmer Dabbelt Subject: [patch V2 27/37] rseq: Implement fast path for exit to user References: <20250823161326.635281786@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Sat, 23 Aug 2025 18:40:14 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Implement the actual logic for handling RSEQ updates in a fast path after handling the TIF work and at the point where the task is actually returning to user space. This is the right point to do that because at this point the CPU and the MM CID are stable and cannot longer change due to yet another reschedule. That happens when the task is handling it via TIF_NOTIFY_RESUME in resume_user_mode_work(), which is invoked from the exit to user mode work loop. The function is invoked after the TIF work is handled and runs with interrupts disabled, which means it cannot resolve page faults. It therefore disables page faults and in case the access to the user space memory faults, it: - notes the fail in the event struct - raises TIF_NOTIFY_RESUME - returns false to the caller The caller has to go back to the TIF work, which runs with interrupts enabled and therefore can resolve the page faults. This happens mostly on fork() when the memory is marked COW. That will be optimized by setting the failure flag and raising TIF_NOTIFY_RESUME right on fork to avoid the otherwise unavoidable round trip. If the user memory inspection finds invalid data, the function returns false as well and sets the fatal flag in the event struct along with TIF_NOTIFY_RESUME. The slow path notify handler has to evaluate that flag and terminate the task with SIGSEGV as documented. The initial decision to invoke any of this is based on two flags in the event struct: @has_rseq and @sched_switch. The decision is in pseudo ASM: load tsk::event::has_rseq and tsk::event::sched_switch jnz inspect_user_space mov $0, tsk::event::events ... leave So for the common case where the task was not scheduled out, this really boils down to four instructions before going out if the compiler is not completely stupid (and yes, some of them are). If the condition is true, then it checks, whether CPU ID or MM CID have changed. If so, then the CPU/MM IDs have to be updated and are thereby cached for the next round. The update unconditionally retrieves the user space critical section address to spare another user*begin/end() pair. If that's not zero and tsk::event::user_irq is set, then the critical section is analyzed and acted upon. If either zero or the entry came via syscall the critical section analysis is skipped. If the comparison is false then the critical section has to be analyzed because the event flag is then only true when entry from user was by interrupt. This is provided without the actual hookup to let reviewers focus on the implementation details. The hookup happens in the next step. Note: As with quite some other optimizations this depends on the generic entry infrastructure and is not enabled to be sucked into random architecture implementations. Signed-off-by: Thomas Gleixner --- include/linux/rseq_entry.h | 137 ++++++++++++++++++++++++++++++++++++++++= ++++- include/linux/rseq_types.h | 3=20 kernel/rseq.c | 2=20 3 files changed, 139 insertions(+), 3 deletions(-) --- a/include/linux/rseq_entry.h +++ b/include/linux/rseq_entry.h @@ -10,6 +10,7 @@ struct rseq_stats { unsigned long exit; unsigned long signal; unsigned long slowpath; + unsigned long fastpath; unsigned long ids; unsigned long cs; unsigned long clear; @@ -204,8 +205,8 @@ bool rseq_debug_update_user_cs(struct ta =20 /* * On debug kernels validate that user space did not mess with it if - * DEBUG_RSEQ is enabled, but don't on the first exit to user space. In - * that case cpu_cid is ~0. See fork/execve. + * debugging is enabled, but don't do that on the first exit to user + * space. In that case cpu_cid is ~0. See fork/execve. */ bool rseq_debug_validate_uids(struct task_struct *t) { @@ -393,6 +394,131 @@ static rseq_inline bool rseq_update_usr( return rseq_update_user_cs(t, regs, csaddr); } =20 +/* + * If you want to use this then convert your architecture to the generic + * entry code. I'm tired of building workarounds for people who can't be + * bothered to make the maintainence of generic infrastructure less + * burdensome. Just sucking everything into the architecture code and + * thereby making others chase the horrible hacks and keep them working is + * neither acceptable nor sustainable. + */ +#ifdef CONFIG_GENERIC_ENTRY + +/* + * This is inlined into the exit path because: + * + * 1) It's a one time comparison in the fast path when there is no event to + * handle + * + * 2) The access to the user space rseq memory (TLS) is unlikely to fault + * so the straight inline operation is: + * + * - Four 32-bit stores only if CPU ID/ MM CID need to be updated + * - One 64-bit load to retrieve the critical section address + * + * 3) In the unlikely case that the critical section address is !=3D NULL: + * + * - One 64-bit load to retrieve the start IP + * - One 64-bit load to retrieve the offset for calculating the end + * - One 64-bit load to retrieve the abort IP + * - One store to clear the critical section address + * + * The non-debug case implements only the minimal required checking and + * protection against a rogue abort IP in kernel space, which would be + * exploitable at least on x86. Any fallout from invalid critical section + * descriptors is a user space problem. The debug case provides the full + * set of checks and terminates the task if a condition is not met. + * + * In case of a fault or an invalid value, this sets TIF_NOTIFY_RESUME and + * tells the caller to loop back into exit_to_user_mode_loop(). The rseq + * slow path there will handle the fail. + */ +static __always_inline bool rseq_exit_to_user_mode_restart(struct pt_regs = *regs) +{ + struct task_struct *t =3D current; + + /* + * If the task did not go through schedule or got the flag enforced + * by the rseq syscall or execve, then nothing to do here. + * + * CPU ID and MM CID can only change when going through a context + * switch. + * + * This can only be done when rseq_event::has_rseq is true. + * rseq_sched_switch_event() sets rseq_event::sched unconditionally + * true to avoid a load of rseq_event::has_rseq in the context + * switch path. + * + * This check uses a '&' and not a '&&' to force the compiler to do + * an actual AND operation instead of two seperate conditionals. + * + * A sane compiler requires four instructions for the nothing to do + * case including clearing the events, but your milage might vary. + */ + if (likely(!(t->rseq_event.sched_switch & t->rseq_event.has_rseq))) + goto done; + + rseq_stat_inc(rseq_stats.fastpath); + + pagefault_disable(); + + if (likely(!t->rseq_event.ids_changed)) { + /* + * If IDs have not changed rseq_event::user_irq must be true + * See rseq_sched_switch_event(). + */ + u64 csaddr; + + if (unlikely(get_user_masked_u64(&csaddr, &t->rseq->rseq_cs))) + goto fail; + + if (static_branch_unlikely(&rseq_debug_enabled) || unlikely(csaddr)) { + if (unlikely(!rseq_update_user_cs(t, regs, csaddr))) + goto fail; + } + } else { + struct rseq_ids ids =3D { + .cpu_id =3D task_cpu(t), + .mm_cid =3D task_mm_cid(t), + }; + u32 node_id =3D cpu_to_node(ids.cpu_id); + + if (unlikely(!rseq_update_usr(t, regs, &ids, node_id))) + goto fail; + } + + pagefault_enable(); + +done: + /* Clear state so next entry starts from a clean slate */ + t->rseq_event.events =3D 0; + return false; + +fail: + pagefault_enable(); + /* Force it into the slow path. Don't clear the state! */ + t->rseq_event.slowpath =3D true; + set_tsk_thread_flag(t, TIF_NOTIFY_RESUME); + return true; +} + +static __always_inline unsigned long +rseq_exit_to_user_mode_work(struct pt_regs *regs, unsigned long ti_work, c= onst unsigned long mask) +{ + /* + * Check if all work bits have been cleared before handling rseq. + */ + if ((ti_work & mask) !=3D 0) + return ti_work; + + if (likely(!__rseq_exit_to_user_mode_restart(regs))) + return ti_work; + + return ti_work | _TIF_NOTIFY_RESUME; +} + +#endif /* !CONFIG_GENERIC_ENTRY */ + static __always_inline void rseq_exit_to_user_mode(void) { struct rseq_event *ev =3D ¤t->rseq_event; @@ -417,8 +543,13 @@ static inline void rseq_debug_syscall_re if (static_branch_unlikely(&rseq_debug_enabled)) __rseq_debug_syscall_return(regs); } - #else /* CONFIG_RSEQ */ +static inline unsigned long rseq_exit_to_user_mode_work(struct pt_regs *re= gs, + unsigned long ti_work, + const unsigned long mask) +{ + return ti_work; +} static inline void rseq_note_user_irq_entry(void) { } static inline void rseq_exit_to_user_mode(void) { } static inline void rseq_debug_syscall_return(struct pt_regs *regs) { } --- a/include/linux/rseq_types.h +++ b/include/linux/rseq_types.h @@ -17,6 +17,8 @@ struct rseq; * @has_rseq: True if the task has a rseq pointer installed * @error: Compound error code for the slow path to analyze * @fatal: User space data corrupted or invalid + * @slowpath: Indicator that slow path processing via TIF_NOTIFY_RESUME + * is required * * @sched_switch and @ids_changed must be adjacent and the combo must be * 16bit aligned to allow a single store, when both are set at the same @@ -41,6 +43,7 @@ struct rseq_event { u16 error; struct { u8 fatal; + u8 slowpath; }; }; }; --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -133,6 +133,7 @@ static int rseq_stats_show(struct seq_fi stats.exit +=3D data_race(per_cpu(rseq_stats.exit, cpu)); stats.signal +=3D data_race(per_cpu(rseq_stats.signal, cpu)); stats.slowpath +=3D data_race(per_cpu(rseq_stats.slowpath, cpu)); + stats.fastpath +=3D data_race(per_cpu(rseq_stats.fastpath, cpu)); stats.ids +=3D data_race(per_cpu(rseq_stats.ids, cpu)); stats.cs +=3D data_race(per_cpu(rseq_stats.cs, cpu)); stats.clear +=3D data_race(per_cpu(rseq_stats.clear, cpu)); @@ -142,6 +143,7 @@ static int rseq_stats_show(struct seq_fi seq_printf(m, "exit: %16lu\n", stats.exit); seq_printf(m, "signal: %16lu\n", stats.signal); seq_printf(m, "slowp: %16lu\n", stats.slowpath); + seq_printf(m, "fastp: %16lu\n", stats.fastpath); seq_printf(m, "ids: %16lu\n", stats.ids); seq_printf(m, "cs: %16lu\n", stats.cs); seq_printf(m, "clear: %16lu\n", stats.clear); From nobody Fri Oct 3 21:53:25 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9DB712F39B1 for ; Sat, 23 Aug 2025 16:40:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967221; cv=none; b=e2pGTL6d4OmTm+MKsh92NHTNrzeMrZW1nKlXNgl9aHXWPdJKjR5lZBafAPxF/HtzEhqYTOr6nLrxrVfUBYlAzFdbYQ8QLOZbx5W1zVZ/OkQpg4fykMSHaeza0V3x4bowOjS8viOf5PlOM0l2bzIfAmJD0rvdVqfQ6k+SX2QsPxI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967221; c=relaxed/simple; bh=TeIaKTx5oTOJNV8fm479T2XBhuJIhgU5GgVF2Z50ZTk=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=RjE0oD9RiInhy1Elhp0wWyS1+OCCNV12s5gX2O2bNgDHiLlLTfbDpxLOWUGPu7nT96n+ldeCPTL22TLbyXmOk9yD6wi3ji6Ayw9gWQV3mtpH4L/L4LTG56B7KmW7Qi5m6Am2RME6RNIbolSWQrCnGOCz+JoRUHsfkIlqVDBTI1U= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=oSr2Kk+b; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=L+ncDQWq; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="oSr2Kk+b"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="L+ncDQWq" Message-ID: <20250823161655.063205235@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1755967218; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=yUaz1viKqpbNy5DnrHuMctWFZT//o6TCXkAtrD4Vqhk=; b=oSr2Kk+bKGtvAmYlasAqGhXp57h12L4vHek26GdjrOEOOzoD+1KSR8VWP7u/OR/Or9We7O cGAFXPYxsXebvb/cJIPDXu5jsRXHOZsQ0L5XgEPKgJz/cL6cqnQV1AiSCFr0XQXqpf/8uc GoJOxJoNBfoc2qFY5WaUklgWeXs9ohz/DMDnJkcD6i6YqpyD8ZbDDVx0cFv8GcKVvsPgO/ 1A2M2dnIBY4cyx15d9YfAAAcUL1Q68XpDr+OaeHCw7JH7gALez8yGUveQk2QlFDcV2SuJm w+DTd0SUIdqIWfnA7hcC1O1YL/2hxKJS5+Y6HqsyGZ1P3//BYNH1CnOB8aOC0w== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1755967218; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=yUaz1viKqpbNy5DnrHuMctWFZT//o6TCXkAtrD4Vqhk=; b=L+ncDQWqGUl8DH904hesdQyARJ49N1m1hoHX8hm5k1oh0Meioal8TYFHZF7PMCyNPKpOVV ZFZQ+HNtVl9ElRBg== From: Thomas Gleixner To: LKML Cc: Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Paolo Bonzini , Sean Christopherson , Wei Liu , Dexuan Cui , x86@kernel.org, Arnd Bergmann , Heiko Carstens , Christian Borntraeger , Sven Schnelle , Huacai Chen , Paul Walmsley , Palmer Dabbelt Subject: [patch V2 28/37] rseq: Switch to fast path processing on exit to user References: <20250823161326.635281786@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Sat, 23 Aug 2025 18:40:17 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Now that all bits and pieces are in place, hook the RSEQ handling fast path function into exit_to_user_mode_prepare() after the TIF work bits have been handled. If case of fast path failure, TIF_NOTIFY_RESUME has been raised and the caller needs to take another turn through the TIF handling slow path. This only works for architectures, which use the generic entry code. Architectures, who still have their own incomplete hacks are not supported and won't be. This results in the following improvements: Kernel build Before After Reduction =20 exit to user 80692981 80514451 =20 signal checks: 32581 121 99% slowpath runs: 1201408 1.49% 198 0.00% 100% fastpath runs: 675941 0.84% N/A id updates: 1233989 1.53% 50541 0.06% 96% cs checks: 1125366 1.39% 0 0.00% 100% cs cleared: 1125366 100% 0 100% cs fixup: 0 0% 0 =20 RSEQ selftests Before After Reduction exit to user: 386281778 387373750 =20 signal checks: 35661203 0 100% slowpath runs: 140542396 36.38% 100 0.00% 100% fastpath runs: 9509789 2.51% N/A id updates: 176203599 45.62% 9087994 2.35% 95% cs checks: 175587856 45.46% 4728394 1.22% 98% cs cleared: 172359544 98.16% 1319307 27.90% 99%=20 cs fixup: 3228312 1.84% 3409087 72.10% The 'cs cleared' and 'cs fixup' percentanges are not relative to the exit to user invocations, they are relative to the actual 'cs check' invocations. While some of this could have been avoided in the original code, like the obvious clearing of CS when it's already clear, the main problem of going through TIF_NOTIFY_RESUME cannot be solved. In some workloads the RSEQ notify handler is invoked more than once before going out to user space. Doing this once when everything has stabilized is the only solution to avoid this. The initial attempt to completely decouple it from the TIF work turned out to be suboptimal for workloads, which do a lot of quick and short system calls. Even if the fast path decision is only 4 instructions (including a conditional branch), this adds up quickly and becomes measurable when the rate for actually having to handle rseq is in the low single digit percentage range of user/kernel transitions. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- include/linux/irq-entry-common.h | 7 ++----- include/linux/resume_user_mode.h | 2 +- include/linux/rseq.h | 24 ++++++++++++++++++------ include/linux/rseq_entry.h | 2 +- init/Kconfig | 2 +- kernel/entry/common.c | 17 ++++++++++++++--- kernel/rseq.c | 8 ++++++-- 7 files changed, 43 insertions(+), 19 deletions(-) --- a/include/linux/irq-entry-common.h +++ b/include/linux/irq-entry-common.h @@ -197,11 +197,8 @@ static __always_inline void arch_exit_to */ void arch_do_signal_or_restart(struct pt_regs *regs); =20 -/** - * exit_to_user_mode_loop - do any pending work before leaving to user spa= ce - */ -unsigned long exit_to_user_mode_loop(struct pt_regs *regs, - unsigned long ti_work); +/* Handle pending TIF work */ +unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long t= i_work); =20 /** * exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required --- a/include/linux/resume_user_mode.h +++ b/include/linux/resume_user_mode.h @@ -59,7 +59,7 @@ static inline void resume_user_mode_work mem_cgroup_handle_over_high(GFP_KERNEL); blkcg_maybe_throttle_current(); =20 - rseq_handle_notify_resume(regs); + rseq_handle_slowpath(regs); } =20 #endif /* LINUX_RESUME_USER_MODE_H */ --- a/include/linux/rseq.h +++ b/include/linux/rseq.h @@ -5,13 +5,19 @@ #ifdef CONFIG_RSEQ #include =20 -void __rseq_handle_notify_resume(struct pt_regs *regs); +void __rseq_handle_slowpath(struct pt_regs *regs); =20 -static inline void rseq_handle_notify_resume(struct pt_regs *regs) +/* Invoked from resume_user_mode_work() */ +static inline void rseq_handle_slowpath(struct pt_regs *regs) { - /* '&' is intentional to spare one conditional branch */ - if (current->rseq_event.sched_switch & current->rseq_event.has_rseq) - __rseq_handle_notify_resume(regs); + if (IS_ENABLED(CONFIG_GENERIC_ENTRY)) { + if (current->rseq_event.slowpath) + __rseq_handle_slowpath(regs); + } else { + /* '&' is intentional to spare one conditional branch */ + if (current->rseq_event.sched_switch & current->rseq_event.has_rseq) + __rseq_handle_slowpath(regs); + } } =20 void __rseq_signal_deliver(int sig, struct pt_regs *regs); @@ -138,6 +144,12 @@ static inline void rseq_fork(struct task t->rseq_sig =3D current->rseq_sig; t->rseq_ids.cpu_cid =3D ~0ULL; t->rseq_event =3D current->rseq_event; + /* + * If it has rseq, force it into the slow path right away + * because it is guaranteed to fault. + */ + if (t->rseq_event.has_rseq) + t->rseq_event.slowpath =3D true; } } =20 @@ -151,7 +163,7 @@ static inline void rseq_execve(struct ta } =20 #else /* CONFIG_RSEQ */ -static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct = pt_regs *regs) { } +static inline void rseq_handle_slowpath(struct pt_regs *regs) { } static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_reg= s *regs) { } static inline void rseq_sched_switch_event(struct task_struct *t) { } static inline void rseq_sched_set_task_cpu(struct task_struct *t, unsigned= int cpu) { } --- a/include/linux/rseq_entry.h +++ b/include/linux/rseq_entry.h @@ -433,7 +433,7 @@ static rseq_inline bool rseq_update_usr( * tells the caller to loop back into exit_to_user_mode_loop(). The rseq * slow path there will handle the fail. */ -static __always_inline bool rseq_exit_to_user_mode_restart(struct pt_regs = *regs) +static __always_inline bool __rseq_exit_to_user_mode_restart(struct pt_reg= s *regs) { struct task_struct *t =3D current; =20 --- a/init/Kconfig +++ b/init/Kconfig @@ -1911,7 +1911,7 @@ config RSEQ_DEBUG_DEFAULT_ENABLE config DEBUG_RSEQ default n bool "Enable debugging of rseq() system call" if EXPERT - depends on RSEQ && DEBUG_KERNEL + depends on RSEQ && DEBUG_KERNEL && !GENERIC_ENTRY select RSEQ_DEBUG_DEFAULT_ENABLE help Enable extra debugging checks for the rseq system call. --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -23,8 +23,7 @@ void __weak arch_do_signal_or_restart(st * Before returning to user space ensure that all pending work * items have been completed. */ - while (ti_work & EXIT_TO_USER_MODE_WORK) { - + do { local_irq_enable_exit_to_user(ti_work); =20 if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) @@ -56,7 +55,19 @@ void __weak arch_do_signal_or_restart(st tick_nohz_user_enter_prepare(); =20 ti_work =3D read_thread_flags(); - } + + /* + * This returns the unmodified ti_work, when ti_work is not + * empty. In that case it waits for the next round to avoid + * multiple updates in case of rescheduling. + * + * When it handles rseq it returns either with empty work + * on success or with TIF_NOTIFY_RESUME set on failure to + * kick the handling into the slow path. + */ + ti_work =3D rseq_exit_to_user_mode_work(regs, ti_work, EXIT_TO_USER_MODE= _WORK); + + } while (ti_work & EXIT_TO_USER_MODE_WORK); =20 /* Return the latest work state for arch_exit_to_user_mode() */ return ti_work; --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -234,7 +234,11 @@ static bool rseq_handle_cs(struct task_s =20 static void rseq_slowpath_update_usr(struct pt_regs *regs) { - /* Preserve rseq state and user_irq state for exit to user */ + /* + * Preserve rseq state and user_irq state. The generic entry code + * clears user_irq on the way out, the non-generic entry + * architectures are not having user_irq. + */ const struct rseq_event evt_mask =3D { .has_rseq =3D true, .user_irq =3D = true, }; struct task_struct *t =3D current; struct rseq_ids ids; @@ -286,7 +290,7 @@ static void rseq_slowpath_update_usr(str } } =20 -void __rseq_handle_notify_resume(struct pt_regs *regs) +void __rseq_handle_slowpath(struct pt_regs *regs) { /* * If invoked from hypervisors before entering the guest via From nobody Fri Oct 3 21:53:25 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 044C12F39D0 for ; Sat, 23 Aug 2025 16:40:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967223; cv=none; b=aUYbWI7ojryQLaWWeeRfb1PkHklaEFXSprKb4Rht0BxsRsil757Gzi/0b0golA5qiWE499UWAgZmCZj3Lt5wgrX573FD5v2FVE7B+TxKCk+YP4rQYLjSrGeW9cAoLCjOBywrwBZo9pPPaKbC3B7zqoUT7E66MKo7nDKHsDSPmSo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967223; c=relaxed/simple; bh=hSU6jbmXqZa1RejXYxiw64+TW9TMWEObgTKqEy4HGko=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=I3O1ZYWX7x0/HrHIKfXaIrLI4zY2zXpw/YmJHrEJu/pzGjsE5i21a43UrJURrLic2G3pTYFfDoaCL6FUhBUkELfQn071dHpEXy+x+vpuLRpKx/7Fi7dt4JPpAO48QQgcdtvybyXDCr28XAiBfEZINOWksNy3+Z+OtdTnyMWZY6o= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=bRvX2S6G; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=OCD8q566; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="bRvX2S6G"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="OCD8q566" Message-ID: <20250823161655.127882103@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1755967220; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=uzJ7JI8HEWKH1yX2ahMCaw99qymb5514bPc6ymNcw3s=; b=bRvX2S6G+6APdrWsotD1tKtiYmi+yOvarvfwm/nlNk8fUk7KcAH+rPPcR4FMnA0svBhu+r Tnv/Op9nwtmYxkjejlDuPcivs7T13HgHKRvzE9v4XW7/H70ujJEdbco5WsMU+LkJYIJu8g evHjSaD3XfYOQg/lRGLMrTqmR6Z5d0GG+rRe8EftM+ZQxKI0Dqbapr6GC1w0hICy4b79co 5ixoQ3u2y6JgR99Ah70RKvd4Tw1w9ADOx3rqzqCA/grk8etJc+wve1+TJ1YtVGYCB8PPJS Fml2NjR2mbsacKgJZ+dszXz46Z+FQcvL0mhnUkk/0zy0vhRk4B5I8fHfmWjdqQ== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1755967220; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=uzJ7JI8HEWKH1yX2ahMCaw99qymb5514bPc6ymNcw3s=; b=OCD8q566eIfCooI4hrPfoL7IIPSmtqF+bbTApP2gLnCUAKCyStsdEpMWwEVa6bG5M2Ip4H y3t0x9MwwQzE8sAg== From: Thomas Gleixner To: LKML Cc: Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Paolo Bonzini , Sean Christopherson , Wei Liu , Dexuan Cui , x86@kernel.org, Arnd Bergmann , Heiko Carstens , Christian Borntraeger , Sven Schnelle , Huacai Chen , Paul Walmsley , Palmer Dabbelt Subject: [patch V2 29/37] entry: Split up exit_to_user_mode_prepare() References: <20250823161326.635281786@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Sat, 23 Aug 2025 18:40:19 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" exit_to_user_mode_prepare() is used for both interrupts and syscalls, but there is extra rseq work, which is only required for in the interrupt exit case. Split up the function and provide wrappers for syscalls and interrupts, which allows to seperate the rseq exit work in the next step. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- include/linux/entry-common.h | 2 - include/linux/irq-entry-common.h | 42 ++++++++++++++++++++++++++++++++++= ----- 2 files changed, 38 insertions(+), 6 deletions(-) --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -156,7 +156,7 @@ static __always_inline void syscall_exit if (unlikely(work & SYSCALL_WORK_EXIT)) syscall_exit_work(regs, work); local_irq_disable_exit_to_user(); - exit_to_user_mode_prepare(regs); + syscall_exit_to_user_mode_prepare(regs); } =20 /** --- a/include/linux/irq-entry-common.h +++ b/include/linux/irq-entry-common.h @@ -201,7 +201,7 @@ void arch_do_signal_or_restart(struct pt unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long t= i_work); =20 /** - * exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required + * __exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required * @regs: Pointer to pt_regs on entry stack * * 1) check that interrupts are disabled @@ -209,8 +209,10 @@ unsigned long exit_to_user_mode_loop(str * 3) call exit_to_user_mode_loop() if any flags from * EXIT_TO_USER_MODE_WORK are set * 4) check that interrupts are still disabled + * + * Don't invoke directly, use the syscall/irqentry_ prefixed variants below */ -static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs) +static __always_inline void __exit_to_user_mode_prepare(struct pt_regs *re= gs) { unsigned long ti_work; =20 @@ -224,15 +226,45 @@ static __always_inline void exit_to_user ti_work =3D exit_to_user_mode_loop(regs, ti_work); =20 arch_exit_to_user_mode_prepare(regs, ti_work); +} =20 - rseq_exit_to_user_mode(); - +static __always_inline void __exit_to_user_mode_validate(void) +{ /* Ensure that kernel state is sane for a return to userspace */ kmap_assert_nomap(); lockdep_assert_irqs_disabled(); lockdep_sys_exit(); } =20 + +/** + * syscall_exit_to_user_mode_prepare - call exit_to_user_mode_loop() if re= quired + * @regs: Pointer to pt_regs on entry stack + * + * Wrapper around __exit_to_user_mode_prepare() to seperate the exit work = for + * syscalls and interrupts. + */ +static __always_inline void syscall_exit_to_user_mode_prepare(struct pt_re= gs *regs) +{ + __exit_to_user_mode_prepare(regs); + rseq_exit_to_user_mode(); + __exit_to_user_mode_validate(); +} + +/** + * irqentry_exit_to_user_mode_prepare - call exit_to_user_mode_loop() if r= equired + * @regs: Pointer to pt_regs on entry stack + * + * Wrapper around __exit_to_user_mode_prepare() to seperate the exit work = for + * syscalls and interrupts. + */ +static __always_inline void irqentry_exit_to_user_mode_prepare(struct pt_r= egs *regs) +{ + __exit_to_user_mode_prepare(regs); + rseq_exit_to_user_mode(); + __exit_to_user_mode_validate(); +} + /** * exit_to_user_mode - Fixup state when exiting to user mode * @@ -297,7 +329,7 @@ static __always_inline void irqentry_ent static __always_inline void irqentry_exit_to_user_mode(struct pt_regs *reg= s) { instrumentation_begin(); - exit_to_user_mode_prepare(regs); + irqentry_exit_to_user_mode_prepare(regs); instrumentation_end(); exit_to_user_mode(); } From nobody Fri Oct 3 21:53:25 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 83CFB2F3C2C for ; Sat, 23 Aug 2025 16:40:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967227; cv=none; b=LXmq10D0AVvS0DYr4YwYHY+z0dzPUyA0e6CTIq5gvQE1WjkAYOodarGES54Dk6Saw9/5Yp1rZkG00+lF+c5clmAtYslxsGY0s00LRJ1djZL7IyLuE2lvU1zbNmEn915Wc/J42w8puN4HII0keGa9Jh3kf57E4r6M5d33Deb9c70= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967227; c=relaxed/simple; bh=jMLgdAgHSzruoTT+FPjMRqYiLZvhfhiTmwc43GLZ84c=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=As24ZQUDSBU90H5Ob4TMHLiiYvcsVdqOHHkGBGl4Co4PZYg7NqzHDKLqIEZeU5SmhZexmEslr7QgP+iuQCb28GVJZVMoByrRyL6puTCThY+CXvKkVCfUGgQC/uU39kKf3qsH9NyDsQvdg8GZDCVs5qfKYtgdAmNIkH0F2B8SSBM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=nPpOOF7L; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=0zuSs4BK; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="nPpOOF7L"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="0zuSs4BK" Message-ID: <20250823161655.191313426@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1755967223; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=XsVwV0NBdObzq9VISgnSUcizZQn0vfkb/Q5nM7G/Qz4=; b=nPpOOF7LNNjW5JqjLgngnJBuM0Ybl35TXHP42G74PdJK9f18S3Mcb/LM8xR2V8pWiDSetZ K8OJTr4VxC0qEz4VK0S2KVrDMcv20CGG6fmTpRQaMYUKQZrZ26LZLSDvkPL6OrTrxzM82M 7iXKo0Oc22pu6HuTM2wt0liUhThntas0LxvMQ26OisezDxlRFTkME7pCw/g9RAwDlYYv8y sRaVCm/yB9foosaopobCEDF/sKQ8vGkJPJcm7/e0eXGFJtZyeyA5h5rWckKYFSpUhHHirq 97i8Eg/73yMSlIwD0HmjIDRZjiXrdEEeS6DgyA9eRMF2OnnFyrcE1PKD/Gc2UA== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1755967223; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=XsVwV0NBdObzq9VISgnSUcizZQn0vfkb/Q5nM7G/Qz4=; b=0zuSs4BKC/RYh2FJmwFEVCZOiYBJ3WRqhzSc9rXLsIpu6X309U3n7w/r1HenBYF9FHZWNh geRvcuKsmUDmntDA== From: Thomas Gleixner To: LKML Cc: Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Paolo Bonzini , Sean Christopherson , Wei Liu , Dexuan Cui , x86@kernel.org, Arnd Bergmann , Heiko Carstens , Christian Borntraeger , Sven Schnelle , Huacai Chen , Paul Walmsley , Palmer Dabbelt Subject: [patch V2 30/37] rseq: Split up rseq_exit_to_user_mode() References: <20250823161326.635281786@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Sat, 23 Aug 2025 18:40:22 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Seperate the interrupt and syscall exit handling. Syscall exit does not require to clear the user_irq bit as it can't be set. On interrupt exit it can be set when the interrupt did not result in a scheduling event and therefore the return path did not invoke the TIF work handling, which would have cleared it. The debug check for the event state is also not really required even when debug mode is enabled via the static key. Debug mode is largely aiding user space by enabling a larger amount of validation checks, which cause a segfault when a malformed critical section is detected. In production mode the critical section handling takes the content mostly as is and lets user space keep the pieces when it screwed up. On kernel changes in that area the state check is useful, but that can be done when lockdep is enabled, which is anyway a required test scenario for fundamental changes. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- include/linux/irq-entry-common.h | 4 ++-- include/linux/rseq_entry.h | 21 +++++++++++++++++---- 2 files changed, 19 insertions(+), 6 deletions(-) --- a/include/linux/irq-entry-common.h +++ b/include/linux/irq-entry-common.h @@ -247,7 +247,7 @@ static __always_inline void __exit_to_us static __always_inline void syscall_exit_to_user_mode_prepare(struct pt_re= gs *regs) { __exit_to_user_mode_prepare(regs); - rseq_exit_to_user_mode(); + rseq_syscall_exit_to_user_mode(); __exit_to_user_mode_validate(); } =20 @@ -261,7 +261,7 @@ static __always_inline void syscall_exit static __always_inline void irqentry_exit_to_user_mode_prepare(struct pt_r= egs *regs) { __exit_to_user_mode_prepare(regs); - rseq_exit_to_user_mode(); + rseq_irqentry_exit_to_user_mode(); __exit_to_user_mode_validate(); } =20 --- a/include/linux/rseq_entry.h +++ b/include/linux/rseq_entry.h @@ -519,19 +519,31 @@ rseq_exit_to_user_mode_work(struct pt_re =20 #endif /* !CONFIG_GENERIC_ENTRY */ =20 -static __always_inline void rseq_exit_to_user_mode(void) +static __always_inline void rseq_syscall_exit_to_user_mode(void) { struct rseq_event *ev =3D ¤t->rseq_event; =20 rseq_stat_inc(rseq_stats.exit); =20 - if (static_branch_unlikely(&rseq_debug_enabled)) + /* Needed to remove the store for the !lockdep case */ + if (IS_ENABLED(CONFIG_LOCKDEP)) { WARN_ON_ONCE(ev->sched_switch); + ev->events =3D 0; + } +} + +static __always_inline void rseq_irqentry_exit_to_user_mode(void) +{ + struct rseq_event *ev =3D ¤t->rseq_event; + + rseq_stat_inc(rseq_stats.exit); + + lockdep_assert_once(!ev->sched_switch); =20 /* * Ensure that event (especially user_irq) is cleared when the * interrupt did not result in a schedule and therefore the - * rseq processing did not clear it. + * rseq processing could not clear it. */ ev->events =3D 0; } @@ -551,7 +563,8 @@ static inline unsigned long rseq_exit_to return ti_work; } static inline void rseq_note_user_irq_entry(void) { } -static inline void rseq_exit_to_user_mode(void) { } +static inline void rseq_syscall_exit_to_user_mode(void) { } +static inline void rseq_irqentry_exit_to_user_mode(void) { } static inline void rseq_debug_syscall_return(struct pt_regs *regs) { } #endif /* !CONFIG_RSEQ */ From nobody Fri Oct 3 21:53:25 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4F5E42F3C28 for ; Sat, 23 Aug 2025 16:40:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967228; cv=none; b=XdurapShNByla1SIXuRqANLVB4KElKBlqzFWfPBCeprA9gtFs+M06DUNTuDj87m7iiHcEaMlBAhyN/Y6vYNcY5GGB3/6gwAc6F03dvhh14rgpq1tMrPt2v0f2HReS7rYUpUyutXar6Leg7NZezwoSPmSVYIXWK5ZHKpPG5Q/xu0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967228; c=relaxed/simple; bh=X5G/VAp/ihTYmUGdr6ztAY72TaCHI5fMEG3WN39wmQY=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=SeRulFnloUiShH8lKiVYiEKBlQuBsfgUomn7+Vefb34G9/zfIVoHKmTPmjBM42jp9pfhy2fX1mHIzut6lW3EGe88AI+2s319sTLaMaRY2dEysDaxnLq0xbF8eUOf2aaAZQLHVnfZfostQvw3qS1kDUo80UBtDukP1JSd/wvZFRU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=1uMZy+f2; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=iT0BE6wB; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="1uMZy+f2"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="iT0BE6wB" Message-ID: <20250823161655.256689417@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1755967225; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=p5+FscC5uvMSsyR/bs1Il8uwho5B1Sslb1Xh+1sgu54=; b=1uMZy+f2ii0nZWvgOF9erE+iCM09Xk6pbfd0eS69tmxRVlkyb63ZcpzuLJ8r96avR04hd+ jsIcfcMphUCXEJogoT3dsjChGJechCujTrKg9do7eZJW78yWlMgDHR7XTk/UiGrnSr0aiJ 5GP140xFTDU8yp5AkaWMQfdVkRzcqX3s1HrZU04cxJfMkDYJRUAZy9a+IBUQsB6KbGrCrc 8t0OQPFm7gRWNMIL6/FN8taUFPG48wL0tHsT/3PDXX78N8fravo+sq7G6wzqt5l2MYkfOG qgeoN3kjk5MqF+IYkjdY1M5eH93X9YD+EL/vPe82X42chJlIwG/mMHvYwv7lpQ== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1755967225; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=p5+FscC5uvMSsyR/bs1Il8uwho5B1Sslb1Xh+1sgu54=; b=iT0BE6wByyCg4m74sk4B21k7o4IHGSrmO87MQdg9TvgBHcH1OP7umVChuDY8/Ze5+sGeeS 2H6549ZrNXG0zeDQ== From: Thomas Gleixner To: LKML Cc: Jens Axboe , Arnd Bergmann , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Paolo Bonzini , Sean Christopherson , Wei Liu , Dexuan Cui , x86@kernel.org, Heiko Carstens , Christian Borntraeger , Sven Schnelle , Huacai Chen , Paul Walmsley , Palmer Dabbelt Subject: [patch V2 31/37] asm-generic: Provide generic TIF infrastructure References: <20250823161326.635281786@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Sat, 23 Aug 2025 18:40:24 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Common TIF bits do not have to be defined by every architecture. They can be defined in a generic header. That allows adding generic TIF bits without chasing a gazillion of architecture headers, which is again a unjustified burden on anyone who works on generic infrastructure as it always needs a boat load of work to keep existing architecture code working when adding new stuff. While it is not as horrible as the ignorance of the generic entry infrastructure, it is a welcome mechanism to make architecture people rethink their approach of just leaching generic improvements into architecture code and thereby making it accumulatingly harder to maintain and improve generic code. It's about time that this changea. Provide the infrastructure and split the TIF space in half, 16 generic and 16 architecture specific bits. This could probably be extended by TIF_SINGLESTEP and BLOCKSTEP, but those are only used in architecture specific code. So leave them alone for now. Signed-off-by: Thomas Gleixner Cc: Arnd Bergmann Acked-by: Arnd Bergmann Reviewed-by: Mathieu Desnoyers --- arch/Kconfig | 4 ++ include/asm-generic/thread_info_tif.h | 48 +++++++++++++++++++++++++++++= +++++ 2 files changed, 52 insertions(+) --- a/arch/Kconfig +++ b/arch/Kconfig @@ -1730,6 +1730,10 @@ config ARCH_VMLINUX_NEEDS_RELOCS relocations preserved. This is used by some architectures to construct bespoke relocation tables for KASLR. =20 +# Select if architecture uses the common generic TIF bits +config HAVE_GENERIC_TIF_BITS + bool + source "kernel/gcov/Kconfig" =20 source "scripts/gcc-plugins/Kconfig" --- /dev/null +++ b/include/asm-generic/thread_info_tif.h @@ -0,0 +1,48 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_GENERIC_THREAD_INFO_TIF_H_ +#define _ASM_GENERIC_THREAD_INFO_TIF_H_ + +#include + +/* Bits 16-31 are reserved for architecture specific purposes */ + +#define TIF_NOTIFY_RESUME 0 // callback before returning to user +#define _TIF_NOTIFY_RESUME BIT(TIF_NOTIFY_RESUME) + +#define TIF_SIGPENDING 1 // signal pending +#define _TIF_SIGPENDING BIT(TIF_SIGPENDING) + +#define TIF_NOTIFY_SIGNAL 2 // signal notifications exist +#define _TIF_NOTIFY_SIGNAL BIT(TIF_NOTIFY_SIGNAL) + +#define TIF_MEMDIE 3 // is terminating due to OOM killer +#define _TIF_MEMDIE BIT(TIF_MEMDIE) + +#define TIF_NEED_RESCHED 4 // rescheduling necessary +#define _TIF_NEED_RESCHED BIT(TIF_NEED_RESCHED) + +#ifdef HAVE_TIF_NEED_RESCHED_LAZY +# define TIF_NEED_RESCHED_LAZY 5 // Lazy rescheduling needed +# define _TIF_NEED_RESCHED_LAZY BIT(TIF_NEED_RESCHED_LAZY) +#endif + +#ifdef HAVE_TIF_POLLING_NRFLAG +# define TIF_POLLING_NRFLAG 6 // idle is polling for TIF_NEED_RESCHED +# define _TIF_POLLING_NRFLAG BIT(TIF_POLLING_NRFLAG) +#endif + +#define TIF_USER_RETURN_NOTIFY 7 // notify kernel of userspace return +#define _TIF_USER_RETURN_NOTIFY BIT(TIF_USER_RETURN_NOTIFY) + +#define TIF_UPROBE 8 // breakpointed or singlestepping +#define _TIF_UPROBE BIT(TIF_UPROBE) + +#define TIF_PATCH_PENDING 9 // pending live patching update +#define _TIF_PATCH_PENDING BIT(TIF_PATCH_PENDING) + +#ifdef HAVE_TIF_RESTORE_SIGMASK +# define TIF_RESTORE_SIGMASK 10 // Restore signal mask in do_signal() */ +# define _TIF_RESTORE_SIGMASK BIT(TIF_RESTORE_SIGMASK) +#endif + +#endif /* _ASM_GENERIC_THREAD_INFO_TIF_H_ */ From nobody Fri Oct 3 21:53:25 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9543D2F4A01 for ; Sat, 23 Aug 2025 16:40:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967231; cv=none; b=la0ugZUd0sPD19YUIH/Z0x1o10z+5tIHXPCF+ydBXeNOgtF7EV65qL9KRNlZuYHY9zWi04rvQG7XOoi59FyFVObSWmhPsSFwPrhSMMecHrgFj5ilUdlIRluANckRkr87kuUi6cl7pKyavinOSX21kdAEgus2CxmtRXHeQPO9/zU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967231; c=relaxed/simple; bh=c56W/JKLLIaR3mY5+jKBE911U+s2TFIysIGEHPQw0A0=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=dXeP7SkARHLWByJKKK/3ot4JMgobaq4LMdhn2MRJIlXTmPAGvgP8/KyExEySBdpgoNbnnhwx5w/85cPGGAxYRdloABw3bsMvlbdcl0uk8VKaWQZEK3rywTori2S+NYPoJjFcZ4kP1fFi7+LG9bo3PJAuQROxoAj4HaICZA3adz0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=tDNrDYlw; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=RrWHFayO; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="tDNrDYlw"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="RrWHFayO" Message-ID: <20250823161655.319791141@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1755967228; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=ZYxO0WEQJ2l0m708uWkDzuf67BMRsy+EgwA7fpHRcnE=; b=tDNrDYlwJuJOZzOhNKvG51T+liFMGcChvpR/tVH21lLkpai7AfXGETB5ThSMQPOnfy1Pe/ I2gFEI6hHKxOq4zk83K0F4m5S5AK9w7ZuNFV0/lQoCGRthXES/tiHYgH6cwEz0+E7nYLTG nZE4a/S+r1b+glggxP95RIv1Ll/WOd56XXY/8r4dBby6n9zMQm0CrEVUD7rBJuRrDYXFmL Rw1db4YHan8OlV4HThRBpj8NojS+8eczUFM5Nnpr6wKEWa8/WBKeRaBKBMYZ4y9gBkc4Z7 9tXCPbvMIVNDmWDChmPZTxvj5mdNN5ZRgtqTmz/uGaIPkTMqCYMqJUHSvymaQA== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1755967228; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=ZYxO0WEQJ2l0m708uWkDzuf67BMRsy+EgwA7fpHRcnE=; b=RrWHFayOPPv7ETg2Bw+4aGhojvHfo+n5tktE128fj2EiFcJ5eA+hJuDw9nfX51YGovk1Pv gl2hWliht9q4YNBQ== From: Thomas Gleixner To: LKML Cc: Jens Axboe , x86@kernel.org, Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Paolo Bonzini , Sean Christopherson , Wei Liu , Dexuan Cui , Arnd Bergmann , Heiko Carstens , Christian Borntraeger , Sven Schnelle , Huacai Chen , Paul Walmsley , Palmer Dabbelt Subject: [patch V2 32/37] x86: Use generic TIF bits References: <20250823161326.635281786@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Sat, 23 Aug 2025 18:40:27 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" No point in defining generic items and the upcoming RSEQ optimizations are only available with this _and_ the generic entry infrastructure, which is already used by x86. So no further action required here. Signed-off-by: Thomas Gleixner Cc: x86@kernel.org Reviewed-by: Mathieu Desnoyers --- arch/x86/Kconfig | 1=20 arch/x86/include/asm/thread_info.h | 74 +++++++++++++++-----------------= ----- 2 files changed, 31 insertions(+), 44 deletions(-) --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -239,6 +239,7 @@ config X86 select HAVE_EFFICIENT_UNALIGNED_ACCESS select HAVE_EISA if X86_32 select HAVE_EXIT_THREAD + select HAVE_GENERIC_TIF_BITS select HAVE_GUP_FAST select HAVE_FENTRY if X86_64 || DYNAMIC_FTRACE select HAVE_FTRACE_GRAPH_FUNC if HAVE_FUNCTION_GRAPH_TRACER --- a/arch/x86/include/asm/thread_info.h +++ b/arch/x86/include/asm/thread_info.h @@ -80,56 +80,42 @@ struct thread_info { #endif =20 /* - * thread information flags - * - these are process state flags that various assembly files - * may need to access + * Tell the generic TIF infrastructure which bits x86 supports */ -#define TIF_NOTIFY_RESUME 1 /* callback before returning to user */ -#define TIF_SIGPENDING 2 /* signal pending */ -#define TIF_NEED_RESCHED 3 /* rescheduling necessary */ -#define TIF_NEED_RESCHED_LAZY 4 /* Lazy rescheduling needed */ -#define TIF_SINGLESTEP 5 /* reenable singlestep on user return*/ -#define TIF_SSBD 6 /* Speculative store bypass disable */ -#define TIF_SPEC_IB 9 /* Indirect branch speculation mitigation */ -#define TIF_SPEC_L1D_FLUSH 10 /* Flush L1D on mm switches (processes) */ -#define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */ -#define TIF_UPROBE 12 /* breakpointed or singlestepping */ -#define TIF_PATCH_PENDING 13 /* pending live patching update */ -#define TIF_NEED_FPU_LOAD 14 /* load FPU on return to userspace */ -#define TIF_NOCPUID 15 /* CPUID is not accessible in userland */ -#define TIF_NOTSC 16 /* TSC is not accessible in userland */ -#define TIF_NOTIFY_SIGNAL 17 /* signal notifications exist */ -#define TIF_MEMDIE 20 /* is terminating due to OOM killer */ -#define TIF_POLLING_NRFLAG 21 /* idle is polling for TIF_NEED_RESCHED */ +#define HAVE_TIF_NEED_RESCHED_LAZY +#define HAVE_TIF_POLLING_NRFLAG +#define HAVE_TIF_SINGLESTEP + +#include + +/* Architecture specific TIF space starts at 16 */ +#define TIF_SSBD 16 /* Speculative store bypass disable */ +#define TIF_SPEC_IB 17 /* Indirect branch speculation mitigation */ +#define TIF_SPEC_L1D_FLUSH 18 /* Flush L1D on mm switches (processes) */ +#define TIF_NEED_FPU_LOAD 19 /* load FPU on return to userspace */ +#define TIF_NOCPUID 20 /* CPUID is not accessible in userland */ +#define TIF_NOTSC 21 /* TSC is not accessible in userland */ #define TIF_IO_BITMAP 22 /* uses I/O bitmap */ #define TIF_SPEC_FORCE_UPDATE 23 /* Force speculation MSR update in contex= t switch */ #define TIF_FORCED_TF 24 /* true if TF in eflags artificially */ -#define TIF_BLOCKSTEP 25 /* set when we want DEBUGCTLMSR_BTF */ +#define TIF_SINGLESTEP 25 /* reenable singlestep on user return*/ +#define TIF_BLOCKSTEP 26 /* set when we want DEBUGCTLMSR_BTF */ #define TIF_LAZY_MMU_UPDATES 27 /* task is updating the mmu lazily */ -#define TIF_ADDR32 29 /* 32-bit address space on 64 bits */ +#define TIF_ADDR32 28 /* 32-bit address space on 64 bits */ =20 -#define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME) -#define _TIF_SIGPENDING (1 << TIF_SIGPENDING) -#define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED) -#define _TIF_NEED_RESCHED_LAZY (1 << TIF_NEED_RESCHED_LAZY) -#define _TIF_SINGLESTEP (1 << TIF_SINGLESTEP) -#define _TIF_SSBD (1 << TIF_SSBD) -#define _TIF_SPEC_IB (1 << TIF_SPEC_IB) -#define _TIF_SPEC_L1D_FLUSH (1 << TIF_SPEC_L1D_FLUSH) -#define _TIF_USER_RETURN_NOTIFY (1 << TIF_USER_RETURN_NOTIFY) -#define _TIF_UPROBE (1 << TIF_UPROBE) -#define _TIF_PATCH_PENDING (1 << TIF_PATCH_PENDING) -#define _TIF_NEED_FPU_LOAD (1 << TIF_NEED_FPU_LOAD) -#define _TIF_NOCPUID (1 << TIF_NOCPUID) -#define _TIF_NOTSC (1 << TIF_NOTSC) -#define _TIF_NOTIFY_SIGNAL (1 << TIF_NOTIFY_SIGNAL) -#define _TIF_POLLING_NRFLAG (1 << TIF_POLLING_NRFLAG) -#define _TIF_IO_BITMAP (1 << TIF_IO_BITMAP) -#define _TIF_SPEC_FORCE_UPDATE (1 << TIF_SPEC_FORCE_UPDATE) -#define _TIF_FORCED_TF (1 << TIF_FORCED_TF) -#define _TIF_BLOCKSTEP (1 << TIF_BLOCKSTEP) -#define _TIF_LAZY_MMU_UPDATES (1 << TIF_LAZY_MMU_UPDATES) -#define _TIF_ADDR32 (1 << TIF_ADDR32) +#define _TIF_SSBD BIT(TIF_SSBD) +#define _TIF_SPEC_IB BIT(TIF_SPEC_IB) +#define _TIF_SPEC_L1D_FLUSH BIT(TIF_SPEC_L1D_FLUSH) +#define _TIF_NEED_FPU_LOAD BIT(TIF_NEED_FPU_LOAD) +#define _TIF_NOCPUID BIT(TIF_NOCPUID) +#define _TIF_NOTSC BIT(TIF_NOTSC) +#define _TIF_IO_BITMAP BIT(TIF_IO_BITMAP) +#define _TIF_SPEC_FORCE_UPDATE BIT(TIF_SPEC_FORCE_UPDATE) +#define _TIF_FORCED_TF BIT(TIF_FORCED_TF) +#define _TIF_BLOCKSTEP BIT(TIF_BLOCKSTEP) +#define _TIF_SINGLESTEP BIT(TIF_SINGLESTEP) +#define _TIF_LAZY_MMU_UPDATES BIT(TIF_LAZY_MMU_UPDATES) +#define _TIF_ADDR32 BIT(TIF_ADDR32) =20 /* flags to check in __switch_to() */ #define _TIF_WORK_CTXSW_BASE \ From nobody Fri Oct 3 21:53:25 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7D2D22F3C28 for ; Sat, 23 Aug 2025 16:40:33 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967235; cv=none; b=G53AdRDx12sgUhtM60fwNTaHtkW/RjRza2lTE6iby95qzNQUXEe7x2UO6CxMvq5by4yrfZBOEIalt5nFE+AjPXeoTc6dVD97tM9IwK0cBTVNWpmE3r8ZDc7cyOZNn5whbYFnmeUKrikrc3OMgPYj1DWAZeeXg3itaTg544eXOpg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967235; c=relaxed/simple; bh=8bRVUxOcw7CVhePiF0ETkMtyhlXjX+iukE1DMBxg80o=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=hWIbqFgdtwqu5mSzEtyymfihRjU0sNeqcw2/54oPQOkb4Lb5bstKkO/Gc6o1BanlpHksDK1AwO4mpwfVtNCL2W5EkOBXYG7Edcd7Ws4uRXW8q6VsdUrif0QTT2Oqfqv0NUkEMQl7kK6rKsQx5LypLcIEI5NLoxdwRVrE7BJ0BZY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=vbEaS18z; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=F4kO8Su3; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="vbEaS18z"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="F4kO8Su3" Message-ID: <20250823161655.383980378@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1755967232; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=Zqiry3Pddar0HicLXI7NIufbjdQAnVt2G/oqNW0KXKw=; b=vbEaS18zgabTCb7LhxXsY0RXs1pU/ma09gM0blsahg2jksvykhsLCD0kQaAwmiwyFT9p0s tMgydFWSuvysr2rp+bL12YvMEXmvNcP5NU7fvgK5kCxlfZP5nJdmNq/51MuGUaYXeXvD9Y 8sZfFmwtT1jPI0p0cNbxLoWUqldbGtr+7s0WNmiTwc7iHKyoKtuEdibcgu8fLhPssbe+Im RAFlekQmE7gLnZ/edp0ZNIpCvQzJZmaVV8F7+TFpFf5fcoirZypp99VI/ieQmtIwO1ncrD Hpcdz5Y8e3BofZTwZMLyjBvuCpa2+0w983ajw3OrP9g1cSPYJDaxcHD1MlBLWA== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1755967232; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=Zqiry3Pddar0HicLXI7NIufbjdQAnVt2G/oqNW0KXKw=; b=F4kO8Su3Ck2JqKoFt9ICgURny2iDWlH2H2qjZLmCF4J4ViaoOR8fjeFq7n32q76vEU2Qdj ZCp0P4/2ZiFBqhAg== From: Thomas Gleixner To: LKML Cc: Jens Axboe , Heiko Carstens , Christian Borntraeger , Sven Schnelle , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Paolo Bonzini , Sean Christopherson , Wei Liu , Dexuan Cui , x86@kernel.org, Arnd Bergmann , Huacai Chen , Paul Walmsley , Palmer Dabbelt Subject: [patch V2 33/37] s390: Use generic TIF bits References: <20250823161326.635281786@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Sat, 23 Aug 2025 18:40:30 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" No point in defining generic items and the upcoming RSEQ optimizations are only available with this _and_ the generic entry infrastructure, which is already used by s390. So no further action required here. This leaves a comment about the AUDIT/TRACE/SECCOMP bits which are handled by SYSCALL_WORK in the generic code, so they seem redundant, but that's a problem for the s390 wizards to think about. Signed-off-by: Thomas Gleixner Cc: Heiko Carstens Cc: Christian Borntraeger Cc: Sven Schnelle --- arch/s390/Kconfig | 1=20 arch/s390/include/asm/thread_info.h | 44 ++++++++++++++-----------------= ----- 2 files changed, 19 insertions(+), 26 deletions(-) --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -199,6 +199,7 @@ config S390 select HAVE_DYNAMIC_FTRACE_WITH_REGS select HAVE_EBPF_JIT if HAVE_MARCH_Z196_FEATURES select HAVE_EFFICIENT_UNALIGNED_ACCESS + select HAVE_GENERIC_TIF_BITS select HAVE_GUP_FAST select HAVE_FENTRY select HAVE_FTRACE_GRAPH_FUNC --- a/arch/s390/include/asm/thread_info.h +++ b/arch/s390/include/asm/thread_info.h @@ -56,43 +56,35 @@ void arch_setup_new_exec(void); =20 /* * thread information flags bit numbers + * + * Tell the generic TIF infrastructure which special bits s390 supports */ -#define TIF_NOTIFY_RESUME 0 /* callback before returning to user */ -#define TIF_SIGPENDING 1 /* signal pending */ -#define TIF_NEED_RESCHED 2 /* rescheduling necessary */ -#define TIF_NEED_RESCHED_LAZY 3 /* lazy rescheduling needed */ -#define TIF_UPROBE 4 /* breakpointed or single-stepping */ -#define TIF_PATCH_PENDING 5 /* pending live patching update */ -#define TIF_ASCE_PRIMARY 6 /* primary asce is kernel asce */ -#define TIF_NOTIFY_SIGNAL 7 /* signal notifications exist */ -#define TIF_GUARDED_STORAGE 8 /* load guarded storage control block */ -#define TIF_ISOLATE_BP_GUEST 9 /* Run KVM guests with isolated BP */ -#define TIF_PER_TRAP 10 /* Need to handle PER trap on exit to usermode */ -#define TIF_31BIT 16 /* 32bit process */ -#define TIF_MEMDIE 17 /* is terminating due to OOM killer */ -#define TIF_RESTORE_SIGMASK 18 /* restore signal mask in do_signal() */ -#define TIF_SINGLE_STEP 19 /* This task is single stepped */ -#define TIF_BLOCK_STEP 20 /* This task is block stepped */ -#define TIF_UPROBE_SINGLESTEP 21 /* This task is uprobe single stepped */ +#define HAVE_TIF_NEED_RESCHED_LAZY +#define HAVE_TIF_RESTORE_SIGMASK + +#include + +/* Architecture specific bits */ +#define TIF_ASCE_PRIMARY 16 /* primary asce is kernel asce */ +#define TIF_GUARDED_STORAGE 17 /* load guarded storage control block */ +#define TIF_ISOLATE_BP_GUEST 18 /* Run KVM guests with isolated BP */ +#define TIF_PER_TRAP 19 /* Need to handle PER trap on exit to usermode */ +#define TIF_31BIT 20 /* 32bit process */ +#define TIF_SINGLE_STEP 21 /* This task is single stepped */ +#define TIF_BLOCK_STEP 22 /* This task is block stepped */ +#define TIF_UPROBE_SINGLESTEP 23 /* This task is uprobe single stepped */ + +/* These could move over to SYSCALL_WORK bits, no? */ #define TIF_SYSCALL_TRACE 24 /* syscall trace active */ #define TIF_SYSCALL_AUDIT 25 /* syscall auditing active */ #define TIF_SECCOMP 26 /* secure computing */ #define TIF_SYSCALL_TRACEPOINT 27 /* syscall tracepoint instrumentation */ =20 -#define _TIF_NOTIFY_RESUME BIT(TIF_NOTIFY_RESUME) -#define _TIF_SIGPENDING BIT(TIF_SIGPENDING) -#define _TIF_NEED_RESCHED BIT(TIF_NEED_RESCHED) -#define _TIF_NEED_RESCHED_LAZY BIT(TIF_NEED_RESCHED_LAZY) -#define _TIF_UPROBE BIT(TIF_UPROBE) -#define _TIF_PATCH_PENDING BIT(TIF_PATCH_PENDING) #define _TIF_ASCE_PRIMARY BIT(TIF_ASCE_PRIMARY) -#define _TIF_NOTIFY_SIGNAL BIT(TIF_NOTIFY_SIGNAL) #define _TIF_GUARDED_STORAGE BIT(TIF_GUARDED_STORAGE) #define _TIF_ISOLATE_BP_GUEST BIT(TIF_ISOLATE_BP_GUEST) #define _TIF_PER_TRAP BIT(TIF_PER_TRAP) #define _TIF_31BIT BIT(TIF_31BIT) -#define _TIF_MEMDIE BIT(TIF_MEMDIE) -#define _TIF_RESTORE_SIGMASK BIT(TIF_RESTORE_SIGMASK) #define _TIF_SINGLE_STEP BIT(TIF_SINGLE_STEP) #define _TIF_BLOCK_STEP BIT(TIF_BLOCK_STEP) #define _TIF_UPROBE_SINGLESTEP BIT(TIF_UPROBE_SINGLESTEP) From nobody Fri Oct 3 21:53:25 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D8D912F548A for ; Sat, 23 Aug 2025 16:40:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967239; cv=none; b=NS9llMIgf6ipuJfH+JYo6KUCEalNWOL8hLyMlpvMTRrhCjzjZAB6JGb2VEBE/DlWrBtVaNv6LV03R/LBG9C1n8E3053VMvd6PGm0pIalZBW34Dyde5DugzOc1EFruA40TWYatiGLp1lbcPO/lR1W4W9OrQ721k3mduUREmBDtU8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967239; c=relaxed/simple; bh=6q7HEZLKfmTXbEmY9x/E2H4E9vXEkamTuN2cqF3N604=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=m83/tuBEsOlrnysRQqjP6gXo2/8vlQSY3psXcGJj66qXeFxp3BbyL8I9BtkzKlhYoACwcovPstaG14tXGNWnezuh/SgArRFXKehCJHs9MHEOYirWATu4O9JuMQEN+NGu1c9WxdUmeXzy52aD18fvLPVVO0zlouADOgeHbM3YK+Y= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=aO+KRuzL; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=SHuvRO1X; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="aO+KRuzL"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="SHuvRO1X" Message-ID: <20250823161655.449079233@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1755967235; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=+FYMDW4FdA8VymarwR6vPBcMwNS91wQ8ykWchiw/6OE=; b=aO+KRuzL7SUAiVrECWRk16dRmPDqoh7D0o3Nz9ruu+RaKK1DFzZ0j1hYNtRzy5ViOEbtwB XC2/x//H2Rr99Z07HAI877AzF7QJA080MFBNRU1CNef4NrAje0ezPbVp2Bgxwe0TFOBI1X R6mrpjqTq6UJOWIa0cEol+KSTZ/170ldbaK+PeO8fAyUBzZfPNiMKSLb4tiNTAOorY6kZj Z2SNxLdpKYZserYYLJ47bFubZpNFFZcUqaI0X2BSzw4JPgHePyrnbQh+wj7cfblG4HFW90 zN1kp3IeSwJxRAgzM3JgVQPQL+8WucZKGtuk99RbhvgE5jAWWqGr3YZIoSKp7g== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1755967235; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=+FYMDW4FdA8VymarwR6vPBcMwNS91wQ8ykWchiw/6OE=; b=SHuvRO1Xl0GcTWp/IMDYy4za2kv7GlxfEwKzgbMGdlZvWa8znTU+9a9GsK3xVs2cCGj6bX 08RHOc9/PAPUCKBQ== From: Thomas Gleixner To: LKML Cc: Jens Axboe , Huacai Chen , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Paolo Bonzini , Sean Christopherson , Wei Liu , Dexuan Cui , x86@kernel.org, Arnd Bergmann , Heiko Carstens , Christian Borntraeger , Sven Schnelle , Paul Walmsley , Palmer Dabbelt Subject: [patch V2 34/37] loongarch: Use generic TIF bits References: <20250823161326.635281786@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Sat, 23 Aug 2025 18:40:33 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" No point in defining generic items and the upcoming RSEQ optimizations are only available with this _and_ the generic entry infrastructure, which is already used by loongarch. So no further action required here. Signed-off-by: Thomas Gleixner Cc: Huacai Chen --- arch/loongarch/Kconfig | 1=20 arch/loongarch/include/asm/thread_info.h | 76 +++++++++++++-------------= ----- 2 files changed, 35 insertions(+), 42 deletions(-) --- a/arch/loongarch/Kconfig +++ b/arch/loongarch/Kconfig @@ -140,6 +140,7 @@ config LOONGARCH select HAVE_EBPF_JIT select HAVE_EFFICIENT_UNALIGNED_ACCESS if !ARCH_STRICT_ALIGN select HAVE_EXIT_THREAD + select HAVE_GENERIC_TIF_BITS select HAVE_GUP_FAST select HAVE_FTRACE_GRAPH_FUNC select HAVE_FUNCTION_ARG_ACCESS_API --- a/arch/loongarch/include/asm/thread_info.h +++ b/arch/loongarch/include/asm/thread_info.h @@ -65,50 +65,42 @@ register unsigned long current_stack_poi * access * - pending work-to-be-done flags are in LSW * - other flags in MSW + * + * Tell the generic TIF infrastructure which special bits loongarch suppor= ts */ -#define TIF_NEED_RESCHED 0 /* rescheduling necessary */ -#define TIF_NEED_RESCHED_LAZY 1 /* lazy rescheduling necessary */ -#define TIF_SIGPENDING 2 /* signal pending */ -#define TIF_NOTIFY_RESUME 3 /* callback before returning to user */ -#define TIF_NOTIFY_SIGNAL 4 /* signal notifications exist */ -#define TIF_RESTORE_SIGMASK 5 /* restore signal mask in do_signal() */ -#define TIF_NOHZ 6 /* in adaptive nohz mode */ -#define TIF_UPROBE 7 /* breakpointed or singlestepping */ -#define TIF_USEDFPU 8 /* FPU was used by this task this quantum (SMP) */ -#define TIF_USEDSIMD 9 /* SIMD has been used this quantum */ -#define TIF_MEMDIE 10 /* is terminating due to OOM killer */ -#define TIF_FIXADE 11 /* Fix address errors in software */ -#define TIF_LOGADE 12 /* Log address errors to syslog */ -#define TIF_32BIT_REGS 13 /* 32-bit general purpose registers */ -#define TIF_32BIT_ADDR 14 /* 32-bit address space */ -#define TIF_LOAD_WATCH 15 /* If set, load watch registers */ -#define TIF_SINGLESTEP 16 /* Single Step */ -#define TIF_LSX_CTX_LIVE 17 /* LSX context must be preserved */ -#define TIF_LASX_CTX_LIVE 18 /* LASX context must be preserved */ -#define TIF_USEDLBT 19 /* LBT was used by this task this quantum (SMP) */ -#define TIF_LBT_CTX_LIVE 20 /* LBT context must be preserved */ -#define TIF_PATCH_PENDING 21 /* pending live patching update */ +#define HAVE_TIF_NEED_RESCHED_LAZY +#define HAVE_TIF_RESTORE_SIGMASK =20 -#define _TIF_NEED_RESCHED (1< + +/* Architecture specific bits */ +#define TIF_NOHZ 16 /* in adaptive nohz mode */ +#define TIF_USEDFPU 17 /* FPU was used by this task this quantum (SMP) */ +#define TIF_USEDSIMD 18 /* SIMD has been used this quantum */ +#define TIF_FIXADE 10 /* Fix address errors in software */ +#define TIF_LOGADE 20 /* Log address errors to syslog */ +#define TIF_32BIT_REGS 21 /* 32-bit general purpose registers */ +#define TIF_32BIT_ADDR 22 /* 32-bit address space */ +#define TIF_LOAD_WATCH 23 /* If set, load watch registers */ +#define TIF_SINGLESTEP 24 /* Single Step */ +#define TIF_LSX_CTX_LIVE 25 /* LSX context must be preserved */ +#define TIF_LASX_CTX_LIVE 26 /* LASX context must be preserved */ +#define TIF_USEDLBT 27 /* LBT was used by this task this quantum (SMP) */ +#define TIF_LBT_CTX_LIVE 28 /* LBT context must be preserved */ + +#define _TIF_NOHZ BIT(TIF_NOHZ) +#define _TIF_USEDFPU BIT(TIF_USEDFPU) +#define _TIF_USEDSIMD BIT(TIF_USEDSIMD) +#define _TIF_FIXADE BIT(TIF_FIXADE) +#define _TIF_LOGADE BIT(TIF_LOGADE) +#define _TIF_32BIT_REGS BIT(TIF_32BIT_REGS) +#define _TIF_32BIT_ADDR BIT(TIF_32BIT_ADDR) +#define _TIF_LOAD_WATCH BIT(TIF_LOAD_WATCH) +#define _TIF_SINGLESTEP BIT(TIF_SINGLESTEP) +#define _TIF_LSX_CTX_LIVE BIT(TIF_LSX_CTX_LIVE) +#define _TIF_LASX_CTX_LIVE BIT(TIF_LASX_CTX_LIVE) +#define _TIF_USEDLBT BIT(TIF_USEDLBT) +#define _TIF_LBT_CTX_LIVE BIT(TIF_LBT_CTX_LIVE) =20 #endif /* __KERNEL__ */ #endif /* _ASM_THREAD_INFO_H */ From nobody Fri Oct 3 21:53:25 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 318C22F3C28 for ; Sat, 23 Aug 2025 16:40:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967240; cv=none; b=QvE44CsaLnsVso0mxQviS6DsR9AMMVqgBO0Fl+Tbwci25UF2ovLlX9DeuzGlINVKkLnUfAoanZ4AQkwuJjcZVVtj1Q89wAUfmwdeQSOO2q+C1JBFz3FEx913nD+tEgJHy/ZzWt0ZeQGfQygPYec3AJYiOuvaOb/WbMimMRKWoVY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967240; c=relaxed/simple; bh=uUZdYRO007okY0kRRm9tI62V39ZdXzdjmDVeHh9YJVM=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=jA7feQq0XmhVMaszAGhpqz5Whl/bccyPJYGU70xkUEAlytcUmMOC7+tyOkH5/KdDSEDxjhFs9oA9/e18S0BNRMhDoihPaWNDV0r7D2qUzfNLvxX4nqYjfcxc2cNTqStFvp6GyipeHcQEJq2+7B4Mk761stHWVUyYybs/9It2LvY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=JicB2gaf; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=lej8ZLT7; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="JicB2gaf"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="lej8ZLT7" Message-ID: <20250823161655.514963233@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1755967237; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=Ae2PcYVqy5iPrMWzBb6EWbaXpNh86xWAsNJ6qWOKNoo=; b=JicB2gafIs63i6Y7RKgv0z683+bG4xRDR7p4K5Aatuxjh+OrW4C8F9+Pbar561ccrV3UfS ACBW2b9EoxWGPeoCco6LyfoJJtyRn2nXe9W5Vx4lwhBTV/SCdZCFo6RDBThmyd/4ayu7R0 iKyHwkUTy6UVyaFzuQBzFORz9+dMDVex0FxUn9pIImgouvTH5NdSSQNDiNcrwewko76/f4 zXifMpx5FfjQKDukemjmSYPaeGb5D8iyGg4/IDC8F5tGEy18+8BQT+xWy1Rw9WOKViMPHH OK2rrLumPeL1YBhiK4w1dkaBWeld1V8/CRP2ZSNzunyf87Hi4y979F1HwhuN5A== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1755967237; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=Ae2PcYVqy5iPrMWzBb6EWbaXpNh86xWAsNJ6qWOKNoo=; b=lej8ZLT7ZkooEcx0jQ+OlQUVoaETLLgph7gHjU4uBLqyr64zpD34/GwYvmEeMe7YPxj7ZG wTnh7GbKvEh0WZDA== From: Thomas Gleixner To: LKML Cc: Jens Axboe , Paul Walmsley , Palmer Dabbelt , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Paolo Bonzini , Sean Christopherson , Wei Liu , Dexuan Cui , x86@kernel.org, Arnd Bergmann , Heiko Carstens , Christian Borntraeger , Sven Schnelle , Huacai Chen Subject: [patch V2 35/37] riscv: Use generic TIF bits References: <20250823161326.635281786@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Sat, 23 Aug 2025 18:40:36 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" No point in defining generic items and the upcoming RSEQ optimizations are only available with this _and_ the generic entry infrastructure, which is already used by RISCV. So no further action required here. Signed-off-by: Thomas Gleixner Cc: Paul Walmsley Cc: Palmer Dabbelt --- arch/riscv/Kconfig | 1 + arch/riscv/include/asm/thread_info.h | 29 ++++++++++++----------------- 2 files changed, 13 insertions(+), 17 deletions(-) --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -161,6 +161,7 @@ config RISCV select HAVE_FUNCTION_GRAPH_FREGS select HAVE_FUNCTION_TRACER if !XIP_KERNEL && HAVE_DYNAMIC_FTRACE select HAVE_EBPF_JIT if MMU + select HAVE_GENERIC_TIF_BITS select HAVE_GUP_FAST if MMU select HAVE_FUNCTION_ARG_ACCESS_API select HAVE_FUNCTION_ERROR_INJECTION --- a/arch/riscv/include/asm/thread_info.h +++ b/arch/riscv/include/asm/thread_info.h @@ -107,23 +107,18 @@ int arch_dup_task_struct(struct task_str * - pending work-to-be-done flags are in lowest half-word * - other flags in upper half-word(s) */ -#define TIF_NEED_RESCHED 0 /* rescheduling necessary */ -#define TIF_NEED_RESCHED_LAZY 1 /* Lazy rescheduling needed */ -#define TIF_NOTIFY_RESUME 2 /* callback before returning to user */ -#define TIF_SIGPENDING 3 /* signal pending */ -#define TIF_RESTORE_SIGMASK 4 /* restore signal mask in do_signal() */ -#define TIF_MEMDIE 5 /* is terminating due to OOM killer */ -#define TIF_NOTIFY_SIGNAL 9 /* signal notifications exist */ -#define TIF_UPROBE 10 /* uprobe breakpoint or singlestep */ -#define TIF_32BIT 11 /* compat-mode 32bit process */ -#define TIF_RISCV_V_DEFER_RESTORE 12 /* restore Vector before returing to = user */ =20 -#define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED) -#define _TIF_NEED_RESCHED_LAZY (1 << TIF_NEED_RESCHED_LAZY) -#define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME) -#define _TIF_SIGPENDING (1 << TIF_SIGPENDING) -#define _TIF_NOTIFY_SIGNAL (1 << TIF_NOTIFY_SIGNAL) -#define _TIF_UPROBE (1 << TIF_UPROBE) -#define _TIF_RISCV_V_DEFER_RESTORE (1 << TIF_RISCV_V_DEFER_RESTORE) +/* + * Tell the generic TIF infrastructure which bits riscv supports + */ +#define HAVE_TIF_NEED_RESCHED_LAZY +#define HAVE_TIF_RESTORE_SIGMASK + +#include + +#define TIF_32BIT 16 /* compat-mode 32bit process */ +#define TIF_RISCV_V_DEFER_RESTORE 17 /* restore Vector before returing to = user */ + +#define _TIF_RISCV_V_DEFER_RESTORE BIT(TIF_RISCV_V_DEFER_RESTORE) =20 #endif /* _ASM_RISCV_THREAD_INFO_H */ From nobody Fri Oct 3 21:53:25 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A3FB82F549B for ; Sat, 23 Aug 2025 16:40:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967244; cv=none; b=mv0wu5tp4al4J+jd5/GEpIB05jqP8pGF1mSV6+cx7hoZRnyUpUogfIE9XNn5/VwgX3IoNGgsksLg/JBFsSuS/nxPQaBKIuvrtmItxd4L1VNEWtV60tv4SfboliPS18NssOd8IN66/dw90GAvkTJNosAY6lTdenHqtx0WnB1ttLI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967244; c=relaxed/simple; bh=Kebqi7zs68suX4hBibbGKFsJnoYojzyioIExTFP5qvQ=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=FpZZka+PKrc/S0m8ffTAXPhlTfVuKij7f+rEplwFrx7aJgveIMLAS4P5L3h6Sol2zn8QzBDO1FuXVQJwh9fBPqHhIkbp7JcRIpj2keUmY7hwJnkqACzjAjtKn8rDcTbsncR0ekEEExG2WSkWJX8ijQS1+b0v7fXG2AUMuC7Pr7E= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=4zYojfpK; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=hUShs27G; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="4zYojfpK"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="hUShs27G" Message-ID: <20250823161655.586695263@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1755967240; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=G3mvqN4QOI2mGVfxb8vMPb13HQ+DT9JFumag0CQ44Us=; b=4zYojfpKPS54S4XIH/Ty/IUs8fAXIKktmfFB8K8kfrObTBaw4W4G8n2dZy+8iyEFw9AaZW 670Ib5b1mnTuZc6cFMd/+rRM3Zgvbx2PLJBum9XaffTgto6b5Qogkq8i6qTYdfMNZcIAgn 5XY10T4bKpAAIh0e78OgpVigUD+Toz88gft18J9YAFyxZXjqk4T8DWCcwRLQvbrB9gwJLd VL4vb5ilM77eX43YNKODVJnP7yRiqOCJ8qsZA+cELbUysxtiWqtb99150n9j3mV8sSJLEf 3eKhsmoY7WsfF83HotK/TXpfw3+oTMOBrcF9Cg8Se9qk9oQdzJPQ7/0FwHndlA== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1755967240; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=G3mvqN4QOI2mGVfxb8vMPb13HQ+DT9JFumag0CQ44Us=; b=hUShs27G5EKOpHtYdqguCAnHCiGOXlwokkuJaU1ZwSHWzDyMX0VGK2sxiWz2NuOXhJ0htE 2wfiwsPT/tFXTNAA== From: Thomas Gleixner To: LKML Cc: Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Paolo Bonzini , Sean Christopherson , Wei Liu , Dexuan Cui , x86@kernel.org, Arnd Bergmann , Heiko Carstens , Christian Borntraeger , Sven Schnelle , Huacai Chen , Paul Walmsley , Palmer Dabbelt Subject: [patch V2 36/37] rseq: Switch to TIF_RSEQ if supported References: <20250823161326.635281786@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Sat, 23 Aug 2025 18:40:39 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" TIF_NOTIFY_RESUME is a multiplexing TIF bit, which is suboptimal especially with the RSEQ fast path depending on it, but not really handling it. Define a seperate TIF_RSEQ in the generic TIF space and enable the full seperation of fast and slow path for architectures which utilize that. That avoids the hassle with invocations of resume_user_mode_work() from hypervisors, which clear TIF_NOTIFY_RESUME. It makes the therefore required re-evaluation at the end of vcpu_run() a NOOP on architectures which utilize the generic TIF space and have a seperate TIF_RSEQ. The hypervisor TIF handling does not include the seperate TIF_RSEQ as there is no point in doing so. The guest does neither know nor care about the VMM host applications RSEQ state. That state is only relevant when the ioctl() returns to user space. The fastpath implementation still utilizes TIF_NOTIFY_RESUME for failure handling, but this only happens within exit_to_user_mode_loop(), so arguably the hypervisor ioctl() code is long done when this happens. This allows further optimizations for blocking syscall heavy workloads in a subsequent step. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- include/asm-generic/thread_info_tif.h | 3 +++ include/linux/irq-entry-common.h | 2 +- include/linux/rseq.h | 13 ++++++++++--- include/linux/rseq_entry.h | 23 +++++++++++++++++++---- include/linux/thread_info.h | 5 +++++ 5 files changed, 38 insertions(+), 8 deletions(-) --- a/include/asm-generic/thread_info_tif.h +++ b/include/asm-generic/thread_info_tif.h @@ -45,4 +45,7 @@ # define _TIF_RESTORE_SIGMASK BIT(TIF_RESTORE_SIGMASK) #endif =20 +#define TIF_RSEQ 11 // Run RSEQ fast path +#define _TIF_RSEQ BIT(TIF_RSEQ) + #endif /* _ASM_GENERIC_THREAD_INFO_TIF_H_ */ --- a/include/linux/irq-entry-common.h +++ b/include/linux/irq-entry-common.h @@ -30,7 +30,7 @@ #define EXIT_TO_USER_MODE_WORK \ (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | \ _TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY | \ - _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | \ + _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | _TIF_RSEQ | \ ARCH_EXIT_TO_USER_MODE_WORK) =20 /** --- a/include/linux/rseq.h +++ b/include/linux/rseq.h @@ -40,7 +40,7 @@ static inline void rseq_signal_deliver(s =20 static inline void rseq_raise_notify_resume(struct task_struct *t) { - set_tsk_thread_flag(t, TIF_NOTIFY_RESUME); + set_tsk_thread_flag(t, TIF_RSEQ); } =20 /* Invoked from context switch to force evaluation on exit to user */ @@ -122,7 +122,7 @@ static inline void rseq_force_update(voi */ static inline void rseq_virt_userspace_exit(void) { - if (current->rseq_event.sched_switch) + if (!IS_ENABLED(CONFIG_HAVE_GENERIC_TIF_BITS) && current->rseq_event.sche= d_switch) rseq_raise_notify_resume(current); } =20 @@ -147,9 +147,16 @@ static inline void rseq_fork(struct task /* * If it has rseq, force it into the slow path right away * because it is guaranteed to fault. + * + * Setting TIF_NOTIFY_RESUME is redundant but harmless for + * architectures which do not have a seperate TIF_RSEQ, but + * for those who do it's required to enforce the slow path + * as the scheduler sets only TIF_RSEQ. */ - if (t->rseq_event.has_rseq) + if (t->rseq_event.has_rseq) { t->rseq_event.slowpath =3D true; + set_tsk_thread_flag(t, TIF_NOTIFY_RESUME); + } } } =20 --- a/include/linux/rseq_entry.h +++ b/include/linux/rseq_entry.h @@ -502,18 +502,33 @@ static __always_inline bool __rseq_exit_ return true; } =20 +#ifdef CONFIG_HAVE_GENERIC_TIF_BITS +# define CHECK_TIF_RSEQ _TIF_RSEQ +static __always_inline void clear_tif_rseq(void) +{ + clear_thread_flag(TIF_RSEQ); +} +#else +# define CHECK_TIF_RSEQ 0UL +static inline void clear_tif_rseq(void) { } +#endif + static __always_inline unsigned long rseq_exit_to_user_mode_work(struct pt_regs *regs, unsigned long ti_work, c= onst unsigned long mask) { /* * Check if all work bits have been cleared before handling rseq. + * + * In case of a seperate TIF_RSEQ this checks for all other bits to + * be cleared and TIF_RSEQ to be set. */ - if ((ti_work & mask) !=3D 0) - return ti_work; - - if (likely(!__rseq_exit_to_user_mode_restart(regs))) + if ((ti_work & mask) !=3D CHECK_TIF_RSEQ) return ti_work; =20 + if (likely(!__rseq_exit_to_user_mode_restart(regs))) { + clear_tif_rseq(); + return ti_work & ~CHECK_TIF_RSEQ; + } return ti_work | _TIF_NOTIFY_RESUME; } =20 --- a/include/linux/thread_info.h +++ b/include/linux/thread_info.h @@ -67,6 +67,11 @@ enum syscall_work_bit { #define _TIF_NEED_RESCHED_LAZY _TIF_NEED_RESCHED #endif =20 +#ifndef TIF_RSEQ +# define TIF_RSEQ TIF_NOTIFY_RESUME +# define _TIF_RSEQ _TIF_NOTIFY_RESUME +#endif + #ifdef __KERNEL__ =20 #ifndef arch_set_restart_data From nobody Fri Oct 3 21:53:25 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3A29D2F60AD for ; Sat, 23 Aug 2025 16:40:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967247; cv=none; b=G+D9Pi1NS0bqGq3mQoT1DU5I8dvVBeo3gLnEvLhFN/I3q83bQzLk+jfSTkoPxBcoEHDYzbq0ucjXnqfXLDgRUFB2zamcE4Is/rJ5z0+NWYO29fBD+BndlM3ZxAvZR5zKp4mnZNQlmAKswcpBjPo6tXs95eWW9VvIxUWWKD7ls5k= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755967247; c=relaxed/simple; bh=V9KAdQxHT1BykTPxYCmx0HT7MswbXi5MlP61JRMEveU=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=Y7SmcqrGfrHnpqRpJGsNbzvzeAM3S/CLARC9WTa/hV+eoBZiwv163JAbFuwegZaolpBle/4WUg4K6WvDB3IUpP2Zohtz1CH1zC+ru2/8UXAvrI1ypWBAVYhpjmuo246GELyZqzo6whPOsxKYlz+yzuKBXs7znKdDqDMNIYip0vM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=i9+AuXiR; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=AV0luctv; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="i9+AuXiR"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="AV0luctv" Message-ID: <20250823161655.651830871@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1755967243; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=TNP4jLdbe48UgP+IYEPS+/ksUC1Ye9huNyWBu/lGkwc=; b=i9+AuXiR26Ek8Ba10/mTVPu4qYN1BlJacJyCtExrzsb8IqbojZH9Z5i5MjbRy5B2lHv4h9 KqmZU5ngR3UXv7RsZd5RgjcmtVR005E0WxHa95Y4hgRO3pNgWNL79k6wPuAhZnReURnoDQ 5A+gqRG3EyK6R+hpVpSFadAV1CxCgTPjuTBzrd8H7IsFWvVvMFUpxc4153k+VYa/+ssZiE yb/fwfdKUh3azJXkhtg+QMIUOg8J2OI0gfImIL8IqH+lL0amdHnXYnw2MEVlZcWNgdUqs/ tNrSsiPyNVcKgKYThJLSRYbesVDagTZ2x49gSApjZOMgBmT7tr8yurjABqyABA== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1755967243; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=TNP4jLdbe48UgP+IYEPS+/ksUC1Ye9huNyWBu/lGkwc=; b=AV0luctvMe9fGc/1sDDegNo/5laCwk/ky+87CaSDYqWC3sFlj4HdR2ERsLK91DloCAbPlI Pp+h/l6aOlysHdCA== From: Thomas Gleixner To: LKML Cc: Jens Axboe , Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Paolo Bonzini , Sean Christopherson , Wei Liu , Dexuan Cui , x86@kernel.org, Arnd Bergmann , Heiko Carstens , Christian Borntraeger , Sven Schnelle , Huacai Chen , Paul Walmsley , Palmer Dabbelt Subject: [patch V2 37/37] entry/rseq: Optimize for TIF_RSEQ on exit References: <20250823161326.635281786@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Sat, 23 Aug 2025 18:40:42 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Further analysis of the exit path with the seperate TIF_RSEQ showed that depending on the workload a significant amount of invocations of resume_user_mode_work() ends up with no other bit set than TIF_RSEQ. On architectures with a separate TIF_RSEQ this can be distinguished and checked right at the beginning of the function before entering the loop. The quick check is lightweight so it does not impose a massive penalty on non-RSEQ use cases. It just checks for the work being empty, except for TIF_RSEQ and jumps right into the handling fast path. This is truly the only TIF bit there which can be optimized that way because the handling runs only when all the other work has been done. The optimization spares a full round trip through the other conditionals and an interrupt enable/disable pair. The generated code looks reasonable enough to justify this and the resulting numbers do so as well. The main beneficiaries are blocking syscall heavy work loads, where the tasks often end up being scheduled on a different CPU or get a different MM CID, but have no other work to handle on return. A futex benchmark showed up to 90% shortcut utilization and a measurable improvement in perf of ~1%. Non-scheduling work loads do neither see an improvement nor degrade. A full kernel build shows about 15% shortcuts, but no measurable side effects in either direction. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- include/linux/rseq_entry.h | 14 ++++++++++++++ kernel/entry/common.c | 13 +++++++++++-- kernel/rseq.c | 2 ++ 3 files changed, 27 insertions(+), 2 deletions(-) --- a/include/linux/rseq_entry.h +++ b/include/linux/rseq_entry.h @@ -11,6 +11,7 @@ struct rseq_stats { unsigned long signal; unsigned long slowpath; unsigned long fastpath; + unsigned long quicktif; unsigned long ids; unsigned long cs; unsigned long clear; @@ -532,6 +533,14 @@ rseq_exit_to_user_mode_work(struct pt_re return ti_work | _TIF_NOTIFY_RESUME; } =20 +static __always_inline bool +rseq_exit_to_user_mode_early(unsigned long ti_work, const unsigned long ma= sk) +{ + if (IS_ENABLED(CONFIG_HAVE_GENERIC_TIF_BITS)) + return (ti_work & mask) =3D=3D CHECK_TIF_RSEQ; + return false; +} + #endif /* !CONFIG_GENERIC_ENTRY */ =20 static __always_inline void rseq_syscall_exit_to_user_mode(void) @@ -577,6 +586,11 @@ static inline unsigned long rseq_exit_to { return ti_work; } + +static inline bool rseq_exit_to_user_mode_early(unsigned long ti_work, con= st unsigned long mask) +{ + return false; +} static inline void rseq_note_user_irq_entry(void) { } static inline void rseq_syscall_exit_to_user_mode(void) { } static inline void rseq_irqentry_exit_to_user_mode(void) { } --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -22,7 +22,14 @@ void __weak arch_do_signal_or_restart(st /* * Before returning to user space ensure that all pending work * items have been completed. + * + * Optimize for TIF_RSEQ being the only bit set. */ + if (rseq_exit_to_user_mode_early(ti_work, EXIT_TO_USER_MODE_WORK)) { + rseq_stat_inc(rseq_stats.quicktif); + goto do_rseq; + } + do { local_irq_enable_exit_to_user(ti_work); =20 @@ -56,10 +63,12 @@ void __weak arch_do_signal_or_restart(st =20 ti_work =3D read_thread_flags(); =20 + do_rseq: /* * This returns the unmodified ti_work, when ti_work is not - * empty. In that case it waits for the next round to avoid - * multiple updates in case of rescheduling. + * empty (except for TIF_RSEQ). In that case it waits for + * the next round to avoid multiple updates in case of + * rescheduling. * * When it handles rseq it returns either with empty work * on success or with TIF_NOTIFY_RESUME set on failure to --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -134,6 +134,7 @@ static int rseq_stats_show(struct seq_fi stats.signal +=3D data_race(per_cpu(rseq_stats.signal, cpu)); stats.slowpath +=3D data_race(per_cpu(rseq_stats.slowpath, cpu)); stats.fastpath +=3D data_race(per_cpu(rseq_stats.fastpath, cpu)); + stats.quicktif +=3D data_race(per_cpu(rseq_stats.quicktif, cpu)); stats.ids +=3D data_race(per_cpu(rseq_stats.ids, cpu)); stats.cs +=3D data_race(per_cpu(rseq_stats.cs, cpu)); stats.clear +=3D data_race(per_cpu(rseq_stats.clear, cpu)); @@ -144,6 +145,7 @@ static int rseq_stats_show(struct seq_fi seq_printf(m, "signal: %16lu\n", stats.signal); seq_printf(m, "slowp: %16lu\n", stats.slowpath); seq_printf(m, "fastp: %16lu\n", stats.fastpath); + seq_printf(m, "quickt: %16lu\n", stats.quicktif); seq_printf(m, "ids: %16lu\n", stats.ids); seq_printf(m, "cs: %16lu\n", stats.cs); seq_printf(m, "clear: %16lu\n", stats.clear);