[v2] rseq: Optimize exit to user space

[patch V2 28/37] rseq: Switch to fast path processing on exit to user

Posted by Thomas Gleixner 1 month, 1 week ago

Now that all bits and pieces are in place, hook the RSEQ handling fast path
function into exit_to_user_mode_prepare() after the TIF work bits have been
handled. If case of fast path failure, TIF_NOTIFY_RESUME has been raised
and the caller needs to take another turn through the TIF handling slow
path.

This only works for architectures, which use the generic entry code.
Architectures, who still have their own incomplete hacks are not supported
and won't be.

This results in the following improvements:

  Kernel build	       Before		  After		      Reduction
		       
  exit to user         80692981		  80514451      
  signal checks:          32581		       121	       99%
  slowpath runs:        1201408   1.49%	       198 0.00%      100%
  fastpath runs:           	  	    675941 0.84%       N/A
  id updates:           1233989   1.53%	     50541 0.06%       96%
  cs checks:            1125366   1.39%	         0 0.00%      100%
    cs cleared:         1125366      100%	 0            100%
    cs fixup:                 0        0%	 0      

  RSEQ selftests      Before		  After		      Reduction

  exit to user:       386281778		  387373750       
  signal checks:       35661203		          0           100%
  slowpath runs:      140542396 36.38%	        100  0.00%    100%
  fastpath runs:           	  	    9509789  2.51%     N/A
  id updates:         176203599 45.62%	    9087994  2.35%     95%
  cs checks:          175587856 45.46%	    4728394  1.22%     98%
    cs cleared:       172359544   98.16%    1319307   27.90%   99% 
    cs fixup:           3228312    1.84%    3409087   72.10%

The 'cs cleared' and 'cs fixup' percentanges are not relative to the exit
to user invocations, they are relative to the actual 'cs check'
invocations.

While some of this could have been avoided in the original code, like the
obvious clearing of CS when it's already clear, the main problem of going
through TIF_NOTIFY_RESUME cannot be solved. In some workloads the RSEQ
notify handler is invoked more than once before going out to user
space. Doing this once when everything has stabilized is the only solution
to avoid this.

The initial attempt to completely decouple it from the TIF work turned out
to be suboptimal for workloads, which do a lot of quick and short system
calls. Even if the fast path decision is only 4 instructions (including a
conditional branch), this adds up quickly and becomes measurable when the
rate for actually having to handle rseq is in the low single digit
percentage range of user/kernel transitions.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/irq-entry-common.h |    7 ++-----
 include/linux/resume_user_mode.h |    2 +-
 include/linux/rseq.h             |   24 ++++++++++++++++++------
 include/linux/rseq_entry.h       |    2 +-
 init/Kconfig                     |    2 +-
 kernel/entry/common.c            |   17 ++++++++++++++---
 kernel/rseq.c                    |    8 ++++++--
 7 files changed, 43 insertions(+), 19 deletions(-)

--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -197,11 +197,8 @@ static __always_inline void arch_exit_to
  */
 void arch_do_signal_or_restart(struct pt_regs *regs);
 
-/**
- * exit_to_user_mode_loop - do any pending work before leaving to user space
- */
-unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
-				     unsigned long ti_work);
+/* Handle pending TIF work */
+unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long ti_work);
 
 /**
  * exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
--- a/include/linux/resume_user_mode.h
+++ b/include/linux/resume_user_mode.h
@@ -59,7 +59,7 @@ static inline void resume_user_mode_work
 	mem_cgroup_handle_over_high(GFP_KERNEL);
 	blkcg_maybe_throttle_current();
 
-	rseq_handle_notify_resume(regs);
+	rseq_handle_slowpath(regs);
 }
 
 #endif /* LINUX_RESUME_USER_MODE_H */
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -5,13 +5,19 @@
 #ifdef CONFIG_RSEQ
 #include <linux/sched.h>
 
-void __rseq_handle_notify_resume(struct pt_regs *regs);
+void __rseq_handle_slowpath(struct pt_regs *regs);
 
-static inline void rseq_handle_notify_resume(struct pt_regs *regs)
+/* Invoked from resume_user_mode_work() */
+static inline void rseq_handle_slowpath(struct pt_regs *regs)
 {
-	/* '&' is intentional to spare one conditional branch */
-	if (current->rseq_event.sched_switch & current->rseq_event.has_rseq)
-		__rseq_handle_notify_resume(regs);
+	if (IS_ENABLED(CONFIG_GENERIC_ENTRY)) {
+		if (current->rseq_event.slowpath)
+			__rseq_handle_slowpath(regs);
+	} else {
+		/* '&' is intentional to spare one conditional branch */
+		if (current->rseq_event.sched_switch & current->rseq_event.has_rseq)
+			__rseq_handle_slowpath(regs);
+	}
 }
 
 void __rseq_signal_deliver(int sig, struct pt_regs *regs);
@@ -138,6 +144,12 @@ static inline void rseq_fork(struct task
 		t->rseq_sig = current->rseq_sig;
 		t->rseq_ids.cpu_cid = ~0ULL;
 		t->rseq_event = current->rseq_event;
+		/*
+		 * If it has rseq, force it into the slow path right away
+		 * because it is guaranteed to fault.
+		 */
+		if (t->rseq_event.has_rseq)
+			t->rseq_event.slowpath = true;
 	}
 }
 
@@ -151,7 +163,7 @@ static inline void rseq_execve(struct ta
 }
 
 #else /* CONFIG_RSEQ */
-static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) { }
+static inline void rseq_handle_slowpath(struct pt_regs *regs) { }
 static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
 static inline void rseq_sched_switch_event(struct task_struct *t) { }
 static inline void rseq_sched_set_task_cpu(struct task_struct *t, unsigned int cpu) { }
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -433,7 +433,7 @@ static rseq_inline bool rseq_update_usr(
  * tells the caller to loop back into exit_to_user_mode_loop(). The rseq
  * slow path there will handle the fail.
  */
-static __always_inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs)
+static __always_inline bool __rseq_exit_to_user_mode_restart(struct pt_regs *regs)
 {
 	struct task_struct *t = current;
 
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1911,7 +1911,7 @@ config RSEQ_DEBUG_DEFAULT_ENABLE
 config DEBUG_RSEQ
 	default n
 	bool "Enable debugging of rseq() system call" if EXPERT
-	depends on RSEQ && DEBUG_KERNEL
+	depends on RSEQ && DEBUG_KERNEL && !GENERIC_ENTRY
 	select RSEQ_DEBUG_DEFAULT_ENABLE
 	help
 	  Enable extra debugging checks for the rseq system call.
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -23,8 +23,7 @@ void __weak arch_do_signal_or_restart(st
 	 * Before returning to user space ensure that all pending work
 	 * items have been completed.
 	 */
-	while (ti_work & EXIT_TO_USER_MODE_WORK) {
-
+	do {
 		local_irq_enable_exit_to_user(ti_work);
 
 		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
@@ -56,7 +55,19 @@ void __weak arch_do_signal_or_restart(st
 		tick_nohz_user_enter_prepare();
 
 		ti_work = read_thread_flags();
-	}
+
+		/*
+		 * This returns the unmodified ti_work, when ti_work is not
+		 * empty. In that case it waits for the next round to avoid
+		 * multiple updates in case of rescheduling.
+		 *
+		 * When it handles rseq it returns either with empty work
+		 * on success or with TIF_NOTIFY_RESUME set on failure to
+		 * kick the handling into the slow path.
+		 */
+		ti_work = rseq_exit_to_user_mode_work(regs, ti_work, EXIT_TO_USER_MODE_WORK);
+
+	} while (ti_work & EXIT_TO_USER_MODE_WORK);
 
 	/* Return the latest work state for arch_exit_to_user_mode() */
 	return ti_work;
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -234,7 +234,11 @@ static bool rseq_handle_cs(struct task_s
 
 static void rseq_slowpath_update_usr(struct pt_regs *regs)
 {
-	/* Preserve rseq state and user_irq state for exit to user */
+	/*
+	 * Preserve rseq state and user_irq state. The generic entry code
+	 * clears user_irq on the way out, the non-generic entry
+	 * architectures are not having user_irq.
+	 */
 	const struct rseq_event evt_mask = { .has_rseq = true, .user_irq = true, };
 	struct task_struct *t = current;
 	struct rseq_ids ids;
@@ -286,7 +290,7 @@ static void rseq_slowpath_update_usr(str
 	}
 }
 
-void __rseq_handle_notify_resume(struct pt_regs *regs)
+void __rseq_handle_slowpath(struct pt_regs *regs)
 {
 	/*
 	 * If invoked from hypervisors before entering the guest via

Re: [patch V2 28/37] rseq: Switch to fast path processing on exit to user

Posted by Mathieu Desnoyers 1 month, 1 week ago

On 2025-08-23 12:40, Thomas Gleixner wrote:
> Now that all bits and pieces are in place, hook the RSEQ handling fast path
> function into exit_to_user_mode_prepare() after the TIF work bits have been
> handled. If case of fast path failure, TIF_NOTIFY_RESUME has been raised
> and the caller needs to take another turn through the TIF handling slow
> path.
> 
> This only works for architectures, which use the generic entry code.
> Architectures, who still have their own incomplete hacks are not supported
> and won't be.
> 
> This results in the following improvements:
> 
>    Kernel build	       Before		  After		      Reduction
> 		
>    exit to user         80692981		  80514451
>    signal checks:          32581		       121	       99%
>    slowpath runs:        1201408   1.49%	       198 0.00%      100%
>    fastpath runs:           	  	    675941 0.84%       N/A
>    id updates:           1233989   1.53%	     50541 0.06%       96%
>    cs checks:            1125366   1.39%	         0 0.00%      100%
>      cs cleared:         1125366      100%	 0            100%
>      cs fixup:                 0        0%	 0
> 
>    RSEQ selftests      Before		  After		      Reduction
> 
>    exit to user:       386281778		  387373750
>    signal checks:       35661203		          0           100%
>    slowpath runs:      140542396 36.38%	        100  0.00%    100%
>    fastpath runs:           	  	    9509789  2.51%     N/A
>    id updates:         176203599 45.62%	    9087994  2.35%     95%
>    cs checks:          175587856 45.46%	    4728394  1.22%     98%
>      cs cleared:       172359544   98.16%    1319307   27.90%   99%
>      cs fixup:           3228312    1.84%    3409087   72.10%
> 
> The 'cs cleared' and 'cs fixup' percentanges are not relative to the exit
> to user invocations, they are relative to the actual 'cs check'
> invocations.
> 
> While some of this could have been avoided in the original code, like the
> obvious clearing of CS when it's already clear, the main problem of going
> through TIF_NOTIFY_RESUME cannot be solved. In some workloads the RSEQ
> notify handler is invoked more than once before going out to user
> space. Doing this once when everything has stabilized is the only solution
> to avoid this.
> 
> The initial attempt to completely decouple it from the TIF work turned out
> to be suboptimal for workloads, which do a lot of quick and short system
> calls. Even if the fast path decision is only 4 instructions (including a
> conditional branch), this adds up quickly and becomes measurable when the
> rate for actually having to handle rseq is in the low single digit
> percentage range of user/kernel transitions.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> ---
>   include/linux/irq-entry-common.h |    7 ++-----
>   include/linux/resume_user_mode.h |    2 +-
>   include/linux/rseq.h             |   24 ++++++++++++++++++------
>   include/linux/rseq_entry.h       |    2 +-
>   init/Kconfig                     |    2 +-
>   kernel/entry/common.c            |   17 ++++++++++++++---
>   kernel/rseq.c                    |    8 ++++++--
>   7 files changed, 43 insertions(+), 19 deletions(-)
> 
> --- a/include/linux/irq-entry-common.h
> +++ b/include/linux/irq-entry-common.h
> @@ -197,11 +197,8 @@ static __always_inline void arch_exit_to
>    */
>   void arch_do_signal_or_restart(struct pt_regs *regs);
>   
> -/**
> - * exit_to_user_mode_loop - do any pending work before leaving to user space
> - */
> -unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
> -				     unsigned long ti_work);
> +/* Handle pending TIF work */
> +unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long ti_work);
>   
>   /**
>    * exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
> --- a/include/linux/resume_user_mode.h
> +++ b/include/linux/resume_user_mode.h
> @@ -59,7 +59,7 @@ static inline void resume_user_mode_work
>   	mem_cgroup_handle_over_high(GFP_KERNEL);
>   	blkcg_maybe_throttle_current();
>   
> -	rseq_handle_notify_resume(regs);
> +	rseq_handle_slowpath(regs);
>   }
>   
>   #endif /* LINUX_RESUME_USER_MODE_H */
> --- a/include/linux/rseq.h
> +++ b/include/linux/rseq.h
> @@ -5,13 +5,19 @@
>   #ifdef CONFIG_RSEQ
>   #include <linux/sched.h>
>   
> -void __rseq_handle_notify_resume(struct pt_regs *regs);
> +void __rseq_handle_slowpath(struct pt_regs *regs);
>   
> -static inline void rseq_handle_notify_resume(struct pt_regs *regs)
> +/* Invoked from resume_user_mode_work() */
> +static inline void rseq_handle_slowpath(struct pt_regs *regs)
>   {
> -	/* '&' is intentional to spare one conditional branch */
> -	if (current->rseq_event.sched_switch & current->rseq_event.has_rseq)
> -		__rseq_handle_notify_resume(regs);
> +	if (IS_ENABLED(CONFIG_GENERIC_ENTRY)) {
> +		if (current->rseq_event.slowpath)
> +			__rseq_handle_slowpath(regs);
> +	} else {
> +		/* '&' is intentional to spare one conditional branch */
> +		if (current->rseq_event.sched_switch & current->rseq_event.has_rseq)
> +			__rseq_handle_slowpath(regs);
> +	}
>   }
>   
>   void __rseq_signal_deliver(int sig, struct pt_regs *regs);
> @@ -138,6 +144,12 @@ static inline void rseq_fork(struct task
>   		t->rseq_sig = current->rseq_sig;
>   		t->rseq_ids.cpu_cid = ~0ULL;
>   		t->rseq_event = current->rseq_event;
> +		/*
> +		 * If it has rseq, force it into the slow path right away
> +		 * because it is guaranteed to fault.
> +		 */
> +		if (t->rseq_event.has_rseq)
> +			t->rseq_event.slowpath = true;
>   	}
>   }
>   
> @@ -151,7 +163,7 @@ static inline void rseq_execve(struct ta
>   }
>   
>   #else /* CONFIG_RSEQ */
> -static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) { }
> +static inline void rseq_handle_slowpath(struct pt_regs *regs) { }
>   static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
>   static inline void rseq_sched_switch_event(struct task_struct *t) { }
>   static inline void rseq_sched_set_task_cpu(struct task_struct *t, unsigned int cpu) { }
> --- a/include/linux/rseq_entry.h
> +++ b/include/linux/rseq_entry.h
> @@ -433,7 +433,7 @@ static rseq_inline bool rseq_update_usr(
>    * tells the caller to loop back into exit_to_user_mode_loop(). The rseq
>    * slow path there will handle the fail.
>    */
> -static __always_inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs)
> +static __always_inline bool __rseq_exit_to_user_mode_restart(struct pt_regs *regs)
>   {
>   	struct task_struct *t = current;
>   
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1911,7 +1911,7 @@ config RSEQ_DEBUG_DEFAULT_ENABLE
>   config DEBUG_RSEQ
>   	default n
>   	bool "Enable debugging of rseq() system call" if EXPERT
> -	depends on RSEQ && DEBUG_KERNEL
> +	depends on RSEQ && DEBUG_KERNEL && !GENERIC_ENTRY
>   	select RSEQ_DEBUG_DEFAULT_ENABLE
>   	help
>   	  Enable extra debugging checks for the rseq system call.
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -23,8 +23,7 @@ void __weak arch_do_signal_or_restart(st
>   	 * Before returning to user space ensure that all pending work
>   	 * items have been completed.
>   	 */
> -	while (ti_work & EXIT_TO_USER_MODE_WORK) {
> -
> +	do {
>   		local_irq_enable_exit_to_user(ti_work);
>   
>   		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
> @@ -56,7 +55,19 @@ void __weak arch_do_signal_or_restart(st
>   		tick_nohz_user_enter_prepare();
>   
>   		ti_work = read_thread_flags();
> -	}
> +
> +		/*
> +		 * This returns the unmodified ti_work, when ti_work is not
> +		 * empty. In that case it waits for the next round to avoid
> +		 * multiple updates in case of rescheduling.
> +		 *
> +		 * When it handles rseq it returns either with empty work
> +		 * on success or with TIF_NOTIFY_RESUME set on failure to
> +		 * kick the handling into the slow path.
> +		 */
> +		ti_work = rseq_exit_to_user_mode_work(regs, ti_work, EXIT_TO_USER_MODE_WORK);
> +
> +	} while (ti_work & EXIT_TO_USER_MODE_WORK);
>   
>   	/* Return the latest work state for arch_exit_to_user_mode() */
>   	return ti_work;
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -234,7 +234,11 @@ static bool rseq_handle_cs(struct task_s
>   
>   static void rseq_slowpath_update_usr(struct pt_regs *regs)
>   {
> -	/* Preserve rseq state and user_irq state for exit to user */
> +	/*
> +	 * Preserve rseq state and user_irq state. The generic entry code
> +	 * clears user_irq on the way out, the non-generic entry
> +	 * architectures are not having user_irq.
> +	 */
>   	const struct rseq_event evt_mask = { .has_rseq = true, .user_irq = true, };
>   	struct task_struct *t = current;
>   	struct rseq_ids ids;
> @@ -286,7 +290,7 @@ static void rseq_slowpath_update_usr(str
>   	}
>   }
>   
> -void __rseq_handle_notify_resume(struct pt_regs *regs)
> +void __rseq_handle_slowpath(struct pt_regs *regs)
>   {
>   	/*
>   	 * If invoked from hypervisors before entering the guest via
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [patch V2 28/37] rseq: Switch to fast path processing on exit to user

Posted by Mathieu Desnoyers 1 month, 1 week ago

On 2025-08-26 11:40, Mathieu Desnoyers wrote:
> On 2025-08-23 12:40, Thomas Gleixner wrote:
>> Now that all bits and pieces are in place, hook the RSEQ handling fast 
>> path
>> function into exit_to_user_mode_prepare() after the TIF work bits have 
>> been
>> handled. If case of fast path failure, TIF_NOTIFY_RESUME has been raised
>> and the caller needs to take another turn through the TIF handling slow
>> path.
>>
>> This only works for architectures, which use the generic entry code.
>> Architectures, who still have their own incomplete hacks are not 
>> supported
>> and won't be.
>>
>> This results in the following improvements:
>>
>>    Kernel build           Before          After              Reduction
>>
>>    exit to user         80692981          80514451
>>    signal checks:          32581               121           99%
>>    slowpath runs:        1201408   1.49%           198 0.00%      100%
>>    fastpath runs:                         675941 0.84%       N/A
>>    id updates:           1233989   1.53%         50541 0.06%       96%
>>    cs checks:            1125366   1.39%             0 0.00%      100%
>>      cs cleared:         1125366      100%     0            100%
>>      cs fixup:                 0        0%     0
>>
>>    RSEQ selftests      Before          After              Reduction
>>
>>    exit to user:       386281778          387373750
>>    signal checks:       35661203                  0           100%
>>    slowpath runs:      140542396 36.38%            100  0.00%    100%
>>    fastpath runs:                         9509789  2.51%     N/A
>>    id updates:         176203599 45.62%        9087994  2.35%     95%
>>    cs checks:          175587856 45.46%        4728394  1.22%     98%
>>      cs cleared:       172359544   98.16%    1319307   27.90%   99%
>>      cs fixup:           3228312    1.84%    3409087   72.10%

By the way, you should really not be using the entire rseq selftests
as a representative workload for profiling the kernel rseq implementation.

Those selftests include "loop injection", "yield injection", "kill
injection" and "sleep injection" within the relevant userspace code
paths, which really increase the likelihood of hitting stuff like
"cs fixup" compared to anything that comes close to a realistic
use-case. This is really useful for testing correctness, but not
for profiling. For instance, the "loop injection" introduces busy
loops within rseq critical sections to significantly increase the
likelihood of hitting a cs fixup.

Those specific selftests are really just "stress-tests" that don't
represent any relevant workload.

The rseq selftests that are more relevant for the type of profiling
you are trying to do here are the "param_test_benchmark". Those
entirely compile-out the injection code and focus on the performance
of rseq fast-path under heavy use. This is already more representative
of a semi-realistic "super-heavy" rseq use workload (you could see it
as a rseq worse-case use upper bound).

I suspect that using this for profiling, you will find out that
optimizing the "cs fixup" code path is not relevant.

The following script runs the "benchmark" tests, which are more relevant
for profiling:

diff --git a/tools/testing/selftests/rseq/Makefile b/tools/testing/selftests/rseq/Makefile
index 0d0a5fae5954..30339183f8a2 100644
--- a/tools/testing/selftests/rseq/Makefile
+++ b/tools/testing/selftests/rseq/Makefile
@@ -21,7 +21,7 @@ TEST_GEN_PROGS = basic_test basic_percpu_ops_test basic_percpu_ops_mm_cid_test p
  
  TEST_GEN_PROGS_EXTENDED = librseq.so
  
-TEST_PROGS = run_param_test.sh run_syscall_errors_test.sh
+TEST_PROGS = run_param_test.sh run_param_test_benchmark.sh run_syscall_errors_test.sh
  
  TEST_FILES := settings
  
diff --git a/tools/testing/selftests/rseq/run_param_test_benchmark.sh b/tools/testing/selftests/rseq/run_param_test_benchmark.sh
new file mode 100755
index 000000000000..17b3dfcfcdd4
--- /dev/null
+++ b/tools/testing/selftests/rseq/run_param_test_benchmark.sh
@@ -0,0 +1,49 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0+ or MIT
+
+NR_CPUS=`grep '^processor' /proc/cpuinfo | wc -l`
+
+EXTRA_ARGS=${@}
+
+OLDIFS="$IFS"
+IFS=$'\n'
+TEST_LIST=(
+	"-T s"
+	"-T l"
+	"-T b"
+	"-T b -M"
+	"-T m"
+	"-T m -M"
+	"-T i"
+	"-T r"
+)
+
+TEST_NAME=(
+	"spinlock"
+	"list"
+	"buffer"
+	"buffer with barrier"
+	"memcpy"
+	"memcpy with barrier"
+	"increment"
+	"membarrier"
+)
+IFS="$OLDIFS"
+
+REPS=10000000
+NR_THREADS=$((6*${NR_CPUS}))
+
+function do_tests()
+{
+	local i=0
+	while [ "$i" -lt "${#TEST_LIST[@]}" ]; do
+		echo "Running benchmark test ${TEST_NAME[$i]}"
+		./param_test_benchmark ${TEST_LIST[$i]} -r ${REPS} -t ${NR_THREADS} ${@} ${EXTRA_ARGS} || exit 1
+
+		echo "Running mm_cid benchmark test ${TEST_NAME[$i]}"
+		./param_test_mm_cid_benchmark ${TEST_LIST[$i]} -r ${REPS} -t ${NR_THREADS} ${@} ${EXTRA_ARGS} || exit 1
+		let "i++"
+	done
+}
+
+do_tests

Thanks,

Mathieu

>>
>> The 'cs cleared' and 'cs fixup' percentanges are not relative to the exit
>> to user invocations, they are relative to the actual 'cs check'
>> invocations.
>>
>> While some of this could have been avoided in the original code, like the
>> obvious clearing of CS when it's already clear, the main problem of going
>> through TIF_NOTIFY_RESUME cannot be solved. In some workloads the RSEQ
>> notify handler is invoked more than once before going out to user
>> space. Doing this once when everything has stabilized is the only 
>> solution
>> to avoid this.
>>
>> The initial attempt to completely decouple it from the TIF work turned 
>> out
>> to be suboptimal for workloads, which do a lot of quick and short system
>> calls. Even if the fast path decision is only 4 instructions (including a
>> conditional branch), this adds up quickly and becomes measurable when the
>> rate for actually having to handle rseq is in the low single digit
>> percentage range of user/kernel transitions.
>>
>> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> 
> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> 
>> ---
>>   include/linux/irq-entry-common.h |    7 ++-----
>>   include/linux/resume_user_mode.h |    2 +-
>>   include/linux/rseq.h             |   24 ++++++++++++++++++------
>>   include/linux/rseq_entry.h       |    2 +-
>>   init/Kconfig                     |    2 +-
>>   kernel/entry/common.c            |   17 ++++++++++++++---
>>   kernel/rseq.c                    |    8 ++++++--
>>   7 files changed, 43 insertions(+), 19 deletions(-)
>>
>> --- a/include/linux/irq-entry-common.h
>> +++ b/include/linux/irq-entry-common.h
>> @@ -197,11 +197,8 @@ static __always_inline void arch_exit_to
>>    */
>>   void arch_do_signal_or_restart(struct pt_regs *regs);
>> -/**
>> - * exit_to_user_mode_loop - do any pending work before leaving to 
>> user space
>> - */
>> -unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>> -                     unsigned long ti_work);
>> +/* Handle pending TIF work */
>> +unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned 
>> long ti_work);
>>   /**
>>    * exit_to_user_mode_prepare - call exit_to_user_mode_loop() if 
>> required
>> --- a/include/linux/resume_user_mode.h
>> +++ b/include/linux/resume_user_mode.h
>> @@ -59,7 +59,7 @@ static inline void resume_user_mode_work
>>       mem_cgroup_handle_over_high(GFP_KERNEL);
>>       blkcg_maybe_throttle_current();
>> -    rseq_handle_notify_resume(regs);
>> +    rseq_handle_slowpath(regs);
>>   }
>>   #endif /* LINUX_RESUME_USER_MODE_H */
>> --- a/include/linux/rseq.h
>> +++ b/include/linux/rseq.h
>> @@ -5,13 +5,19 @@
>>   #ifdef CONFIG_RSEQ
>>   #include <linux/sched.h>
>> -void __rseq_handle_notify_resume(struct pt_regs *regs);
>> +void __rseq_handle_slowpath(struct pt_regs *regs);
>> -static inline void rseq_handle_notify_resume(struct pt_regs *regs)
>> +/* Invoked from resume_user_mode_work() */
>> +static inline void rseq_handle_slowpath(struct pt_regs *regs)
>>   {
>> -    /* '&' is intentional to spare one conditional branch */
>> -    if (current->rseq_event.sched_switch & current->rseq_event.has_rseq)
>> -        __rseq_handle_notify_resume(regs);
>> +    if (IS_ENABLED(CONFIG_GENERIC_ENTRY)) {
>> +        if (current->rseq_event.slowpath)
>> +            __rseq_handle_slowpath(regs);
>> +    } else {
>> +        /* '&' is intentional to spare one conditional branch */
>> +        if (current->rseq_event.sched_switch & current- 
>> >rseq_event.has_rseq)
>> +            __rseq_handle_slowpath(regs);
>> +    }
>>   }
>>   void __rseq_signal_deliver(int sig, struct pt_regs *regs);
>> @@ -138,6 +144,12 @@ static inline void rseq_fork(struct task
>>           t->rseq_sig = current->rseq_sig;
>>           t->rseq_ids.cpu_cid = ~0ULL;
>>           t->rseq_event = current->rseq_event;
>> +        /*
>> +         * If it has rseq, force it into the slow path right away
>> +         * because it is guaranteed to fault.
>> +         */
>> +        if (t->rseq_event.has_rseq)
>> +            t->rseq_event.slowpath = true;
>>       }
>>   }
>> @@ -151,7 +163,7 @@ static inline void rseq_execve(struct ta
>>   }
>>   #else /* CONFIG_RSEQ */
>> -static inline void rseq_handle_notify_resume(struct ksignal *ksig, 
>> struct pt_regs *regs) { }
>> +static inline void rseq_handle_slowpath(struct pt_regs *regs) { }
>>   static inline void rseq_signal_deliver(struct ksignal *ksig, struct 
>> pt_regs *regs) { }
>>   static inline void rseq_sched_switch_event(struct task_struct *t) { }
>>   static inline void rseq_sched_set_task_cpu(struct task_struct *t, 
>> unsigned int cpu) { }
>> --- a/include/linux/rseq_entry.h
>> +++ b/include/linux/rseq_entry.h
>> @@ -433,7 +433,7 @@ static rseq_inline bool rseq_update_usr(
>>    * tells the caller to loop back into exit_to_user_mode_loop(). The 
>> rseq
>>    * slow path there will handle the fail.
>>    */
>> -static __always_inline bool rseq_exit_to_user_mode_restart(struct 
>> pt_regs *regs)
>> +static __always_inline bool __rseq_exit_to_user_mode_restart(struct 
>> pt_regs *regs)
>>   {
>>       struct task_struct *t = current;
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -1911,7 +1911,7 @@ config RSEQ_DEBUG_DEFAULT_ENABLE
>>   config DEBUG_RSEQ
>>       default n
>>       bool "Enable debugging of rseq() system call" if EXPERT
>> -    depends on RSEQ && DEBUG_KERNEL
>> +    depends on RSEQ && DEBUG_KERNEL && !GENERIC_ENTRY
>>       select RSEQ_DEBUG_DEFAULT_ENABLE
>>       help
>>         Enable extra debugging checks for the rseq system call.
>> --- a/kernel/entry/common.c
>> +++ b/kernel/entry/common.c
>> @@ -23,8 +23,7 @@ void __weak arch_do_signal_or_restart(st
>>        * Before returning to user space ensure that all pending work
>>        * items have been completed.
>>        */
>> -    while (ti_work & EXIT_TO_USER_MODE_WORK) {
>> -
>> +    do {
>>           local_irq_enable_exit_to_user(ti_work);
>>           if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
>> @@ -56,7 +55,19 @@ void __weak arch_do_signal_or_restart(st
>>           tick_nohz_user_enter_prepare();
>>           ti_work = read_thread_flags();
>> -    }
>> +
>> +        /*
>> +         * This returns the unmodified ti_work, when ti_work is not
>> +         * empty. In that case it waits for the next round to avoid
>> +         * multiple updates in case of rescheduling.
>> +         *
>> +         * When it handles rseq it returns either with empty work
>> +         * on success or with TIF_NOTIFY_RESUME set on failure to
>> +         * kick the handling into the slow path.
>> +         */
>> +        ti_work = rseq_exit_to_user_mode_work(regs, ti_work, 
>> EXIT_TO_USER_MODE_WORK);
>> +
>> +    } while (ti_work & EXIT_TO_USER_MODE_WORK);
>>       /* Return the latest work state for arch_exit_to_user_mode() */
>>       return ti_work;
>> --- a/kernel/rseq.c
>> +++ b/kernel/rseq.c
>> @@ -234,7 +234,11 @@ static bool rseq_handle_cs(struct task_s
>>   static void rseq_slowpath_update_usr(struct pt_regs *regs)
>>   {
>> -    /* Preserve rseq state and user_irq state for exit to user */
>> +    /*
>> +     * Preserve rseq state and user_irq state. The generic entry code
>> +     * clears user_irq on the way out, the non-generic entry
>> +     * architectures are not having user_irq.
>> +     */
>>       const struct rseq_event evt_mask = { .has_rseq = true, .user_irq 
>> = true, };
>>       struct task_struct *t = current;
>>       struct rseq_ids ids;
>> @@ -286,7 +290,7 @@ static void rseq_slowpath_update_usr(str
>>       }
>>   }
>> -void __rseq_handle_notify_resume(struct pt_regs *regs)
>> +void __rseq_handle_slowpath(struct pt_regs *regs)
>>   {
>>       /*
>>        * If invoked from hypervisors before entering the guest via
>>
> 
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [patch V2 28/37] rseq: Switch to fast path processing on exit to user

Posted by Thomas Gleixner 1 month ago

On Wed, Aug 27 2025 at 09:45, Mathieu Desnoyers wrote:
> On 2025-08-26 11:40, Mathieu Desnoyers wrote:
>>>    RSEQ selftests      Before          After              Reduction
>>>
>>>    exit to user:       386281778          387373750
>>>    signal checks:       35661203                  0           100%
>>>    slowpath runs:      140542396 36.38%            100  0.00%    100%
>>>    fastpath runs:                         9509789  2.51%     N/A
>>>    id updates:         176203599 45.62%        9087994  2.35%     95%
>>>    cs checks:          175587856 45.46%        4728394  1.22%     98%
>>>      cs cleared:       172359544   98.16%    1319307   27.90%   99%
>>>      cs fixup:           3228312    1.84%    3409087   72.10%
>
> By the way, you should really not be using the entire rseq selftests
> as a representative workload for profiling the kernel rseq implementation.
>
> Those selftests include "loop injection", "yield injection", "kill
> injection" and "sleep injection" within the relevant userspace code
> paths, which really increase the likelihood of hitting stuff like
> "cs fixup" compared to anything that comes close to a realistic
> use-case. This is really useful for testing correctness, but not
> for profiling. For instance, the "loop injection" introduces busy
> loops within rseq critical sections to significantly increase the
> likelihood of hitting a cs fixup.
>
> Those specific selftests are really just "stress-tests" that don't
> represent any relevant workload.

True, they still tell how much useless work the kernel was doing, no?

Re: [patch V2 28/37] rseq: Switch to fast path processing on exit to user

Posted by Mathieu Desnoyers 4 weeks, 1 day ago

On 2025-09-02 14:36, Thomas Gleixner wrote:
> On Wed, Aug 27 2025 at 09:45, Mathieu Desnoyers wrote:
>> On 2025-08-26 11:40, Mathieu Desnoyers wrote:
>>>>     RSEQ selftests      Before          After              Reduction
>>>>
>>>>     exit to user:       386281778          387373750
>>>>     signal checks:       35661203                  0           100%
>>>>     slowpath runs:      140542396 36.38%            100  0.00%    100%
>>>>     fastpath runs:                         9509789  2.51%     N/A
>>>>     id updates:         176203599 45.62%        9087994  2.35%     95%
>>>>     cs checks:          175587856 45.46%        4728394  1.22%     98%
>>>>       cs cleared:       172359544   98.16%    1319307   27.90%   99%
>>>>       cs fixup:           3228312    1.84%    3409087   72.10%
>>
>> By the way, you should really not be using the entire rseq selftests
>> as a representative workload for profiling the kernel rseq implementation.
>>
>> Those selftests include "loop injection", "yield injection", "kill
>> injection" and "sleep injection" within the relevant userspace code
>> paths, which really increase the likelihood of hitting stuff like
>> "cs fixup" compared to anything that comes close to a realistic
>> use-case. This is really useful for testing correctness, but not
>> for profiling. For instance, the "loop injection" introduces busy
>> loops within rseq critical sections to significantly increase the
>> likelihood of hitting a cs fixup.
>>
>> Those specific selftests are really just "stress-tests" that don't
>> represent any relevant workload.
> 
> True, they still tell how much useless work the kernel was doing, no?

Somewhat, but they misrepresent what should be considered as fast vs
slow paths, and thus what are relevant optimization targets.

Let me try to explain my thinking further through a comparison with
a periodic task scenario.

Let's suppose you have a periodic task that happens once per day in
normal workloads, and you alter its period in a stress-test to make it
run every 10ms to make sure you hit race conditions quickly for testing
purposes. Of course this periodic task will show up in the profiles as
a fast-path, but that's just because it's been made to run very
frequently by the stress-test setup.

Running busy loops within rseq critical sections is similar: they were
made to trigger aborts on purpose, so the aborts happen much more often
than they would in any workload that is not trying to trigger this on
purpose.

So yes the work that you see there under stress test is indeed work
that the kernel is doing in those situations, but it over-represents
the frequency of rseq aborts because those are precisely what the
stress-tests are aiming to trigger.

This is why I discourage using the loop/yield/kill/sleep injection
parts of the selftests for profiling purposes, and rather recommend
using the "benchmark" selftests which are much closer to real-life
workloads.

Of course if you are interested in optimizing the rseq ip fixup code
path, then using the stress-tests *is* relevant, because it allows
hitting that code often enough to make it significant in profiles.
But that does not mean that the rseq ip fixup scenario happens often
enough in real-life workloads to justify optimizing it.

All that being said, I'm perfectly fine with your improvements, but
I just want to clarify what should be considered as relevant metrics
that justify future optimization efforts and orient future optimization
vs code complexity trade offs.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [patch V2 28/37] rseq: Switch to fast path processing on exit to user

Posted by Thomas Gleixner 4 weeks, 1 day ago

On Thu, Sep 04 2025 at 13:54, Mathieu Desnoyers wrote:
> On 2025-09-02 14:36, Thomas Gleixner wrote:
>
> All that being said, I'm perfectly fine with your improvements, but
> I just want to clarify what should be considered as relevant metrics
> that justify future optimization efforts and orient future optimization
> vs code complexity trade offs.

I understand that.

Though my main objective was to optimize for the 'nothing to see here'
case, which is hit both in a kernel compile and also in the stress test
suite as the numbers show.

I definitely was not optizing for the actual handling of critical
sections in the first place. That this turned out to be slightly more
efficient is mostly a byproduct of the main goal as I just integrated
stuff more tightly.

So the actual benchmark code which will only rarely hit that path is not
that interesting. I ran the benchmark script out of curiosity
nevertheless. Here you go:

Before:
      27.883787661 seconds time elapsed
    2983.093796000 seconds user                                                                                                                                      4.227902000 seconds sys                                                                                                                                

After:
      27.908213568 seconds time elapsed
    2994.785114000 seconds user
       2.555690000 seconds sys

The times have quite some variance across multiple runs on both kernels,
but the trend of spending significantly less kernel cycles is very
consistent.

Thanks,

        tglx