[v2] rseq: Optimize exit to user space

[patch V2 07/37] rseq, virt: Retrigger RSEQ after vcpu_run()

Posted by Thomas Gleixner 1 month, 1 week ago

Hypervisors invoke resume_user_mode_work() before entering the guest, which
clears TIF_NOTIFY_RESUME. The @regs argument is NULL as there is no user
space context available to them, so the rseq notify handler skips
inspecting the critical section, but updates the CPU/MM CID values
unconditionally so that the eventual pending rseq event is not lost on the
way to user space.

This is a pointless exercise as the task might be rescheduled before
actually returning to user space and it creates unnecessary work in the
vcpu_run() loops.

It's way more efficient to ignore that invocation based on @regs == NULL
and let the hypervisors re-raise TIF_NOTIFY_RESUME after returning from the
vcpu_run() loop before returning from the ioctl().

This ensures that a pending RSEQ update is not lost and the IDs are updated
before returning to user space.

Once the RSEQ handling is decoupled from TIF_NOTIFY_RESUME, this turns into
a NOOP.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Dexuan Cui <decui@microsoft.com>
---
 drivers/hv/mshv_root_main.c |    2 +
 include/linux/rseq.h        |   17 +++++++++
 kernel/rseq.c               |   76 +++++++++++++++++++++++---------------------
 virt/kvm/kvm_main.c         |    3 +
 4 files changed, 62 insertions(+), 36 deletions(-)

--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -585,6 +585,8 @@ static long mshv_run_vp_with_root_schedu
 		}
 	} while (!vp->run.flags.intercept_suspend);
 
+	rseq_virt_userspace_exit();
+
 	return ret;
 }
 
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -38,6 +38,22 @@ static __always_inline void rseq_exit_to
 }
 
 /*
+ * KVM/HYPERV invoke resume_user_mode_work() before entering guest mode,
+ * which clears TIF_NOTIFY_RESUME. To avoid updating user space RSEQ in
+ * that case just to do it eventually again before returning to user space,
+ * the entry resume_user_mode_work() invocation is ignored as the register
+ * argument is NULL.
+ *
+ * After returning from guest mode, they have to invoke this function to
+ * re-raise TIF_NOTIFY_RESUME if necessary.
+ */
+static inline void rseq_virt_userspace_exit(void)
+{
+	if (current->rseq_event_pending)
+		set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+}
+
+/*
  * If parent process has a registered restartable sequences area, the
  * child inherits. Unregister rseq for a clone with CLONE_VM set.
  */
@@ -68,6 +84,7 @@ static inline void rseq_execve(struct ta
 static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) { }
 static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
 static inline void rseq_sched_switch_event(struct task_struct *t) { }
+static inline void rseq_virt_userspace_exit(void) { }
 static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags) { }
 static inline void rseq_execve(struct task_struct *t) { }
 static inline void rseq_exit_to_user_mode(void) { }
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -422,50 +422,54 @@ void __rseq_handle_notify_resume(struct
 {
 	struct task_struct *t = current;
 	int ret, sig;
+	bool event;
+
+	/*
+	 * If invoked from hypervisors before entering the guest via
+	 * resume_user_mode_work(), then @regs is a NULL pointer.
+	 *
+	 * resume_user_mode_work() clears TIF_NOTIFY_RESUME and re-raises
+	 * it before returning from the ioctl() to user space when
+	 * rseq_event.sched_switch is set.
+	 *
+	 * So it's safe to ignore here instead of pointlessly updating it
+	 * in the vcpu_run() loop.
+	 */
+	if (!regs)
+		return;
 
 	if (unlikely(t->flags & PF_EXITING))
 		return;
 
 	/*
-	 * If invoked from hypervisors or IO-URING, then @regs is a NULL
-	 * pointer, so fixup cannot be done. If the syscall which led to
-	 * this invocation was invoked inside a critical section, then it
-	 * will either end up in this code again or a possible violation of
-	 * a syscall inside a critical region can only be detected by the
-	 * debug code in rseq_syscall() in a debug enabled kernel.
+	 * Read and clear the event pending bit first. If the task
+	 * was not preempted or migrated or a signal is on the way,
+	 * there is no point in doing any of the heavy lifting here
+	 * on production kernels. In that case TIF_NOTIFY_RESUME
+	 * was raised by some other functionality.
+	 *
+	 * This is correct because the read/clear operation is
+	 * guarded against scheduler preemption, which makes it CPU
+	 * local atomic. If the task is preempted right after
+	 * re-enabling preemption then TIF_NOTIFY_RESUME is set
+	 * again and this function is invoked another time _before_
+	 * the task is able to return to user mode.
+	 *
+	 * On a debug kernel, invoke the fixup code unconditionally
+	 * with the result handed in to allow the detection of
+	 * inconsistencies.
 	 */
-	if (regs) {
-		/*
-		 * Read and clear the event pending bit first. If the task
-		 * was not preempted or migrated or a signal is on the way,
-		 * there is no point in doing any of the heavy lifting here
-		 * on production kernels. In that case TIF_NOTIFY_RESUME
-		 * was raised by some other functionality.
-		 *
-		 * This is correct because the read/clear operation is
-		 * guarded against scheduler preemption, which makes it CPU
-		 * local atomic. If the task is preempted right after
-		 * re-enabling preemption then TIF_NOTIFY_RESUME is set
-		 * again and this function is invoked another time _before_
-		 * the task is able to return to user mode.
-		 *
-		 * On a debug kernel, invoke the fixup code unconditionally
-		 * with the result handed in to allow the detection of
-		 * inconsistencies.
-		 */
-		bool event;
-
-		scoped_guard(RSEQ_EVENT_GUARD) {
-			event = t->rseq_event_pending;
-			t->rseq_event_pending = false;
-		}
+	scoped_guard(RSEQ_EVENT_GUARD) {
+		event = t->rseq_event_pending;
+		t->rseq_event_pending = false;
+	}
 
-		if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) {
-			ret = rseq_ip_fixup(regs, event);
-			if (unlikely(ret < 0))
-				goto error;
-		}
+	if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) {
+		ret = rseq_ip_fixup(regs, event);
+		if (unlikely(ret < 0))
+			goto error;
 	}
+
 	if (unlikely(rseq_update_cpu_node_id(t)))
 		goto error;
 	return;
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -49,6 +49,7 @@
 #include <linux/lockdep.h>
 #include <linux/kthread.h>
 #include <linux/suspend.h>
+#include <linux/rseq.h>
 
 #include <asm/processor.h>
 #include <asm/ioctl.h>
@@ -4466,6 +4467,8 @@ static long kvm_vcpu_ioctl(struct file *
 		r = kvm_arch_vcpu_ioctl_run(vcpu);
 		vcpu->wants_to_run = false;
 
+		rseq_virt_userspace_exit();
+
 		trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
 		break;
 	}

Re: [patch V2 07/37] rseq, virt: Retrigger RSEQ after vcpu_run()

Posted by Mathieu Desnoyers 1 month, 1 week ago

On 2025-08-23 12:39, Thomas Gleixner wrote:
> Hypervisors invoke resume_user_mode_work() before entering the guest, which
> clears TIF_NOTIFY_RESUME. The @regs argument is NULL as there is no user
> space context available to them, so the rseq notify handler skips
> inspecting the critical section, but updates the CPU/MM CID values
> unconditionally so that the eventual pending rseq event is not lost on the
> way to user space.
> 
> This is a pointless exercise as the task might be rescheduled before
> actually returning to user space and it creates unnecessary work in the
> vcpu_run() loops.

One question here: AFAIU, this removes the updates to the cpu_id_start,
cpu_id, mm_cid, and node_id fields on exit to virt usermode. This means
that while the virt guest is running in usermode, the host hypervisor
process has stale rseq fields, until it eventually returns to the
hypervisor's host userspace (from ioctl).

Considering the rseq uapi documentation, this should not matter.
Each of those fields have this statement:

"This field should only be read by the thread which registered this data
structure."

I can however think of use-cases for reading the rseq fields from other
hypervisor threads to figure out information about thread placement.
Doing so would however go against the documented uapi.

I'd rather ask whether anyone is misusing this uapi in that way before
going ahead with the change, just to prevent surprises.

I'm OK with the re-trigger of rseq, as it does indeed appear to fix
an issue, but I'm concerned about the ABI impact of skipping the
rseq_update_cpu_node_id() on return to virt userspace.

Thoughts ?

Thanks,

Mathieu

> 
> It's way more efficient to ignore that invocation based on @regs == NULL
> and let the hypervisors re-raise TIF_NOTIFY_RESUME after returning from the
> vcpu_run() loop before returning from the ioctl().
> 
> This ensures that a pending RSEQ update is not lost and the IDs are updated
> before returning to user space.
> 
> Once the RSEQ handling is decoupled from TIF_NOTIFY_RESUME, this turns into
> a NOOP.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Sean Christopherson <seanjc@google.com>
> Cc: Wei Liu <wei.liu@kernel.org>
> Cc: Dexuan Cui <decui@microsoft.com>
> ---
>   drivers/hv/mshv_root_main.c |    2 +
>   include/linux/rseq.h        |   17 +++++++++
>   kernel/rseq.c               |   76 +++++++++++++++++++++++---------------------
>   virt/kvm/kvm_main.c         |    3 +
>   4 files changed, 62 insertions(+), 36 deletions(-)
> 
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -585,6 +585,8 @@ static long mshv_run_vp_with_root_schedu
>   		}
>   	} while (!vp->run.flags.intercept_suspend);
>   
> +	rseq_virt_userspace_exit();
> +
>   	return ret;
>   }
>   
> --- a/include/linux/rseq.h
> +++ b/include/linux/rseq.h
> @@ -38,6 +38,22 @@ static __always_inline void rseq_exit_to
>   }
>   
>   /*
> + * KVM/HYPERV invoke resume_user_mode_work() before entering guest mode,
> + * which clears TIF_NOTIFY_RESUME. To avoid updating user space RSEQ in
> + * that case just to do it eventually again before returning to user space,
> + * the entry resume_user_mode_work() invocation is ignored as the register
> + * argument is NULL.
> + *
> + * After returning from guest mode, they have to invoke this function to
> + * re-raise TIF_NOTIFY_RESUME if necessary.
> + */
> +static inline void rseq_virt_userspace_exit(void)
> +{
> +	if (current->rseq_event_pending)
> +		set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
> +}
> +
> +/*
>    * If parent process has a registered restartable sequences area, the
>    * child inherits. Unregister rseq for a clone with CLONE_VM set.
>    */
> @@ -68,6 +84,7 @@ static inline void rseq_execve(struct ta
>   static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) { }
>   static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
>   static inline void rseq_sched_switch_event(struct task_struct *t) { }
> +static inline void rseq_virt_userspace_exit(void) { }
>   static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags) { }
>   static inline void rseq_execve(struct task_struct *t) { }
>   static inline void rseq_exit_to_user_mode(void) { }
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -422,50 +422,54 @@ void __rseq_handle_notify_resume(struct
>   {
>   	struct task_struct *t = current;
>   	int ret, sig;
> +	bool event;
> +
> +	/*
> +	 * If invoked from hypervisors before entering the guest via
> +	 * resume_user_mode_work(), then @regs is a NULL pointer.
> +	 *
> +	 * resume_user_mode_work() clears TIF_NOTIFY_RESUME and re-raises
> +	 * it before returning from the ioctl() to user space when
> +	 * rseq_event.sched_switch is set.
> +	 *
> +	 * So it's safe to ignore here instead of pointlessly updating it
> +	 * in the vcpu_run() loop.
> +	 */
> +	if (!regs)
> +		return;
>   
>   	if (unlikely(t->flags & PF_EXITING))
>   		return;
>   
>   	/*
> -	 * If invoked from hypervisors or IO-URING, then @regs is a NULL
> -	 * pointer, so fixup cannot be done. If the syscall which led to
> -	 * this invocation was invoked inside a critical section, then it
> -	 * will either end up in this code again or a possible violation of
> -	 * a syscall inside a critical region can only be detected by the
> -	 * debug code in rseq_syscall() in a debug enabled kernel.
> +	 * Read and clear the event pending bit first. If the task
> +	 * was not preempted or migrated or a signal is on the way,
> +	 * there is no point in doing any of the heavy lifting here
> +	 * on production kernels. In that case TIF_NOTIFY_RESUME
> +	 * was raised by some other functionality.
> +	 *
> +	 * This is correct because the read/clear operation is
> +	 * guarded against scheduler preemption, which makes it CPU
> +	 * local atomic. If the task is preempted right after
> +	 * re-enabling preemption then TIF_NOTIFY_RESUME is set
> +	 * again and this function is invoked another time _before_
> +	 * the task is able to return to user mode.
> +	 *
> +	 * On a debug kernel, invoke the fixup code unconditionally
> +	 * with the result handed in to allow the detection of
> +	 * inconsistencies.
>   	 */
> -	if (regs) {
> -		/*
> -		 * Read and clear the event pending bit first. If the task
> -		 * was not preempted or migrated or a signal is on the way,
> -		 * there is no point in doing any of the heavy lifting here
> -		 * on production kernels. In that case TIF_NOTIFY_RESUME
> -		 * was raised by some other functionality.
> -		 *
> -		 * This is correct because the read/clear operation is
> -		 * guarded against scheduler preemption, which makes it CPU
> -		 * local atomic. If the task is preempted right after
> -		 * re-enabling preemption then TIF_NOTIFY_RESUME is set
> -		 * again and this function is invoked another time _before_
> -		 * the task is able to return to user mode.
> -		 *
> -		 * On a debug kernel, invoke the fixup code unconditionally
> -		 * with the result handed in to allow the detection of
> -		 * inconsistencies.
> -		 */
> -		bool event;
> -
> -		scoped_guard(RSEQ_EVENT_GUARD) {
> -			event = t->rseq_event_pending;
> -			t->rseq_event_pending = false;
> -		}
> +	scoped_guard(RSEQ_EVENT_GUARD) {
> +		event = t->rseq_event_pending;
> +		t->rseq_event_pending = false;
> +	}
>   
> -		if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) {
> -			ret = rseq_ip_fixup(regs, event);
> -			if (unlikely(ret < 0))
> -				goto error;
> -		}
> +	if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) {
> +		ret = rseq_ip_fixup(regs, event);
> +		if (unlikely(ret < 0))
> +			goto error;
>   	}
> +
>   	if (unlikely(rseq_update_cpu_node_id(t)))
>   		goto error;
>   	return;
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -49,6 +49,7 @@
>   #include <linux/lockdep.h>
>   #include <linux/kthread.h>
>   #include <linux/suspend.h>
> +#include <linux/rseq.h>
>   
>   #include <asm/processor.h>
>   #include <asm/ioctl.h>
> @@ -4466,6 +4467,8 @@ static long kvm_vcpu_ioctl(struct file *
>   		r = kvm_arch_vcpu_ioctl_run(vcpu);
>   		vcpu->wants_to_run = false;
>   
> +		rseq_virt_userspace_exit();
> +
>   		trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
>   		break;
>   	}
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [patch V2 07/37] rseq, virt: Retrigger RSEQ after vcpu_run()

Posted by Sean Christopherson 1 month, 1 week ago

On Mon, Aug 25, 2025, Mathieu Desnoyers wrote:
> On 2025-08-23 12:39, Thomas Gleixner wrote:
> > Hypervisors invoke resume_user_mode_work() before entering the guest, which
> > clears TIF_NOTIFY_RESUME. The @regs argument is NULL as there is no user
> > space context available to them, so the rseq notify handler skips
> > inspecting the critical section, but updates the CPU/MM CID values
> > unconditionally so that the eventual pending rseq event is not lost on the
> > way to user space.
> > 
> > This is a pointless exercise as the task might be rescheduled before
> > actually returning to user space and it creates unnecessary work in the
> > vcpu_run() loops.
> 
> One question here: AFAIU, this removes the updates to the cpu_id_start,
> cpu_id, mm_cid, and node_id fields on exit to virt usermode. This means
> that while the virt guest is running in usermode, the host hypervisor
> process has stale rseq fields, until it eventually returns to the
> hypervisor's host userspace (from ioctl).
> 
> Considering the rseq uapi documentation, this should not matter.
> Each of those fields have this statement:
> 
> "This field should only be read by the thread which registered this data
> structure."
> 
> I can however think of use-cases for reading the rseq fields from other
> hypervisor threads to figure out information about thread placement.
> Doing so would however go against the documented uapi.
> 
> I'd rather ask whether anyone is misusing this uapi in that way before
> going ahead with the change, just to prevent surprises.
> 
> I'm OK with the re-trigger of rseq, as it does indeed appear to fix
> an issue, but I'm concerned about the ABI impact of skipping the
> rseq_update_cpu_node_id() on return to virt userspace.
> 
> Thoughts ?

I know the idea of exposing rseq to paravirtualized guests has been floated (more
than once), but I don't _think_ anyone has actually shipped anything of that 
nature.

> > @@ -49,6 +49,7 @@
> >   #include <linux/lockdep.h>
> >   #include <linux/kthread.h>
> >   #include <linux/suspend.h>
> > +#include <linux/rseq.h>
> >   #include <asm/processor.h>
> >   #include <asm/ioctl.h>
> > @@ -4466,6 +4467,8 @@ static long kvm_vcpu_ioctl(struct file *
> >   		r = kvm_arch_vcpu_ioctl_run(vcpu);
> >   		vcpu->wants_to_run = false;
> > +		rseq_virt_userspace_exit();

I don't love bleeding even more entry/rseq details into KVM.  Rather than optimize
KVM and then add TIF_RSEQ, what if we do the opposite?  I.e. add TIF_RSEQ to
XFER_TO_GUEST_MODE_WORK as part of "rseq: Switch to TIF_RSEQ if supported", and
then drop TIF_RSEQ from XFER_TO_GUEST_MODE_WORK in a new patch?

That should make it easier to revert the KVM/virt change if it turns out PV setups
are playing games with rseq, and it would give the stragglers (arm64 in particular)
some motiviation to implement TIF_RSEQ and/or switch to generic TIF bits.

Re: [patch V2 07/37] rseq, virt: Retrigger RSEQ after vcpu_run()

Posted by Thomas Gleixner 1 month ago

On Mon, Aug 25 2025 at 13:24, Sean Christopherson wrote:
> On Mon, Aug 25, 2025, Mathieu Desnoyers wrote:
>> > @@ -4466,6 +4467,8 @@ static long kvm_vcpu_ioctl(struct file *
>> >   		r = kvm_arch_vcpu_ioctl_run(vcpu);
>> >   		vcpu->wants_to_run = false;
>> > +		rseq_virt_userspace_exit();
>
> I don't love bleeding even more entry/rseq details into KVM.

Neither do I.

> Rather than optimize KVM and then add TIF_RSEQ, what if we do the
> opposite?

I'm not optimizing KVM. I'm simplifying the RSEQ parts to ignore
TIF_NOTIFY_RESUME when invoked with @regs == NULL.

> I.e. add TIF_RSEQ to XFER_TO_GUEST_MODE_WORK as part of "rseq: Switch
> to TIF_RSEQ if supported", and then drop TIF_RSEQ from
> XFER_TO_GUEST_MODE_WORK in a new patch?

The problem is that I have to keep all the architectures which

    - do not use the generic entry code
    - therefore can't be switched trivially over to the TIF_RSEQ scheme
    - have RSEQ support enabled

alive and working.

> That should make it easier to revert the KVM/virt change if it turns
> out PV setups are playing games with rseq,

I can't find a hint of such an insanity in kernel, so *shrug*.

If there is out of tree code which plays games with the vCPU's user
space thread::TLS::rseq, then it rightfully breaks. The update, which
happens today, is just coincidence and a kernel internal implementation
detail.

> and it would give the stragglers (arm64 in particular) some
> motiviation to implement TIF_RSEQ and/or switch to generic TIF bits.

There is enough motivation in this series to do so :)

Thanks,

        tglx