KVM: x86: User-return MSR fix+cleanups

[PATCH v5 3/4] KVM: x86: Leave user-return notifier registered on reboot/shutdown

Posted by Sean Christopherson 3 months, 1 week ago

Leave KVM's user-return notifier registered in the unlikely case that the
notifier is registered when disabling virtualization via IPI callback in
response to reboot/shutdown.  On reboot/shutdown, keeping the notifier
registered is ok as far as MSR state is concerned (arguably better then
restoring MSRs at an unknown point in time), as the callback will run
cleanly and restore host MSRs if the CPU manages to return to userspace
before the system goes down.

The only wrinkle is that if kvm.ko module unload manages to race with
reboot/shutdown, then leaving the notifier registered could lead to
use-after-free due to calling into unloaded kvm.ko module code.  But such
a race is only possible on --forced reboot/shutdown, because otherwise
userspace tasks would be frozen before kvm_shutdown() is called, i.e. on a
"normal" reboot/shutdown, it should be impossible for the CPU to return to
userspace after kvm_shutdown().

Furthermore, on a --forced reboot/shutdown, unregistering the user-return
hook from IRQ context doesn't fully guard against use-after-free, because
KVM could immediately re-register the hook, e.g. if the IRQ arrives before
kvm_user_return_register_notifier() is called.

Rather than trying to guard against the IPI in the "normal" user-return
code, which is difficult and noisy, simply leave the user-return notifier
registered on a reboot, and bump the kvm.ko module refcount to defend
against a use-after-free due to kvm.ko unload racing against reboot.

Alternatively, KVM could allow kvm.ko and try to drop the notifiers during
kvm_x86_exit(), but that's also a can of worms as registration is per-CPU,
and so KVM would need to blast an IPI, and doing so while a reboot/shutdown
is in-progress is far risky than preventing userspace from unloading KVM.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bb7a7515f280..c927326344b1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13086,7 +13086,21 @@ int kvm_arch_enable_virtualization_cpu(void)
 void kvm_arch_disable_virtualization_cpu(void)
 {
 	kvm_x86_call(disable_virtualization_cpu)();
-	drop_user_return_notifiers();
+
+	/*
+	 * Leave the user-return notifiers as-is when disabling virtualization
+	 * for reboot, i.e. when disabling via IPI function call, and instead
+	 * pin kvm.ko (if it's a module) to defend against use-after-free (in
+	 * the *very* unlikely scenario module unload is racing with reboot).
+	 * On a forced reboot, tasks aren't frozen before shutdown, and so KVM
+	 * could be actively modifying user-return MSR state when the IPI to
+	 * disable virtualization arrives.  Handle the extreme edge case here
+	 * instead of trying to account for it in the normal flows.
+	 */
+	if (in_task() || WARN_ON_ONCE(!kvm_rebooting))
+		drop_user_return_notifiers();
+	else
+		__module_get(THIS_MODULE);
 }
 
 bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu)
-- 
2.51.1.930.gacf6e81ea2-goog

Re: [PATCH v5 3/4] KVM: x86: Leave user-return notifier registered on reboot/shutdown

Posted by Chao Gao 3 months ago

On Thu, Oct 30, 2025 at 12:15:27PM -0700, Sean Christopherson wrote:
>Leave KVM's user-return notifier registered in the unlikely case that the
>notifier is registered when disabling virtualization via IPI callback in
>response to reboot/shutdown.  On reboot/shutdown, keeping the notifier
>registered is ok as far as MSR state is concerned (arguably better then
>restoring MSRs at an unknown point in time), as the callback will run
>cleanly and restore host MSRs if the CPU manages to return to userspace
>before the system goes down.
>
>The only wrinkle is that if kvm.ko module unload manages to race with
>reboot/shutdown, then leaving the notifier registered could lead to
>use-after-free due to calling into unloaded kvm.ko module code.  But such
>a race is only possible on --forced reboot/shutdown, because otherwise
>userspace tasks would be frozen before kvm_shutdown() is called, i.e. on a
>"normal" reboot/shutdown, it should be impossible for the CPU to return to
>userspace after kvm_shutdown().
>
>Furthermore, on a --forced reboot/shutdown, unregistering the user-return
>hook from IRQ context doesn't fully guard against use-after-free, because
>KVM could immediately re-register the hook, e.g. if the IRQ arrives before
>kvm_user_return_register_notifier() is called.
>
>Rather than trying to guard against the IPI in the "normal" user-return
>code, which is difficult and noisy, simply leave the user-return notifier
>registered on a reboot, and bump the kvm.ko module refcount to defend
>against a use-after-free due to kvm.ko unload racing against reboot.
>
>Alternatively, KVM could allow kvm.ko and try to drop the notifiers during
>kvm_x86_exit(), but that's also a can of worms as registration is per-CPU,
>and so KVM would need to blast an IPI, and doing so while a reboot/shutdown
>is in-progress is far risky than preventing userspace from unloading KVM.
>
>Signed-off-by: Sean Christopherson <seanjc@google.com>
>---
> arch/x86/kvm/x86.c | 16 +++++++++++++++-
> 1 file changed, 15 insertions(+), 1 deletion(-)
>
>diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>index bb7a7515f280..c927326344b1 100644
>--- a/arch/x86/kvm/x86.c
>+++ b/arch/x86/kvm/x86.c
>@@ -13086,7 +13086,21 @@ int kvm_arch_enable_virtualization_cpu(void)
> void kvm_arch_disable_virtualization_cpu(void)
> {
> 	kvm_x86_call(disable_virtualization_cpu)();
>-	drop_user_return_notifiers();
>+
>+	/*
>+	 * Leave the user-return notifiers as-is when disabling virtualization
>+	 * for reboot, i.e. when disabling via IPI function call, and instead
>+	 * pin kvm.ko (if it's a module) to defend against use-after-free (in
>+	 * the *very* unlikely scenario module unload is racing with reboot).
>+	 * On a forced reboot, tasks aren't frozen before shutdown, and so KVM
>+	 * could be actively modifying user-return MSR state when the IPI to
>+	 * disable virtualization arrives.  Handle the extreme edge case here
>+	 * instead of trying to account for it in the normal flows.
>+	 */
>+	if (in_task() || WARN_ON_ONCE(!kvm_rebooting))
>+		drop_user_return_notifiers();
>+	else
>+		__module_get(THIS_MODULE);

This doesn't pin kvm-{intel,amd}.ko, right? if so, there is still a potential
user-after-free if the CPU returns to userspace after the per-CPU
user_return_msrs is freed on kvm-{intel,amd}.ko unloading.

I think we need to either move __module_get() into
kvm_x86_call(disable_virtualization_cpu)() or allocate/free the per-CPU
user_return_msrs when loading/unloading kvm.ko. e.g.,

From 0269f0ee839528e8a9616738d615a096901d6185 Mon Sep 17 00:00:00 2001
From: Chao Gao <chao.gao@intel.com>
Date: Fri, 7 Nov 2025 00:10:28 -0800
Subject: [PATCH] KVM: x86: Allocate/free user_return_msrs at kvm.ko
 (un)loading time

Move user_return_msrs allocation/free from vendor modules (kvm-intel.ko and
kvm-amd.ko) (un)loading time to kvm.ko's to make it less risky to access
user_return_msrs in kvm.ko. Tying the lifetime of user_return_msrs to
vendor modules makes every access to user_return_msrs prone to
use-after-free issues as vendor modules may be unloaded at any time.

kvm_nr_uret_msrs is still reset to 0 when vendor modules are loaded to
clear out the user return MSR list configured by the previous vendor
module.

Signed-off-by: Chao Gao <chao.gao@intel.com>
---
 arch/x86/kvm/x86.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bb7a7515f280..ab411bd09567 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -575,18 +575,17 @@ static inline void kvm_async_pf_hash_reset(struct
kvm_vcpu *vcpu)
		vcpu->arch.apf.gfns[i] = ~0;
 }
 
-static int kvm_init_user_return_msrs(void)
+static int __init kvm_init_user_return_msrs(void)
 {
	user_return_msrs = alloc_percpu(struct kvm_user_return_msrs);
	if (!user_return_msrs) {
		pr_err("failed to allocate percpu user_return_msrs\n");
		return -ENOMEM;
	}
-	kvm_nr_uret_msrs = 0;
	return 0;
 }
 
-static void kvm_free_user_return_msrs(void)
+static void __exit kvm_free_user_return_msrs(void)
 {
	int cpu;
 
@@ -10044,13 +10043,11 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops
*ops)
		return -ENOMEM;
	}
 
-	r = kvm_init_user_return_msrs();
-	if (r)
-		goto out_free_x86_emulator_cache;
+	kvm_nr_uret_msrs = 0;
 
	r = kvm_mmu_vendor_module_init();
	if (r)
-		goto out_free_percpu;
+		goto out_free_x86_emulator_cache;
 
	kvm_caps.supported_vm_types = BIT(KVM_X86_DEFAULT_VM);
	kvm_caps.supported_mce_cap = MCG_CTL_P | MCG_SER_P;
@@ -10148,8 +10145,6 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops
*ops)
	kvm_x86_call(hardware_unsetup)();
 out_mmu_exit:
	kvm_mmu_vendor_module_exit();
-out_free_percpu:
-	kvm_free_user_return_msrs();
 out_free_x86_emulator_cache:
	kmem_cache_destroy(x86_emulator_cache);
	return r;
@@ -10178,7 +10173,6 @@ void kvm_x86_vendor_exit(void)
 #endif
	kvm_x86_call(hardware_unsetup)();
	kvm_mmu_vendor_module_exit();
-	kvm_free_user_return_msrs();
	kmem_cache_destroy(x86_emulator_cache);
 #ifdef CONFIG_KVM_XEN
	static_key_deferred_flush(&kvm_xen_enabled);
@@ -14361,8 +14355,14 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_rmp_fault);
 
 static int __init kvm_x86_init(void)
 {
+	int r;
+
	kvm_init_xstate_sizes();
 
+	r = kvm_init_user_return_msrs();
+	if (r)
+		return r;
+
	kvm_mmu_x86_module_init();
	mitigate_smt_rsb &= boot_cpu_has_bug(X86_BUG_SMT_RSB) &&
cpu_smt_possible();
	return 0;
@@ -14371,6 +14371,7 @@ module_init(kvm_x86_init);
 
 static void __exit kvm_x86_exit(void)
 {
+	kvm_free_user_return_msrs();
	WARN_ON_ONCE(static_branch_unlikely(&kvm_has_noapic_vcpu));
 }
 module_exit(kvm_x86_exit);
-- 
2.47.3

Re: [PATCH v5 3/4] KVM: x86: Leave user-return notifier registered on reboot/shutdown

Posted by Sean Christopherson 3 months ago

On Fri, Nov 07, 2025, Chao Gao wrote:
> On Thu, Oct 30, 2025 at 12:15:27PM -0700, Sean Christopherson wrote:
> >diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >index bb7a7515f280..c927326344b1 100644
> >--- a/arch/x86/kvm/x86.c
> >+++ b/arch/x86/kvm/x86.c
> >@@ -13086,7 +13086,21 @@ int kvm_arch_enable_virtualization_cpu(void)
> > void kvm_arch_disable_virtualization_cpu(void)
> > {
> > 	kvm_x86_call(disable_virtualization_cpu)();
> >-	drop_user_return_notifiers();
> >+
> >+	/*
> >+	 * Leave the user-return notifiers as-is when disabling virtualization
> >+	 * for reboot, i.e. when disabling via IPI function call, and instead
> >+	 * pin kvm.ko (if it's a module) to defend against use-after-free (in
> >+	 * the *very* unlikely scenario module unload is racing with reboot).
> >+	 * On a forced reboot, tasks aren't frozen before shutdown, and so KVM
> >+	 * could be actively modifying user-return MSR state when the IPI to
> >+	 * disable virtualization arrives.  Handle the extreme edge case here
> >+	 * instead of trying to account for it in the normal flows.
> >+	 */
> >+	if (in_task() || WARN_ON_ONCE(!kvm_rebooting))
> >+		drop_user_return_notifiers();
> >+	else
> >+		__module_get(THIS_MODULE);
> 
> This doesn't pin kvm-{intel,amd}.ko, right? if so, there is still a potential
> user-after-free if the CPU returns to userspace after the per-CPU
> user_return_msrs is freed on kvm-{intel,amd}.ko unloading.
> 
> I think we need to either move __module_get() into
> kvm_x86_call(disable_virtualization_cpu)() or allocate/free the per-CPU
> user_return_msrs when loading/unloading kvm.ko. e.g.,

Gah, you're right.  I considered the complications with vendor modules, but missed
the kvm_x86_vendor_exit() angle.

> >From 0269f0ee839528e8a9616738d615a096901d6185 Mon Sep 17 00:00:00 2001
> From: Chao Gao <chao.gao@intel.com>
> Date: Fri, 7 Nov 2025 00:10:28 -0800
> Subject: [PATCH] KVM: x86: Allocate/free user_return_msrs at kvm.ko
>  (un)loading time
> 
> Move user_return_msrs allocation/free from vendor modules (kvm-intel.ko and
> kvm-amd.ko) (un)loading time to kvm.ko's to make it less risky to access
> user_return_msrs in kvm.ko. Tying the lifetime of user_return_msrs to
> vendor modules makes every access to user_return_msrs prone to
> use-after-free issues as vendor modules may be unloaded at any time.
> 
> kvm_nr_uret_msrs is still reset to 0 when vendor modules are loaded to
> clear out the user return MSR list configured by the previous vendor
> module.

Hmm, the other idea would to stash the owner in kvm_x86_ops, and then do:

		__module_get(kvm_x86_ops.owner);

LOL, but that's even more flawed from a certain perspective, because
kvm_x86_ops.owner could be completely stale, especially if this races with
kvm_x86_vendor_exit().

> +static void __exit kvm_free_user_return_msrs(void)
>  {
> 	int cpu;
>  
> @@ -10044,13 +10043,11 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops
> *ops)
> 		return -ENOMEM;
> 	}
>  
> -	r = kvm_init_user_return_msrs();
> -	if (r)
> -		goto out_free_x86_emulator_cache;
> +	kvm_nr_uret_msrs = 0;

For maximum paranoia, we should zero at exit() and WARN at init().

> 	r = kvm_mmu_vendor_module_init();
> 	if (r)
> -		goto out_free_percpu;
> +		goto out_free_x86_emulator_cache;
>  
> 	kvm_caps.supported_vm_types = BIT(KVM_X86_DEFAULT_VM);
> 	kvm_caps.supported_mce_cap = MCG_CTL_P | MCG_SER_P;
> @@ -10148,8 +10145,6 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops
> *ops)
> 	kvm_x86_call(hardware_unsetup)();
>  out_mmu_exit:
> 	kvm_mmu_vendor_module_exit();
> -out_free_percpu:
> -	kvm_free_user_return_msrs();
>  out_free_x86_emulator_cache:
> 	kmem_cache_destroy(x86_emulator_cache);
> 	return r;
> @@ -10178,7 +10173,6 @@ void kvm_x86_vendor_exit(void)
>  #endif
> 	kvm_x86_call(hardware_unsetup)();
> 	kvm_mmu_vendor_module_exit();
> -	kvm_free_user_return_msrs();
> 	kmem_cache_destroy(x86_emulator_cache);
>  #ifdef CONFIG_KVM_XEN
> 	static_key_deferred_flush(&kvm_xen_enabled);
> @@ -14361,8 +14355,14 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_rmp_fault);
>  
>  static int __init kvm_x86_init(void)
>  {
> +	int r;
> +
> 	kvm_init_xstate_sizes();
>  
> +	r = kvm_init_user_return_msrs();
> +	if (r)

Rather than dynamically allocate the array of structures, we can "statically"
allocate it when the module is loaded.

I'll post this as a proper patch (with my massages) once I've tested.

Thanks much!

(and I forgot to hit "send", so this is going to show up after the patch, sorry)

[PATCH v5 1/4] KVM: TDX: Explicitly set user-return MSRs that *may* be clobbered by the TDX-Module
[PATCH v5 2/4] KVM: x86: WARN if user-return MSR notifier is registered on exit
[PATCH v5 3/4] KVM: x86: Leave user-return notifier registered on reboot/shutdown
[PATCH v5 4/4] KVM: x86: Don't disable IRQs when unregistering user-return notifier