[PATCH] KVM: x86: switch hugepage recovery thread to vhost_task

Paolo Bonzini posted 1 patch 2 weeks, 1 day ago
arch/x86/include/asm/kvm_host.h |   4 +-
arch/x86/kvm/Kconfig            |   1 +
arch/x86/kvm/mmu/mmu.c          |  67 +++++++++++----------
include/linux/kvm_host.h        |   6 --
virt/kvm/kvm_main.c             | 103 --------------------------------
5 files changed, 39 insertions(+), 142 deletions(-)
[PATCH] KVM: x86: switch hugepage recovery thread to vhost_task
Posted by Paolo Bonzini 2 weeks, 1 day ago
kvm_vm_create_worker_thread() is meant to be used for kthreads that
can consume significant amounts of CPU time on behalf of a VM or in
response to how the VM behaves (for example how it accesses its memory).
Therefore it wants to charge the CPU time consumed by that work to
the VM's container.

However, because of these threads, cgroups which have kvm instances inside
never complete freezing.  This can be trivially reproduced:

  root@test ~# mkdir /sys/fs/cgroup/test
  root@test ~# echo $fish_pid > /sys/fs/cgroup/test/cgroup.procs
  root@test ~# qemu-system-x86_64 --nographic -enable-kvm

and in another terminal:

  root@test ~# echo 1 > /sys/fs/cgroup/test/cgroup.freeze
  root@test ~# cat /sys/fs/cgroup/test/cgroup.events
  populated 1
  frozen 0

The cgroup freezing happens in the signal delivery path but
kvm_vm_worker_thread() thread never call into the signal delivery path while
joining non-root cgroups, so they never get frozen. Because the cgroup
freezer determines whether a given cgroup is frozen by comparing the number
of frozen threads to the total number of threads in the cgroup, the cgroup
never becomes frozen and users waiting for the state transition may hang
indefinitely.

Since the worker kthread is tied to a user process, it's better if
it behaves similarly to user tasks as much as possible, including
being able to send SIGSTOP and SIGCONT.  In fact, vhost_task is all
that kvm_vm_create_worker_thread() wanted to be and more: not only it
inherits the userspace process's cgroups, it has other niceties like
being parented properly in the process tree.  Use it instead of the
homegrown alternative.

(Commit message based on emails from Tejun).

Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Luca Boccassi <bluca@debian.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |   4 +-
 arch/x86/kvm/Kconfig            |   1 +
 arch/x86/kvm/mmu/mmu.c          |  67 +++++++++++----------
 include/linux/kvm_host.h        |   6 --
 virt/kvm/kvm_main.c             | 103 --------------------------------
 5 files changed, 39 insertions(+), 142 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6d9f763a7bb9..d6657cc0fe6b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -26,6 +26,7 @@
 #include <linux/irqbypass.h>
 #include <linux/hyperv.h>
 #include <linux/kfifo.h>
+#include <linux/sched/vhost_task.h>
 
 #include <asm/apic.h>
 #include <asm/pvclock-abi.h>
@@ -1443,7 +1444,8 @@ struct kvm_arch {
 	bool sgx_provisioning_allowed;
 
 	struct kvm_x86_pmu_event_filter __rcu *pmu_event_filter;
-	struct task_struct *nx_huge_page_recovery_thread;
+	struct vhost_task *nx_huge_page_recovery_thread;
+	u64 nx_huge_page_next;
 
 #ifdef CONFIG_X86_64
 	/* The number of TDP MMU pages across all roots. */
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index f09f13c01c6b..b387d61af44f 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -29,6 +29,7 @@ config KVM_X86
 	select HAVE_KVM_IRQ_BYPASS
 	select HAVE_KVM_IRQ_ROUTING
 	select HAVE_KVM_READONLY_MEM
+	select VHOST_TASK
 	select KVM_ASYNC_PF
 	select USER_RETURN_NOTIFIER
 	select KVM_MMIO
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8e853a5fc867..d5af4f8c5a6a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7281,7 +7281,7 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp)
 			kvm_mmu_zap_all_fast(kvm);
 			mutex_unlock(&kvm->slots_lock);
 
-			wake_up_process(kvm->arch.nx_huge_page_recovery_thread);
+			vhost_task_wake(kvm->arch.nx_huge_page_recovery_thread);
 		}
 		mutex_unlock(&kvm_lock);
 	}
@@ -7427,7 +7427,7 @@ static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel
 		mutex_lock(&kvm_lock);
 
 		list_for_each_entry(kvm, &vm_list, vm_list)
-			wake_up_process(kvm->arch.nx_huge_page_recovery_thread);
+			vhost_task_wake(kvm->arch.nx_huge_page_recovery_thread);
 
 		mutex_unlock(&kvm_lock);
 	}
@@ -7530,62 +7530,65 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm)
 	srcu_read_unlock(&kvm->srcu, rcu_idx);
 }
 
-static long get_nx_huge_page_recovery_timeout(u64 start_time)
+#define NX_HUGE_PAGE_DISABLED (-1)
+
+static u64 get_nx_huge_page_recovery_next(void)
 {
 	bool enabled;
 	uint period;
 
 	enabled = calc_nx_huge_pages_recovery_period(&period);
 
-	return enabled ? start_time + msecs_to_jiffies(period) - get_jiffies_64()
-		       : MAX_SCHEDULE_TIMEOUT;
+	return enabled ? get_jiffies_64() + msecs_to_jiffies(period)
+		: NX_HUGE_PAGE_DISABLED;
 }
 
-static int kvm_nx_huge_page_recovery_worker(struct kvm *kvm, uintptr_t data)
+static void kvm_nx_huge_page_recovery_worker_kill(void *data)
 {
-	u64 start_time;
+}
+
+static bool kvm_nx_huge_page_recovery_worker(void *data)
+{
+	struct kvm *kvm = data;
 	long remaining_time;
 
-	while (true) {
-		start_time = get_jiffies_64();
-		remaining_time = get_nx_huge_page_recovery_timeout(start_time);
+	if (kvm->arch.nx_huge_page_next == NX_HUGE_PAGE_DISABLED)
+		return false;
 
-		set_current_state(TASK_INTERRUPTIBLE);
-		while (!kthread_should_stop() && remaining_time > 0) {
-			schedule_timeout(remaining_time);
-			remaining_time = get_nx_huge_page_recovery_timeout(start_time);
-			set_current_state(TASK_INTERRUPTIBLE);
-		}
-
-		set_current_state(TASK_RUNNING);
-
-		if (kthread_should_stop())
-			return 0;
-
-		kvm_recover_nx_huge_pages(kvm);
+	remaining_time = kvm->arch.nx_huge_page_next - get_jiffies_64();
+	if (remaining_time > 0) {
+		schedule_timeout(remaining_time);
+		/* check for signals and come back */
+		return true;
 	}
+
+	__set_current_state(TASK_RUNNING);
+	kvm_recover_nx_huge_pages(kvm);
+	kvm->arch.nx_huge_page_next = get_nx_huge_page_recovery_next();
+	return true;
 }
 
 int kvm_mmu_post_init_vm(struct kvm *kvm)
 {
-	int err;
-
 	if (nx_hugepage_mitigation_hard_disabled)
 		return 0;
 
-	err = kvm_vm_create_worker_thread(kvm, kvm_nx_huge_page_recovery_worker, 0,
-					  "kvm-nx-lpage-recovery",
-					  &kvm->arch.nx_huge_page_recovery_thread);
-	if (!err)
-		kthread_unpark(kvm->arch.nx_huge_page_recovery_thread);
+	kvm->arch.nx_huge_page_next = get_nx_huge_page_recovery_next();
+	kvm->arch.nx_huge_page_recovery_thread = vhost_task_create(
+		kvm_nx_huge_page_recovery_worker, kvm_nx_huge_page_recovery_worker_kill,
+		kvm, "kvm-nx-lpage-recovery");
+	
+	if (!kvm->arch.nx_huge_page_recovery_thread)
+		return -ENOMEM;
 
-	return err;
+	vhost_task_start(kvm->arch.nx_huge_page_recovery_thread);
+	return 0;
 }
 
 void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
 {
 	if (kvm->arch.nx_huge_page_recovery_thread)
-		kthread_stop(kvm->arch.nx_huge_page_recovery_thread);
+		vhost_task_stop(kvm->arch.nx_huge_page_recovery_thread);
 }
 
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 45be36e5285f..85fe9d0ebb91 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2382,12 +2382,6 @@ static inline int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu)
 }
 #endif /* CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE */
 
-typedef int (*kvm_vm_thread_fn_t)(struct kvm *kvm, uintptr_t data);
-
-int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
-				uintptr_t data, const char *name,
-				struct task_struct **thread_ptr);
-
 #ifdef CONFIG_KVM_XFER_TO_GUEST_WORK
 static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
 {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6ca7a1045bbb..279e03029ce1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -6561,106 +6561,3 @@ void kvm_exit(void)
 	kvm_irqfd_exit();
 }
 EXPORT_SYMBOL_GPL(kvm_exit);
-
-struct kvm_vm_worker_thread_context {
-	struct kvm *kvm;
-	struct task_struct *parent;
-	struct completion init_done;
-	kvm_vm_thread_fn_t thread_fn;
-	uintptr_t data;
-	int err;
-};
-
-static int kvm_vm_worker_thread(void *context)
-{
-	/*
-	 * The init_context is allocated on the stack of the parent thread, so
-	 * we have to locally copy anything that is needed beyond initialization
-	 */
-	struct kvm_vm_worker_thread_context *init_context = context;
-	struct task_struct *parent;
-	struct kvm *kvm = init_context->kvm;
-	kvm_vm_thread_fn_t thread_fn = init_context->thread_fn;
-	uintptr_t data = init_context->data;
-	int err;
-
-	err = kthread_park(current);
-	/* kthread_park(current) is never supposed to return an error */
-	WARN_ON(err != 0);
-	if (err)
-		goto init_complete;
-
-	err = cgroup_attach_task_all(init_context->parent, current);
-	if (err) {
-		kvm_err("%s: cgroup_attach_task_all failed with err %d\n",
-			__func__, err);
-		goto init_complete;
-	}
-
-	set_user_nice(current, task_nice(init_context->parent));
-
-init_complete:
-	init_context->err = err;
-	complete(&init_context->init_done);
-	init_context = NULL;
-
-	if (err)
-		goto out;
-
-	/* Wait to be woken up by the spawner before proceeding. */
-	kthread_parkme();
-
-	if (!kthread_should_stop())
-		err = thread_fn(kvm, data);
-
-out:
-	/*
-	 * Move kthread back to its original cgroup to prevent it lingering in
-	 * the cgroup of the VM process, after the latter finishes its
-	 * execution.
-	 *
-	 * kthread_stop() waits on the 'exited' completion condition which is
-	 * set in exit_mm(), via mm_release(), in do_exit(). However, the
-	 * kthread is removed from the cgroup in the cgroup_exit() which is
-	 * called after the exit_mm(). This causes the kthread_stop() to return
-	 * before the kthread actually quits the cgroup.
-	 */
-	rcu_read_lock();
-	parent = rcu_dereference(current->real_parent);
-	get_task_struct(parent);
-	rcu_read_unlock();
-	cgroup_attach_task_all(parent, current);
-	put_task_struct(parent);
-
-	return err;
-}
-
-int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
-				uintptr_t data, const char *name,
-				struct task_struct **thread_ptr)
-{
-	struct kvm_vm_worker_thread_context init_context = {};
-	struct task_struct *thread;
-
-	*thread_ptr = NULL;
-	init_context.kvm = kvm;
-	init_context.parent = current;
-	init_context.thread_fn = thread_fn;
-	init_context.data = data;
-	init_completion(&init_context.init_done);
-
-	thread = kthread_run(kvm_vm_worker_thread, &init_context,
-			     "%s-%d", name, task_pid_nr(current));
-	if (IS_ERR(thread))
-		return PTR_ERR(thread);
-
-	/* kthread_run is never supposed to return NULL */
-	WARN_ON(thread == NULL);
-
-	wait_for_completion(&init_context.init_done);
-
-	if (!init_context.err)
-		*thread_ptr = thread;
-
-	return init_context.err;
-}
-- 
2.43.5
Re: [PATCH] KVM: x86: switch hugepage recovery thread to vhost_task
Posted by Michal Koutný 1 week, 1 day ago
On Fri, Nov 08, 2024 at 08:07:37AM GMT, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Since the worker kthread is tied to a user process, it's better if
> it behaves similarly to user tasks as much as possible, including
> being able to send SIGSTOP and SIGCONT.

Do you mean s/send/receive/?

Consequently, it's OK if a (possibly unprivileged) user stops this
thread forever (they only harm themselves, not the rest of the system),
correct?


> In fact, vhost_task is all that kvm_vm_create_worker_thread() wanted
> to be and more: not only it inherits the userspace process's cgroups,
> it has other niceties like being parented properly in the process
> tree.  Use it instead of the homegrown alternative.

It is nice indeed.
I think the bugs we saw are not so serious to warrant
Fixes: c57c80467f90e ("kvm: Add helper function for creating VM worker threads")
.
(But I'm posting it here so that I can find the reference later.)

Thanks,
Michal
Re: [PATCH] KVM: x86: switch hugepage recovery thread to vhost_task
Posted by Paolo Bonzini 5 days, 8 hours ago
On Fri, Nov 15, 2024 at 5:59 PM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Fri, Nov 08, 2024 at 08:07:37AM GMT, Paolo Bonzini <pbonzini@redhat.com> wrote:
> > Since the worker kthread is tied to a user process, it's better if
> > it behaves similarly to user tasks as much as possible, including
> > being able to send SIGSTOP and SIGCONT.
>
> Do you mean s/send/receive/?

I mean being able to send it to the threads with an effect.

> Consequently, it's OK if a (possibly unprivileged) user stops this
> thread forever (they only harm themselves, not the rest of the system),
> correct?

Yes, they will run with fewer huge pages and worse TLB performance.

Paolo

> > In fact, vhost_task is all that kvm_vm_create_worker_thread() wanted
> > to be and more: not only it inherits the userspace process's cgroups,
> > it has other niceties like being parented properly in the process
> > tree.  Use it instead of the homegrown alternative.
>
> It is nice indeed.
> I think the bugs we saw are not so serious to warrant
> Fixes: c57c80467f90e ("kvm: Add helper function for creating VM worker threads")
> .
> (But I'm posting it here so that I can find the reference later.)
Re: [PATCH] KVM: x86: switch hugepage recovery thread to vhost_task
Posted by Sean Christopherson 1 week, 2 days ago
On Fri, Nov 08, 2024, Paolo Bonzini wrote:
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 8e853a5fc867..d5af4f8c5a6a 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -7281,7 +7281,7 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp)
>  			kvm_mmu_zap_all_fast(kvm);
>  			mutex_unlock(&kvm->slots_lock);
>  
> -			wake_up_process(kvm->arch.nx_huge_page_recovery_thread);
> +			vhost_task_wake(kvm->arch.nx_huge_page_recovery_thread);
>  		}
>  		mutex_unlock(&kvm_lock);
>  	}
> @@ -7427,7 +7427,7 @@ static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel
>  		mutex_lock(&kvm_lock);
>  
>  		list_for_each_entry(kvm, &vm_list, vm_list)
> -			wake_up_process(kvm->arch.nx_huge_page_recovery_thread);
> +			vhost_task_wake(kvm->arch.nx_huge_page_recovery_thread);
>  
>  		mutex_unlock(&kvm_lock);
>  	}
> @@ -7530,62 +7530,65 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm)
>  	srcu_read_unlock(&kvm->srcu, rcu_idx);
>  }
>  
> -static long get_nx_huge_page_recovery_timeout(u64 start_time)
> +#define NX_HUGE_PAGE_DISABLED (-1)

I don't see any point in using -1.  That's more legal (though still impossible
and absurd) than an deadline of '0'.  And it's somewhat confusing because KVM
uses -1 for the default nx_huge_pages value to indicate "enable the NX huge page
mitigation if the CPU is vulnerable to L1TF", not "disable the mitigation".

> +static u64 get_nx_huge_page_recovery_next(void)
>  {
>  	bool enabled;
>  	uint period;
>  
>  	enabled = calc_nx_huge_pages_recovery_period(&period);
>  
> -	return enabled ? start_time + msecs_to_jiffies(period) - get_jiffies_64()
> -		       : MAX_SCHEDULE_TIMEOUT;
> +	return enabled ? get_jiffies_64() + msecs_to_jiffies(period)
> +		: NX_HUGE_PAGE_DISABLED;

Please align the '?' and ':' to show that they are related paths of the ternary
operator.  Moot point if we go without a literal '0'.

>  }
>  
> -static int kvm_nx_huge_page_recovery_worker(struct kvm *kvm, uintptr_t data)
> +static void kvm_nx_huge_page_recovery_worker_kill(void *data)
>  {
> -	u64 start_time;
> +}
> +
> +static bool kvm_nx_huge_page_recovery_worker(void *data)
> +{
> +	struct kvm *kvm = data;
>  	long remaining_time;
>  
> -	while (true) {
> -		start_time = get_jiffies_64();
> -		remaining_time = get_nx_huge_page_recovery_timeout(start_time);
> +	if (kvm->arch.nx_huge_page_next == NX_HUGE_PAGE_DISABLED)
> +		return false;

The "next" concept is broken.  Once KVM sees NX_HUGE_PAGE_DISABLED for a given VM,
KVM will never re-evaluate nx_huge_page_next.  Similarly, if the recovery period
and/or ratio changes, KVM won't recompute the "next" time until the current timeout
has expired.

I fiddled around with various ideas, but I don't see a better solution that something
along the lines of KVM's request system, e.g. set a bool to indicate the params
changed, and sprinkle smp_{r,w}mb() barriers to ensure the vhost task sees the
new params.

FWIW, I also found "next" to be confusing.  How about "deadline"? KVM uses that
terminology for the APIC timer, i.e. it's familiar, intuitive, and accurate(ish).

Something like this as fixup?  (comments would be nice)

---
 arch/x86/include/asm/kvm_host.h |  3 ++-
 arch/x86/kvm/mmu/mmu.c          | 34 +++++++++++++++++++++------------
 2 files changed, 24 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 72f3bcfc54d7..e9fb8b9a9c2b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1444,7 +1444,8 @@ struct kvm_arch {
 
 	struct kvm_x86_pmu_event_filter __rcu *pmu_event_filter;
 	struct vhost_task *nx_huge_page_recovery_thread;
-	u64 nx_huge_page_next;
+	u64 nx_huge_page_deadline;
+	bool nx_huge_page_params_changed;
 
 #ifdef CONFIG_X86_64
 	/* The number of TDP MMU pages across all roots. */
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d0c2d9d2588f..acfa14d4248b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7102,6 +7102,13 @@ static void mmu_destroy_caches(void)
 	kmem_cache_destroy(mmu_page_header_cache);
 }
 
+static void mmu_wake_nx_huge_page_task(struct kvm *kvm)
+{
+	smp_wmb();
+	WRITE_ONCE(kvm->arch.nx_huge_page_deadline, true);
+	vhost_task_wake(kvm->arch.nx_huge_page_recovery_thread);
+}
+
 static int get_nx_huge_pages(char *buffer, const struct kernel_param *kp)
 {
 	if (nx_hugepage_mitigation_hard_disabled)
@@ -7162,7 +7169,7 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp)
 			kvm_mmu_zap_all_fast(kvm);
 			mutex_unlock(&kvm->slots_lock);
 
-			vhost_task_wake(kvm->arch.nx_huge_page_recovery_thread);
+			mmu_wake_nx_huge_page_task(kvm);
 		}
 		mutex_unlock(&kvm_lock);
 	}
@@ -7291,7 +7298,7 @@ static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel
 		mutex_lock(&kvm_lock);
 
 		list_for_each_entry(kvm, &vm_list, vm_list)
-			vhost_task_wake(kvm->arch.nx_huge_page_recovery_thread);
+			mmu_wake_nx_huge_page_task(kvm);
 
 		mutex_unlock(&kvm_lock);
 	}
@@ -7394,17 +7401,14 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm)
 	srcu_read_unlock(&kvm->srcu, rcu_idx);
 }
 
-#define NX_HUGE_PAGE_DISABLED (-1)
-
-static u64 get_nx_huge_page_recovery_next(void)
+static u64 get_nx_huge_page_recovery_deadline(void)
 {
 	bool enabled;
 	uint period;
 
 	enabled = calc_nx_huge_pages_recovery_period(&period);
 
-	return enabled ? get_jiffies_64() + msecs_to_jiffies(period)
-		: NX_HUGE_PAGE_DISABLED;
+	return enabled ? get_jiffies_64() + msecs_to_jiffies(period) : 0;
 }
 
 static void kvm_nx_huge_page_recovery_worker_kill(void *data)
@@ -7416,10 +7420,16 @@ static bool kvm_nx_huge_page_recovery_worker(void *data)
 	struct kvm *kvm = data;
 	long remaining_time;
 
-	if (kvm->arch.nx_huge_page_next == NX_HUGE_PAGE_DISABLED)
+	if (READ_ONCE(kvm->arch.nx_huge_page_params_changed)) {
+		smp_rmb();
+		WRITE_ONCE(kvm->arch.nx_huge_page_params_changed, false);
+		kvm->arch.nx_huge_page_deadline = get_nx_huge_page_recovery_deadline();
+	}
+
+	if (!kvm->arch.nx_huge_page_deadline)
 		return false;
 
-	remaining_time = kvm->arch.nx_huge_page_next - get_jiffies_64();
+	remaining_time = kvm->arch.nx_huge_page_deadline - get_jiffies_64();
 	if (remaining_time > 0) {
 		schedule_timeout(remaining_time);
 		/* check for signals and come back */
@@ -7428,7 +7438,7 @@ static bool kvm_nx_huge_page_recovery_worker(void *data)
 
 	__set_current_state(TASK_RUNNING);
 	kvm_recover_nx_huge_pages(kvm);
-	kvm->arch.nx_huge_page_next = get_nx_huge_page_recovery_next();
+	kvm->arch.nx_huge_page_deadline = get_nx_huge_page_recovery_deadline();
 	return true;
 }
 
@@ -7437,11 +7447,11 @@ int kvm_mmu_post_init_vm(struct kvm *kvm)
 	if (nx_hugepage_mitigation_hard_disabled)
 		return 0;
 
-	kvm->arch.nx_huge_page_next = get_nx_huge_page_recovery_next();
+	WRITE_ONCE(kvm->arch.nx_huge_page_params_changed, true);
 	kvm->arch.nx_huge_page_recovery_thread = vhost_task_create(
 		kvm_nx_huge_page_recovery_worker, kvm_nx_huge_page_recovery_worker_kill,
 		kvm, "kvm-nx-lpage-recovery");
-	
+
 	if (!kvm->arch.nx_huge_page_recovery_thread)
 		return -ENOMEM;
 

base-commit: 922a5630cd31e4414f964aa64f45a5884f40188c
--
Re: [PATCH] KVM: x86: switch hugepage recovery thread to vhost_task
Posted by Paolo Bonzini 1 week, 2 days ago
On 11/14/24 00:56, Sean Christopherson wrote:
>> +static bool kvm_nx_huge_page_recovery_worker(void *data)
>> +{
>> +	struct kvm *kvm = data;
>>   	long remaining_time;
>>   
>> -	while (true) {
>> -		start_time = get_jiffies_64();
>> -		remaining_time = get_nx_huge_page_recovery_timeout(start_time);
>> +	if (kvm->arch.nx_huge_page_next == NX_HUGE_PAGE_DISABLED)
>> +		return false;
> 
> The "next" concept is broken.  Once KVM sees NX_HUGE_PAGE_DISABLED for a given VM,
> KVM will never re-evaluate nx_huge_page_next.  Similarly, if the recovery period
> and/or ratio changes, KVM won't recompute the "next" time until the current timeout
> has expired.
> 
> I fiddled around with various ideas, but I don't see a better solution that something
> along the lines of KVM's request system, e.g. set a bool to indicate the params
> changed, and sprinkle smp_{r,w}mb() barriers to ensure the vhost task sees the
> new params.

"next" is broken, but there is a much better way to fix it.  You just
track the *last* time that the recovery ran.  This is also better
behaved when you flip recovery back and forth to disabled and back
to enabled: if your recovery period is 1 minute, it will run the
next recovery after 1 minute independent of how many times you flipped
the parameter.

This also fits much better with the "restart function after
schedule()" model that vhost_task.c requires, and it gets rid of
get_nx_huge_page_recovery_whatever() completely.  It's also silly how
close the code is to the broken v1, which is why I'm sending it below
instead of sending again a large patchdd.

Paolo

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 72f3bcfc54d7..e159e44a6a1b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1444,7 +1444,7 @@ struct kvm_arch {
  
  	struct kvm_x86_pmu_event_filter __rcu *pmu_event_filter;
  	struct vhost_task *nx_huge_page_recovery_thread;
-	u64 nx_huge_page_next;
+	u64 nx_huge_page_last;
  
  #ifdef CONFIG_X86_64
  	/* The number of TDP MMU pages across all roots. */
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d0c2d9d2588f..22e7ad235123 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7394,19 +7394,6 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm)
  	srcu_read_unlock(&kvm->srcu, rcu_idx);
  }
  
-#define NX_HUGE_PAGE_DISABLED (-1)
-
-static u64 get_nx_huge_page_recovery_next(void)
-{
-	bool enabled;
-	uint period;
-
-	enabled = calc_nx_huge_pages_recovery_period(&period);
-
-	return enabled ? get_jiffies_64() + msecs_to_jiffies(period)
-		: NX_HUGE_PAGE_DISABLED;
-}
-
  static void kvm_nx_huge_page_recovery_worker_kill(void *data)
  {
  }
@@ -7414,12 +7401,16 @@ static void kvm_nx_huge_page_recovery_worker_kill(void *data)
  static bool kvm_nx_huge_page_recovery_worker(void *data)
  {
  	struct kvm *kvm = data;
+	bool enabled;
+	uint period;
  	long remaining_time;
  
-	if (kvm->arch.nx_huge_page_next == NX_HUGE_PAGE_DISABLED)
+	enabled = calc_nx_huge_pages_recovery_period(&period);
+	if (!enabled)
  		return false;
  
-	remaining_time = kvm->arch.nx_huge_page_next - get_jiffies_64();
+	remaining_time = kvm->arch.nx_huge_page_last + msecs_to_jiffies(period)
+		- get_jiffies_64();
  	if (remaining_time > 0) {
  		schedule_timeout(remaining_time);
  		/* check for signals and come back */
@@ -7428,7 +7419,7 @@ static bool kvm_nx_huge_page_recovery_worker(void *data)
  
  	__set_current_state(TASK_RUNNING);
  	kvm_recover_nx_huge_pages(kvm);
-	kvm->arch.nx_huge_page_next = get_nx_huge_page_recovery_next();
+	kvm->arch.nx_huge_page_last = get_jiffies_64();
  	return true;
  }
  
@@ -7437,11 +7428,11 @@ int kvm_mmu_post_init_vm(struct kvm *kvm)
  	if (nx_hugepage_mitigation_hard_disabled)
  		return 0;
  
-	kvm->arch.nx_huge_page_next = get_nx_huge_page_recovery_next();
+	kvm->arch.nx_huge_page_last = get_jiffies_64();
  	kvm->arch.nx_huge_page_recovery_thread = vhost_task_create(
  		kvm_nx_huge_page_recovery_worker, kvm_nx_huge_page_recovery_worker_kill,
  		kvm, "kvm-nx-lpage-recovery");
-	
+
  	if (!kvm->arch.nx_huge_page_recovery_thread)
  		return -ENOMEM;
Re: [PATCH] KVM: x86: switch hugepage recovery thread to vhost_task
Posted by Sean Christopherson 1 week, 2 days ago
On Thu, Nov 14, 2024, Paolo Bonzini wrote:
> On 11/14/24 00:56, Sean Christopherson wrote:
> > > +static bool kvm_nx_huge_page_recovery_worker(void *data)
> > > +{
> > > +	struct kvm *kvm = data;
> > >   	long remaining_time;
> > > -	while (true) {
> > > -		start_time = get_jiffies_64();
> > > -		remaining_time = get_nx_huge_page_recovery_timeout(start_time);
> > > +	if (kvm->arch.nx_huge_page_next == NX_HUGE_PAGE_DISABLED)
> > > +		return false;
> > 
> > The "next" concept is broken.  Once KVM sees NX_HUGE_PAGE_DISABLED for a given VM,
> > KVM will never re-evaluate nx_huge_page_next.  Similarly, if the recovery period
> > and/or ratio changes, KVM won't recompute the "next" time until the current timeout
> > has expired.
> > 
> > I fiddled around with various ideas, but I don't see a better solution that something
> > along the lines of KVM's request system, e.g. set a bool to indicate the params
> > changed, and sprinkle smp_{r,w}mb() barriers to ensure the vhost task sees the
> > new params.
> 
> "next" is broken, but there is a much better way to fix it.  You just
> track the *last* time that the recovery ran.  This is also better
> behaved when you flip recovery back and forth to disabled and back
> to enabled: if your recovery period is 1 minute, it will run the
> next recovery after 1 minute independent of how many times you flipped
> the parameter.

Heh, I my brain was trying to get there last night, but I couldn't quite piece
things together.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Re: [PATCH] KVM: x86: switch hugepage recovery thread to vhost_task
Posted by Tejun Heo 2 weeks, 1 day ago
On Fri, Nov 08, 2024 at 08:07:37AM -0500, Paolo Bonzini wrote:
...
> Since the worker kthread is tied to a user process, it's better if
> it behaves similarly to user tasks as much as possible, including
> being able to send SIGSTOP and SIGCONT.  In fact, vhost_task is all
> that kvm_vm_create_worker_thread() wanted to be and more: not only it
> inherits the userspace process's cgroups, it has other niceties like
> being parented properly in the process tree.  Use it instead of the
> homegrown alternative.

Didn't about vhost_task. That looks perfect. From cgroup POV:

  Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun
Re: [PATCH] KVM: x86: switch hugepage recovery thread to vhost_task
Posted by Luca Boccassi 2 weeks ago
On Fri, 8 Nov 2024 at 16:53, Tejun Heo <tj@kernel.org> wrote:
>
> On Fri, Nov 08, 2024 at 08:07:37AM -0500, Paolo Bonzini wrote:
> ...
> > Since the worker kthread is tied to a user process, it's better if
> > it behaves similarly to user tasks as much as possible, including
> > being able to send SIGSTOP and SIGCONT.  In fact, vhost_task is all
> > that kvm_vm_create_worker_thread() wanted to be and more: not only it
> > inherits the userspace process's cgroups, it has other niceties like
> > being parented properly in the process tree.  Use it instead of the
> > homegrown alternative.
>
> Didn't about vhost_task. That looks perfect. From cgroup POV:
>
>   Acked-by: Tejun Heo <tj@kernel.org>
>
> Thanks.

Thanks, tested on my machine by applying it to kernel 6.11.5 and can
confirm the issues are gone, freezing the cgroup works and everything
else too.

Could you please CC stable so that it can get backported?

Tested-by: Luca Boccassi <bluca@debian.org>