KVM: x86: switch hugepage recovery thread to vhost_task

[PATCH] KVM: x86: switch hugepage recovery thread to vhost_task

Posted by Paolo Bonzini 1 year, 3 months ago

kvm_vm_create_worker_thread() is meant to be used for kthreads that
can consume significant amounts of CPU time on behalf of a VM or in
response to how the VM behaves (for example how it accesses its memory).
Therefore it wants to charge the CPU time consumed by that work to
the VM's container.

However, because of these threads, cgroups which have kvm instances inside
never complete freezing.  This can be trivially reproduced:

  root@test ~# mkdir /sys/fs/cgroup/test
  root@test ~# echo $fish_pid > /sys/fs/cgroup/test/cgroup.procs
  root@test ~# qemu-system-x86_64 --nographic -enable-kvm

and in another terminal:

  root@test ~# echo 1 > /sys/fs/cgroup/test/cgroup.freeze
  root@test ~# cat /sys/fs/cgroup/test/cgroup.events
  populated 1
  frozen 0

The cgroup freezing happens in the signal delivery path but
kvm_vm_worker_thread() thread never call into the signal delivery path while
joining non-root cgroups, so they never get frozen. Because the cgroup
freezer determines whether a given cgroup is frozen by comparing the number
of frozen threads to the total number of threads in the cgroup, the cgroup
never becomes frozen and users waiting for the state transition may hang
indefinitely.

Since the worker kthread is tied to a user process, it's better if
it behaves similarly to user tasks as much as possible, including
being able to send SIGSTOP and SIGCONT.  In fact, vhost_task is all
that kvm_vm_create_worker_thread() wanted to be and more: not only it
inherits the userspace process's cgroups, it has other niceties like
being parented properly in the process tree.  Use it instead of the
homegrown alternative.

(Commit message based on emails from Tejun).

Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Luca Boccassi <bluca@debian.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |   4 +-
 arch/x86/kvm/Kconfig            |   1 +
 arch/x86/kvm/mmu/mmu.c          |  67 +++++++++++----------
 include/linux/kvm_host.h        |   6 --
 virt/kvm/kvm_main.c             | 103 --------------------------------
 5 files changed, 39 insertions(+), 142 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6d9f763a7bb9..d6657cc0fe6b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -26,6 +26,7 @@
 #include <linux/irqbypass.h>
 #include <linux/hyperv.h>
 #include <linux/kfifo.h>
+#include <linux/sched/vhost_task.h>
 
 #include <asm/apic.h>
 #include <asm/pvclock-abi.h>
@@ -1443,7 +1444,8 @@ struct kvm_arch {
 	bool sgx_provisioning_allowed;
 
 	struct kvm_x86_pmu_event_filter __rcu *pmu_event_filter;
-	struct task_struct *nx_huge_page_recovery_thread;
+	struct vhost_task *nx_huge_page_recovery_thread;
+	u64 nx_huge_page_next;
 
 #ifdef CONFIG_X86_64
 	/* The number of TDP MMU pages across all roots. */
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index f09f13c01c6b..b387d61af44f 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -29,6 +29,7 @@ config KVM_X86
 	select HAVE_KVM_IRQ_BYPASS
 	select HAVE_KVM_IRQ_ROUTING
 	select HAVE_KVM_READONLY_MEM
+	select VHOST_TASK
 	select KVM_ASYNC_PF
 	select USER_RETURN_NOTIFIER
 	select KVM_MMIO
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8e853a5fc867..d5af4f8c5a6a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7281,7 +7281,7 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp)
 			kvm_mmu_zap_all_fast(kvm);
 			mutex_unlock(&kvm->slots_lock);
 
-			wake_up_process(kvm->arch.nx_huge_page_recovery_thread);
+			vhost_task_wake(kvm->arch.nx_huge_page_recovery_thread);
 		}
 		mutex_unlock(&kvm_lock);
 	}
@@ -7427,7 +7427,7 @@ static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel
 		mutex_lock(&kvm_lock);
 
 		list_for_each_entry(kvm, &vm_list, vm_list)
-			wake_up_process(kvm->arch.nx_huge_page_recovery_thread);
+			vhost_task_wake(kvm->arch.nx_huge_page_recovery_thread);
 
 		mutex_unlock(&kvm_lock);
 	}
@@ -7530,62 +7530,65 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm)
 	srcu_read_unlock(&kvm->srcu, rcu_idx);
 }
 
-static long get_nx_huge_page_recovery_timeout(u64 start_time)
+#define NX_HUGE_PAGE_DISABLED (-1)
+
+static u64 get_nx_huge_page_recovery_next(void)
 {
 	bool enabled;
 	uint period;
 
 	enabled = calc_nx_huge_pages_recovery_period(&period);
 
-	return enabled ? start_time + msecs_to_jiffies(period) - get_jiffies_64()
-		       : MAX_SCHEDULE_TIMEOUT;
+	return enabled ? get_jiffies_64() + msecs_to_jiffies(period)
+		: NX_HUGE_PAGE_DISABLED;
 }
 
-static int kvm_nx_huge_page_recovery_worker(struct kvm *kvm, uintptr_t data)
+static void kvm_nx_huge_page_recovery_worker_kill(void *data)
 {
-	u64 start_time;
+}
+
+static bool kvm_nx_huge_page_recovery_worker(void *data)
+{
+	struct kvm *kvm = data;
 	long remaining_time;
 
-	while (true) {
-		start_time = get_jiffies_64();
-		remaining_time = get_nx_huge_page_recovery_timeout(start_time);
+	if (kvm->arch.nx_huge_page_next == NX_HUGE_PAGE_DISABLED)
+		return false;
 
-		set_current_state(TASK_INTERRUPTIBLE);
-		while (!kthread_should_stop() && remaining_time > 0) {
-			schedule_timeout(remaining_time);
-			remaining_time = get_nx_huge_page_recovery_timeout(start_time);
-			set_current_state(TASK_INTERRUPTIBLE);
-		}
-
-		set_current_state(TASK_RUNNING);
-
-		if (kthread_should_stop())
-			return 0;
-
-		kvm_recover_nx_huge_pages(kvm);
+	remaining_time = kvm->arch.nx_huge_page_next - get_jiffies_64();
+	if (remaining_time > 0) {
+		schedule_timeout(remaining_time);
+		/* check for signals and come back */
+		return true;
 	}
+
+	__set_current_state(TASK_RUNNING);
+	kvm_recover_nx_huge_pages(kvm);
+	kvm->arch.nx_huge_page_next = get_nx_huge_page_recovery_next();
+	return true;
 }
 
 int kvm_mmu_post_init_vm(struct kvm *kvm)
 {
-	int err;
-
 	if (nx_hugepage_mitigation_hard_disabled)
 		return 0;
 
-	err = kvm_vm_create_worker_thread(kvm, kvm_nx_huge_page_recovery_worker, 0,
-					  "kvm-nx-lpage-recovery",
-					  &kvm->arch.nx_huge_page_recovery_thread);
-	if (!err)
-		kthread_unpark(kvm->arch.nx_huge_page_recovery_thread);
+	kvm->arch.nx_huge_page_next = get_nx_huge_page_recovery_next();
+	kvm->arch.nx_huge_page_recovery_thread = vhost_task_create(
+		kvm_nx_huge_page_recovery_worker, kvm_nx_huge_page_recovery_worker_kill,
+		kvm, "kvm-nx-lpage-recovery");
+	
+	if (!kvm->arch.nx_huge_page_recovery_thread)
+		return -ENOMEM;
 
-	return err;
+	vhost_task_start(kvm->arch.nx_huge_page_recovery_thread);
+	return 0;
 }
 
 void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
 {
 	if (kvm->arch.nx_huge_page_recovery_thread)
-		kthread_stop(kvm->arch.nx_huge_page_recovery_thread);
+		vhost_task_stop(kvm->arch.nx_huge_page_recovery_thread);
 }
 
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 45be36e5285f..85fe9d0ebb91 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2382,12 +2382,6 @@ static inline int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu)
 }
 #endif /* CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE */
 
-typedef int (*kvm_vm_thread_fn_t)(struct kvm *kvm, uintptr_t data);
-
-int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
-				uintptr_t data, const char *name,
-				struct task_struct **thread_ptr);
-
 #ifdef CONFIG_KVM_XFER_TO_GUEST_WORK
 static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
 {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6ca7a1045bbb..279e03029ce1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -6561,106 +6561,3 @@ void kvm_exit(void)
 	kvm_irqfd_exit();
 }
 EXPORT_SYMBOL_GPL(kvm_exit);
-
-struct kvm_vm_worker_thread_context {
-	struct kvm *kvm;
-	struct task_struct *parent;
-	struct completion init_done;
-	kvm_vm_thread_fn_t thread_fn;
-	uintptr_t data;
-	int err;
-};
-
-static int kvm_vm_worker_thread(void *context)
-{
-	/*
-	 * The init_context is allocated on the stack of the parent thread, so
-	 * we have to locally copy anything that is needed beyond initialization
-	 */
-	struct kvm_vm_worker_thread_context *init_context = context;
-	struct task_struct *parent;
-	struct kvm *kvm = init_context->kvm;
-	kvm_vm_thread_fn_t thread_fn = init_context->thread_fn;
-	uintptr_t data = init_context->data;
-	int err;
-
-	err = kthread_park(current);
-	/* kthread_park(current) is never supposed to return an error */
-	WARN_ON(err != 0);
-	if (err)
-		goto init_complete;
-
-	err = cgroup_attach_task_all(init_context->parent, current);
-	if (err) {
-		kvm_err("%s: cgroup_attach_task_all failed with err %d\n",
-			__func__, err);
-		goto init_complete;
-	}
-
-	set_user_nice(current, task_nice(init_context->parent));
-
-init_complete:
-	init_context->err = err;
-	complete(&init_context->init_done);
-	init_context = NULL;
-
-	if (err)
-		goto out;
-
-	/* Wait to be woken up by the spawner before proceeding. */
-	kthread_parkme();
-
-	if (!kthread_should_stop())
-		err = thread_fn(kvm, data);
-
-out:
-	/*
-	 * Move kthread back to its original cgroup to prevent it lingering in
-	 * the cgroup of the VM process, after the latter finishes its
-	 * execution.
-	 *
-	 * kthread_stop() waits on the 'exited' completion condition which is
-	 * set in exit_mm(), via mm_release(), in do_exit(). However, the
-	 * kthread is removed from the cgroup in the cgroup_exit() which is
-	 * called after the exit_mm(). This causes the kthread_stop() to return
-	 * before the kthread actually quits the cgroup.
-	 */
-	rcu_read_lock();
-	parent = rcu_dereference(current->real_parent);
-	get_task_struct(parent);
-	rcu_read_unlock();
-	cgroup_attach_task_all(parent, current);
-	put_task_struct(parent);
-
-	return err;
-}
-
-int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
-				uintptr_t data, const char *name,
-				struct task_struct **thread_ptr)
-{
-	struct kvm_vm_worker_thread_context init_context = {};
-	struct task_struct *thread;
-
-	*thread_ptr = NULL;
-	init_context.kvm = kvm;
-	init_context.parent = current;
-	init_context.thread_fn = thread_fn;
-	init_context.data = data;
-	init_completion(&init_context.init_done);
-
-	thread = kthread_run(kvm_vm_worker_thread, &init_context,
-			     "%s-%d", name, task_pid_nr(current));
-	if (IS_ERR(thread))
-		return PTR_ERR(thread);
-
-	/* kthread_run is never supposed to return NULL */
-	WARN_ON(thread == NULL);
-
-	wait_for_completion(&init_context.init_done);
-
-	if (!init_context.err)
-		*thread_ptr = thread;
-
-	return init_context.err;
-}
-- 
2.43.5

Re: [PATCH] KVM: x86: switch hugepage recovery thread to vhost_task

Posted by Keith Busch 1 year, 1 month ago

On Fri, Nov 08, 2024 at 08:07:37AM -0500, Paolo Bonzini wrote:
> Since the worker kthread is tied to a user process, it's better if
> it behaves similarly to user tasks as much as possible, including
> being able to send SIGSTOP and SIGCONT.  In fact, vhost_task is all
> that kvm_vm_create_worker_thread() wanted to be and more: not only it
> inherits the userspace process's cgroups, it has other niceties like
> being parented properly in the process tree.  Use it instead of the
> homegrown alternative.

This appears to be causing a user space regression. The library
"minijail" is used by virtual machine manager "crossvm". crossvm uses
minijail to fork processes, but the library requires the process be
single threaded. Prior to this patch, the process was single threaded,
but this change creates a relationship from the kvm thread to the user
process that fails minijail's test.

For reference, here's the affected library function reporting this
behavior change:

  https://github.com/google/minijail/blob/main/rust/minijail/src/lib.rs#L987

Reverting the patch makes it work again.

Fwiw, I think the single threaded check may be misguided, but just doing
my due diligence to report the user space interaction.

Re: [PATCH] KVM: x86: switch hugepage recovery thread to vhost_task

Posted by Paolo Bonzini 1 year, 1 month ago

On 12/19/24 18:32, Keith Busch wrote:
> This appears to be causing a user space regression. The library
> "minijail" is used by virtual machine manager "crossvm". crossvm uses
> minijail to fork processes, but the library requires the process be
> single threaded. Prior to this patch, the process was single threaded,
> but this change creates a relationship from the kvm thread to the user
> process that fails minijail's test.

Thanks for the report.

The minijail code has a flag that's documented like this:

     /// Disables the check that prevents forking in a multithreaded environment.
     /// This is only safe if the child process calls exec immediately after
     /// forking. The state of locks, and whether or not they will unlock
     /// is undefined. Additionally, objects allocated on other threads that
     /// expect to be dropped when those threads cease execution will not be
     /// dropped.
     /// Thus, nothing should be called that relies on shared synchronization
     /// primitives referenced outside of the current thread. The safest
     /// way to use this is to immediately exec in the child.

Is crosvm trying to do anything but exec?  If not, it should probably use the
flag.

> Fwiw, I think the single threaded check may be misguided, but just doing
> my due diligence to report the user space interaction.

I don't think the check itself is misguided, but if it can be improved to
only look at usermode threads somehow, that would be better.  In general
Linux is moving towards properly tracking the parent-child relationship
of kernel processes (for vhost and io_uring, and now for KVM).

Paolo

Re: [PATCH] KVM: x86: switch hugepage recovery thread to vhost_task

Posted by Keith Busch 1 year, 1 month ago

On Thu, Dec 19, 2024 at 06:42:09PM +0100, Paolo Bonzini wrote:
> On 12/19/24 18:32, Keith Busch wrote:
> > This appears to be causing a user space regression. The library
> > "minijail" is used by virtual machine manager "crossvm". crossvm uses
> > minijail to fork processes, but the library requires the process be
> > single threaded. Prior to this patch, the process was single threaded,
> > but this change creates a relationship from the kvm thread to the user
> > process that fails minijail's test.
> 
> Thanks for the report.
> 
> The minijail code has a flag that's documented like this:
> 
>     /// Disables the check that prevents forking in a multithreaded environment.
>     /// This is only safe if the child process calls exec immediately after
>     /// forking. The state of locks, and whether or not they will unlock
>     /// is undefined. Additionally, objects allocated on other threads that
>     /// expect to be dropped when those threads cease execution will not be
>     /// dropped.
>     /// Thus, nothing should be called that relies on shared synchronization
>     /// primitives referenced outside of the current thread. The safest
>     /// way to use this is to immediately exec in the child.
> 
> Is crosvm trying to do anything but exec?  If not, it should probably use the
> flag.

Good point, and I'm not sure right now. I don't think I know any crosvm
developer experts but I'm working on that to get a better explanation of
what's happening,

Re: [PATCH] KVM: x86: switch hugepage recovery thread to vhost_task

Posted by Paolo Bonzini 1 year, 1 month ago

On Thu, Dec 19, 2024 at 7:09 PM Keith Busch <kbusch@kernel.org> wrote:
> > Is crosvm trying to do anything but exec?  If not, it should probably use the
> > flag.
>
> Good point, and I'm not sure right now. I don't think I know any crosvm
> developer experts but I'm working on that to get a better explanation of
> what's happening,

Ok, I found the code and it doesn't exec (e.g.
https://github.com/google/crosvm/blob/b339d3d7/src/crosvm/sys/linux/jail_warden.rs#L122),
so that's not an option. Well, if I understand correctly from a
cursory look at the code, crosvm is creating a jailed child process
early, and then spawns further jails through it; so it's just this
first process that has to cheat.

One possibility on the KVM side is to delay creating the vhost_task
until the first KVM_RUN. I don't like it but...

I think we should nevertheless add something to the status file in
procfs, that makes it easy to detect kernel tasks (PF_KTHREAD |
PF_IO_WORKER | PF_USER_WORKER).

Paolo

Re: [PATCH] KVM: x86: switch hugepage recovery thread to vhost_task

Posted by Keith Busch 1 year ago

On Thu, Dec 19, 2024 at 09:30:16PM +0100, Paolo Bonzini wrote:
> On Thu, Dec 19, 2024 at 7:09 PM Keith Busch <kbusch@kernel.org> wrote:
> > > Is crosvm trying to do anything but exec?  If not, it should probably use the
> > > flag.
> >
> > Good point, and I'm not sure right now. I don't think I know any crosvm
> > developer experts but I'm working on that to get a better explanation of
> > what's happening,
> 
> Ok, I found the code and it doesn't exec (e.g.
> https://github.com/google/crosvm/blob/b339d3d7/src/crosvm/sys/linux/jail_warden.rs#L122),
> so that's not an option. Well, if I understand correctly from a
> cursory look at the code, crosvm is creating a jailed child process
> early, and then spawns further jails through it; so it's just this
> first process that has to cheat.
> 
> One possibility on the KVM side is to delay creating the vhost_task
> until the first KVM_RUN. I don't like it but...

This option is actually kind of appealing in that we don't need to
change any application side to filter out kernel tasks, as well as not
having a new kernel dependency to even report these types of tasks as
kernel threads.

I gave it a quick try. I'm not very familiar with the code here, so not
sure if this is thread safe or not, but it did successfully get crosvm
booting again.

---
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2401606db2604..422b6b06de4fe 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7415,6 +7415,8 @@ int kvm_mmu_post_init_vm(struct kvm *kvm)
 {
 	if (nx_hugepage_mitigation_hard_disabled)
 		return 0;
+	if (kvm->arch.nx_huge_page_recovery_thread)
+		return 0;
 
 	kvm->arch.nx_huge_page_last = get_jiffies_64();
 	kvm->arch.nx_huge_page_recovery_thread = vhost_task_create(
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c79a8cc57ba42..263363c46626b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11463,6 +11463,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 	struct kvm_run *kvm_run = vcpu->run;
 	int r;
 
+	r = kvm_mmu_post_init_vm(vcpu->kvm);
+	if (r)
+		return r;
+
 	vcpu_load(vcpu);
 	kvm_sigset_activate(vcpu);
 	kvm_run->flags = 0;
@@ -12740,11 +12744,6 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 	return ret;
 }
 
-int kvm_arch_post_init_vm(struct kvm *kvm)
-{
-	return kvm_mmu_post_init_vm(kvm);
-}
-
 static void kvm_unload_vcpu_mmu(struct kvm_vcpu *vcpu)
 {
 	vcpu_load(vcpu);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 401439bb21e3e..a219bd2d8aec8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1596,7 +1596,6 @@ int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu);
 bool kvm_arch_dy_runnable(struct kvm_vcpu *vcpu);
 bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
 bool kvm_arch_vcpu_preempted_in_kernel(struct kvm_vcpu *vcpu);
-int kvm_arch_post_init_vm(struct kvm *kvm);
 void kvm_arch_pre_destroy_vm(struct kvm *kvm);
 void kvm_arch_create_vm_debugfs(struct kvm *kvm);
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index de2c11dae2316..adacc6eaa7d9d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1065,15 +1065,6 @@ static int kvm_create_vm_debugfs(struct kvm *kvm, const char *fdname)
 	return ret;
 }
 
-/*
- * Called after the VM is otherwise initialized, but just before adding it to
- * the vm_list.
- */
-int __weak kvm_arch_post_init_vm(struct kvm *kvm)
-{
-	return 0;
-}
-
 /*
  * Called just after removing the VM from the vm_list, but before doing any
  * other destruction.
@@ -1194,10 +1185,6 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
 	if (r)
 		goto out_err_no_debugfs;
 
-	r = kvm_arch_post_init_vm(kvm);
-	if (r)
-		goto out_err;
-
 	mutex_lock(&kvm_lock);
 	list_add(&kvm->vm_list, &vm_list);
 	mutex_unlock(&kvm_lock);
@@ -1207,8 +1194,6 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
 
 	return kvm;
 
-out_err:
-	kvm_destroy_vm_debugfs(kvm);
 out_err_no_debugfs:
 	kvm_coalesced_mmio_free(kvm);
 out_no_coalesced_mmio:
--

Re: [PATCH] KVM: x86: switch hugepage recovery thread to vhost_task

Posted by Paolo Bonzini 1 year ago

On 1/13/25 16:35, Keith Busch wrote:
>> Ok, I found the code and it doesn't exec (e.g.
>> https://github.com/google/crosvm/blob/b339d3d7/src/crosvm/sys/linux/jail_warden.rs#L122),
>> so that's not an option. Well, if I understand correctly from a
>> cursory look at the code, crosvm is creating a jailed child process
>> early, and then spawns further jails through it; so it's just this
>> first process that has to cheat.
>>
>> One possibility on the KVM side is to delay creating the vhost_task
>> until the first KVM_RUN. I don't like it but...
> 
> This option is actually kind of appealing in that we don't need to
> change any application side to filter out kernel tasks, as well as not
> having a new kernel dependency to even report these types of tasks as
> kernel threads.
> 
> I gave it a quick try. I'm not very familiar with the code here, so not
> sure if this is thread safe or not, but it did successfully get crosvm
> booting again.

That looks good to me too.  Would you like to send it with a commit 
message and SoB?

Thanks,

Paolo

> ---
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 2401606db2604..422b6b06de4fe 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -7415,6 +7415,8 @@ int kvm_mmu_post_init_vm(struct kvm *kvm)
>   {
>   	if (nx_hugepage_mitigation_hard_disabled)
>   		return 0;
> +	if (kvm->arch.nx_huge_page_recovery_thread)
> +		return 0;
>   
>   	kvm->arch.nx_huge_page_last = get_jiffies_64();
>   	kvm->arch.nx_huge_page_recovery_thread = vhost_task_create(
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c79a8cc57ba42..263363c46626b 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -11463,6 +11463,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>   	struct kvm_run *kvm_run = vcpu->run;
>   	int r;
>   
> +	r = kvm_mmu_post_init_vm(vcpu->kvm);
> +	if (r)
> +		return r;
> +
>   	vcpu_load(vcpu);
>   	kvm_sigset_activate(vcpu);
>   	kvm_run->flags = 0;
> @@ -12740,11 +12744,6 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>   	return ret;
>   }
>   
> -int kvm_arch_post_init_vm(struct kvm *kvm)
> -{
> -	return kvm_mmu_post_init_vm(kvm);
> -}
> -
>   static void kvm_unload_vcpu_mmu(struct kvm_vcpu *vcpu)
>   {
>   	vcpu_load(vcpu);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 401439bb21e3e..a219bd2d8aec8 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1596,7 +1596,6 @@ int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu);
>   bool kvm_arch_dy_runnable(struct kvm_vcpu *vcpu);
>   bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
>   bool kvm_arch_vcpu_preempted_in_kernel(struct kvm_vcpu *vcpu);
> -int kvm_arch_post_init_vm(struct kvm *kvm);
>   void kvm_arch_pre_destroy_vm(struct kvm *kvm);
>   void kvm_arch_create_vm_debugfs(struct kvm *kvm);
>   
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index de2c11dae2316..adacc6eaa7d9d 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1065,15 +1065,6 @@ static int kvm_create_vm_debugfs(struct kvm *kvm, const char *fdname)
>   	return ret;
>   }
>   
> -/*
> - * Called after the VM is otherwise initialized, but just before adding it to
> - * the vm_list.
> - */
> -int __weak kvm_arch_post_init_vm(struct kvm *kvm)
> -{
> -	return 0;
> -}
> -
>   /*
>    * Called just after removing the VM from the vm_list, but before doing any
>    * other destruction.
> @@ -1194,10 +1185,6 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
>   	if (r)
>   		goto out_err_no_debugfs;
>   
> -	r = kvm_arch_post_init_vm(kvm);
> -	if (r)
> -		goto out_err;
> -
>   	mutex_lock(&kvm_lock);
>   	list_add(&kvm->vm_list, &vm_list);
>   	mutex_unlock(&kvm_lock);
> @@ -1207,8 +1194,6 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
>   
>   	return kvm;
>   
> -out_err:
> -	kvm_destroy_vm_debugfs(kvm);
>   out_err_no_debugfs:
>   	kvm_coalesced_mmio_free(kvm);
>   out_no_coalesced_mmio:
> --
>

Re: [PATCH] KVM: x86: switch hugepage recovery thread to vhost_task

Posted by Sean Christopherson 1 year ago

On Tue, Jan 14, 2025, Paolo Bonzini wrote:
> On 1/13/25 16:35, Keith Busch wrote:
> > > Ok, I found the code and it doesn't exec (e.g.
> > > https://github.com/google/crosvm/blob/b339d3d7/src/crosvm/sys/linux/jail_warden.rs#L122),
> > > so that's not an option. Well, if I understand correctly from a
> > > cursory look at the code, crosvm is creating a jailed child process
> > > early, and then spawns further jails through it; so it's just this
> > > first process that has to cheat.
> > > 
> > > One possibility on the KVM side is to delay creating the vhost_task
> > > until the first KVM_RUN. I don't like it but...
> > 
> > This option is actually kind of appealing in that we don't need to
> > change any application side to filter out kernel tasks, as well as not
> > having a new kernel dependency to even report these types of tasks as
> > kernel threads.
> > 
> > I gave it a quick try. I'm not very familiar with the code here, so not
> > sure if this is thread safe or not,

It's not.

> > but it did successfully get crosvm booting again.
> 
> That looks good to me too.  Would you like to send it with a commit message
> and SoB?

> > ---
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 2401606db2604..422b6b06de4fe 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -7415,6 +7415,8 @@ int kvm_mmu_post_init_vm(struct kvm *kvm)
> >   {
> >   	if (nx_hugepage_mitigation_hard_disabled)
> >   		return 0;
> > +	if (kvm->arch.nx_huge_page_recovery_thread)
> > +		return 0;

...

> >   	kvm->arch.nx_huge_page_last = get_jiffies_64();
> >   	kvm->arch.nx_huge_page_recovery_thread = vhost_task_create(
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index c79a8cc57ba42..263363c46626b 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -11463,6 +11463,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> >   	struct kvm_run *kvm_run = vcpu->run;
> >   	int r;
> > +	r = kvm_mmu_post_init_vm(vcpu->kvm);
> > +	if (r)
> > +		return r;

The only lock held at this point is vcpu->mutex, the obvious choices for guarding
the per-VM task creation are kvm->lock or kvm->mmu_lock, but we definitely don't
want to blindly take either lock in KVM_RUN.

Re: [PATCH] KVM: x86: switch hugepage recovery thread to vhost_task

Posted by Keith Busch 1 year ago

On Tue, Jan 14, 2025 at 07:06:20PM -0800, Sean Christopherson wrote:
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 2401606db2604..422b6b06de4fe 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -7415,6 +7415,8 @@ int kvm_mmu_post_init_vm(struct kvm *kvm)
> > >   {
> > >   	if (nx_hugepage_mitigation_hard_disabled)
> > >   		return 0;
> > > +	if (kvm->arch.nx_huge_page_recovery_thread)
> > > +		return 0;
> 
> ...
> 
> > >   	kvm->arch.nx_huge_page_last = get_jiffies_64();
> > >   	kvm->arch.nx_huge_page_recovery_thread = vhost_task_create(
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index c79a8cc57ba42..263363c46626b 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -11463,6 +11463,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> > >   	struct kvm_run *kvm_run = vcpu->run;
> > >   	int r;
> > > +	r = kvm_mmu_post_init_vm(vcpu->kvm);
> > > +	if (r)
> > > +		return r;
> 
> The only lock held at this point is vcpu->mutex, the obvious choices for guarding
> the per-VM task creation are kvm->lock or kvm->mmu_lock, but we definitely don't
> want to blindly take either lock in KVM_RUN.

Thanks for the feedback. Would this otherwise be okay if I use a
different mechanism to ensure the vhost task creation happens only once
per kvm instance? Or are you suggesting the task creation needs to be
somewhere other than KVM_RUN?

My other initial concern was that this makes kvm_destroy_vm less
symmetrical to kvm_create_vm, but that part looks okay: the vcpu that's
being run holds a reference preventing kvm_destroy_vm from being called.

Re: [PATCH] KVM: x86: switch hugepage recovery thread to vhost_task

Posted by Paolo Bonzini 1 year ago

On Wed, Jan 15, 2025 at 5:51 PM Keith Busch <kbusch@kernel.org> wrote:
>
> On Tue, Jan 14, 2025 at 07:06:20PM -0800, Sean Christopherson wrote:
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index 2401606db2604..422b6b06de4fe 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -7415,6 +7415,8 @@ int kvm_mmu_post_init_vm(struct kvm *kvm)
> > > >   {
> > > >           if (nx_hugepage_mitigation_hard_disabled)
> > > >                   return 0;
> > > > + if (kvm->arch.nx_huge_page_recovery_thread)
> > > > +         return 0;
> >
> > ...
> >
> > > >           kvm->arch.nx_huge_page_last = get_jiffies_64();
> > > >           kvm->arch.nx_huge_page_recovery_thread = vhost_task_create(
> > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > index c79a8cc57ba42..263363c46626b 100644
> > > > --- a/arch/x86/kvm/x86.c
> > > > +++ b/arch/x86/kvm/x86.c
> > > > @@ -11463,6 +11463,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> > > >           struct kvm_run *kvm_run = vcpu->run;
> > > >           int r;
> > > > + r = kvm_mmu_post_init_vm(vcpu->kvm);
> > > > + if (r)
> > > > +         return r;
> >
> > The only lock held at this point is vcpu->mutex, the obvious choices for guarding
> > the per-VM task creation are kvm->lock or kvm->mmu_lock, but we definitely don't
> > want to blindly take either lock in KVM_RUN.
>
> Thanks for the feedback. Would this otherwise be okay if I use a
> different mechanism to ensure the vhost task creation happens only once
> per kvm instance? Or are you suggesting the task creation needs to be
> somewhere other than KVM_RUN?

You can implement something like pthread_once():

#define ONCE_NOT_STARTED 0
#define ONCE_RUNNING     1
#define ONCE_COMPLETED   2

struct once {
        atomic_t state;
        struct mutex lock;
}

static inline void __once_init(struct once *once,
    const char *name, struct lock_class_key *key)
{
        atomic_set(&once->state, ONCE_NOT_STARTED);
        __mutex_init(&once->lock, name, key);
}

#define once_init(once)                                               \
do {                                                                  \
        static struct lock_class_key __key;                           \
                                                                      \
       __once_init((once), #once, &__key);                            \
} while (0)

static inline void call_once(struct once *once, void (*cb)(struct once *))
{
        /* Pairs with atomic_set_release() below.  */
        if (atomic_read_acquire(&once->state) == ONCE_COMPLETED)
                return;

        guard(mutex)(&once->lock);
        WARN_ON(atomic_read(&once->state) == ONCE_RUNNING);
        if (atomic_read(&once->state) != ONCE_NOT_STARTED)
                return;

        atomic_set(&once->state, ONCE_RUNNING);
        cb(once);
        atomic_set_release(&once->state, ONCE_COMPLETED);
}

Where to put it I don't know.  It doesn't belong in
include/linux/once.h.  I'm okay with arch/x86/kvm/call_once.h and just
pull it with #include "call_once.h".

Paolo

Re: [PATCH] KVM: x86: switch hugepage recovery thread to vhost_task

Posted by Keith Busch 1 year ago

On Wed, Jan 15, 2025 at 06:10:05PM +0100, Paolo Bonzini wrote:
> You can implement something like pthread_once():

...

> Where to put it I don't know.  It doesn't belong in
> include/linux/once.h.  I'm okay with arch/x86/kvm/call_once.h and just
> pull it with #include "call_once.h".

Thanks for the suggestion, I can work with that. As to where to put it,
I think the new 'struct once' needs to be a member of struct kvm_arch,
so I've put it in arch/x86/include/asm/.

Here's the result with that folded in. If this is okay, I'll send a v2,
and can split out the call_once as a prep patch with your attribution if
you like.

---
diff --git a/arch/x86/include/asm/kvm_call_once.h b/arch/x86/include/asm/kvm_call_once.h
new file mode 100644
index 0000000000000..451cc87084aa7
--- /dev/null
+++ b/arch/x86/include/asm/kvm_call_once.h
@@ -0,0 +1,44 @@
+#ifndef _LINUX_CALL_ONCE_H
+#define _LINUX_CALL_ONCE_H
+
+#include <linux/types.h>
+
+#define ONCE_NOT_STARTED 0
+#define ONCE_RUNNING     1
+#define ONCE_COMPLETED   2
+
+struct once {
+        atomic_t state;
+        struct mutex lock;
+};
+
+static inline void __once_init(struct once *once, const char *name,
+			       struct lock_class_key *key)
+{
+        atomic_set(&once->state, ONCE_NOT_STARTED);
+        __mutex_init(&once->lock, name, key);
+}
+
+#define once_init(once)							\
+do {									\
+	static struct lock_class_key __key;				\
+	__once_init((once), #once, &__key);				\
+} while (0)
+
+static inline void call_once(struct once *once, void (*cb)(struct once *))
+{
+        /* Pairs with atomic_set_release() below.  */
+        if (atomic_read_acquire(&once->state) == ONCE_COMPLETED)
+                return;
+
+        guard(mutex)(&once->lock);
+        WARN_ON(atomic_read(&once->state) == ONCE_RUNNING);
+        if (atomic_read(&once->state) != ONCE_NOT_STARTED)
+                return;
+
+        atomic_set(&once->state, ONCE_RUNNING);
+        cb(once);
+        atomic_set_release(&once->state, ONCE_COMPLETED);
+}
+
+#endif /* _LINUX_CALL_ONCE_H */
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e159e44a6a1b6..56c79958dc9cb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -37,6 +37,7 @@
 #include <asm/kvm_page_track.h>
 #include <asm/kvm_vcpu_regs.h>
 #include <asm/hyperv-tlfs.h>
+#include <asm/kvm_call_once.h>
 #include <asm/reboot.h>
 
 #define __KVM_HAVE_ARCH_VCPU_DEBUGFS
@@ -1445,6 +1446,7 @@ struct kvm_arch {
 	struct kvm_x86_pmu_event_filter __rcu *pmu_event_filter;
 	struct vhost_task *nx_huge_page_recovery_thread;
 	u64 nx_huge_page_last;
+	struct once nx_once;
 
 #ifdef CONFIG_X86_64
 	/* The number of TDP MMU pages across all roots. */
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2401606db2604..3c0c1f3647ceb 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7411,20 +7411,28 @@ static bool kvm_nx_huge_page_recovery_worker(void *data)
 	return true;
 }
 
-int kvm_mmu_post_init_vm(struct kvm *kvm)
+static void kvm_mmu_start_lpage_recovery(struct once *once)
 {
-	if (nx_hugepage_mitigation_hard_disabled)
-		return 0;
+	struct kvm_arch *ka = container_of(once, struct kvm_arch, nx_once);
+	struct kvm *kvm = container_of(ka, struct kvm, arch);
 
 	kvm->arch.nx_huge_page_last = get_jiffies_64();
 	kvm->arch.nx_huge_page_recovery_thread = vhost_task_create(
 		kvm_nx_huge_page_recovery_worker, kvm_nx_huge_page_recovery_worker_kill,
 		kvm, "kvm-nx-lpage-recovery");
 
+	if (kvm->arch.nx_huge_page_recovery_thread)
+		vhost_task_start(kvm->arch.nx_huge_page_recovery_thread);
+}
+
+int kvm_mmu_post_init_vm(struct kvm *kvm)
+{
+	if (nx_hugepage_mitigation_hard_disabled)
+		return 0;
+
+	call_once(&kvm->arch.nx_once, kvm_mmu_start_lpage_recovery);
 	if (!kvm->arch.nx_huge_page_recovery_thread)
 		return -ENOMEM;
-
-	vhost_task_start(kvm->arch.nx_huge_page_recovery_thread);
 	return 0;
 }
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c79a8cc57ba42..23bf088fc4ae1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11463,6 +11463,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 	struct kvm_run *kvm_run = vcpu->run;
 	int r;
 
+	r = kvm_mmu_post_init_vm(vcpu->kvm);
+	if (r)
+		return r;
+
 	vcpu_load(vcpu);
 	kvm_sigset_activate(vcpu);
 	kvm_run->flags = 0;
@@ -12742,7 +12746,8 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 
 int kvm_arch_post_init_vm(struct kvm *kvm)
 {
-	return kvm_mmu_post_init_vm(kvm);
+	once_init(&kvm->arch.nx_once);
+	return 0;
 }
 
 static void kvm_unload_vcpu_mmu(struct kvm_vcpu *vcpu)
--

Re: [PATCH] KVM: x86: switch hugepage recovery thread to vhost_task

Posted by Alyssa Ross 1 year ago

On Wed, Jan 15, 2025 at 12:03:27PM -0700, Keith Busch wrote:
> On Wed, Jan 15, 2025 at 06:10:05PM +0100, Paolo Bonzini wrote:
> > You can implement something like pthread_once():
>
> ...
>
> > Where to put it I don't know.  It doesn't belong in
> > include/linux/once.h.  I'm okay with arch/x86/kvm/call_once.h and just
> > pull it with #include "call_once.h".
>
> Thanks for the suggestion, I can work with that. As to where to put it,
> I think the new 'struct once' needs to be a member of struct kvm_arch,
> so I've put it in arch/x86/include/asm/.
>
> Here's the result with that folded in. If this is okay, I'll send a v2,
> and can split out the call_once as a prep patch with your attribution if
> you like.

Has there been any progress here?  I'm also affected by the crosvm
regression, and it's been backported to the LTS stable kernel.

(CCing the stable and regressions lists to make sure the regression is
tracked.)

#regzbot introduced: d96c77bd4eeb

Re: [PATCH] KVM: x86: switch hugepage recovery thread to vhost_task

Posted by Keith Busch 1 year ago

On Wed, Jan 22, 2025 at 12:38:25PM +0100, Alyssa Ross wrote:
> On Wed, Jan 15, 2025 at 12:03:27PM -0700, Keith Busch wrote:
> > On Wed, Jan 15, 2025 at 06:10:05PM +0100, Paolo Bonzini wrote:
> > > You can implement something like pthread_once():
> >
> > ...
> >
> > > Where to put it I don't know.  It doesn't belong in
> > > include/linux/once.h.  I'm okay with arch/x86/kvm/call_once.h and just
> > > pull it with #include "call_once.h".
> >
> > Thanks for the suggestion, I can work with that. As to where to put it,
> > I think the new 'struct once' needs to be a member of struct kvm_arch,
> > so I've put it in arch/x86/include/asm/.
> >
> > Here's the result with that folded in. If this is okay, I'll send a v2,
> > and can split out the call_once as a prep patch with your attribution if
> > you like.
> 
> Has there been any progress here?  I'm also affected by the crosvm
> regression, and it's been backported to the LTS stable kernel.

Would you be able to try the proposed patch here and reply with a
Tested-by if it's successful for you? I'd also like to unblock this,
whether this patch is in the right direction or try something else.

Re: [PATCH] KVM: x86: switch hugepage recovery thread to vhost_task

Posted by Alyssa Ross 1 year ago

Keith Busch <kbusch@kernel.org> writes:

> On Wed, Jan 22, 2025 at 12:38:25PM +0100, Alyssa Ross wrote:
>> On Wed, Jan 15, 2025 at 12:03:27PM -0700, Keith Busch wrote:
>> > On Wed, Jan 15, 2025 at 06:10:05PM +0100, Paolo Bonzini wrote:
>> > > You can implement something like pthread_once():
>> >
>> > ...
>> >
>> > > Where to put it I don't know.  It doesn't belong in
>> > > include/linux/once.h.  I'm okay with arch/x86/kvm/call_once.h and just
>> > > pull it with #include "call_once.h".
>> >
>> > Thanks for the suggestion, I can work with that. As to where to put it,
>> > I think the new 'struct once' needs to be a member of struct kvm_arch,
>> > so I've put it in arch/x86/include/asm/.
>> >
>> > Here's the result with that folded in. If this is okay, I'll send a v2,
>> > and can split out the call_once as a prep patch with your attribution if
>> > you like.
>> 
>> Has there been any progress here?  I'm also affected by the crosvm
>> regression, and it's been backported to the LTS stable kernel.
>
> Would you be able to try the proposed patch here and reply with a
> Tested-by if it's successful for you? I'd also like to unblock this,
> whether this patch is in the right direction or try something else.

Sure!  I can confirm this patch fixes crosvm for me when applied to 6.13.

Tested-by: Alyssa Ross <hi@alyssa.is>

Re: [PATCH] KVM: x86: switch hugepage recovery thread to vhost_task

Posted by Keith Busch 1 year, 1 month ago

On Thu, Dec 19, 2024 at 09:30:16PM +0100, Paolo Bonzini wrote:
> On Thu, Dec 19, 2024 at 7:09 PM Keith Busch <kbusch@kernel.org> wrote:
> > > Is crosvm trying to do anything but exec?  If not, it should probably use the
> > > flag.
> >
> > Good point, and I'm not sure right now. I don't think I know any crosvm
> > developer experts but I'm working on that to get a better explanation of
> > what's happening,
> 
> Ok, I found the code and it doesn't exec (e.g.
> https://github.com/google/crosvm/blob/b339d3d7/src/crosvm/sys/linux/jail_warden.rs#L122),
> so that's not an option. 

Thanks, I was slowly getting there too. It's been a while since I had to
work with the languange, so I'm a bit rusty (no pun intended) at
navigating.

> Well, if I understand correctly from a
> cursory look at the code, crosvm is creating a jailed child process
> early, and then spawns further jails through it; so it's just this
> first process that has to cheat.
> 
> One possibility on the KVM side is to delay creating the vhost_task
> until the first KVM_RUN. I don't like it but...
> 
> I think we should nevertheless add something to the status file in
> procfs, that makes it easy to detect kernel tasks (PF_KTHREAD |
> PF_IO_WORKER | PF_USER_WORKER).

I currently think excluding kernel tasks from this check probably aligns
with what it's trying to do, so anything to make that easier is a good
step, IMO.

Re: [PATCH] KVM: x86: switch hugepage recovery thread to vhost_task

Posted by Paolo Bonzini 1 year, 1 month ago

On 12/19/24 23:23, Keith Busch wrote:
> On Thu, Dec 19, 2024 at 09:30:16PM +0100, Paolo Bonzini wrote:
>> On Thu, Dec 19, 2024 at 7:09 PM Keith Busch <kbusch@kernel.org> wrote:
>>>> Is crosvm trying to do anything but exec?  If not, it should probably use the
>>>> flag.
>>>
>>> Good point, and I'm not sure right now. I don't think I know any crosvm
>>> developer experts but I'm working on that to get a better explanation of
>>> what's happening,
>>
>> Ok, I found the code and it doesn't exec (e.g.
>> https://github.com/google/crosvm/blob/b339d3d7/src/crosvm/sys/linux/jail_warden.rs#L122),
>> so that's not an option.
> 
> Thanks, I was slowly getting there too. It's been a while since I had to
> work with the languange, so I'm a bit rusty (no pun intended) at
> navigating.
> 
>> Well, if I understand correctly from a
>> cursory look at the code, crosvm is creating a jailed child process
>> early, and then spawns further jails through it; so it's just this
>> first process that has to cheat.
>>
>> One possibility on the KVM side is to delay creating the vhost_task
>> until the first KVM_RUN. I don't like it but...
>>
>> I think we should nevertheless add something to the status file in
>> procfs, that makes it easy to detect kernel tasks (PF_KTHREAD |
>> PF_IO_WORKER | PF_USER_WORKER).
> 
> I currently think excluding kernel tasks from this check probably aligns
> with what it's trying to do, so anything to make that easier is a good
> step, IMO.
> 

It could be as simple as this on the kernel side: [adding Jens for
a first look]

=============== 8< ===========
From: Paolo Bonzini <pbonzini@redhat.com>
Subject: [PATCH] fs: proc: mark user and I/O workers as "kernel threads"

A Rust library called "minijail" is looking at procfs to check if
the current task has multiple threads, and to prevent fork() if it
does.  This is because fork() is in general ill-advised in
multi-threaded programs, for example if another thread might have
taken locks.

However, this attempt falls afoul of kernel threads that are children
of the user process that they serve.  These are not a problem when
forking, but they are still present in procfs.  The library should
discard them, but there is currently no way for userspace to detect
PF_USER_WORKER or PF_IO_WORKER threads.

The closest is the "Kthread" key in /proc/PID/task/TID/status.  Extend
it instead of introducing another keyl tasks that are marked with
PF_USER_WORKER or PF_IO_WORKER are not kthreads, but they are close
enough for basically all intents and purposes.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 34a47fb0c57f..f702fb50c8ef 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -221,7 +221,7 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
  #endif
  	seq_putc(m, '\n');
  
-	seq_printf(m, "Kthread:\t%c\n", p->flags & PF_KTHREAD ? '1' : '0');
+	seq_printf(m, "Kthread:\t%c\n", p->flags & (PF_KTHREAD | PF_USER_WORKER | PF_IO_WORKER) ? '1' : '0');
  }
  
  void render_sigset_t(struct seq_file *m, const char *header,


Paolo

Re: [PATCH] KVM: x86: switch hugepage recovery thread to vhost_task

Posted by Keith Busch 1 year, 1 month ago

On Thu, Dec 19, 2024 at 11:57:40PM +0100, Paolo Bonzini wrote:
> It could be as simple as this on the kernel side: [adding Jens for
> a first look]

Cc'ing his @kernel email; first message:

	https://lore.kernel.org/linux-kernel/20241108130737.126567-1-pbonzini@redhat.com/
 
> =============== 8< ===========
> From: Paolo Bonzini <pbonzini@redhat.com>
> Subject: [PATCH] fs: proc: mark user and I/O workers as "kernel threads"
> 
> A Rust library called "minijail" is looking at procfs to check if
> the current task has multiple threads, and to prevent fork() if it
> does.  This is because fork() is in general ill-advised in
> multi-threaded programs, for example if another thread might have
> taken locks.
> 
> However, this attempt falls afoul of kernel threads that are children
> of the user process that they serve.  These are not a problem when
> forking, but they are still present in procfs.  The library should
> discard them, but there is currently no way for userspace to detect
> PF_USER_WORKER or PF_IO_WORKER threads.
> 
> The closest is the "Kthread" key in /proc/PID/task/TID/status.  Extend
> it instead of introducing another keyl tasks that are marked with
> PF_USER_WORKER or PF_IO_WORKER are not kthreads, but they are close
> enough for basically all intents and purposes.

Yes, this looks good to me. But I also would have thought the original
patch was safe too :). Hopefully nothing depends on the current value
for these... there is often something making it difficult to have nice
things.

> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> 
> diff --git a/fs/proc/array.c b/fs/proc/array.c
> index 34a47fb0c57f..f702fb50c8ef 100644
> --- a/fs/proc/array.c
> +++ b/fs/proc/array.c
> @@ -221,7 +221,7 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
>  #endif
>  	seq_putc(m, '\n');
> -	seq_printf(m, "Kthread:\t%c\n", p->flags & PF_KTHREAD ? '1' : '0');
> +	seq_printf(m, "Kthread:\t%c\n", p->flags & (PF_KTHREAD | PF_USER_WORKER | PF_IO_WORKER) ? '1' : '0');
>  }
>  void render_sigset_t(struct seq_file *m, const char *header,UBLK_U_IO_REGISTER_ZC_RING

Re: [PATCH] KVM: x86: switch hugepage recovery thread to vhost_task

Posted by Michal Koutný 1 year, 2 months ago

On Fri, Nov 08, 2024 at 08:07:37AM GMT, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Since the worker kthread is tied to a user process, it's better if
> it behaves similarly to user tasks as much as possible, including
> being able to send SIGSTOP and SIGCONT.

Do you mean s/send/receive/?

Consequently, it's OK if a (possibly unprivileged) user stops this
thread forever (they only harm themselves, not the rest of the system),
correct?

> In fact, vhost_task is all that kvm_vm_create_worker_thread() wanted
> to be and more: not only it inherits the userspace process's cgroups,
> it has other niceties like being parented properly in the process
> tree.  Use it instead of the homegrown alternative.

It is nice indeed.
I think the bugs we saw are not so serious to warrant
Fixes: c57c80467f90e ("kvm: Add helper function for creating VM worker threads")
.
(But I'm posting it here so that I can find the reference later.)

Thanks,
Michal

Re: [PATCH] KVM: x86: switch hugepage recovery thread to vhost_task

Posted by Paolo Bonzini 1 year, 2 months ago

On Fri, Nov 15, 2024 at 5:59 PM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Fri, Nov 08, 2024 at 08:07:37AM GMT, Paolo Bonzini <pbonzini@redhat.com> wrote:
> > Since the worker kthread is tied to a user process, it's better if
> > it behaves similarly to user tasks as much as possible, including
> > being able to send SIGSTOP and SIGCONT.
>
> Do you mean s/send/receive/?

I mean being able to send it to the threads with an effect.

> Consequently, it's OK if a (possibly unprivileged) user stops this
> thread forever (they only harm themselves, not the rest of the system),
> correct?

Yes, they will run with fewer huge pages and worse TLB performance.

Paolo

> > In fact, vhost_task is all that kvm_vm_create_worker_thread() wanted
> > to be and more: not only it inherits the userspace process's cgroups,
> > it has other niceties like being parented properly in the process
> > tree.  Use it instead of the homegrown alternative.
>
> It is nice indeed.
> I think the bugs we saw are not so serious to warrant
> Fixes: c57c80467f90e ("kvm: Add helper function for creating VM worker threads")
> .
> (But I'm posting it here so that I can find the reference later.)

Re: [PATCH] KVM: x86: switch hugepage recovery thread to vhost_task

Posted by Michal Koutný 1 year, 2 months ago

On Mon, Nov 18, 2024 at 01:42:27PM GMT, Paolo Bonzini <pbonzini@redhat.com> wrote:
> I mean being able to send it to the threads with an effect.

Understood now.

> > Consequently, it's OK if a (possibly unprivileged) user stops this
> > thread forever (they only harm themselves, not the rest of the system),
> > correct?
> 
> Yes, they will run with fewer huge pages and worse TLB performance.

Alright.

Thanks,
Michal

> > It is nice indeed.
> > I think the bugs we saw are not so serious to warrant
> > Fixes: c57c80467f90e ("kvm: Add helper function for creating VM worker threads")

I'm mainly posting this because there are some people surprised this
didn't get to 6.12. Hence I wonder if Cc: stable would be helpful here.

Re: [PATCH] KVM: x86: switch hugepage recovery thread to vhost_task

Posted by Paolo Bonzini 1 year, 2 months ago

On Mon, Nov 25, 2024 at 10:01 AM Michal Koutný <mkoutny@suse.com> wrote:
> > > I think the bugs we saw are not so serious to warrant
> > > Fixes: c57c80467f90e ("kvm: Add helper function for creating VM worker threads")
>
> I'm mainly posting this because there are some people surprised this
> didn't get to 6.12. Hence I wonder if Cc: stable would be helpful here.

Yes, the patch didn't make it to 6.12 because I wanted to get reviews
(which I did and resulted in changes), but it is in 6.13 and Cc:
stable is there.

Paolo

Re: [PATCH] KVM: x86: switch hugepage recovery thread to vhost_task

Posted by Sean Christopherson 1 year, 2 months ago

On Fri, Nov 08, 2024, Paolo Bonzini wrote:
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 8e853a5fc867..d5af4f8c5a6a 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -7281,7 +7281,7 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp)
>  			kvm_mmu_zap_all_fast(kvm);
>  			mutex_unlock(&kvm->slots_lock);
>  
> -			wake_up_process(kvm->arch.nx_huge_page_recovery_thread);
> +			vhost_task_wake(kvm->arch.nx_huge_page_recovery_thread);
>  		}
>  		mutex_unlock(&kvm_lock);
>  	}
> @@ -7427,7 +7427,7 @@ static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel
>  		mutex_lock(&kvm_lock);
>  
>  		list_for_each_entry(kvm, &vm_list, vm_list)
> -			wake_up_process(kvm->arch.nx_huge_page_recovery_thread);
> +			vhost_task_wake(kvm->arch.nx_huge_page_recovery_thread);
>  
>  		mutex_unlock(&kvm_lock);
>  	}
> @@ -7530,62 +7530,65 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm)
>  	srcu_read_unlock(&kvm->srcu, rcu_idx);
>  }
>  
> -static long get_nx_huge_page_recovery_timeout(u64 start_time)
> +#define NX_HUGE_PAGE_DISABLED (-1)

I don't see any point in using -1.  That's more legal (though still impossible
and absurd) than an deadline of '0'.  And it's somewhat confusing because KVM
uses -1 for the default nx_huge_pages value to indicate "enable the NX huge page
mitigation if the CPU is vulnerable to L1TF", not "disable the mitigation".

> +static u64 get_nx_huge_page_recovery_next(void)
>  {
>  	bool enabled;
>  	uint period;
>  
>  	enabled = calc_nx_huge_pages_recovery_period(&period);
>  
> -	return enabled ? start_time + msecs_to_jiffies(period) - get_jiffies_64()
> -		       : MAX_SCHEDULE_TIMEOUT;
> +	return enabled ? get_jiffies_64() + msecs_to_jiffies(period)
> +		: NX_HUGE_PAGE_DISABLED;

Please align the '?' and ':' to show that they are related paths of the ternary
operator.  Moot point if we go without a literal '0'.

>  }
>  
> -static int kvm_nx_huge_page_recovery_worker(struct kvm *kvm, uintptr_t data)
> +static void kvm_nx_huge_page_recovery_worker_kill(void *data)
>  {
> -	u64 start_time;
> +}
> +
> +static bool kvm_nx_huge_page_recovery_worker(void *data)
> +{
> +	struct kvm *kvm = data;
>  	long remaining_time;
>  
> -	while (true) {
> -		start_time = get_jiffies_64();
> -		remaining_time = get_nx_huge_page_recovery_timeout(start_time);
> +	if (kvm->arch.nx_huge_page_next == NX_HUGE_PAGE_DISABLED)
> +		return false;

The "next" concept is broken.  Once KVM sees NX_HUGE_PAGE_DISABLED for a given VM,
KVM will never re-evaluate nx_huge_page_next.  Similarly, if the recovery period
and/or ratio changes, KVM won't recompute the "next" time until the current timeout
has expired.

I fiddled around with various ideas, but I don't see a better solution that something
along the lines of KVM's request system, e.g. set a bool to indicate the params
changed, and sprinkle smp_{r,w}mb() barriers to ensure the vhost task sees the
new params.

FWIW, I also found "next" to be confusing.  How about "deadline"? KVM uses that
terminology for the APIC timer, i.e. it's familiar, intuitive, and accurate(ish).

Something like this as fixup?  (comments would be nice)

---
 arch/x86/include/asm/kvm_host.h |  3 ++-
 arch/x86/kvm/mmu/mmu.c          | 34 +++++++++++++++++++++------------
 2 files changed, 24 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 72f3bcfc54d7..e9fb8b9a9c2b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1444,7 +1444,8 @@ struct kvm_arch {
 
 	struct kvm_x86_pmu_event_filter __rcu *pmu_event_filter;
 	struct vhost_task *nx_huge_page_recovery_thread;
-	u64 nx_huge_page_next;
+	u64 nx_huge_page_deadline;
+	bool nx_huge_page_params_changed;
 
 #ifdef CONFIG_X86_64
 	/* The number of TDP MMU pages across all roots. */
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d0c2d9d2588f..acfa14d4248b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7102,6 +7102,13 @@ static void mmu_destroy_caches(void)
 	kmem_cache_destroy(mmu_page_header_cache);
 }
 
+static void mmu_wake_nx_huge_page_task(struct kvm *kvm)
+{
+	smp_wmb();
+	WRITE_ONCE(kvm->arch.nx_huge_page_deadline, true);
+	vhost_task_wake(kvm->arch.nx_huge_page_recovery_thread);
+}
+
 static int get_nx_huge_pages(char *buffer, const struct kernel_param *kp)
 {
 	if (nx_hugepage_mitigation_hard_disabled)
@@ -7162,7 +7169,7 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp)
 			kvm_mmu_zap_all_fast(kvm);
 			mutex_unlock(&kvm->slots_lock);
 
-			vhost_task_wake(kvm->arch.nx_huge_page_recovery_thread);
+			mmu_wake_nx_huge_page_task(kvm);
 		}
 		mutex_unlock(&kvm_lock);
 	}
@@ -7291,7 +7298,7 @@ static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel
 		mutex_lock(&kvm_lock);
 
 		list_for_each_entry(kvm, &vm_list, vm_list)
-			vhost_task_wake(kvm->arch.nx_huge_page_recovery_thread);
+			mmu_wake_nx_huge_page_task(kvm);
 
 		mutex_unlock(&kvm_lock);
 	}
@@ -7394,17 +7401,14 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm)
 	srcu_read_unlock(&kvm->srcu, rcu_idx);
 }
 
-#define NX_HUGE_PAGE_DISABLED (-1)
-
-static u64 get_nx_huge_page_recovery_next(void)
+static u64 get_nx_huge_page_recovery_deadline(void)
 {
 	bool enabled;
 	uint period;
 
 	enabled = calc_nx_huge_pages_recovery_period(&period);
 
-	return enabled ? get_jiffies_64() + msecs_to_jiffies(period)
-		: NX_HUGE_PAGE_DISABLED;
+	return enabled ? get_jiffies_64() + msecs_to_jiffies(period) : 0;
 }
 
 static void kvm_nx_huge_page_recovery_worker_kill(void *data)
@@ -7416,10 +7420,16 @@ static bool kvm_nx_huge_page_recovery_worker(void *data)
 	struct kvm *kvm = data;
 	long remaining_time;
 
-	if (kvm->arch.nx_huge_page_next == NX_HUGE_PAGE_DISABLED)
+	if (READ_ONCE(kvm->arch.nx_huge_page_params_changed)) {
+		smp_rmb();
+		WRITE_ONCE(kvm->arch.nx_huge_page_params_changed, false);
+		kvm->arch.nx_huge_page_deadline = get_nx_huge_page_recovery_deadline();
+	}
+
+	if (!kvm->arch.nx_huge_page_deadline)
 		return false;
 
-	remaining_time = kvm->arch.nx_huge_page_next - get_jiffies_64();
+	remaining_time = kvm->arch.nx_huge_page_deadline - get_jiffies_64();
 	if (remaining_time > 0) {
 		schedule_timeout(remaining_time);
 		/* check for signals and come back */
@@ -7428,7 +7438,7 @@ static bool kvm_nx_huge_page_recovery_worker(void *data)
 
 	__set_current_state(TASK_RUNNING);
 	kvm_recover_nx_huge_pages(kvm);
-	kvm->arch.nx_huge_page_next = get_nx_huge_page_recovery_next();
+	kvm->arch.nx_huge_page_deadline = get_nx_huge_page_recovery_deadline();
 	return true;
 }
 
@@ -7437,11 +7447,11 @@ int kvm_mmu_post_init_vm(struct kvm *kvm)
 	if (nx_hugepage_mitigation_hard_disabled)
 		return 0;
 
-	kvm->arch.nx_huge_page_next = get_nx_huge_page_recovery_next();
+	WRITE_ONCE(kvm->arch.nx_huge_page_params_changed, true);
 	kvm->arch.nx_huge_page_recovery_thread = vhost_task_create(
 		kvm_nx_huge_page_recovery_worker, kvm_nx_huge_page_recovery_worker_kill,
 		kvm, "kvm-nx-lpage-recovery");
-	
+
 	if (!kvm->arch.nx_huge_page_recovery_thread)
 		return -ENOMEM;
 

base-commit: 922a5630cd31e4414f964aa64f45a5884f40188c
--

Re: [PATCH] KVM: x86: switch hugepage recovery thread to vhost_task

Posted by Paolo Bonzini 1 year, 2 months ago

On 11/14/24 00:56, Sean Christopherson wrote:
>> +static bool kvm_nx_huge_page_recovery_worker(void *data)
>> +{
>> +	struct kvm *kvm = data;
>>   	long remaining_time;
>>   
>> -	while (true) {
>> -		start_time = get_jiffies_64();
>> -		remaining_time = get_nx_huge_page_recovery_timeout(start_time);
>> +	if (kvm->arch.nx_huge_page_next == NX_HUGE_PAGE_DISABLED)
>> +		return false;
> 
> The "next" concept is broken.  Once KVM sees NX_HUGE_PAGE_DISABLED for a given VM,
> KVM will never re-evaluate nx_huge_page_next.  Similarly, if the recovery period
> and/or ratio changes, KVM won't recompute the "next" time until the current timeout
> has expired.
> 
> I fiddled around with various ideas, but I don't see a better solution that something
> along the lines of KVM's request system, e.g. set a bool to indicate the params
> changed, and sprinkle smp_{r,w}mb() barriers to ensure the vhost task sees the
> new params.

"next" is broken, but there is a much better way to fix it.  You just
track the *last* time that the recovery ran.  This is also better
behaved when you flip recovery back and forth to disabled and back
to enabled: if your recovery period is 1 minute, it will run the
next recovery after 1 minute independent of how many times you flipped
the parameter.

This also fits much better with the "restart function after
schedule()" model that vhost_task.c requires, and it gets rid of
get_nx_huge_page_recovery_whatever() completely.  It's also silly how
close the code is to the broken v1, which is why I'm sending it below
instead of sending again a large patchdd.

Paolo

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 72f3bcfc54d7..e159e44a6a1b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1444,7 +1444,7 @@ struct kvm_arch {
  
  	struct kvm_x86_pmu_event_filter __rcu *pmu_event_filter;
  	struct vhost_task *nx_huge_page_recovery_thread;
-	u64 nx_huge_page_next;
+	u64 nx_huge_page_last;
  
  #ifdef CONFIG_X86_64
  	/* The number of TDP MMU pages across all roots. */
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d0c2d9d2588f..22e7ad235123 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7394,19 +7394,6 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm)
  	srcu_read_unlock(&kvm->srcu, rcu_idx);
  }
  
-#define NX_HUGE_PAGE_DISABLED (-1)
-
-static u64 get_nx_huge_page_recovery_next(void)
-{
-	bool enabled;
-	uint period;
-
-	enabled = calc_nx_huge_pages_recovery_period(&period);
-
-	return enabled ? get_jiffies_64() + msecs_to_jiffies(period)
-		: NX_HUGE_PAGE_DISABLED;
-}
-
  static void kvm_nx_huge_page_recovery_worker_kill(void *data)
  {
  }
@@ -7414,12 +7401,16 @@ static void kvm_nx_huge_page_recovery_worker_kill(void *data)
  static bool kvm_nx_huge_page_recovery_worker(void *data)
  {
  	struct kvm *kvm = data;
+	bool enabled;
+	uint period;
  	long remaining_time;
  
-	if (kvm->arch.nx_huge_page_next == NX_HUGE_PAGE_DISABLED)
+	enabled = calc_nx_huge_pages_recovery_period(&period);
+	if (!enabled)
  		return false;
  
-	remaining_time = kvm->arch.nx_huge_page_next - get_jiffies_64();
+	remaining_time = kvm->arch.nx_huge_page_last + msecs_to_jiffies(period)
+		- get_jiffies_64();
  	if (remaining_time > 0) {
  		schedule_timeout(remaining_time);
  		/* check for signals and come back */
@@ -7428,7 +7419,7 @@ static bool kvm_nx_huge_page_recovery_worker(void *data)
  
  	__set_current_state(TASK_RUNNING);
  	kvm_recover_nx_huge_pages(kvm);
-	kvm->arch.nx_huge_page_next = get_nx_huge_page_recovery_next();
+	kvm->arch.nx_huge_page_last = get_jiffies_64();
  	return true;
  }
  
@@ -7437,11 +7428,11 @@ int kvm_mmu_post_init_vm(struct kvm *kvm)
  	if (nx_hugepage_mitigation_hard_disabled)
  		return 0;
  
-	kvm->arch.nx_huge_page_next = get_nx_huge_page_recovery_next();
+	kvm->arch.nx_huge_page_last = get_jiffies_64();
  	kvm->arch.nx_huge_page_recovery_thread = vhost_task_create(
  		kvm_nx_huge_page_recovery_worker, kvm_nx_huge_page_recovery_worker_kill,
  		kvm, "kvm-nx-lpage-recovery");
-	
+
  	if (!kvm->arch.nx_huge_page_recovery_thread)
  		return -ENOMEM;

Re: [PATCH] KVM: x86: switch hugepage recovery thread to vhost_task

Posted by Sean Christopherson 1 year, 2 months ago

On Thu, Nov 14, 2024, Paolo Bonzini wrote:
> On 11/14/24 00:56, Sean Christopherson wrote:
> > > +static bool kvm_nx_huge_page_recovery_worker(void *data)
> > > +{
> > > +	struct kvm *kvm = data;
> > >   	long remaining_time;
> > > -	while (true) {
> > > -		start_time = get_jiffies_64();
> > > -		remaining_time = get_nx_huge_page_recovery_timeout(start_time);
> > > +	if (kvm->arch.nx_huge_page_next == NX_HUGE_PAGE_DISABLED)
> > > +		return false;
> > 
> > The "next" concept is broken.  Once KVM sees NX_HUGE_PAGE_DISABLED for a given VM,
> > KVM will never re-evaluate nx_huge_page_next.  Similarly, if the recovery period
> > and/or ratio changes, KVM won't recompute the "next" time until the current timeout
> > has expired.
> > 
> > I fiddled around with various ideas, but I don't see a better solution that something
> > along the lines of KVM's request system, e.g. set a bool to indicate the params
> > changed, and sprinkle smp_{r,w}mb() barriers to ensure the vhost task sees the
> > new params.
> 
> "next" is broken, but there is a much better way to fix it.  You just
> track the *last* time that the recovery ran.  This is also better
> behaved when you flip recovery back and forth to disabled and back
> to enabled: if your recovery period is 1 minute, it will run the
> next recovery after 1 minute independent of how many times you flipped
> the parameter.

Heh, I my brain was trying to get there last night, but I couldn't quite piece
things together.

Reviewed-by: Sean Christopherson <seanjc@google.com>

Re: [PATCH] KVM: x86: switch hugepage recovery thread to vhost_task

Posted by Tejun Heo 1 year, 3 months ago

On Fri, Nov 08, 2024 at 08:07:37AM -0500, Paolo Bonzini wrote:
...
> Since the worker kthread is tied to a user process, it's better if
> it behaves similarly to user tasks as much as possible, including
> being able to send SIGSTOP and SIGCONT.  In fact, vhost_task is all
> that kvm_vm_create_worker_thread() wanted to be and more: not only it
> inherits the userspace process's cgroups, it has other niceties like
> being parented properly in the process tree.  Use it instead of the
> homegrown alternative.

Didn't about vhost_task. That looks perfect. From cgroup POV:

  Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

Re: [PATCH] KVM: x86: switch hugepage recovery thread to vhost_task

Posted by Luca Boccassi 1 year, 3 months ago

On Fri, 8 Nov 2024 at 16:53, Tejun Heo <tj@kernel.org> wrote:
>
> On Fri, Nov 08, 2024 at 08:07:37AM -0500, Paolo Bonzini wrote:
> ...
> > Since the worker kthread is tied to a user process, it's better if
> > it behaves similarly to user tasks as much as possible, including
> > being able to send SIGSTOP and SIGCONT.  In fact, vhost_task is all
> > that kvm_vm_create_worker_thread() wanted to be and more: not only it
> > inherits the userspace process's cgroups, it has other niceties like
> > being parented properly in the process tree.  Use it instead of the
> > homegrown alternative.
>
> Didn't about vhost_task. That looks perfect. From cgroup POV:
>
>   Acked-by: Tejun Heo <tj@kernel.org>
>
> Thanks.

Thanks, tested on my machine by applying it to kernel 6.11.5 and can
confirm the issues are gone, freezing the cgroup works and everything
else too.

Could you please CC stable so that it can get backported?

Tested-by: Luca Boccassi <bluca@debian.org>