[PATCH v10 0/8] Use copy_process in vhost layer

Mike Christie posted 8 patches 3 years, 10 months ago
There is a newer version of this series
[PATCH v10 0/8] Use copy_process in vhost layer
Posted by Mike Christie 3 years, 10 months ago
The following patches were made over Linus's tree.

Eric and Christian, the vhost maintainer, Michael Tsirkin has ACK'd the
patches. I haven't got any more comments from you guys for a couple
postings now (Jan 8 was the last reply). Are you guys ok to merge them?

For everyone else that hasn't see this before, the patches allow the
vhost layer to do a copy_process on the thread that does the
VHOST_SET_OWNER ioctl like how io_uring does a copy_process against its
userspace app. This allows the vhost layer's worker threads to inherit
cgroups, namespaces, address space, etc and this worker thread will also
be accounted for against that owner/parent process's RLIMIT_NPROC limit.

If you are not familiar with qemu and vhost here is more detailed
problem description:

Qemu will create vhost devices in the kernel which perform network, SCSI,
etc IO and management operations from worker threads created by the
kthread API. Because the kthread API does a copy_process on the kthreadd
thread, the vhost layer has to use kthread_use_mm to access the Qemu
thread's memory and cgroup_attach_task_all to add itself to the Qemu
thread's cgroups.

The problem with this approach is that we then have to add new functions/
args/functionality for every thing we want to inherit. I started doing
that here:

https://lkml.org/lkml/2021/6/23/1233

for the RLIMIT_NPROC check, but it seems it might be easier to just
inherit everything from the beginning, becuase I'd need to do something
like that patch several times.

V10:
- Eric's cleanup patches my vhost flush cleanup patches are merged
upstream, so rebase against Linus's tree which has everything.
V9:
- Rebase against Eric's kthread-cleanups-for-v5.19 branch. Drop patches
no longer needed due to kernel clone arg and pf io worker patches in that
branch.
V8:
- Fix kzalloc GFP use.
- Fix email subject version number.
V7:
- Drop generic user_worker_* helpers and replace with vhost_task specific
  ones.
- Drop autoreap patch. Use kernel_wait4 instead.
- Fix issue where vhost.ko could be removed while the worker function is
  still running.
V6:
- Rename kernel_worker to user_worker and fix prefixes.
- Add better patch descriptions.
V5:
- Handle kbuild errors by building patchset against current kernel that
  has all deps merged. Also add patch to remove create_io_thread code as
  it's not used anymore.
- Rebase patchset against current kernel and handle a new vm PF_IO_WORKER
  case added in 5.16-rc1.
- Add PF_USER_WORKER flag so we can check it later after the initial
  thread creation for the wake up, vm and singal cses.
- Added patch to auto reap the worker thread.
V4:
- Drop NO_SIG patch and replaced with Christian's SIG_IGN patch.
- Merged Christian's kernel_worker_flags_valid helpers into patch 5 that
  added the new kernel worker functions.
- Fixed extra "i" issue.
- Added PF_USER_WORKER flag and added check that kernel_worker_start users
  had that flag set. Also dropped patches that passed worker flags to
  copy_thread and replaced with PF_USER_WORKER check.
V3:
- Add parentheses in p->flag and work_flags check in copy_thread.
- Fix check in arm/arm64 which was doing the reverse of other archs
  where it did likely(!flags) instead of unlikely(flags).
V2:
- Rename kernel_copy_process to kernel_worker.
- Instead of exporting functions, make kernel_worker() a proper
  function/API that does common work for the caller.
- Instead of adding new fields to kernel_clone_args for each option
  make it flag based similar to CLONE_*.
- Drop unused completion struct in vhost.
- Fix compile warnings by merging vhost cgroup cleanup patch and
  vhost conversion patch.
Re: [PATCH v10 0/8] Use copy_process in vhost layer
Posted by Mike Christie 3 years, 9 months ago
Eric and Christian, Ping?

If you guys don't like these patches anymore what about something
simple like just exporting some helpers to update and check a task's
nproc limit. Something like this:


diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 81cab4b01edc..71b5946be792 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -98,6 +98,10 @@ int kernel_wait(pid_t pid, int *stat);
 
 extern void free_task(struct task_struct *tsk);
 
+extern bool task_is_over_nproc_limit(struct task_struct *tsk);
+extern void task_inc_nproc(struct task_struct *tsk);
+extern void task_dec_nproc(struct task_struct *tsk);
+
 /* sched_exec is called by processes performing an exec */
 #ifdef CONFIG_SMP
 extern void sched_exec(void);
diff --git a/kernel/cred.c b/kernel/cred.c
index e10c15f51c1f..c15e7b926013 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -358,7 +358,7 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
 		kdebug("share_creds(%p{%d,%d})",
 		       p->cred, atomic_read(&p->cred->usage),
 		       read_cred_subscribers(p->cred));
-		inc_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1);
+		task_inc_nproc(p);
 		return 0;
 	}
 
@@ -395,7 +395,7 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
 #endif
 
 	p->cred = p->real_cred = get_cred(new);
-	inc_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1);
+	task_inc_nproc(p);
 	alter_cred_subscribers(new, 2);
 	validate_creds(new);
 	return 0;
diff --git a/kernel/fork.c b/kernel/fork.c
index 9d44f2d46c69..88dbe3458d7d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1964,6 +1964,32 @@ static void copy_oom_score_adj(u64 clone_flags, struct task_struct *tsk)
 	mutex_unlock(&oom_adj_mutex);
 }
 
+bool task_is_over_nproc_limit(struct task_struct *tsk)
+{
+	if (!is_ucounts_overlimit(task_ucounts(tsk), UCOUNT_RLIMIT_NPROC,
+	    task_rlimit(tsk, RLIMIT_NPROC)))
+		return false;
+
+	if (tsk->real_cred->user != INIT_USER && !capable(CAP_SYS_RESOURCE) &&
+	    !capable(CAP_SYS_ADMIN))
+		return true;
+
+	return false;
+}
+EXPORT_SYMBOL_GPL(task_is_over_nproc_limit);
+
+void task_inc_nproc(struct task_struct *tsk)
+{
+	inc_rlimit_ucounts(task_ucounts(tsk), UCOUNT_RLIMIT_NPROC, 1);
+}
+EXPORT_SYMBOL_GPL(task_inc_nproc);
+
+void task_dec_nproc(struct task_struct *tsk)
+{
+	dec_rlimit_ucounts(task_ucounts(tsk), UCOUNT_RLIMIT_NPROC, 1);
+}
+EXPORT_SYMBOL_GPL(task_dec_nproc);
+
 /*
  * This creates a new process as a copy of the old one,
  * but does not actually start it yet.
@@ -2102,11 +2128,8 @@ static __latent_entropy struct task_struct *copy_process(
 		goto bad_fork_free;
 
 	retval = -EAGAIN;
-	if (is_ucounts_overlimit(task_ucounts(p), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) {
-		if (p->real_cred->user != INIT_USER &&
-		    !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
-			goto bad_fork_cleanup_count;
-	}
+	if (task_is_over_nproc_limit(p))
+		goto bad_fork_cleanup_count;
 	current->flags &= ~PF_NPROC_EXCEEDED;
 
 	/*
@@ -2526,7 +2549,7 @@ static __latent_entropy struct task_struct *copy_process(
 bad_fork_cleanup_delayacct:
 	delayacct_tsk_free(p);
 bad_fork_cleanup_count:
-	dec_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1);
+	task_dec_nproc(p);
 	exit_creds(p);
 bad_fork_free:
 	WRITE_ONCE(p->__state, TASK_DEAD);


On 6/19/22 8:13 PM, Mike Christie wrote:
> The following patches were made over Linus's tree.
> 
> Eric and Christian, the vhost maintainer, Michael Tsirkin has ACK'd the
> patches. I haven't got any more comments from you guys for a couple
> postings now (Jan 8 was the last reply). Are you guys ok to merge them?
> 
> For everyone else that hasn't see this before, the patches allow the
> vhost layer to do a copy_process on the thread that does the
> VHOST_SET_OWNER ioctl like how io_uring does a copy_process against its
> userspace app. This allows the vhost layer's worker threads to inherit
> cgroups, namespaces, address space, etc and this worker thread will also
> be accounted for against that owner/parent process's RLIMIT_NPROC limit.
> 
> If you are not familiar with qemu and vhost here is more detailed
> problem description:
> 
> Qemu will create vhost devices in the kernel which perform network, SCSI,
> etc IO and management operations from worker threads created by the
> kthread API. Because the kthread API does a copy_process on the kthreadd
> thread, the vhost layer has to use kthread_use_mm to access the Qemu
> thread's memory and cgroup_attach_task_all to add itself to the Qemu
> thread's cgroups.
> 
> The problem with this approach is that we then have to add new functions/
> args/functionality for every thing we want to inherit. I started doing
> that here:
> 
> https://lkml.org/lkml/2021/6/23/1233
> 
> for the RLIMIT_NPROC check, but it seems it might be easier to just
> inherit everything from the beginning, becuase I'd need to do something
> like that patch several times.
> 
> V10:
> - Eric's cleanup patches my vhost flush cleanup patches are merged
> upstream, so rebase against Linus's tree which has everything.
> V9:
> - Rebase against Eric's kthread-cleanups-for-v5.19 branch. Drop patches
> no longer needed due to kernel clone arg and pf io worker patches in that
> branch.
> V8:
> - Fix kzalloc GFP use.
> - Fix email subject version number.
> V7:
> - Drop generic user_worker_* helpers and replace with vhost_task specific
>   ones.
> - Drop autoreap patch. Use kernel_wait4 instead.
> - Fix issue where vhost.ko could be removed while the worker function is
>   still running.
> V6:
> - Rename kernel_worker to user_worker and fix prefixes.
> - Add better patch descriptions.
> V5:
> - Handle kbuild errors by building patchset against current kernel that
>   has all deps merged. Also add patch to remove create_io_thread code as
>   it's not used anymore.
> - Rebase patchset against current kernel and handle a new vm PF_IO_WORKER
>   case added in 5.16-rc1.
> - Add PF_USER_WORKER flag so we can check it later after the initial
>   thread creation for the wake up, vm and singal cses.
> - Added patch to auto reap the worker thread.
> V4:
> - Drop NO_SIG patch and replaced with Christian's SIG_IGN patch.
> - Merged Christian's kernel_worker_flags_valid helpers into patch 5 that
>   added the new kernel worker functions.
> - Fixed extra "i" issue.
> - Added PF_USER_WORKER flag and added check that kernel_worker_start users
>   had that flag set. Also dropped patches that passed worker flags to
>   copy_thread and replaced with PF_USER_WORKER check.
> V3:
> - Add parentheses in p->flag and work_flags check in copy_thread.
> - Fix check in arm/arm64 which was doing the reverse of other archs
>   where it did likely(!flags) instead of unlikely(flags).
> V2:
> - Rename kernel_copy_process to kernel_worker.
> - Instead of exporting functions, make kernel_worker() a proper
>   function/API that does common work for the caller.
> - Instead of adding new fields to kernel_clone_args for each option
>   make it flag based similar to CLONE_*.
> - Drop unused completion struct in vhost.
> - Fix compile warnings by merging vhost cgroup cleanup patch and
>   vhost conversion patch.
> 
> 
> 
> _______________________________________________
> Virtualization mailing list
> Virtualization@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v10 0/8] Use copy_process in vhost layer
Posted by Christoph Hellwig 3 years, 9 months ago
On Mon, Jul 11, 2022 at 11:44:14PM -0500, Mike Christie wrote:
> Eric and Christian, Ping?
> 
> If you guys don't like these patches anymore what about something
> simple like just exporting some helpers to update and check a task's
> nproc limit. Something like this:

I'm not Eric or Christian and don't really feel qualified on the
technical aspects here, but please drop the pointless externs while
you're at it :)
Re: [PATCH v10 0/8] Use copy_process in vhost layer
Posted by Christian Brauner 3 years, 9 months ago
On Mon, Jul 11, 2022 at 09:54:45PM -0700, Christoph Hellwig wrote:
> On Mon, Jul 11, 2022 at 11:44:14PM -0500, Mike Christie wrote:
> > Eric and Christian, Ping?
> > 
> > If you guys don't like these patches anymore what about something
> > simple like just exporting some helpers to update and check a task's
> > nproc limit. Something like this:
> 
> I'm not Eric or Christian and don't really feel qualified on the
> technical aspects here, but please drop the pointless externs while
> you're at it :)

I'm on vacation but I'll try to look once I get back.

Christian