From nobody Tue Apr 7 17:13:52 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F3CCD3A1E88; Thu, 26 Feb 2026 13:51:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772113877; cv=none; b=JcvPEkOPxAcbk5bG3VksaJZiw2XbTb+b2ve40/9ol1okGKvdR/Fnk62Ns0EY0Yi/mnpJah0ZgTOimggG94epx2K2fsg3B8HpVLDQmfh1cM+acLnWaMxwlr9dGw/H7pQis0GAKgWNNT80e5Wv0idXZ4c78CHkGHTqOCKMfwjoeDo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772113877; c=relaxed/simple; bh=l7gnpFS9XkcCCd7BGMHyc7VLW8/9pgBGD7AZ+2Ixk9k=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=fpiEOPelR7yclt7PJTNmXBWUvUEucn/+l9hbHPuSdxBNydHw+erLQMDmHZ121uw9jzs0dPiaCHLqzsMCTgeHsfZHh5fq7qCzPwgYAMxaXEFi2RuZqiOn3vq0Qxnf1iEz430+aMhj74MIVeWkQXZmooLjqGc/BN9GFlfoExv83G8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=FI9GfPYa; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="FI9GfPYa" Received: by smtp.kernel.org (Postfix) with ESMTPSA id F35FAC19422; Thu, 26 Feb 2026 13:51:14 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1772113876; bh=l7gnpFS9XkcCCd7BGMHyc7VLW8/9pgBGD7AZ+2Ixk9k=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=FI9GfPYaq8RP0lw62M90E5Uj4GiqXaptsdJ2iYtYkkCqqfvyb0ZSs3lzUQXuKGDVf 0z8NPmKwQlhpqGNUeSpmOnb13JFe11G5ke6IkZ0bBW1wzZNTrDDYW6x3Ysv3S+N/DE 4tmcFcwhBi0W/e52Amy1Is7s/T+HT+IdLUVkEBjc/JpW2eAn9Ce2qphy+I7cy9WuZn j6UuIQGT3Yf03BpI1fUfjsRAmVvjmmKANX8hNxil1VYtiIK0DdBpYtFZZK1f9xOdY2 WguOfArfq/BLEd71HEGlvyHq+YWTBEz0UgrdjWdCa0caW5UW9dGvjlb7MXIEfP3uv4 4Md0PHfanaAqQ== From: Christian Brauner Date: Thu, 26 Feb 2026 14:51:01 +0100 Subject: [PATCH v5 3/6] pidfd: add CLONE_PIDFD_AUTOKILL Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20260226-work-pidfs-autoreap-v5-3-d148b984a989@kernel.org> References: <20260226-work-pidfs-autoreap-v5-0-d148b984a989@kernel.org> In-Reply-To: <20260226-work-pidfs-autoreap-v5-0-d148b984a989@kernel.org> To: Oleg Nesterov , Jann Horn Cc: Linus Torvalds , Ingo Molnar , Peter Zijlstra , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Christian Brauner X-Mailer: b4 0.15-dev-47773 X-Developer-Signature: v=1; a=openpgp-sha256; l=7595; i=brauner@kernel.org; h=from:subject:message-id; bh=l7gnpFS9XkcCCd7BGMHyc7VLW8/9pgBGD7AZ+2Ixk9k=; b=owGbwMvMwCU28Zj0gdSKO4sYT6slMWQu8D+terBlfdLSnxXHs2waWZmYusRrvuseUd0iolhma jH3T6xkRykLgxgXg6yYIotDu0m43HKeis1GmRowc1iZQIYwcHEKwESOLmT4Z6/LWjJ1xoGqLZeE j558lLv0pq/NTvagohrLqRed3Gs7JzEybBFVn+q70V+7JqfMIPbtx0/8VzZqnKgUDLnWryAcOMG IDQA= X-Developer-Key: i=brauner@kernel.org; a=openpgp; fpr=4880B8C9BD0E5106FC070F4F7B3C391EFEA93624 Add a new clone3() flag CLONE_PIDFD_AUTOKILL that ties a child's lifetime to the pidfd returned from clone3(). When the last reference to the struct file created by clone3() is closed the kernel sends SIGKILL to the child. A pidfd obtained via pidfd_open() for the same process does not keep the child alive and does not trigger autokill - only the specific struct file from clone3() has this property. This is useful for container runtimes, service managers, and sandboxed subprocess execution - any scenario where the child must die if the parent crashes or abandons the pidfd. CLONE_PIDFD_AUTOKILL requires both CLONE_PIDFD (the whole point is tying lifetime to the pidfd file) and CLONE_AUTOREAP (a killed child with no one to reap it would become a zombie). CLONE_THREAD is rejected because autokill targets a process not a thread. The clone3 pidfd is identified by the PIDFD_AUTOKILL file flag set on the struct file at clone3() time. The pidfs .release handler checks this flag and sends SIGKILL via do_send_sig_info(SIGKILL, SEND_SIG_PRIV, ...) only when it is set. Files from pidfd_open() or open_by_handle_at() are distinct struct files that do not carry this flag. dup()/fork() share the same struct file so they extend the child's lifetime until the last reference drops. CLONE_PIDFD_AUTOKILL uses a privilege model based on CLONE_NNP: without CLONE_NNP the child could escalate privileges via setuid/setgid exec after being spawned, so the caller must have CAP_SYS_ADMIN in its user namespace. With CLONE_NNP the child can never gain new privileges so unprivileged usage is allowed. This is a deliberate departure from the pdeath_signal model which is reset during secureexec and commit_creds() rendering it useless for container runtimes that need to deprivilege themselves. Signed-off-by: Christian Brauner --- fs/pidfs.c | 38 ++++++++++++++++++++++++++++++++------ include/uapi/linux/pidfd.h | 1 + include/uapi/linux/sched.h | 1 + kernel/fork.c | 29 ++++++++++++++++++++++++++--- 4 files changed, 60 insertions(+), 9 deletions(-) diff --git a/fs/pidfs.c b/fs/pidfs.c index 318253344b5c..a8d1bca0395d 100644 --- a/fs/pidfs.c +++ b/fs/pidfs.c @@ -8,6 +8,8 @@ #include #include #include +#include +#include #include #include #include @@ -637,7 +639,28 @@ static long pidfd_ioctl(struct file *file, unsigned in= t cmd, unsigned long arg) return open_namespace(ns_common); } =20 +static int pidfs_file_release(struct inode *inode, struct file *file) +{ + struct pid *pid =3D inode->i_private; + struct task_struct *task; + + if (!(file->f_flags & PIDFD_AUTOKILL)) + return 0; + + guard(rcu)(); + task =3D pid_task(pid, PIDTYPE_TGID); + if (!task) + return 0; + + /* Not available for kthreads or user workers for now. */ + if (WARN_ON_ONCE(task->flags & (PF_KTHREAD | PF_USER_WORKER))) + return 0; + do_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_TGID); + return 0; +} + static const struct file_operations pidfs_file_operations =3D { + .release =3D pidfs_file_release, .poll =3D pidfd_poll, #ifdef CONFIG_PROC_FS .show_fdinfo =3D pidfd_show_fdinfo, @@ -1093,11 +1116,11 @@ struct file *pidfs_alloc_file(struct pid *pid, unsi= gned int flags) int ret; =20 /* - * Ensure that PIDFD_STALE can be passed as a flag without - * overloading other uapi pidfd flags. + * Ensure that internal pidfd flags don't overlap with each + * other or with uapi pidfd flags. */ - BUILD_BUG_ON(PIDFD_STALE =3D=3D PIDFD_THREAD); - BUILD_BUG_ON(PIDFD_STALE =3D=3D PIDFD_NONBLOCK); + BUILD_BUG_ON(hweight32(PIDFD_THREAD | PIDFD_NONBLOCK | + PIDFD_STALE | PIDFD_AUTOKILL) !=3D 4); =20 ret =3D path_from_stashed(&pid->stashed, pidfs_mnt, get_pid(pid), &path); if (ret < 0) @@ -1108,9 +1131,12 @@ struct file *pidfs_alloc_file(struct pid *pid, unsig= ned int flags) flags &=3D ~PIDFD_STALE; flags |=3D O_RDWR; pidfd_file =3D dentry_open(&path, flags, current_cred()); - /* Raise PIDFD_THREAD explicitly as do_dentry_open() strips it. */ + /* + * Raise PIDFD_THREAD and PIDFD_AUTOKILL explicitly as + * do_dentry_open() strips O_EXCL and O_TRUNC. + */ if (!IS_ERR(pidfd_file)) - pidfd_file->f_flags |=3D (flags & PIDFD_THREAD); + pidfd_file->f_flags |=3D (flags & (PIDFD_THREAD | PIDFD_AUTOKILL)); =20 return pidfd_file; } diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h index ea9a6811fc76..9281956a9f32 100644 --- a/include/uapi/linux/pidfd.h +++ b/include/uapi/linux/pidfd.h @@ -13,6 +13,7 @@ #ifdef __KERNEL__ #include #define PIDFD_STALE CLONE_PIDFD +#define PIDFD_AUTOKILL O_TRUNC #endif =20 /* Flags for pidfd_send_signal(). */ diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h index 7b1b87473ebb..0aafb4652afc 100644 --- a/include/uapi/linux/sched.h +++ b/include/uapi/linux/sched.h @@ -37,6 +37,7 @@ #define CLONE_CLEAR_SIGHAND 0x100000000ULL /* Clear any signal handler and= reset to SIG_DFL. */ #define CLONE_INTO_CGROUP 0x200000000ULL /* Clone into a specific cgroup g= iven the right permissions. */ #define CLONE_AUTOREAP 0x400000000ULL /* Auto-reap child on exit. */ +#define CLONE_PIDFD_AUTOKILL 0x800000000ULL /* Kill child when clone pidfd= closes. */ #define CLONE_NNP 0x1000000000ULL /* Set no_new_privs on child. */ =20 /* diff --git a/kernel/fork.c b/kernel/fork.c index a3202ee278d8..0f4944ce378d 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2042,6 +2042,24 @@ __latent_entropy struct task_struct *copy_process( return ERR_PTR(-EINVAL); } =20 + if (clone_flags & CLONE_PIDFD_AUTOKILL) { + if (!(clone_flags & CLONE_PIDFD)) + return ERR_PTR(-EINVAL); + if (!(clone_flags & CLONE_AUTOREAP)) + return ERR_PTR(-EINVAL); + if (clone_flags & CLONE_THREAD) + return ERR_PTR(-EINVAL); + /* + * Without CLONE_NNP the child could escalate privileges + * after being spawned, so require CAP_SYS_ADMIN. + * With CLONE_NNP the child can't gain new privileges, + * so allow unprivileged usage. + */ + if (!(clone_flags & CLONE_NNP) && + !ns_capable(current_user_ns(), CAP_SYS_ADMIN)) + return ERR_PTR(-EPERM); + } + /* * Force any signals received before this point to be delivered * before the fork happens. Collect up signals sent to multiple @@ -2264,13 +2282,18 @@ __latent_entropy struct task_struct *copy_process( * if the fd table isn't shared). */ if (clone_flags & CLONE_PIDFD) { - int flags =3D (clone_flags & CLONE_THREAD) ? PIDFD_THREAD : 0; + unsigned flags =3D PIDFD_STALE; + + if (clone_flags & CLONE_THREAD) + flags |=3D PIDFD_THREAD; + if (clone_flags & CLONE_PIDFD_AUTOKILL) + flags |=3D PIDFD_AUTOKILL; =20 /* * Note that no task has been attached to @pid yet indicate * that via CLONE_PIDFD. */ - retval =3D pidfd_prepare(pid, flags | PIDFD_STALE, &pidfile); + retval =3D pidfd_prepare(pid, flags, &pidfile); if (retval < 0) goto bad_fork_free_pid; pidfd =3D retval; @@ -2917,7 +2940,7 @@ static bool clone3_args_valid(struct kernel_clone_arg= s *kargs) /* Verify that no unknown flags are passed along. */ if (kargs->flags & ~(CLONE_LEGACY_FLAGS | CLONE_CLEAR_SIGHAND | CLONE_INTO_CGROUP | - CLONE_AUTOREAP | CLONE_NNP)) + CLONE_AUTOREAP | CLONE_NNP | CLONE_PIDFD_AUTOKILL)) return false; =20 /* --=20 2.47.3