From nobody Tue Apr 7 15:27:24 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EAE4127F72C; Thu, 26 Feb 2026 13:51:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772113873; cv=none; b=rv0qbEnzUejv0/aLudmRloplh7I3JwSTrL2l0aNFEnVAwTXFhoRCz/nxxLZ6zjt2LJUDSDVbFcUmq3iE4MIjFLnWUzC1g7NazJsK7xX5lRVq1oi+2jIM8pjTpIZY2arv5+m5zRSAuNQvTB+3VD5xmNqMeSxAF7i0lgjA6TxpDCo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772113873; c=relaxed/simple; bh=oHaJZts+QrjYBNUL0bMFZBKLhJ1qXU7/bPzzxp7wO58=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=gviCAEGBhu54SkCmAUY/L+Hx+GuTe9WdH626ZGc+e4rPmKQLXYhhxR2KY9gM0extG2DwVl6ZZOUUVmAr6GYm1+zkrZA/8w5CNC0QrrpwjE53j/fcdXZQ8D3kkQdjxAtDjhxUBi5LtXcxCoku1O8Da71Za2lOpDerETXGCt6sCIo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=WErjZFeX; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="WErjZFeX" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 8F2E8C116C6; Thu, 26 Feb 2026 13:51:10 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1772113872; bh=oHaJZts+QrjYBNUL0bMFZBKLhJ1qXU7/bPzzxp7wO58=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=WErjZFeXdGKHyXShvVNqpaJVGu5gNaaJobXSBgnwQqxpcg1C3MRXVopPZfJ7lTSFe qBhljkWfo86dxo3Jfo4IioKREgMVpXxId8a1v+SQBCWzmKWL6NyoU2dpdJ5bWWVOd9 WfVe7Y2ViK1uW5Rs3/6b1VkfuwIadEfI7JAMm8QxZFtGqX7Jj5u+givLjeqaL+83Nm 0NX5876s1om7x88DvR9Z9snQ6P3LbFEdA1lS/vu5CRUmSz1GXgk2Rj9JCs6twO7mAb uz0CV4nlF83mY6bGNqo5hhXOcmICPin79uWOzP2PMOfZ8wWdpntm0dbFQ064jYsA5A pvevy4awFA5lA== From: Christian Brauner Date: Thu, 26 Feb 2026 14:50:59 +0100 Subject: [PATCH v5 1/6] clone: add CLONE_AUTOREAP Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20260226-work-pidfs-autoreap-v5-1-d148b984a989@kernel.org> References: <20260226-work-pidfs-autoreap-v5-0-d148b984a989@kernel.org> In-Reply-To: <20260226-work-pidfs-autoreap-v5-0-d148b984a989@kernel.org> To: Oleg Nesterov , Jann Horn Cc: Linus Torvalds , Ingo Molnar , Peter Zijlstra , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Christian Brauner X-Mailer: b4 0.15-dev-47773 X-Developer-Signature: v=1; a=openpgp-sha256; l=5721; i=brauner@kernel.org; h=from:subject:message-id; bh=oHaJZts+QrjYBNUL0bMFZBKLhJ1qXU7/bPzzxp7wO58=; b=owGbwMvMwCU28Zj0gdSKO4sYT6slMWQu8D/99W9EN6Obt2elWu2Wpoirs7XaDNw3BtzzWfhG4 qDVlwl5HaUsDGJcDLJiiiwO7Sbhcst5KjYbZWrAzGFlAhnCwMUpABOp8WJkeDvV64iw8OEic/lr uj3K5Rq/1oadWDNV0zgwYJbAjpU3lBn+6X58L5t+kdtcXJ/11z6bU0vYl2z6qn1mrpLC3qM/jv+ awA8A X-Developer-Key: i=brauner@kernel.org; a=openpgp; fpr=4880B8C9BD0E5106FC070F4F7B3C391EFEA93624 Add a new clone3() flag CLONE_AUTOREAP that makes a child process auto-reap on exit without ever becoming a zombie. This is a per-process property in contrast to the existing auto-reap mechanism via SA_NOCLDWAIT or SIG_IGN for SIGCHLD which applies to all children of a given parent. Currently the only way to automatically reap children is to set SA_NOCLDWAIT or SIG_IGN on SIGCHLD. This is a parent-scoped property affecting all children which makes it unsuitable for libraries or applications that need selective auto-reaping of specific children while still being able to wait() on others. CLONE_AUTOREAP stores an autoreap flag in the child's signal_struct. When the child exits do_notify_parent() checks this flag and causes exit_notify() to transition the task directly to EXIT_DEAD. Since the flag lives on the child it survives reparenting: if the original parent exits and the child is reparented to a subreaper or init the child still auto-reaps when it eventually exits. CLONE_AUTOREAP can be combined with CLONE_PIDFD to allow the parent to monitor the child's exit via poll() and retrieve exit status via PIDFD_GET_INFO. Without CLONE_PIDFD it provides a fire-and-forget pattern where the parent simply doesn't care about the child's exit status. No exit signal is delivered so exit_signal must be zero. CLONE_AUTOREAP is rejected in combination with CLONE_PARENT. If a CLONE_AUTOREAP child were to clone(CLONE_PARENT) the new grandchild would inherit exit_signal =3D=3D 0 from the autoreap parent's group leader but without signal->autoreap. This grandchild would become a zombie that never sends a signal and is never autoreaped - confusing and arguably broken behavior. The flag is not inherited by the autoreap process's own children. Each child that should be autoreaped must be explicitly created with CLONE_AUTOREAP. Link: https://github.com/uapi-group/kernel-features/issues/45 Signed-off-by: Christian Brauner Reviewed-by: Oleg Nesterov --- include/linux/sched/signal.h | 1 + include/uapi/linux/sched.h | 1 + kernel/fork.c | 14 +++++++++++++- kernel/ptrace.c | 3 ++- kernel/signal.c | 4 ++++ 5 files changed, 21 insertions(+), 2 deletions(-) diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index a22248aebcf9..f842c86b806f 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -132,6 +132,7 @@ struct signal_struct { */ unsigned int is_child_subreaper:1; unsigned int has_child_subreaper:1; + unsigned int autoreap:1; =20 #ifdef CONFIG_POSIX_TIMERS =20 diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h index 359a14cc76a4..8a22ea640817 100644 --- a/include/uapi/linux/sched.h +++ b/include/uapi/linux/sched.h @@ -36,6 +36,7 @@ /* Flags for the clone3() syscall. */ #define CLONE_CLEAR_SIGHAND 0x100000000ULL /* Clear any signal handler and= reset to SIG_DFL. */ #define CLONE_INTO_CGROUP 0x200000000ULL /* Clone into a specific cgroup g= iven the right permissions. */ +#define CLONE_AUTOREAP 0x400000000ULL /* Auto-reap child on exit. */ =20 /* * cloning flags intersect with CSIGNAL so can be used with unshare and cl= one3 diff --git a/kernel/fork.c b/kernel/fork.c index e832da9d15a4..0dedf2999f0c 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2028,6 +2028,15 @@ __latent_entropy struct task_struct *copy_process( return ERR_PTR(-EINVAL); } =20 + if (clone_flags & CLONE_AUTOREAP) { + if (clone_flags & CLONE_THREAD) + return ERR_PTR(-EINVAL); + if (clone_flags & CLONE_PARENT) + return ERR_PTR(-EINVAL); + if (args->exit_signal) + return ERR_PTR(-EINVAL); + } + /* * Force any signals received before this point to be delivered * before the fork happens. Collect up signals sent to multiple @@ -2435,6 +2444,8 @@ __latent_entropy struct task_struct *copy_process( */ p->signal->has_child_subreaper =3D p->real_parent->signal->has_child_su= breaper || p->real_parent->signal->is_child_subreaper; + if (clone_flags & CLONE_AUTOREAP) + p->signal->autoreap =3D 1; list_add_tail(&p->sibling, &p->real_parent->children); list_add_tail_rcu(&p->tasks, &init_task.tasks); attach_pid(p, PIDTYPE_TGID); @@ -2897,7 +2908,8 @@ static bool clone3_args_valid(struct kernel_clone_arg= s *kargs) { /* Verify that no unknown flags are passed along. */ if (kargs->flags & - ~(CLONE_LEGACY_FLAGS | CLONE_CLEAR_SIGHAND | CLONE_INTO_CGROUP)) + ~(CLONE_LEGACY_FLAGS | CLONE_CLEAR_SIGHAND | CLONE_INTO_CGROUP | + CLONE_AUTOREAP)) return false; =20 /* diff --git a/kernel/ptrace.c b/kernel/ptrace.c index 392ec2f75f01..68c17daef8d4 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -549,7 +549,8 @@ static bool __ptrace_detach(struct task_struct *tracer,= struct task_struct *p) if (!dead && thread_group_empty(p)) { if (!same_thread_group(p->real_parent, tracer)) dead =3D do_notify_parent(p, p->exit_signal); - else if (ignoring_children(tracer->sighand)) { + else if (ignoring_children(tracer->sighand) || + p->signal->autoreap) { __wake_up_parent(p, tracer); dead =3D true; } diff --git a/kernel/signal.c b/kernel/signal.c index d65d0fe24bfb..e61f39fa8c8a 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2251,6 +2251,10 @@ bool do_notify_parent(struct task_struct *tsk, int s= ig) if (psig->action[SIGCHLD-1].sa.sa_handler =3D=3D SIG_IGN) sig =3D 0; } + if (!tsk->ptrace && tsk->signal->autoreap) { + autoreap =3D true; + sig =3D 0; + } /* * Send with __send_signal as si_pid and si_uid are in the * parent's namespaces. --=20 2.47.3 From nobody Tue Apr 7 15:27:24 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0EEF33A1A40; Thu, 26 Feb 2026 13:51:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772113875; cv=none; b=VNl+q++6b5xkv7sZCQq8jPlyaHTa+PtJscLAFLbxeiEPDfSYXOCcne3cspfpbBRkTJexLKjYYkWzAt9kp9Kw5PfHjXfkOH2nEXxY4VdKxUpcUdoXRQLDkFBzurKVp88mly/GvwKTYETe31yWEMkMtz98U48HAnCwSDP8fDHPQjA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772113875; c=relaxed/simple; bh=Pe64i0mVb0l9+hOUVgaR7FNAmTnQLfrD7A8hMzslM0Q=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=phwyWstex9otPEhNxjsInCfIBH2ZbahGDcXFWyGJN5IDhlSM09AdkCsDTe2ajvv9wgIiq7vrGDnyPbgmqrrC94L6nTGSAhOc5XTmole2YK0AIWcwfznlODx711gjiHUR+ioBOJc8wA4E6aCrraunNzhVSBwMLVhqRxB6DeI4+SY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=moe0Emux; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="moe0Emux" Received: by smtp.kernel.org (Postfix) with ESMTPSA id D2BA1C2BC87; Thu, 26 Feb 2026 13:51:12 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1772113874; bh=Pe64i0mVb0l9+hOUVgaR7FNAmTnQLfrD7A8hMzslM0Q=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=moe0EmuxsBzegZtcPXiLwvocIH/bqfru1Wf4t08Mf/GulV2gjzALhXXmnwe56IRil fw9UG7H/J3vAqqAlJKwYmHG16mhsIdqCcmJArerKHYeAtl9oAGiaZBSrwNyhggOi24 yY+7jm7tFdoHDjMEmTP5pGYQA6j2ZoL7pul3wMe36ttSVTjp5wZX7SZr6jJujxsZjN byAfKtzofrCElPP6Ew4b8wGvanIIcD4B/yAucLLJ5FcS+QrS9VlXcr4ZtSsqHRTCUK 9oao82LRDRFE9874yI/Pkya0jiLCv7F7+nhure/bjzgYllxnG5whSqSea+EEqvZYtv 6ktQDxOpaf+tA== From: Christian Brauner Date: Thu, 26 Feb 2026 14:51:00 +0100 Subject: [PATCH v5 2/6] clone: add CLONE_NNP Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20260226-work-pidfs-autoreap-v5-2-d148b984a989@kernel.org> References: <20260226-work-pidfs-autoreap-v5-0-d148b984a989@kernel.org> In-Reply-To: <20260226-work-pidfs-autoreap-v5-0-d148b984a989@kernel.org> To: Oleg Nesterov , Jann Horn Cc: Linus Torvalds , Ingo Molnar , Peter Zijlstra , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Christian Brauner X-Mailer: b4 0.15-dev-47773 X-Developer-Signature: v=1; a=openpgp-sha256; l=2367; i=brauner@kernel.org; h=from:subject:message-id; bh=Pe64i0mVb0l9+hOUVgaR7FNAmTnQLfrD7A8hMzslM0Q=; b=owGbwMvMwCU28Zj0gdSKO4sYT6slMWQu8D+dI1esbn1DVeKF72VbnvcMVaZZszY4Ht1l6lX7r optq//LjlIWBjEuBlkxRRaHdpNwueU8FZuNMjVg5rAygQxh4OIUgIlMEmFkmPP0y+ZU1nU88sc2 FjRWMC/ImdEaX1gW3LLjTYXob95dKxn+h29ZJHTIZffXA9+/OZ7p3ON/fdHNqE9XNi43XLHfTSd vGysA X-Developer-Key: i=brauner@kernel.org; a=openpgp; fpr=4880B8C9BD0E5106FC070F4F7B3C391EFEA93624 Add a new clone3() flag CLONE_NNP that sets no_new_privs on the child process at clone time. This is analogous to prctl(PR_SET_NO_NEW_PRIVS) but applied at process creation rather than requiring a separate step after the child starts running. CLONE_NNP is rejected with CLONE_THREAD. It's conceptually a lot simpler if the whole thread-group is forced into NNP and not have single threads running around with NNP. Signed-off-by: Christian Brauner Reviewed-by: Oleg Nesterov --- include/uapi/linux/sched.h | 1 + kernel/fork.c | 10 +++++++++- 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h index 8a22ea640817..7b1b87473ebb 100644 --- a/include/uapi/linux/sched.h +++ b/include/uapi/linux/sched.h @@ -37,6 +37,7 @@ #define CLONE_CLEAR_SIGHAND 0x100000000ULL /* Clear any signal handler and= reset to SIG_DFL. */ #define CLONE_INTO_CGROUP 0x200000000ULL /* Clone into a specific cgroup g= iven the right permissions. */ #define CLONE_AUTOREAP 0x400000000ULL /* Auto-reap child on exit. */ +#define CLONE_NNP 0x1000000000ULL /* Set no_new_privs on child. */ =20 /* * cloning flags intersect with CSIGNAL so can be used with unshare and cl= one3 diff --git a/kernel/fork.c b/kernel/fork.c index 0dedf2999f0c..a3202ee278d8 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2037,6 +2037,11 @@ __latent_entropy struct task_struct *copy_process( return ERR_PTR(-EINVAL); } =20 + if (clone_flags & CLONE_NNP) { + if (clone_flags & CLONE_THREAD) + return ERR_PTR(-EINVAL); + } + /* * Force any signals received before this point to be delivered * before the fork happens. Collect up signals sent to multiple @@ -2421,6 +2426,9 @@ __latent_entropy struct task_struct *copy_process( */ copy_seccomp(p); =20 + if (clone_flags & CLONE_NNP) + task_set_no_new_privs(p); + init_task_pid_links(p); if (likely(p->pid)) { ptrace_init_task(p, (clone_flags & CLONE_PTRACE) || trace); @@ -2909,7 +2917,7 @@ static bool clone3_args_valid(struct kernel_clone_arg= s *kargs) /* Verify that no unknown flags are passed along. */ if (kargs->flags & ~(CLONE_LEGACY_FLAGS | CLONE_CLEAR_SIGHAND | CLONE_INTO_CGROUP | - CLONE_AUTOREAP)) + CLONE_AUTOREAP | CLONE_NNP)) return false; =20 /* --=20 2.47.3 From nobody Tue Apr 7 15:27:24 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F3CCD3A1E88; Thu, 26 Feb 2026 13:51:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772113877; cv=none; b=JcvPEkOPxAcbk5bG3VksaJZiw2XbTb+b2ve40/9ol1okGKvdR/Fnk62Ns0EY0Yi/mnpJah0ZgTOimggG94epx2K2fsg3B8HpVLDQmfh1cM+acLnWaMxwlr9dGw/H7pQis0GAKgWNNT80e5Wv0idXZ4c78CHkGHTqOCKMfwjoeDo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772113877; c=relaxed/simple; bh=l7gnpFS9XkcCCd7BGMHyc7VLW8/9pgBGD7AZ+2Ixk9k=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=fpiEOPelR7yclt7PJTNmXBWUvUEucn/+l9hbHPuSdxBNydHw+erLQMDmHZ121uw9jzs0dPiaCHLqzsMCTgeHsfZHh5fq7qCzPwgYAMxaXEFi2RuZqiOn3vq0Qxnf1iEz430+aMhj74MIVeWkQXZmooLjqGc/BN9GFlfoExv83G8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=FI9GfPYa; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="FI9GfPYa" Received: by smtp.kernel.org (Postfix) with ESMTPSA id F35FAC19422; Thu, 26 Feb 2026 13:51:14 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1772113876; bh=l7gnpFS9XkcCCd7BGMHyc7VLW8/9pgBGD7AZ+2Ixk9k=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=FI9GfPYaq8RP0lw62M90E5Uj4GiqXaptsdJ2iYtYkkCqqfvyb0ZSs3lzUQXuKGDVf 0z8NPmKwQlhpqGNUeSpmOnb13JFe11G5ke6IkZ0bBW1wzZNTrDDYW6x3Ysv3S+N/DE 4tmcFcwhBi0W/e52Amy1Is7s/T+HT+IdLUVkEBjc/JpW2eAn9Ce2qphy+I7cy9WuZn j6UuIQGT3Yf03BpI1fUfjsRAmVvjmmKANX8hNxil1VYtiIK0DdBpYtFZZK1f9xOdY2 WguOfArfq/BLEd71HEGlvyHq+YWTBEz0UgrdjWdCa0caW5UW9dGvjlb7MXIEfP3uv4 4Md0PHfanaAqQ== From: Christian Brauner Date: Thu, 26 Feb 2026 14:51:01 +0100 Subject: [PATCH v5 3/6] pidfd: add CLONE_PIDFD_AUTOKILL Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20260226-work-pidfs-autoreap-v5-3-d148b984a989@kernel.org> References: <20260226-work-pidfs-autoreap-v5-0-d148b984a989@kernel.org> In-Reply-To: <20260226-work-pidfs-autoreap-v5-0-d148b984a989@kernel.org> To: Oleg Nesterov , Jann Horn Cc: Linus Torvalds , Ingo Molnar , Peter Zijlstra , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Christian Brauner X-Mailer: b4 0.15-dev-47773 X-Developer-Signature: v=1; a=openpgp-sha256; l=7595; i=brauner@kernel.org; h=from:subject:message-id; bh=l7gnpFS9XkcCCd7BGMHyc7VLW8/9pgBGD7AZ+2Ixk9k=; b=owGbwMvMwCU28Zj0gdSKO4sYT6slMWQu8D+terBlfdLSnxXHs2waWZmYusRrvuseUd0iolhma jH3T6xkRykLgxgXg6yYIotDu0m43HKeis1GmRowc1iZQIYwcHEKwESOLmT4Z6/LWjJ1xoGqLZeE j558lLv0pq/NTvagohrLqRed3Gs7JzEybBFVn+q70V+7JqfMIPbtx0/8VzZqnKgUDLnWryAcOMG IDQA= X-Developer-Key: i=brauner@kernel.org; a=openpgp; fpr=4880B8C9BD0E5106FC070F4F7B3C391EFEA93624 Add a new clone3() flag CLONE_PIDFD_AUTOKILL that ties a child's lifetime to the pidfd returned from clone3(). When the last reference to the struct file created by clone3() is closed the kernel sends SIGKILL to the child. A pidfd obtained via pidfd_open() for the same process does not keep the child alive and does not trigger autokill - only the specific struct file from clone3() has this property. This is useful for container runtimes, service managers, and sandboxed subprocess execution - any scenario where the child must die if the parent crashes or abandons the pidfd. CLONE_PIDFD_AUTOKILL requires both CLONE_PIDFD (the whole point is tying lifetime to the pidfd file) and CLONE_AUTOREAP (a killed child with no one to reap it would become a zombie). CLONE_THREAD is rejected because autokill targets a process not a thread. The clone3 pidfd is identified by the PIDFD_AUTOKILL file flag set on the struct file at clone3() time. The pidfs .release handler checks this flag and sends SIGKILL via do_send_sig_info(SIGKILL, SEND_SIG_PRIV, ...) only when it is set. Files from pidfd_open() or open_by_handle_at() are distinct struct files that do not carry this flag. dup()/fork() share the same struct file so they extend the child's lifetime until the last reference drops. CLONE_PIDFD_AUTOKILL uses a privilege model based on CLONE_NNP: without CLONE_NNP the child could escalate privileges via setuid/setgid exec after being spawned, so the caller must have CAP_SYS_ADMIN in its user namespace. With CLONE_NNP the child can never gain new privileges so unprivileged usage is allowed. This is a deliberate departure from the pdeath_signal model which is reset during secureexec and commit_creds() rendering it useless for container runtimes that need to deprivilege themselves. Signed-off-by: Christian Brauner Reviewed-by: Oleg Nesterov --- fs/pidfs.c | 38 ++++++++++++++++++++++++++++++++------ include/uapi/linux/pidfd.h | 1 + include/uapi/linux/sched.h | 1 + kernel/fork.c | 29 ++++++++++++++++++++++++++--- 4 files changed, 60 insertions(+), 9 deletions(-) diff --git a/fs/pidfs.c b/fs/pidfs.c index 318253344b5c..a8d1bca0395d 100644 --- a/fs/pidfs.c +++ b/fs/pidfs.c @@ -8,6 +8,8 @@ #include #include #include +#include +#include #include #include #include @@ -637,7 +639,28 @@ static long pidfd_ioctl(struct file *file, unsigned in= t cmd, unsigned long arg) return open_namespace(ns_common); } =20 +static int pidfs_file_release(struct inode *inode, struct file *file) +{ + struct pid *pid =3D inode->i_private; + struct task_struct *task; + + if (!(file->f_flags & PIDFD_AUTOKILL)) + return 0; + + guard(rcu)(); + task =3D pid_task(pid, PIDTYPE_TGID); + if (!task) + return 0; + + /* Not available for kthreads or user workers for now. */ + if (WARN_ON_ONCE(task->flags & (PF_KTHREAD | PF_USER_WORKER))) + return 0; + do_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_TGID); + return 0; +} + static const struct file_operations pidfs_file_operations =3D { + .release =3D pidfs_file_release, .poll =3D pidfd_poll, #ifdef CONFIG_PROC_FS .show_fdinfo =3D pidfd_show_fdinfo, @@ -1093,11 +1116,11 @@ struct file *pidfs_alloc_file(struct pid *pid, unsi= gned int flags) int ret; =20 /* - * Ensure that PIDFD_STALE can be passed as a flag without - * overloading other uapi pidfd flags. + * Ensure that internal pidfd flags don't overlap with each + * other or with uapi pidfd flags. */ - BUILD_BUG_ON(PIDFD_STALE =3D=3D PIDFD_THREAD); - BUILD_BUG_ON(PIDFD_STALE =3D=3D PIDFD_NONBLOCK); + BUILD_BUG_ON(hweight32(PIDFD_THREAD | PIDFD_NONBLOCK | + PIDFD_STALE | PIDFD_AUTOKILL) !=3D 4); =20 ret =3D path_from_stashed(&pid->stashed, pidfs_mnt, get_pid(pid), &path); if (ret < 0) @@ -1108,9 +1131,12 @@ struct file *pidfs_alloc_file(struct pid *pid, unsig= ned int flags) flags &=3D ~PIDFD_STALE; flags |=3D O_RDWR; pidfd_file =3D dentry_open(&path, flags, current_cred()); - /* Raise PIDFD_THREAD explicitly as do_dentry_open() strips it. */ + /* + * Raise PIDFD_THREAD and PIDFD_AUTOKILL explicitly as + * do_dentry_open() strips O_EXCL and O_TRUNC. + */ if (!IS_ERR(pidfd_file)) - pidfd_file->f_flags |=3D (flags & PIDFD_THREAD); + pidfd_file->f_flags |=3D (flags & (PIDFD_THREAD | PIDFD_AUTOKILL)); =20 return pidfd_file; } diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h index ea9a6811fc76..9281956a9f32 100644 --- a/include/uapi/linux/pidfd.h +++ b/include/uapi/linux/pidfd.h @@ -13,6 +13,7 @@ #ifdef __KERNEL__ #include #define PIDFD_STALE CLONE_PIDFD +#define PIDFD_AUTOKILL O_TRUNC #endif =20 /* Flags for pidfd_send_signal(). */ diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h index 7b1b87473ebb..0aafb4652afc 100644 --- a/include/uapi/linux/sched.h +++ b/include/uapi/linux/sched.h @@ -37,6 +37,7 @@ #define CLONE_CLEAR_SIGHAND 0x100000000ULL /* Clear any signal handler and= reset to SIG_DFL. */ #define CLONE_INTO_CGROUP 0x200000000ULL /* Clone into a specific cgroup g= iven the right permissions. */ #define CLONE_AUTOREAP 0x400000000ULL /* Auto-reap child on exit. */ +#define CLONE_PIDFD_AUTOKILL 0x800000000ULL /* Kill child when clone pidfd= closes. */ #define CLONE_NNP 0x1000000000ULL /* Set no_new_privs on child. */ =20 /* diff --git a/kernel/fork.c b/kernel/fork.c index a3202ee278d8..0f4944ce378d 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2042,6 +2042,24 @@ __latent_entropy struct task_struct *copy_process( return ERR_PTR(-EINVAL); } =20 + if (clone_flags & CLONE_PIDFD_AUTOKILL) { + if (!(clone_flags & CLONE_PIDFD)) + return ERR_PTR(-EINVAL); + if (!(clone_flags & CLONE_AUTOREAP)) + return ERR_PTR(-EINVAL); + if (clone_flags & CLONE_THREAD) + return ERR_PTR(-EINVAL); + /* + * Without CLONE_NNP the child could escalate privileges + * after being spawned, so require CAP_SYS_ADMIN. + * With CLONE_NNP the child can't gain new privileges, + * so allow unprivileged usage. + */ + if (!(clone_flags & CLONE_NNP) && + !ns_capable(current_user_ns(), CAP_SYS_ADMIN)) + return ERR_PTR(-EPERM); + } + /* * Force any signals received before this point to be delivered * before the fork happens. Collect up signals sent to multiple @@ -2264,13 +2282,18 @@ __latent_entropy struct task_struct *copy_process( * if the fd table isn't shared). */ if (clone_flags & CLONE_PIDFD) { - int flags =3D (clone_flags & CLONE_THREAD) ? PIDFD_THREAD : 0; + unsigned flags =3D PIDFD_STALE; + + if (clone_flags & CLONE_THREAD) + flags |=3D PIDFD_THREAD; + if (clone_flags & CLONE_PIDFD_AUTOKILL) + flags |=3D PIDFD_AUTOKILL; =20 /* * Note that no task has been attached to @pid yet indicate * that via CLONE_PIDFD. */ - retval =3D pidfd_prepare(pid, flags | PIDFD_STALE, &pidfile); + retval =3D pidfd_prepare(pid, flags, &pidfile); if (retval < 0) goto bad_fork_free_pid; pidfd =3D retval; @@ -2917,7 +2940,7 @@ static bool clone3_args_valid(struct kernel_clone_arg= s *kargs) /* Verify that no unknown flags are passed along. */ if (kargs->flags & ~(CLONE_LEGACY_FLAGS | CLONE_CLEAR_SIGHAND | CLONE_INTO_CGROUP | - CLONE_AUTOREAP | CLONE_NNP)) + CLONE_AUTOREAP | CLONE_NNP | CLONE_PIDFD_AUTOKILL)) return false; =20 /* --=20 2.47.3 From nobody Tue Apr 7 15:27:24 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 30A9539A817; Thu, 26 Feb 2026 13:51:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772113879; cv=none; b=kAC+7+UmdNJms4eiqDjUK0hnoc93kJKu+6ech1nunzzICn6BEHOsj9l9TpwLawhq4SglBWO3LejFDGP2CoZRg0nyOU2TL1qGLvxnuSgrllOzap//56uZZ2WdCHAuWA0968U1yUSpvix+3A3DDqPc53/lV/C484Vff3m2UoUyvJA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772113879; c=relaxed/simple; bh=X2kNHjyqR+3iLO9bHV2aqSIMFWma9GzWC7MByJ9D3F8=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=qSaauVYYFJDJtSSYt02Ld+/hwuNlgLQEF0gxjRzMvM4T18x1TuaeCiyG7nbqhokhSTw4fVN5Qja560Y4le/Tm7/1pqnKkpUe7aEOLU2dRKPZCNZ0q+X/i4LphqyS9YV5mEra/hj/G2+DOxiKgoi37P4DKjx2xjmz6y40U8Il/oo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=pij3uZXH; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="pij3uZXH" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 1E4B7C19424; Thu, 26 Feb 2026 13:51:16 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1772113878; bh=X2kNHjyqR+3iLO9bHV2aqSIMFWma9GzWC7MByJ9D3F8=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=pij3uZXHo8HrJUVZwu4osCX+vTkhPTNlijrVFJSdFf4fJl8g8+BPWlTGHH7pNPB7X spO46PqAVV/wVmvmlCkiZ/0OAB69J3wgiL8TDTbWh/CuQ08OYkhshqIneYdD0wj3ls kR2e1A9Zq0Yw+NYBpiEYMW/JFVSPGyeH8U5LEksQ9AE2gcIZhEFs+4Dk7l2F7w1NRe y5kbCLv4uxAjJdPnb7+f7zuNyI7rCwVv4/hPKA/DimXTfJb2LJPBj5AdGM5mmHfsnk RB3eRaRpPi5gLbthqogXhAIQE4DIR0sgDru19xNf+HOOqMX0WAFeN4bP73AutsYuJ5 pz+mW5D0jupIw== From: Christian Brauner Date: Thu, 26 Feb 2026 14:51:02 +0100 Subject: [PATCH v5 4/6] selftests/pidfd: add CLONE_AUTOREAP tests Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20260226-work-pidfs-autoreap-v5-4-d148b984a989@kernel.org> References: <20260226-work-pidfs-autoreap-v5-0-d148b984a989@kernel.org> In-Reply-To: <20260226-work-pidfs-autoreap-v5-0-d148b984a989@kernel.org> To: Oleg Nesterov , Jann Horn Cc: Linus Torvalds , Ingo Molnar , Peter Zijlstra , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Christian Brauner X-Mailer: b4 0.15-dev-47773 X-Developer-Signature: v=1; a=openpgp-sha256; l=14401; i=brauner@kernel.org; h=from:subject:message-id; bh=X2kNHjyqR+3iLO9bHV2aqSIMFWma9GzWC7MByJ9D3F8=; b=owGbwMvMwCU28Zj0gdSKO4sYT6slMWQu8D/d4r96S0Vb3s3nH968n51dxahzeMKeQ9kmOxR2m 289JfvdsaOUhUGMi0FWTJHFod0kXG45T8Vmo0wNmDmsTCBDGLg4BWAi2bwM/2wiF8e83W9pfK6y /lTrhQXf/hV8jP7NZD1759cti3IXlLox/HddkJKryrn52clVWT8mx51+ovE5KJnp44W78t3CR2d InWMHAA== X-Developer-Key: i=brauner@kernel.org; a=openpgp; fpr=4880B8C9BD0E5106FC070F4F7B3C391EFEA93624 Add tests for the new CLONE_AUTOREAP clone3() flag: - autoreap_without_pidfd: CLONE_AUTOREAP without CLONE_PIDFD works (fire-and-forget) - autoreap_rejects_exit_signal: CLONE_AUTOREAP with non-zero exit_signal fails - autoreap_rejects_parent: CLONE_AUTOREAP with CLONE_PARENT fails - autoreap_rejects_thread: CLONE_AUTOREAP with CLONE_THREAD fails - autoreap_basic: child exits, pidfd poll works, PIDFD_GET_INFO returns correct exit code, waitpid() returns -ECHILD - autoreap_signaled: child killed by signal, exit info correct via pidfd - autoreap_reparent: autoreap grandchild reparented to subreaper still auto-reaps - autoreap_multithreaded: autoreap process with sub-threads auto-reaps after last thread exits - autoreap_no_inherit: grandchild forked without CLONE_AUTOREAP becomes a regular zombie Signed-off-by: Christian Brauner Reviewed-by: Oleg Nesterov --- tools/testing/selftests/pidfd/.gitignore | 1 + tools/testing/selftests/pidfd/Makefile | 2 +- .../testing/selftests/pidfd/pidfd_autoreap_test.c | 496 +++++++++++++++++= ++++ 3 files changed, 498 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/pidfd/.gitignore b/tools/testing/selft= ests/pidfd/.gitignore index 144e7ff65d6a..4cd8ec7fd349 100644 --- a/tools/testing/selftests/pidfd/.gitignore +++ b/tools/testing/selftests/pidfd/.gitignore @@ -12,3 +12,4 @@ pidfd_info_test pidfd_exec_helper pidfd_xattr_test pidfd_setattr_test +pidfd_autoreap_test diff --git a/tools/testing/selftests/pidfd/Makefile b/tools/testing/selftes= ts/pidfd/Makefile index 764a8f9ecefa..4211f91e9af8 100644 --- a/tools/testing/selftests/pidfd/Makefile +++ b/tools/testing/selftests/pidfd/Makefile @@ -4,7 +4,7 @@ CFLAGS +=3D -g $(KHDR_INCLUDES) $(TOOLS_INCLUDES) -pthread = -Wall TEST_GEN_PROGS :=3D pidfd_test pidfd_fdinfo_test pidfd_open_test \ pidfd_poll_test pidfd_wait pidfd_getfd_test pidfd_setns_test \ pidfd_file_handle_test pidfd_bind_mount pidfd_info_test \ - pidfd_xattr_test pidfd_setattr_test + pidfd_xattr_test pidfd_setattr_test pidfd_autoreap_test =20 TEST_GEN_PROGS_EXTENDED :=3D pidfd_exec_helper =20 diff --git a/tools/testing/selftests/pidfd/pidfd_autoreap_test.c b/tools/te= sting/selftests/pidfd/pidfd_autoreap_test.c new file mode 100644 index 000000000000..e230d2fe4a64 --- /dev/null +++ b/tools/testing/selftests/pidfd/pidfd_autoreap_test.c @@ -0,0 +1,496 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright (c) 2026 Christian Brauner + +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "pidfd.h" +#include "kselftest_harness.h" + +#ifndef CLONE_AUTOREAP +#define CLONE_AUTOREAP 0x400000000ULL +#endif + +static pid_t create_autoreap_child(int *pidfd) +{ + struct __clone_args args =3D { + .flags =3D CLONE_PIDFD | CLONE_AUTOREAP, + .exit_signal =3D 0, + .pidfd =3D ptr_to_u64(pidfd), + }; + + return sys_clone3(&args, sizeof(args)); +} + +/* + * Test that CLONE_AUTOREAP works without CLONE_PIDFD (fire-and-forget). + */ +TEST(autoreap_without_pidfd) +{ + struct __clone_args args =3D { + .flags =3D CLONE_AUTOREAP, + .exit_signal =3D 0, + }; + pid_t pid; + int ret; + + pid =3D sys_clone3(&args, sizeof(args)); + if (pid < 0 && errno =3D=3D EINVAL) + SKIP(return, "CLONE_AUTOREAP not supported"); + ASSERT_GE(pid, 0); + + if (pid =3D=3D 0) + _exit(0); + + /* + * Give the child a moment to exit and be autoreaped. + * Then verify no zombie remains. + */ + usleep(200000); + ret =3D waitpid(pid, NULL, WNOHANG); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, ECHILD); +} + +/* + * Test that CLONE_AUTOREAP with a non-zero exit_signal fails. + */ +TEST(autoreap_rejects_exit_signal) +{ + struct __clone_args args =3D { + .flags =3D CLONE_AUTOREAP, + .exit_signal =3D SIGCHLD, + }; + pid_t pid; + + pid =3D sys_clone3(&args, sizeof(args)); + ASSERT_EQ(pid, -1); + ASSERT_EQ(errno, EINVAL); +} + +/* + * Test that CLONE_AUTOREAP with CLONE_PARENT fails. + */ +TEST(autoreap_rejects_parent) +{ + struct __clone_args args =3D { + .flags =3D CLONE_AUTOREAP | CLONE_PARENT, + .exit_signal =3D 0, + }; + pid_t pid; + + pid =3D sys_clone3(&args, sizeof(args)); + ASSERT_EQ(pid, -1); + ASSERT_EQ(errno, EINVAL); +} + +/* + * Test that CLONE_AUTOREAP with CLONE_THREAD fails. + */ +TEST(autoreap_rejects_thread) +{ + struct __clone_args args =3D { + .flags =3D CLONE_AUTOREAP | CLONE_THREAD | + CLONE_SIGHAND | CLONE_VM, + .exit_signal =3D 0, + }; + pid_t pid; + + pid =3D sys_clone3(&args, sizeof(args)); + ASSERT_EQ(pid, -1); + ASSERT_EQ(errno, EINVAL); +} + +/* + * Basic test: create an autoreap child, let it exit, verify: + * - pidfd becomes readable (poll returns POLLIN) + * - PIDFD_GET_INFO returns the correct exit code + * - waitpid() returns -1/ECHILD (no zombie) + */ +TEST(autoreap_basic) +{ + struct pidfd_info info =3D { .mask =3D PIDFD_INFO_EXIT }; + int pidfd =3D -1, ret; + struct pollfd pfd; + pid_t pid; + + pid =3D create_autoreap_child(&pidfd); + if (pid < 0 && errno =3D=3D EINVAL) + SKIP(return, "CLONE_AUTOREAP not supported"); + ASSERT_GE(pid, 0); + + if (pid =3D=3D 0) + _exit(42); + + ASSERT_GE(pidfd, 0); + + /* Wait for the child to exit via pidfd poll. */ + pfd.fd =3D pidfd; + pfd.events =3D POLLIN; + ret =3D poll(&pfd, 1, 5000); + ASSERT_EQ(ret, 1); + ASSERT_TRUE(pfd.revents & POLLIN); + + /* Verify exit info via PIDFD_GET_INFO. */ + ret =3D ioctl(pidfd, PIDFD_GET_INFO, &info); + ASSERT_EQ(ret, 0); + ASSERT_TRUE(info.mask & PIDFD_INFO_EXIT); + /* + * exit_code is in waitpid format: for _exit(42), + * WIFEXITED is true and WEXITSTATUS is 42. + */ + ASSERT_TRUE(WIFEXITED(info.exit_code)); + ASSERT_EQ(WEXITSTATUS(info.exit_code), 42); + + /* Verify no zombie: waitpid should fail with ECHILD. */ + ret =3D waitpid(pid, NULL, WNOHANG); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, ECHILD); + + close(pidfd); +} + +/* + * Test that an autoreap child killed by a signal reports + * the correct exit info. + */ +TEST(autoreap_signaled) +{ + struct pidfd_info info =3D { .mask =3D PIDFD_INFO_EXIT }; + int pidfd =3D -1, ret; + struct pollfd pfd; + pid_t pid; + + pid =3D create_autoreap_child(&pidfd); + if (pid < 0 && errno =3D=3D EINVAL) + SKIP(return, "CLONE_AUTOREAP not supported"); + ASSERT_GE(pid, 0); + + if (pid =3D=3D 0) { + pause(); + _exit(1); + } + + ASSERT_GE(pidfd, 0); + + /* Kill the child. */ + ret =3D sys_pidfd_send_signal(pidfd, SIGKILL, NULL, 0); + ASSERT_EQ(ret, 0); + + /* Wait for exit via pidfd. */ + pfd.fd =3D pidfd; + pfd.events =3D POLLIN; + ret =3D poll(&pfd, 1, 5000); + ASSERT_EQ(ret, 1); + ASSERT_TRUE(pfd.revents & POLLIN); + + /* Verify signal info. */ + ret =3D ioctl(pidfd, PIDFD_GET_INFO, &info); + ASSERT_EQ(ret, 0); + ASSERT_TRUE(info.mask & PIDFD_INFO_EXIT); + ASSERT_TRUE(WIFSIGNALED(info.exit_code)); + ASSERT_EQ(WTERMSIG(info.exit_code), SIGKILL); + + /* No zombie. */ + ret =3D waitpid(pid, NULL, WNOHANG); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, ECHILD); + + close(pidfd); +} + +/* + * Test autoreap survives reparenting: middle process creates an + * autoreap grandchild, then exits. The grandchild gets reparented + * to us (the grandparent, which is a subreaper). When the grandchild + * exits, it should still be autoreaped - no zombie under us. + */ +TEST(autoreap_reparent) +{ + int ipc_sockets[2], ret; + int pidfd =3D -1; + struct pollfd pfd; + pid_t mid_pid, grandchild_pid; + char buf[32] =3D {}; + + /* Make ourselves a subreaper so reparented children come to us. */ + ret =3D prctl(PR_SET_CHILD_SUBREAPER, 1); + ASSERT_EQ(ret, 0); + + ret =3D socketpair(AF_LOCAL, SOCK_STREAM | SOCK_CLOEXEC, 0, ipc_sockets); + ASSERT_EQ(ret, 0); + + mid_pid =3D fork(); + ASSERT_GE(mid_pid, 0); + + if (mid_pid =3D=3D 0) { + /* Middle child: create an autoreap grandchild. */ + int gc_pidfd =3D -1; + + close(ipc_sockets[0]); + + grandchild_pid =3D create_autoreap_child(&gc_pidfd); + if (grandchild_pid < 0) { + write_nointr(ipc_sockets[1], "E", 1); + close(ipc_sockets[1]); + _exit(1); + } + + if (grandchild_pid =3D=3D 0) { + /* Grandchild: wait for signal to exit. */ + close(ipc_sockets[1]); + if (gc_pidfd >=3D 0) + close(gc_pidfd); + pause(); + _exit(0); + } + + /* Send grandchild PID to grandparent. */ + snprintf(buf, sizeof(buf), "%d", grandchild_pid); + write_nointr(ipc_sockets[1], buf, strlen(buf)); + close(ipc_sockets[1]); + if (gc_pidfd >=3D 0) + close(gc_pidfd); + + /* Middle child exits, grandchild gets reparented. */ + _exit(0); + } + + close(ipc_sockets[1]); + + /* Read grandchild's PID. */ + ret =3D read_nointr(ipc_sockets[0], buf, sizeof(buf) - 1); + close(ipc_sockets[0]); + ASSERT_GT(ret, 0); + + if (buf[0] =3D=3D 'E') { + waitpid(mid_pid, NULL, 0); + prctl(PR_SET_CHILD_SUBREAPER, 0); + SKIP(return, "CLONE_AUTOREAP not supported"); + } + + grandchild_pid =3D atoi(buf); + ASSERT_GT(grandchild_pid, 0); + + /* Wait for the middle child to exit. */ + ret =3D waitpid(mid_pid, NULL, 0); + ASSERT_EQ(ret, mid_pid); + + /* + * Now the grandchild is reparented to us (subreaper). + * Open a pidfd for the grandchild and kill it. + */ + pidfd =3D sys_pidfd_open(grandchild_pid, 0); + ASSERT_GE(pidfd, 0); + + ret =3D sys_pidfd_send_signal(pidfd, SIGKILL, NULL, 0); + ASSERT_EQ(ret, 0); + + /* Wait for it to exit via pidfd poll. */ + pfd.fd =3D pidfd; + pfd.events =3D POLLIN; + ret =3D poll(&pfd, 1, 5000); + ASSERT_EQ(ret, 1); + ASSERT_TRUE(pfd.revents & POLLIN); + + /* + * The grandchild should have been autoreaped even though + * we (the new parent) haven't set SA_NOCLDWAIT. + * waitpid should return -1/ECHILD. + */ + ret =3D waitpid(grandchild_pid, NULL, WNOHANG); + EXPECT_EQ(ret, -1); + EXPECT_EQ(errno, ECHILD); + + close(pidfd); + + /* Clean up subreaper status. */ + prctl(PR_SET_CHILD_SUBREAPER, 0); +} + +static int thread_sock_fd; + +static void *thread_func(void *arg) +{ + /* Signal parent we're running. */ + write_nointr(thread_sock_fd, "1", 1); + + /* Give main thread time to call _exit() first. */ + usleep(200000); + + return NULL; +} + +/* + * Test that an autoreap child with multiple threads is properly + * autoreaped only after all threads have exited. + */ +TEST(autoreap_multithreaded) +{ + struct pidfd_info info =3D { .mask =3D PIDFD_INFO_EXIT }; + int ipc_sockets[2], ret; + int pidfd =3D -1; + struct pollfd pfd; + pid_t pid; + char c; + + ret =3D socketpair(AF_LOCAL, SOCK_STREAM | SOCK_CLOEXEC, 0, ipc_sockets); + ASSERT_EQ(ret, 0); + + pid =3D create_autoreap_child(&pidfd); + if (pid < 0 && errno =3D=3D EINVAL) { + close(ipc_sockets[0]); + close(ipc_sockets[1]); + SKIP(return, "CLONE_AUTOREAP not supported"); + } + ASSERT_GE(pid, 0); + + if (pid =3D=3D 0) { + pthread_t thread; + + close(ipc_sockets[0]); + + /* + * Create a sub-thread that outlives the main thread. + * The thread signals readiness, then sleeps. + * The main thread waits briefly, then calls _exit(). + */ + thread_sock_fd =3D ipc_sockets[1]; + pthread_create(&thread, NULL, thread_func, NULL); + pthread_detach(thread); + + /* Wait for thread to be running. */ + usleep(100000); + + /* Main thread exits; sub-thread is still alive. */ + _exit(99); + } + + close(ipc_sockets[1]); + + /* Wait for the sub-thread to signal readiness. */ + ret =3D read_nointr(ipc_sockets[0], &c, 1); + close(ipc_sockets[0]); + ASSERT_EQ(ret, 1); + + /* Wait for the process to fully exit via pidfd poll. */ + pfd.fd =3D pidfd; + pfd.events =3D POLLIN; + ret =3D poll(&pfd, 1, 5000); + ASSERT_EQ(ret, 1); + ASSERT_TRUE(pfd.revents & POLLIN); + + /* Verify exit info. */ + ret =3D ioctl(pidfd, PIDFD_GET_INFO, &info); + ASSERT_EQ(ret, 0); + ASSERT_TRUE(info.mask & PIDFD_INFO_EXIT); + ASSERT_TRUE(WIFEXITED(info.exit_code)); + ASSERT_EQ(WEXITSTATUS(info.exit_code), 99); + + /* No zombie. */ + ret =3D waitpid(pid, NULL, WNOHANG); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, ECHILD); + + close(pidfd); +} + +/* + * Test that autoreap is NOT inherited by grandchildren. + */ +TEST(autoreap_no_inherit) +{ + int ipc_sockets[2], ret; + int pidfd =3D -1; + pid_t pid; + char buf[2] =3D {}; + struct pollfd pfd; + + ret =3D socketpair(AF_LOCAL, SOCK_STREAM | SOCK_CLOEXEC, 0, ipc_sockets); + ASSERT_EQ(ret, 0); + + pid =3D create_autoreap_child(&pidfd); + if (pid < 0 && errno =3D=3D EINVAL) { + close(ipc_sockets[0]); + close(ipc_sockets[1]); + SKIP(return, "CLONE_AUTOREAP not supported"); + } + ASSERT_GE(pid, 0); + + if (pid =3D=3D 0) { + pid_t gc; + int status; + + close(ipc_sockets[0]); + + /* Autoreap child forks a grandchild (without autoreap). */ + gc =3D fork(); + if (gc < 0) { + write_nointr(ipc_sockets[1], "E", 1); + _exit(1); + } + if (gc =3D=3D 0) { + /* Grandchild: exit immediately. */ + close(ipc_sockets[1]); + _exit(77); + } + + /* + * The grandchild should become a regular zombie + * since it was NOT created with CLONE_AUTOREAP. + * Wait for it to verify. + */ + ret =3D waitpid(gc, &status, 0); + if (ret =3D=3D gc && WIFEXITED(status) && + WEXITSTATUS(status) =3D=3D 77) { + write_nointr(ipc_sockets[1], "P", 1); + } else { + write_nointr(ipc_sockets[1], "F", 1); + } + close(ipc_sockets[1]); + _exit(0); + } + + close(ipc_sockets[1]); + + ret =3D read_nointr(ipc_sockets[0], buf, 1); + close(ipc_sockets[0]); + ASSERT_EQ(ret, 1); + + /* + * 'P' means the autoreap child was able to waitpid() its + * grandchild (correct - grandchild should be a normal zombie, + * not autoreaped). + */ + ASSERT_EQ(buf[0], 'P'); + + /* Wait for the autoreap child to exit. */ + pfd.fd =3D pidfd; + pfd.events =3D POLLIN; + ret =3D poll(&pfd, 1, 5000); + ASSERT_EQ(ret, 1); + + /* Autoreap child itself should be autoreaped. */ + ret =3D waitpid(pid, NULL, WNOHANG); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, ECHILD); + + close(pidfd); +} + +TEST_HARNESS_MAIN --=20 2.47.3 From nobody Tue Apr 7 15:27:24 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8360C3A7849; Thu, 26 Feb 2026 13:51:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772113881; cv=none; b=Gvl6jubPHWCjsACv49wkDp+DlpyxGLprX2PMfP519GZcuCo4qxNX+omppeGSK6fbg3laXMkD7ZnNh4r5wBj6O7KFbSndrD8XdutGIebAlBrmXLRHQtCj8bPFcQNtQoZx0DNpuGRpLCtcJbf/dm9TmtXo92DF656sctc6Y/r2vaE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772113881; c=relaxed/simple; bh=XMWruOYLkxF4dZAR02/HIKIt+BPClxko2uWALy0QcoI=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=brjaYyX6JVR+ec84EAfEcQbQv6gm5Us7jeYQ9QIUhV9GnqOzgXiJ7IhiC8TI3fHB9IjJxtbWPrlCDiZcMqmzco2JhN/nZzacc7gcqoMbjJyxz652qm6TYD36QkmWUdX0t8zpwT+XTt3WfisDqi25DnPq/EpCixWsyPXC7tbU7jg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=qZwP3f38; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="qZwP3f38" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 4ED7BC116C6; Thu, 26 Feb 2026 13:51:19 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1772113881; bh=XMWruOYLkxF4dZAR02/HIKIt+BPClxko2uWALy0QcoI=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=qZwP3f38uKfGB9fq7IYo9TahVOr4ZYqbWZ3EleDaFyAIcdPKLxjizwC4CDqnxZnhO /SIBVkvhQyjW8zT/LzxL4sVsmsnduB2JeLOPPUnUbDnyNkn/efD0huJjkmYQPujb+I 3z7vUUCIIsBxNG8m5EfjbCrTwRbI3C7m5Hvo+MeVHM7QClUoZYIrCxrbZ4ya4G+nS0 0EQs7/oUZZi/pI0MuSk/g1EndWylKco4Wf9Q4h41Sp8MHFNDYvK8/EqmFZoOxvweSB VxQIsH5RGPAGSCCQc2jQlzVB92DsKqxraJfU6WBh+6GEGFmFn/+mVbiQ7kdtfcx+WJ tYQBLQYkMFF7Q== From: Christian Brauner Date: Thu, 26 Feb 2026 14:51:03 +0100 Subject: [PATCH v5 5/6] selftests/pidfd: add CLONE_NNP tests Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20260226-work-pidfs-autoreap-v5-5-d148b984a989@kernel.org> References: <20260226-work-pidfs-autoreap-v5-0-d148b984a989@kernel.org> In-Reply-To: <20260226-work-pidfs-autoreap-v5-0-d148b984a989@kernel.org> To: Oleg Nesterov , Jann Horn Cc: Linus Torvalds , Ingo Molnar , Peter Zijlstra , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Christian Brauner X-Mailer: b4 0.15-dev-47773 X-Developer-Signature: v=1; a=openpgp-sha256; l=4306; i=brauner@kernel.org; h=from:subject:message-id; bh=XMWruOYLkxF4dZAR02/HIKIt+BPClxko2uWALy0QcoI=; b=owGbwMvMwCU28Zj0gdSKO4sYT6slMWQu8D9ddnibmWbDlzJ5Z97z+T2KOyUCArnzy1mCXFX+F M5WfVDQUcrCIMbFICumyOLQbhIut5ynYrNRpgbMHFYmkCEMXJwCMJFpkxkZzsrlZHq8zkgIyxN8 vezFx9AuIfZbgSkCxvP+m9ecbrrxhOF/2fGXK/3D+V0vrDyY3hfc9mO2ia+Sx/Us0Q0a97NfTTn PCQA= X-Developer-Key: i=brauner@kernel.org; a=openpgp; fpr=4880B8C9BD0E5106FC070F4F7B3C391EFEA93624 Add tests for the new CLONE_NNP flag: - nnp_sets_no_new_privs: Verify a child created with CLONE_NNP has no_new_privs set while the parent does not. - nnp_rejects_thread: Verify CLONE_NNP | CLONE_THREAD is rejected with -EINVAL since threads share credentials. - autoreap_no_new_privs_unset: Verify a plain CLONE_AUTOREAP child does not get no_new_privs. Signed-off-by: Christian Brauner Reviewed-by: Oleg Nesterov --- .../testing/selftests/pidfd/pidfd_autoreap_test.c | 126 +++++++++++++++++= ++++ 1 file changed, 126 insertions(+) diff --git a/tools/testing/selftests/pidfd/pidfd_autoreap_test.c b/tools/te= sting/selftests/pidfd/pidfd_autoreap_test.c index e230d2fe4a64..5fb11230fb07 100644 --- a/tools/testing/selftests/pidfd/pidfd_autoreap_test.c +++ b/tools/testing/selftests/pidfd/pidfd_autoreap_test.c @@ -26,6 +26,10 @@ #define CLONE_AUTOREAP 0x400000000ULL #endif =20 +#ifndef CLONE_NNP +#define CLONE_NNP 0x1000000000ULL +#endif + static pid_t create_autoreap_child(int *pidfd) { struct __clone_args args =3D { @@ -493,4 +497,126 @@ TEST(autoreap_no_inherit) close(pidfd); } =20 +/* + * Test that CLONE_NNP sets no_new_privs on the child. + * The child checks via prctl(PR_GET_NO_NEW_PRIVS) and reports back. + * The parent must NOT have no_new_privs set afterwards. + */ +TEST(nnp_sets_no_new_privs) +{ + struct __clone_args args =3D { + .flags =3D CLONE_PIDFD | CLONE_AUTOREAP | CLONE_NNP, + .exit_signal =3D 0, + }; + struct pidfd_info info =3D { .mask =3D PIDFD_INFO_EXIT }; + int pidfd =3D -1, ret; + struct pollfd pfd; + pid_t pid; + + /* Ensure parent does not already have no_new_privs. */ + ret =3D prctl(PR_GET_NO_NEW_PRIVS, 0, 0, 0, 0); + ASSERT_EQ(ret, 0) { + TH_LOG("Parent already has no_new_privs set, cannot run test"); + } + + args.pidfd =3D ptr_to_u64(&pidfd); + + pid =3D sys_clone3(&args, sizeof(args)); + if (pid < 0 && errno =3D=3D EINVAL) + SKIP(return, "CLONE_NNP not supported"); + ASSERT_GE(pid, 0); + + if (pid =3D=3D 0) { + /* + * Child: check no_new_privs. Exit 0 if set, 1 if not. + */ + ret =3D prctl(PR_GET_NO_NEW_PRIVS, 0, 0, 0, 0); + _exit(ret =3D=3D 1 ? 0 : 1); + } + + ASSERT_GE(pidfd, 0); + + /* Parent must still NOT have no_new_privs. */ + ret =3D prctl(PR_GET_NO_NEW_PRIVS, 0, 0, 0, 0); + ASSERT_EQ(ret, 0) { + TH_LOG("Parent got no_new_privs after creating CLONE_NNP child"); + } + + /* Wait for child to exit. */ + pfd.fd =3D pidfd; + pfd.events =3D POLLIN; + ret =3D poll(&pfd, 1, 5000); + ASSERT_EQ(ret, 1); + + /* Verify child exited with 0 (no_new_privs was set). */ + ret =3D ioctl(pidfd, PIDFD_GET_INFO, &info); + ASSERT_EQ(ret, 0); + ASSERT_TRUE(info.mask & PIDFD_INFO_EXIT); + ASSERT_TRUE(WIFEXITED(info.exit_code)); + ASSERT_EQ(WEXITSTATUS(info.exit_code), 0) { + TH_LOG("Child did not have no_new_privs set"); + } + + close(pidfd); +} + +/* + * Test that CLONE_NNP with CLONE_THREAD fails with EINVAL. + */ +TEST(nnp_rejects_thread) +{ + struct __clone_args args =3D { + .flags =3D CLONE_NNP | CLONE_THREAD | + CLONE_SIGHAND | CLONE_VM, + .exit_signal =3D 0, + }; + pid_t pid; + + pid =3D sys_clone3(&args, sizeof(args)); + ASSERT_EQ(pid, -1); + ASSERT_EQ(errno, EINVAL); +} + +/* + * Test that a plain CLONE_AUTOREAP child does NOT get no_new_privs. + * Only CLONE_NNP should set it. + */ +TEST(autoreap_no_new_privs_unset) +{ + struct pidfd_info info =3D { .mask =3D PIDFD_INFO_EXIT }; + int pidfd =3D -1, ret; + struct pollfd pfd; + pid_t pid; + + pid =3D create_autoreap_child(&pidfd); + if (pid < 0 && errno =3D=3D EINVAL) + SKIP(return, "CLONE_AUTOREAP not supported"); + ASSERT_GE(pid, 0); + + if (pid =3D=3D 0) { + /* + * Child: check no_new_privs. Exit 0 if NOT set, 1 if set. + */ + ret =3D prctl(PR_GET_NO_NEW_PRIVS, 0, 0, 0, 0); + _exit(ret =3D=3D 0 ? 0 : 1); + } + + ASSERT_GE(pidfd, 0); + + pfd.fd =3D pidfd; + pfd.events =3D POLLIN; + ret =3D poll(&pfd, 1, 5000); + ASSERT_EQ(ret, 1); + + ret =3D ioctl(pidfd, PIDFD_GET_INFO, &info); + ASSERT_EQ(ret, 0); + ASSERT_TRUE(info.mask & PIDFD_INFO_EXIT); + ASSERT_TRUE(WIFEXITED(info.exit_code)); + ASSERT_EQ(WEXITSTATUS(info.exit_code), 0) { + TH_LOG("Plain autoreap child unexpectedly has no_new_privs"); + } + + close(pidfd); +} + TEST_HARNESS_MAIN --=20 2.47.3 From nobody Tue Apr 7 15:27:24 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D20673ACEE9; Thu, 26 Feb 2026 13:51:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772113883; cv=none; b=L1e11DR3Rz6wun/KJcnSig5NQlcH0VsMStRnrMz9uDoLDhceOaDvjKRGBf92Gok2HdZTA/w/vJ17ZiATJRT1wjilL1yifzNfWLYCehmfzne4WQ90Uaj4fcmDY8W21X20kUoSE7SYyNWQ1FYqzMmFiQ5z/q+fCIxNrFuYznlP8pA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772113883; c=relaxed/simple; bh=TDnJu/pMKQyvPl38p8goYkYRU9gufN4lR8yOJqLo3HU=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=JL48fbmo+vWCZm3Y8gMMkNiXSZwBa/cnZ7a+fhwptceKhjFMzKPF2MLqGgmXSfsyYU61fNIe4FBEYI9oiJqQSRnjPMkr4GXKDuiMXiwltLOdKHz5gLZx7FX+9MHwjZu/nLmEarQF9xb/m9v3NyyTjHeHYdixEvfgOMhH94kNTl4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=W9dBo1bH; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="W9dBo1bH" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 93C29C19423; Thu, 26 Feb 2026 13:51:21 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1772113883; bh=TDnJu/pMKQyvPl38p8goYkYRU9gufN4lR8yOJqLo3HU=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=W9dBo1bH6sGVvjfealiq/uFOSoAlmLRBbeIF2iPiW9oEcbYv6tQ0dnw7krt2jeBqt yJ/j729uTuxB9JKO3Lg/THQXDwT+3RyminZny4re1+4/9H8eZJ6dtxJiF/bAH/vovW LuE/bx8H2+5dOwX76D8DySPtswX385ZLkQ/DzKM1lrf3Aqt1af1lo3iGv/x18m1/uS mtrKbZVw0/CMdVkNOCJLVSD+NdOHGCcjEKETPgC2noPfsDwwjy4YpKjf8zR3zYudl1 N27ug9n8EwuQ4qMHHnhO4NGLJWnqijSYVzgklYBiu+AHZIFA7GPSCTsWcdqOvYHpse mS5L7wP1wF79Q== From: Christian Brauner Date: Thu, 26 Feb 2026 14:51:04 +0100 Subject: [PATCH v5 6/6] selftests/pidfd: add CLONE_PIDFD_AUTOKILL tests Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20260226-work-pidfs-autoreap-v5-6-d148b984a989@kernel.org> References: <20260226-work-pidfs-autoreap-v5-0-d148b984a989@kernel.org> In-Reply-To: <20260226-work-pidfs-autoreap-v5-0-d148b984a989@kernel.org> To: Oleg Nesterov , Jann Horn Cc: Linus Torvalds , Ingo Molnar , Peter Zijlstra , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Christian Brauner X-Mailer: b4 0.15-dev-47773 X-Developer-Signature: v=1; a=openpgp-sha256; l=8062; i=brauner@kernel.org; h=from:subject:message-id; bh=TDnJu/pMKQyvPl38p8goYkYRU9gufN4lR8yOJqLo3HU=; b=owGbwMvMwCU28Zj0gdSKO4sYT6slMWQu8D+d3NvnbPbtofCe3dVy/e0LlKfIRRbem+d1/0K2b Iq8TtPDjlIWBjEuBlkxRRaHdpNwueU8FZuNMjVg5rAygQxh4OIUgIlY+jH8jzqt+unfxkUrtv0M 6vod77z+YMb+/UG8j3lOOj+9Xsyw5RXD/+h711eqGXw6arm0c06vfFGHQkUPi8IGxgO3Tq/U077 xnRkA X-Developer-Key: i=brauner@kernel.org; a=openpgp; fpr=4880B8C9BD0E5106FC070F4F7B3C391EFEA93624 Add tests for CLONE_PIDFD_AUTOKILL: - autokill_basic: Verify closing the clone3 pidfd kills the child. - autokill_requires_pidfd: Verify AUTOKILL without CLONE_PIDFD fails. - autokill_requires_autoreap: Verify AUTOKILL without CLONE_AUTOREAP fails. - autokill_rejects_thread: Verify AUTOKILL with CLONE_THREAD fails. - autokill_pidfd_open_no_effect: Verify only the clone3 pidfd triggers autokill, not pidfd_open(). - autokill_requires_cap_sys_admin: Verify AUTOKILL without CLONE_NNP fails with -EPERM for an unprivileged caller. - autokill_without_nnp_with_cap: Verify AUTOKILL without CLONE_NNP succeeds with CAP_SYS_ADMIN. Signed-off-by: Christian Brauner Reviewed-by: Oleg Nesterov --- .../testing/selftests/pidfd/pidfd_autoreap_test.c | 278 +++++++++++++++++= ++++ 1 file changed, 278 insertions(+) diff --git a/tools/testing/selftests/pidfd/pidfd_autoreap_test.c b/tools/te= sting/selftests/pidfd/pidfd_autoreap_test.c index 5fb11230fb07..36adee6c424e 100644 --- a/tools/testing/selftests/pidfd/pidfd_autoreap_test.c +++ b/tools/testing/selftests/pidfd/pidfd_autoreap_test.c @@ -26,10 +26,37 @@ #define CLONE_AUTOREAP 0x400000000ULL #endif =20 +#ifndef CLONE_PIDFD_AUTOKILL +#define CLONE_PIDFD_AUTOKILL 0x800000000ULL +#endif + #ifndef CLONE_NNP #define CLONE_NNP 0x1000000000ULL #endif =20 +#ifndef _LINUX_CAPABILITY_VERSION_3 +#define _LINUX_CAPABILITY_VERSION_3 0x20080522 +#endif + +struct cap_header { + __u32 version; + int pid; +}; + +struct cap_data { + __u32 effective; + __u32 permitted; + __u32 inheritable; +}; + +static int drop_all_caps(void) +{ + struct cap_header hdr =3D { .version =3D _LINUX_CAPABILITY_VERSION_3 }; + struct cap_data data[2] =3D {}; + + return syscall(__NR_capset, &hdr, data); +} + static pid_t create_autoreap_child(int *pidfd) { struct __clone_args args =3D { @@ -619,4 +646,255 @@ TEST(autoreap_no_new_privs_unset) close(pidfd); } =20 +/* + * Helper: create a child with CLONE_PIDFD | CLONE_PIDFD_AUTOKILL | CLONE_= AUTOREAP | CLONE_NNP. + */ +static pid_t create_autokill_child(int *pidfd) +{ + struct __clone_args args =3D { + .flags =3D CLONE_PIDFD | CLONE_PIDFD_AUTOKILL | + CLONE_AUTOREAP | CLONE_NNP, + .exit_signal =3D 0, + .pidfd =3D ptr_to_u64(pidfd), + }; + + return sys_clone3(&args, sizeof(args)); +} + +/* + * Basic autokill test: child blocks in pause(), parent closes the + * clone3 pidfd, child should be killed and autoreaped. + */ +TEST(autokill_basic) +{ + int pidfd =3D -1, pollfd_fd =3D -1, ret; + struct pollfd pfd; + pid_t pid; + + pid =3D create_autokill_child(&pidfd); + if (pid < 0 && errno =3D=3D EINVAL) + SKIP(return, "CLONE_PIDFD_AUTOKILL not supported"); + ASSERT_GE(pid, 0); + + if (pid =3D=3D 0) { + pause(); + _exit(1); + } + + ASSERT_GE(pidfd, 0); + + /* + * Open a second pidfd via pidfd_open() so we can observe the + * child's death after closing the clone3 pidfd. + */ + pollfd_fd =3D sys_pidfd_open(pid, 0); + ASSERT_GE(pollfd_fd, 0); + + /* Close the clone3 pidfd =E2=80=94 this should trigger autokill. */ + close(pidfd); + + /* Wait for the child to die via the pidfd_open'd fd. */ + pfd.fd =3D pollfd_fd; + pfd.events =3D POLLIN; + ret =3D poll(&pfd, 1, 5000); + ASSERT_EQ(ret, 1); + ASSERT_TRUE(pfd.revents & POLLIN); + + /* Child should be autoreaped =E2=80=94 no zombie. */ + usleep(100000); + ret =3D waitpid(pid, NULL, WNOHANG); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, ECHILD); + + close(pollfd_fd); +} + +/* + * CLONE_PIDFD_AUTOKILL without CLONE_PIDFD must fail with EINVAL. + */ +TEST(autokill_requires_pidfd) +{ + struct __clone_args args =3D { + .flags =3D CLONE_PIDFD_AUTOKILL | CLONE_AUTOREAP, + .exit_signal =3D 0, + }; + pid_t pid; + + pid =3D sys_clone3(&args, sizeof(args)); + ASSERT_EQ(pid, -1); + ASSERT_EQ(errno, EINVAL); +} + +/* + * CLONE_PIDFD_AUTOKILL without CLONE_AUTOREAP must fail with EINVAL. + */ +TEST(autokill_requires_autoreap) +{ + int pidfd =3D -1; + struct __clone_args args =3D { + .flags =3D CLONE_PIDFD | CLONE_PIDFD_AUTOKILL, + .exit_signal =3D 0, + .pidfd =3D ptr_to_u64(&pidfd), + }; + pid_t pid; + + pid =3D sys_clone3(&args, sizeof(args)); + ASSERT_EQ(pid, -1); + ASSERT_EQ(errno, EINVAL); +} + +/* + * CLONE_PIDFD_AUTOKILL with CLONE_THREAD must fail with EINVAL. + */ +TEST(autokill_rejects_thread) +{ + int pidfd =3D -1; + struct __clone_args args =3D { + .flags =3D CLONE_PIDFD | CLONE_PIDFD_AUTOKILL | + CLONE_AUTOREAP | CLONE_THREAD | + CLONE_SIGHAND | CLONE_VM, + .exit_signal =3D 0, + .pidfd =3D ptr_to_u64(&pidfd), + }; + pid_t pid; + + pid =3D sys_clone3(&args, sizeof(args)); + ASSERT_EQ(pid, -1); + ASSERT_EQ(errno, EINVAL); +} + +/* + * Test that only the clone3 pidfd triggers autokill, not pidfd_open(). + * Close the pidfd_open'd fd first =E2=80=94 child should survive. + * Then close the clone3 pidfd =E2=80=94 child should be killed and autore= aped. + */ +TEST(autokill_pidfd_open_no_effect) +{ + int pidfd =3D -1, open_fd =3D -1, ret; + struct pollfd pfd; + pid_t pid; + + pid =3D create_autokill_child(&pidfd); + if (pid < 0 && errno =3D=3D EINVAL) + SKIP(return, "CLONE_PIDFD_AUTOKILL not supported"); + ASSERT_GE(pid, 0); + + if (pid =3D=3D 0) { + pause(); + _exit(1); + } + + ASSERT_GE(pidfd, 0); + + /* Open a second pidfd via pidfd_open(). */ + open_fd =3D sys_pidfd_open(pid, 0); + ASSERT_GE(open_fd, 0); + + /* + * Close the pidfd_open'd fd =E2=80=94 child should survive because + * only the clone3 pidfd has autokill. + */ + close(open_fd); + usleep(200000); + + /* Verify child is still alive by polling the clone3 pidfd. */ + pfd.fd =3D pidfd; + pfd.events =3D POLLIN; + ret =3D poll(&pfd, 1, 0); + ASSERT_EQ(ret, 0) { + TH_LOG("Child died after closing pidfd_open fd =E2=80=94 should still be= alive"); + } + + /* Open another observation fd before triggering autokill. */ + open_fd =3D sys_pidfd_open(pid, 0); + ASSERT_GE(open_fd, 0); + + /* Now close the clone3 pidfd =E2=80=94 this triggers autokill. */ + close(pidfd); + + pfd.fd =3D open_fd; + pfd.events =3D POLLIN; + ret =3D poll(&pfd, 1, 5000); + ASSERT_EQ(ret, 1); + ASSERT_TRUE(pfd.revents & POLLIN); + + /* Child should be autoreaped =E2=80=94 no zombie. */ + usleep(100000); + ret =3D waitpid(pid, NULL, WNOHANG); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, ECHILD); + + close(open_fd); +} + +/* + * Test that CLONE_PIDFD_AUTOKILL without CLONE_NNP fails with EPERM + * for an unprivileged caller. + */ +TEST(autokill_requires_cap_sys_admin) +{ + int pidfd =3D -1, ret; + struct __clone_args args =3D { + .flags =3D CLONE_PIDFD | CLONE_PIDFD_AUTOKILL | + CLONE_AUTOREAP, + .exit_signal =3D 0, + .pidfd =3D ptr_to_u64(&pidfd), + }; + pid_t pid; + + /* Drop all capabilities so we lack CAP_SYS_ADMIN. */ + ret =3D drop_all_caps(); + ASSERT_EQ(ret, 0); + + pid =3D sys_clone3(&args, sizeof(args)); + ASSERT_EQ(pid, -1); + ASSERT_EQ(errno, EPERM); +} + +/* + * Test that CLONE_PIDFD_AUTOKILL without CLONE_NNP succeeds with + * CAP_SYS_ADMIN. + */ +TEST(autokill_without_nnp_with_cap) +{ + struct __clone_args args =3D { + .flags =3D CLONE_PIDFD | CLONE_PIDFD_AUTOKILL | + CLONE_AUTOREAP, + .exit_signal =3D 0, + }; + struct pidfd_info info =3D { .mask =3D PIDFD_INFO_EXIT }; + int pidfd =3D -1, ret; + struct pollfd pfd; + pid_t pid; + + if (geteuid() !=3D 0) + SKIP(return, "Need root/CAP_SYS_ADMIN"); + + args.pidfd =3D ptr_to_u64(&pidfd); + + pid =3D sys_clone3(&args, sizeof(args)); + if (pid < 0 && errno =3D=3D EINVAL) + SKIP(return, "CLONE_PIDFD_AUTOKILL not supported"); + ASSERT_GE(pid, 0); + + if (pid =3D=3D 0) + _exit(0); + + ASSERT_GE(pidfd, 0); + + /* Wait for child to exit. */ + pfd.fd =3D pidfd; + pfd.events =3D POLLIN; + ret =3D poll(&pfd, 1, 5000); + ASSERT_EQ(ret, 1); + + ret =3D ioctl(pidfd, PIDFD_GET_INFO, &info); + ASSERT_EQ(ret, 0); + ASSERT_TRUE(info.mask & PIDFD_INFO_EXIT); + ASSERT_TRUE(WIFEXITED(info.exit_code)); + ASSERT_EQ(WEXITSTATUS(info.exit_code), 0); + + close(pidfd); +} + TEST_HARNESS_MAIN --=20 2.47.3