[PATCH RFC v3 0/4] pidfd: add CLONE_AUTOREAP and CLONE_PIDFD_AUTOKILL

Christian Brauner posted 4 patches 1 month, 2 weeks ago
There is a newer version of this series
fs/pidfs.c                                         |  16 +
include/linux/sched/signal.h                       |   4 +
include/uapi/linux/sched.h                         |   2 +
kernel/fork.c                                      |  28 +-
kernel/ptrace.c                                    |   3 +-
kernel/signal.c                                    |   4 +
tools/testing/selftests/pidfd/.gitignore           |   1 +
tools/testing/selftests/pidfd/Makefile             |   2 +-
.../testing/selftests/pidfd/pidfd_autoreap_test.c  | 676 +++++++++++++++++++++
9 files changed, 732 insertions(+), 4 deletions(-)
[PATCH RFC v3 0/4] pidfd: add CLONE_AUTOREAP and CLONE_PIDFD_AUTOKILL
Posted by Christian Brauner 1 month, 2 weeks ago
Add two new clone3() flags for pidfd-based process lifecycle management.

CLONE_AUTOREAP makes a child process auto-reap on exit without ever
becoming a zombie. This is a per-process property in contrast to the
existing auto-reap mechanism via SA_NOCLDWAIT or SIG_IGN for SIGCHLD
which applies to all children of a given parent.

Currently the only way to automatically reap children is to set
SA_NOCLDWAIT or SIG_IGN on SIGCHLD. This is a parent-scoped property
affecting all children which makes it unsuitable for libraries or
applications that need selective auto-reaping of specific children while
still being able to wait() on others.

CLONE_AUTOREAP stores an autoreap flag in the child's signal_struct.
When the child exits do_notify_parent() checks this flag and returns
autoreap=true causing exit_notify() to transition the task directly to
EXIT_DEAD. Since the flag lives on the child it survives reparenting: if
the original parent exits and the child is reparented to a subreaper or
init the child still auto-reaps when it eventually exits. This is
cleaner then forcing the subreaper to get SIGCHLD and then reaping it.
If the parent doesn't care the subreaper won't care. If there's a
subreaper that would care it would be easy enough to add a prctl() that
either just turns back on SIGCHLD and turns of auto-reaping or a prctl()
that just notifies the subreaper whenever a child is reparented to it.

CLONE_AUTOREAP can be combined with CLONE_PIDFD to allow the parent to
monitor the child's exit via poll() and retrieve exit status via
PIDFD_GET_INFO. Without CLONE_PIDFD it provides a fire-and-forget
pattern. No exit signal is delivered so exit_signal must be zero.

The flag is not inherited by the autoreap process's own children. Each
child that should be autoreaped must be explicitly created with
CLONE_AUTOREAP.

CLONE_PIDFD_AUTOKILL ties a child's lifetime to the pidfd returned from
clone3(). When the last reference to the struct file created by clone3()
is closed the kernel sends SIGKILL to the child. A pidfd obtained via
pidfd_open() for the same process does not keep the child alive and does
not trigger autokill - only the specific struct file from clone3() has
this property. This is useful for container runtimes, service managers,
and sandboxed subprocess execution - any scenario where the child must
die if the parent crashes or abandons the pidfd.

CLONE_PIDFD_AUTOKILL requires both CLONE_PIDFD and CLONE_AUTOREAP. It
requires CLONE_PIDFD because the whole point is tying the child's
lifetime to the pidfd. It requires CLONE_AUTOREAP because a killed child
with no one to reap it would become a zombie - the primary use case is
the parent crashing or abandoning the pidfd so no one is around to call
waitpid().

The clone3 pidfd is identified by storing a pointer to the struct file in
signal_struct.autokill_pidfd. The pidfs .release handler compares the
file being closed against this pointer and sends SIGKILL only on match.
dup()/fork() share the same struct file so they extend the child's
lifetime until the last reference drops.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Changes in v2:
- Add CLONE_PIDFD_AUTOKILL flag
- Decouple CLONE_AUTOREAP from CLONE_PIDFD: the autoreap mechanism has
  no dependency on pidfds. This allows fire-and-forget patterns where
  the parent does not need exit status.
- Link to v1: https://patch.msgid.link/20260216-work-pidfs-autoreap-v1-0-e63f663008f2@kernel.org

---
Christian Brauner (4):
      clone: add CLONE_AUTOREAP
      pidfd: add CLONE_PIDFD_AUTOKILL
      selftests/pidfd: add CLONE_AUTOREAP tests
      selftests/pidfd: add CLONE_PIDFD_AUTOKILL tests

 fs/pidfs.c                                         |  16 +
 include/linux/sched/signal.h                       |   4 +
 include/uapi/linux/sched.h                         |   2 +
 kernel/fork.c                                      |  28 +-
 kernel/ptrace.c                                    |   3 +-
 kernel/signal.c                                    |   4 +
 tools/testing/selftests/pidfd/.gitignore           |   1 +
 tools/testing/selftests/pidfd/Makefile             |   2 +-
 .../testing/selftests/pidfd/pidfd_autoreap_test.c  | 676 +++++++++++++++++++++
 9 files changed, 732 insertions(+), 4 deletions(-)
---
base-commit: 9702969978695d9a699a1f34771580cdbb153b33
change-id: 20260214-work-pidfs-autoreap-3ee677e240a8
Re: [PATCH RFC v3 0/4] pidfd: add CLONE_AUTOREAP and CLONE_PIDFD_AUTOKILL
Posted by Christian Brauner 1 month, 2 weeks ago
> CLONE_PIDFD_AUTOKILL ties a child's lifetime to the pidfd returned from
> clone3(). When the last reference to the struct file created by clone3()
> is closed the kernel sends SIGKILL to the child.

So this is for me one of the most useful features that I've been
pondering for a long time but always put off. It's usefulness is
intimately tied to the fact that the kill-on-close contract cannot be
flaunted no matter what gets executed (freebsd has the same behavior for
pdfork()).

If the parent says to SIGKILL the child once the fd is closed then it
isn't reset no matter if privileged exec or credential change. This is
in contrast to related mechanisms such as pdeath_signal which gets reset
by all kinds of crap but then can be set again and it's just cumbersome
and not super useful. Not even signal delivery is guaranteed as
permission are checked for that as well.

My ideal model for kill-on-close is to just ruthlessly enforce that the
kernel murders anything once the file is released. But I would really
like to get some thoughts on this.