From nobody Sun Feb 8 02:21:39 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F25DF23278D; Fri, 25 Apr 2025 08:11:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745568703; cv=none; b=d3/0NGOzXNRzS+tJkNoVbIO1eKSiC5kTUcwlyFaWGX3nfk6umDORpjU7M7qWlYiv/r0QlNtE2h7MtL3R3PdoGoh5a2G3UL3izFMn7zZ3WEMWY8HE72Abueyf1qvDA2HRRO1aqAVb39kU+7hTcOi6UWgeLUM9ntrorfuWLEOz87Q= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745568703; c=relaxed/simple; bh=52iuwbmVFpSuQ7ggO80gZdYLZgDigukVwIvtQCtZJWI=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=VTkqYN3j4SjkNd59mieI94/0kXqW71ltxFFrZbH+5D24LYDqQjtgQ6n6g3RgnYgezXplW3+dau6TzkU9G+ucOvnzRYZ1j2ItzINNKuhuLUsxIwMG/51cXR0bH05wbNUxbK/fKRWYbqNGwqwgjt9omodDVM8eRQdPPlGnSfgNme8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=DgCxwnr0; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="DgCxwnr0" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 6DA46C4CEEA; Fri, 25 Apr 2025 08:11:38 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1745568702; bh=52iuwbmVFpSuQ7ggO80gZdYLZgDigukVwIvtQCtZJWI=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=DgCxwnr0vupw3+H3Ra/Cv6a/rzjqg5J9H9XqOL8qKehpmBUrNhLq/NX3xLVeJPRnE 367cx/7f+Y+UNDLz/Mk417VDzte/KLDipaD/xYUDmy7klL7cg2sPwW5K79LIJlqTvu gdk3JhkXN4Wc0Hm99gz6I2noJUsojBJrLS/LPCz+pw+ZQ5U9CETGKI0UFUOQ3J5tfP gApVE/YI45R85EMfDkx8MqLLhD0i4nGq78yyup8trS7+yMpGHTHyEBqg35PMGKs21+ u/o/FjTaoeal4uDuY+yEFIHdnLozWNlC7dtrwmgeUy4B8sZsTd0cbaHWGOLz/z5VKv DKfsAKtJ7mxCw== From: Christian Brauner Date: Fri, 25 Apr 2025 10:11:30 +0200 Subject: [PATCH v2 1/4] pidfs: register pid in pidfs Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20250425-work-pidfs-net-v2-1-450a19461e75@kernel.org> References: <20250425-work-pidfs-net-v2-0-450a19461e75@kernel.org> In-Reply-To: <20250425-work-pidfs-net-v2-0-450a19461e75@kernel.org> To: Oleg Nesterov , Kuniyuki Iwashima , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, netdev@vger.kernel.org, David Rheinsberg , Jan Kara , Alexander Mikhalitsyn , Luca Boccassi , Lennart Poettering , Daan De Meyer , Mike Yuan , Christian Brauner X-Mailer: b4 0.15-dev-c25d1 X-Developer-Signature: v=1; a=openpgp-sha256; l=3009; i=brauner@kernel.org; h=from:subject:message-id; bh=52iuwbmVFpSuQ7ggO80gZdYLZgDigukVwIvtQCtZJWI=; b=owGbwMvMwCU28Zj0gdSKO4sYT6slMWRwO29V7vkhfvjKybiJjdN4FvzkWvuy1vpSZrfmhR+7v sgHFAh/6yhlYRDjYpAVU2RxaDcJl1vOU7HZKFMDZg4rE8gQBi5OAZjIZk9Ghuti6cbJKnPmWXix zOdtV12XFh/4sefO2i0nZ0+vYtxX4MLI8EHWb1ddrlVPiiCzX7JnYdRu6ym3RFmTXnP9LLg1d3k ZBwA= X-Developer-Key: i=brauner@kernel.org; a=openpgp; fpr=4880B8C9BD0E5106FC070F4F7B3C391EFEA93624 Add simple helpers that allow a struct pid to be pinned via a pidfs dentry/inode. If no pidfs dentry exists a new one will be allocated for it. A reference is taken by pidfs on @pid. The reference must be released via pidfs_put_pid(). This will allow AF_UNIX sockets to allocate a dentry for the peer credentials pid at the time they are recorded where we know the task is still alive. When the task gets reaped its exit status is guaranteed to be recorded and a pidfd can be handed out for the reaped task. Reviewed-by: Oleg Nesterov Signed-off-by: Christian Brauner Reviewed-by: David Rheinsberg --- fs/pidfs.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++= ++++ include/linux/pidfs.h | 3 +++ 2 files changed, 61 insertions(+) diff --git a/fs/pidfs.c b/fs/pidfs.c index d64a4cbeb0da..308792d4b11a 100644 --- a/fs/pidfs.c +++ b/fs/pidfs.c @@ -896,6 +896,64 @@ struct file *pidfs_alloc_file(struct pid *pid, unsigne= d int flags) return pidfd_file; } =20 +/** + * pidfs_register_pid - register a struct pid in pidfs + * @pid: pid to pin + * + * Register a struct pid in pidfs. Needs to be paired with + * pidfs_put_pid() to not risk leaking the pidfs dentry and inode. + * + * Return: On success zero, on error a negative error code is returned. + */ +int pidfs_register_pid(struct pid *pid) +{ + struct path path __free(path_put) =3D {}; + int ret; + + might_sleep(); + + if (!pid) + return 0; + + ret =3D path_from_stashed(&pid->stashed, pidfs_mnt, get_pid(pid), &path); + if (unlikely(ret)) + return ret; + /* Keep the dentry and only put the reference to the mount. */ + path.dentry =3D NULL; + return 0; +} + +/** + * pidfs_get_pid - pin a struct pid through pidfs + * @pid: pid to pin + * + * Similar to pidfs_register_pid() but only valid if the caller knows + * there's a reference to the @pid through a dentry already that can't + * go away. + */ +void pidfs_get_pid(struct pid *pid) +{ + if (!pid) + return; + WARN_ON_ONCE(!stashed_dentry_get(&pid->stashed)); +} + +/** + * pidfs_put_pid - drop a pidfs reference + * @pid: pid to drop + * + * Drop a reference to @pid via pidfs. This is only safe if the + * reference has been taken via pidfs_get_pid(). + */ +void pidfs_put_pid(struct pid *pid) +{ + might_sleep(); + + if (!pid) + return; + dput(pid->stashed); +} + static void pidfs_inode_init_once(void *data) { struct pidfs_inode *pi =3D data; diff --git a/include/linux/pidfs.h b/include/linux/pidfs.h index 05e6f8f4a026..2676890c4d0d 100644 --- a/include/linux/pidfs.h +++ b/include/linux/pidfs.h @@ -8,5 +8,8 @@ void pidfs_add_pid(struct pid *pid); void pidfs_remove_pid(struct pid *pid); void pidfs_exit(struct task_struct *tsk); extern const struct dentry_operations pidfs_dentry_operations; +int pidfs_register_pid(struct pid *pid); +void pidfs_get_pid(struct pid *pid); +void pidfs_put_pid(struct pid *pid); =20 #endif /* _LINUX_PID_FS_H */ --=20 2.47.2 From nobody Sun Feb 8 02:21:39 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 53BA21FBEB3; Fri, 25 Apr 2025 08:11:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745568707; cv=none; b=dI279usLoWSk/4eAHRX4ARQTsXDNmeU0d3FNP45qCGAtZNFtLC7DO5ZoJke2I0EWwKt7Ssvj6IFlfWwUJ6/ky1kWPIIWluJ019Rt34iGW5qVxyDRJHhVEjbMVyX/MuEO3Erh0LalaYBSQoT94yO1iy9fo8KLJLQBoAk/Y8gP+Ts= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745568707; c=relaxed/simple; bh=TIzb1XjY5m+NMEkrlE4sVETjtxq1jNB+I9PJv+21BYA=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=ZQ4XKgkbqzq0tW+WiT9M+2ukzxKRpQjdQzOaJ20wg15P5yoOHgZv8e4aNSNWLzYoxCSQs+/3VdEBhwb1i0cknhQSApDCi5pp7g4pXirBHp+PxxW3ostLxPiWRNQClGrFqHmlHfWedJVF6p7lFi2tp+YUyXwSF2R0upD6zGd3/Fs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=AgcSoaC9; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="AgcSoaC9" Received: by smtp.kernel.org (Postfix) with ESMTPSA id EA3B0C4CEE4; Fri, 25 Apr 2025 08:11:42 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1745568706; bh=TIzb1XjY5m+NMEkrlE4sVETjtxq1jNB+I9PJv+21BYA=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=AgcSoaC9ENrczjXekusmTk2YJHw5pqb/nh7/8igwvSbSTwV5jIau4/nJGggavGhlX oV0x3Hq62GKEGJfZfNTRkCHxA+MRKpqcjbFxVqZwgUY/YWHuXSzGTNZ/VcSXhBSRqn 5U/8CE46WiW3A2X5fBkKI/+q4KhxVLpzCbdcZGua0WOO9bLzXdWdeTlEFwrjdu+znT hylu6KGhczMEqCilB0h+oTrJlVPgpKbXYrPBgj/B9F4bVhoT8MTMiw6bzxguhYpo5K BcpKD79N3Oz9M6LBYEvI91eRekO0ZhAUJibEBqmsxCtSmo6ucY1tUjDl3OoLKzBtzW kvFj6jVllLRkA== From: Christian Brauner Date: Fri, 25 Apr 2025 10:11:31 +0200 Subject: [PATCH v2 2/4] net, pidfs: prepare for handing out pidfds for reaped sk->sk_peer_pid Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20250425-work-pidfs-net-v2-2-450a19461e75@kernel.org> References: <20250425-work-pidfs-net-v2-0-450a19461e75@kernel.org> In-Reply-To: <20250425-work-pidfs-net-v2-0-450a19461e75@kernel.org> To: Oleg Nesterov , Kuniyuki Iwashima , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, netdev@vger.kernel.org, David Rheinsberg , Jan Kara , Alexander Mikhalitsyn , Luca Boccassi , Lennart Poettering , Daan De Meyer , Mike Yuan , Christian Brauner X-Mailer: b4 0.15-dev-c25d1 X-Developer-Signature: v=1; a=openpgp-sha256; l=8886; i=brauner@kernel.org; h=from:subject:message-id; bh=TIzb1XjY5m+NMEkrlE4sVETjtxq1jNB+I9PJv+21BYA=; b=owGbwMvMwCU28Zj0gdSKO4sYT6slMWRwO299IrjBUjFvs0vqmpNnWDecfyM8qVA8u7e+UcMk/ 8ubf1dVOkpZGMS4GGTFFFkc2k3C5ZbzVGw2ytSAmcPKBDKEgYtTACYSbcfwP8H72hyVqkd7rixS CF/z/jSjJofd6SlybO9Ynx0NEZtnYMrwP3Ftjnjf8cffrJaePBGUc+fL0RyurJ/XfZhV1U/fcmi z4QMA X-Developer-Key: i=brauner@kernel.org; a=openpgp; fpr=4880B8C9BD0E5106FC070F4F7B3C391EFEA93624 SO_PEERPIDFD currently doesn't support handing out pidfds if the sk->sk_peer_pid thread-group leader has already been reaped. In this case it currently returns EINVAL. Userspace still wants to get a pidfd for a reaped process to have a stable handle it can pass on. This is especially useful now that it is possible to retrieve exit information through a pidfd via the PIDFD_GET_INFO ioctl()'s PIDFD_INFO_EXIT flag. Another summary has been provided by David in [1]: > A pidfd can outlive the task it refers to, and thus user-space must > already be prepared that the task underlying a pidfd is gone at the time > they get their hands on the pidfd. For instance, resolving the pidfd to > a PID via the fdinfo must be prepared to read `-1`. > > Despite user-space knowing that a pidfd might be stale, several kernel > APIs currently add another layer that checks for this. In particular, > SO_PEERPIDFD returns `EINVAL` if the peer-task was already reaped, > but returns a stale pidfd if the task is reaped immediately after the > respective alive-check. > > This has the unfortunate effect that user-space now has two ways to > check for the exact same scenario: A syscall might return > EINVAL/ESRCH/... *or* the pidfd might be stale, even though there is no > particular reason to distinguish both cases. This also propagates > through user-space APIs, which pass on pidfds. They must be prepared to > pass on `-1` *or* the pidfd, because there is no guaranteed way to get a > stale pidfd from the kernel. > Userspace must already deal with a pidfd referring to a reaped task as > the task may exit and get reaped at any time will there are still many > pidfds referring to it. In order to allow handing out reaped pidfd SO_PEERPIDFD needs to ensure that PIDFD_INFO_EXIT information is available whenever a pidfd for a reaped task is created by PIDFD_INFO_EXIT. The uapi promises that reaped pidfds are only handed out if it is guaranteed that the caller sees the exit information: TEST_F(pidfd_info, success_reaped) { struct pidfd_info info =3D { .mask =3D PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT, }; /* * Process has already been reaped and PIDFD_INFO_EXIT been set. * Verify that we can retrieve the exit status of the process. */ ASSERT_EQ(ioctl(self->child_pidfd4, PIDFD_GET_INFO, &info), 0); ASSERT_FALSE(!!(info.mask & PIDFD_INFO_CREDS)); ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT)); ASSERT_TRUE(WIFEXITED(info.exit_code)); ASSERT_EQ(WEXITSTATUS(info.exit_code), 0); } To hand out pidfds for reaped processes we thus allocate a pidfs entry for the relevant sk->sk_peer_pid at the time the sk->sk_peer_pid is stashed and drop it when the socket is destroyed. This guarantees that exit information will always be recorded for the sk->sk_peer_pid task and we can hand out pidfds for reaped processes. Link: https://lore.kernel.org/lkml/20230807085203.819772-1-david@readahead.= eu [1] Signed-off-by: Christian Brauner Reviewed-by: David Rheinsberg Reviewed-by: Kuniyuki Iwashima --- net/unix/af_unix.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++---= ---- 1 file changed, 74 insertions(+), 11 deletions(-) diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index f78a2492826f..472f8aa9ea15 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -100,6 +100,7 @@ #include #include #include +#include #include #include #include @@ -643,6 +644,9 @@ static void unix_sock_destructor(struct sock *sk) return; } =20 + if (sk->sk_peer_pid) + pidfs_put_pid(sk->sk_peer_pid); + if (u->addr) unix_release_addr(u->addr); =20 @@ -734,13 +738,48 @@ static void unix_release_sock(struct sock *sk, int em= brion) unix_gc(); /* Garbage collect fds */ } =20 -static void init_peercred(struct sock *sk) +struct unix_peercred { + struct pid *peer_pid; + const struct cred *peer_cred; +}; + +static inline int prepare_peercred(struct unix_peercred *peercred) { - sk->sk_peer_pid =3D get_pid(task_tgid(current)); - sk->sk_peer_cred =3D get_current_cred(); + struct pid *pid; + int err; + + pid =3D task_tgid(current); + err =3D pidfs_register_pid(pid); + if (likely(!err)) { + peercred->peer_pid =3D get_pid(pid); + peercred->peer_cred =3D get_current_cred(); + } + return err; } =20 -static void update_peercred(struct sock *sk) +static void drop_peercred(struct unix_peercred *peercred) +{ + const struct cred *cred =3D NULL; + struct pid *pid =3D NULL; + + might_sleep(); + + swap(peercred->peer_pid, pid); + swap(peercred->peer_cred, cred); + + pidfs_put_pid(pid); + put_pid(pid); + put_cred(cred); +} + +static inline void init_peercred(struct sock *sk, + const struct unix_peercred *peercred) +{ + sk->sk_peer_pid =3D peercred->peer_pid; + sk->sk_peer_cred =3D peercred->peer_cred; +} + +static void update_peercred(struct sock *sk, struct unix_peercred *peercre= d) { const struct cred *old_cred; struct pid *old_pid; @@ -748,11 +787,11 @@ static void update_peercred(struct sock *sk) spin_lock(&sk->sk_peer_lock); old_pid =3D sk->sk_peer_pid; old_cred =3D sk->sk_peer_cred; - init_peercred(sk); + init_peercred(sk, peercred); spin_unlock(&sk->sk_peer_lock); =20 - put_pid(old_pid); - put_cred(old_cred); + peercred->peer_pid =3D old_pid; + peercred->peer_cred =3D old_cred; } =20 static void copy_peercred(struct sock *sk, struct sock *peersk) @@ -761,6 +800,7 @@ static void copy_peercred(struct sock *sk, struct sock = *peersk) =20 spin_lock(&sk->sk_peer_lock); sk->sk_peer_pid =3D get_pid(peersk->sk_peer_pid); + pidfs_get_pid(sk->sk_peer_pid); sk->sk_peer_cred =3D get_cred(peersk->sk_peer_cred); spin_unlock(&sk->sk_peer_lock); } @@ -770,6 +810,7 @@ static int unix_listen(struct socket *sock, int backlog) int err; struct sock *sk =3D sock->sk; struct unix_sock *u =3D unix_sk(sk); + struct unix_peercred peercred =3D {}; =20 err =3D -EOPNOTSUPP; if (sock->type !=3D SOCK_STREAM && sock->type !=3D SOCK_SEQPACKET) @@ -777,6 +818,9 @@ static int unix_listen(struct socket *sock, int backlog) err =3D -EINVAL; if (!READ_ONCE(u->addr)) goto out; /* No listens on an unbound socket */ + err =3D prepare_peercred(&peercred); + if (err) + goto out; unix_state_lock(sk); if (sk->sk_state !=3D TCP_CLOSE && sk->sk_state !=3D TCP_LISTEN) goto out_unlock; @@ -786,11 +830,12 @@ static int unix_listen(struct socket *sock, int backl= og) WRITE_ONCE(sk->sk_state, TCP_LISTEN); =20 /* set credentials so connect can copy them */ - update_peercred(sk); + update_peercred(sk, &peercred); err =3D 0; =20 out_unlock: unix_state_unlock(sk); + drop_peercred(&peercred); out: return err; } @@ -1525,6 +1570,7 @@ static int unix_stream_connect(struct socket *sock, s= truct sockaddr *uaddr, struct sockaddr_un *sunaddr =3D (struct sockaddr_un *)uaddr; struct sock *sk =3D sock->sk, *newsk =3D NULL, *other =3D NULL; struct unix_sock *u =3D unix_sk(sk), *newu, *otheru; + struct unix_peercred peercred =3D {}; struct net *net =3D sock_net(sk); struct sk_buff *skb =3D NULL; unsigned char state; @@ -1561,6 +1607,10 @@ static int unix_stream_connect(struct socket *sock, = struct sockaddr *uaddr, goto out; } =20 + err =3D prepare_peercred(&peercred); + if (err) + goto out; + /* Allocate skb for sending to listening sock */ skb =3D sock_wmalloc(newsk, 1, 0, GFP_KERNEL); if (!skb) { @@ -1636,7 +1686,7 @@ static int unix_stream_connect(struct socket *sock, s= truct sockaddr *uaddr, unix_peer(newsk) =3D sk; newsk->sk_state =3D TCP_ESTABLISHED; newsk->sk_type =3D sk->sk_type; - init_peercred(newsk); + init_peercred(newsk, &peercred); newu =3D unix_sk(newsk); newu->listener =3D other; RCU_INIT_POINTER(newsk->sk_wq, &newu->peer_wq); @@ -1695,20 +1745,33 @@ static int unix_stream_connect(struct socket *sock,= struct sockaddr *uaddr, out_free_sk: unix_release_sock(newsk, 0); out: + drop_peercred(&peercred); return err; } =20 static int unix_socketpair(struct socket *socka, struct socket *sockb) { + struct unix_peercred ska_peercred =3D {}, skb_peercred =3D {}; struct sock *ska =3D socka->sk, *skb =3D sockb->sk; + int err; + + err =3D prepare_peercred(&ska_peercred); + if (err) + return err; + + err =3D prepare_peercred(&skb_peercred); + if (err) { + drop_peercred(&ska_peercred); + return err; + } =20 /* Join our sockets back to back */ sock_hold(ska); sock_hold(skb); unix_peer(ska) =3D skb; unix_peer(skb) =3D ska; - init_peercred(ska); - init_peercred(skb); + init_peercred(ska, &ska_peercred); + init_peercred(skb, &skb_peercred); =20 ska->sk_state =3D TCP_ESTABLISHED; skb->sk_state =3D TCP_ESTABLISHED; --=20 2.47.2 From nobody Sun Feb 8 02:21:39 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D08A323372C; Fri, 25 Apr 2025 08:11:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745568711; cv=none; b=VejYFBFi8WgX4xEuskSPqync5EJVOT1FJXDu7AwiMcKiMEiTuQ34RMDRemOwkMni8hT1lXCVc0Kw3J51v4Q+WV+VVkR345dCCdP6PQTwKMETP473A9p2bnMneT/OrJ2oEf3OwdLT8vSPEzBRszTeXeqoubPexspJp4ocP116cRk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745568711; c=relaxed/simple; bh=nMJ3psWToc89ewO/cJ/7prDnz/wFMrPXq0ZW1NO0b4M=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=Ngxvy2O36c9QXCivN+DSprHd5A/ahtxfcAQ8U0SmIGpjaTZjrpxS6gozmLPfsr/qKtGNUaJKHKhHCFIVWWLb9SCXt6laQeITb0EjyUHjpu+rJIE7XdxhPAXhTf8zSQuZ/WfO3iIwuMO702Yv7VrUSGrRjfKLGdclFCGxpLPK7d8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=hV45fax+; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="hV45fax+" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 516A8C4CEEA; Fri, 25 Apr 2025 08:11:47 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1745568711; bh=nMJ3psWToc89ewO/cJ/7prDnz/wFMrPXq0ZW1NO0b4M=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=hV45fax+HKLTqmZh3rULF2KkDHyVLvBLBXQQhRlc09iJfYhXdvgDLmJ0TJE0wXFw/ jzrG0wX3FG20I75VnwriSDpj13F6eWQIPzG/35gpzGdLwW8LeOPcGZz2U3Kh6cKO+j dIJAZdXCJ5MEjqWKEqey7u4/a0xffLVWzwOSR/TbDC4Ds+UnvdNhlsISZ3736njZE9 CL3ZJBoywv+4vviYO4iUPTZboqZErwjHqJ0z45ZUp1ANm+cYB71ZyHErHa+FZDQsgn Jk0c/j/KiE2IiLPW6KjVX0qC6Chc8S9cPUdVbQ+sENTCBhfv6IU/Xwwa+rKNLHc/7Z nV1mG5gjDawDQ== From: Christian Brauner Date: Fri, 25 Apr 2025 10:11:32 +0200 Subject: [PATCH v2 3/4] pidfs: get rid of __pidfd_prepare() Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20250425-work-pidfs-net-v2-3-450a19461e75@kernel.org> References: <20250425-work-pidfs-net-v2-0-450a19461e75@kernel.org> In-Reply-To: <20250425-work-pidfs-net-v2-0-450a19461e75@kernel.org> To: Oleg Nesterov , Kuniyuki Iwashima , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, netdev@vger.kernel.org, David Rheinsberg , Jan Kara , Alexander Mikhalitsyn , Luca Boccassi , Lennart Poettering , Daan De Meyer , Mike Yuan , Christian Brauner X-Mailer: b4 0.15-dev-c25d1 X-Developer-Signature: v=1; a=openpgp-sha256; l=8672; i=brauner@kernel.org; h=from:subject:message-id; bh=nMJ3psWToc89ewO/cJ/7prDnz/wFMrPXq0ZW1NO0b4M=; b=owGbwMvMwCU28Zj0gdSKO4sYT6slMWRwO29NzEtx/dH5YpLiwYu803esbJhSpOP4+MaKE87CB ud25BmndJSyMIhxMciKKbI4tJuEyy3nqdhslKkBM4eVCWQIAxenAEzk1U1GhnlT85+8dIttP521 yGK9jXaNzZZn8ff7meeFabox6jKs/cTwm+U5T3TTF0WzoO3HT02fPMNm4//CnsKpXnxWj+a6Rd6 IZgMA X-Developer-Key: i=brauner@kernel.org; a=openpgp; fpr=4880B8C9BD0E5106FC070F4F7B3C391EFEA93624 Fold it into pidfd_prepare() and rename PIDFD_CLONE to PIDFD_STALE to indicate that the passed pid might not have task linkage and no explicit check for that should be performed. Reviewed-by: Oleg Nesterov Signed-off-by: Christian Brauner Reviewed-by: David Rheinsberg --- fs/pidfs.c | 22 +++++++----- include/linux/pid.h | 2 +- include/uapi/linux/pidfd.h | 2 +- kernel/fork.c | 83 ++++++++++++++++--------------------------= ---- 4 files changed, 44 insertions(+), 65 deletions(-) diff --git a/fs/pidfs.c b/fs/pidfs.c index 308792d4b11a..0afaffd5a18a 100644 --- a/fs/pidfs.c +++ b/fs/pidfs.c @@ -768,7 +768,7 @@ static inline bool pidfs_pid_valid(struct pid *pid, con= st struct path *path, { enum pid_type type; =20 - if (flags & PIDFD_CLONE) + if (flags & PIDFD_STALE) return true; =20 /* @@ -777,10 +777,14 @@ static inline bool pidfs_pid_valid(struct pid *pid, c= onst struct path *path, * pidfd has been allocated perform another check that the pid * is still alive. If it is exit information is available even * if the task gets reaped before the pidfd is returned to - * userspace. The only exception is PIDFD_CLONE where no task - * linkage has been established for @pid yet and the kernel is - * in the middle of process creation so there's nothing for - * pidfs to miss. + * userspace. The only exception are indicated by PIDFD_STALE: + * + * (1) The kernel is in the middle of task creation and thus no + * task linkage has been established yet. + * (2) The caller knows @pid has been registered in pidfs at a + * time when the task was still alive. + * + * In both cases exit information will have been reported. */ if (flags & PIDFD_THREAD) type =3D PIDTYPE_PID; @@ -874,11 +878,11 @@ struct file *pidfs_alloc_file(struct pid *pid, unsign= ed int flags) int ret; =20 /* - * Ensure that PIDFD_CLONE can be passed as a flag without + * Ensure that PIDFD_STALE can be passed as a flag without * overloading other uapi pidfd flags. */ - BUILD_BUG_ON(PIDFD_CLONE =3D=3D PIDFD_THREAD); - BUILD_BUG_ON(PIDFD_CLONE =3D=3D PIDFD_NONBLOCK); + BUILD_BUG_ON(PIDFD_STALE =3D=3D PIDFD_THREAD); + BUILD_BUG_ON(PIDFD_STALE =3D=3D PIDFD_NONBLOCK); =20 ret =3D path_from_stashed(&pid->stashed, pidfs_mnt, get_pid(pid), &path); if (ret < 0) @@ -887,7 +891,7 @@ struct file *pidfs_alloc_file(struct pid *pid, unsigned= int flags) if (!pidfs_pid_valid(pid, &path, flags)) return ERR_PTR(-ESRCH); =20 - flags &=3D ~PIDFD_CLONE; + flags &=3D ~PIDFD_STALE; pidfd_file =3D dentry_open(&path, flags, current_cred()); /* Raise PIDFD_THREAD explicitly as do_dentry_open() strips it. */ if (!IS_ERR(pidfd_file)) diff --git a/include/linux/pid.h b/include/linux/pid.h index 311ecebd7d56..453ae6d8a68d 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -77,7 +77,7 @@ struct file; struct pid *pidfd_pid(const struct file *file); struct pid *pidfd_get_pid(unsigned int fd, unsigned int *flags); struct task_struct *pidfd_get_task(int pidfd, unsigned int *flags); -int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret); +int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret_f= ile); void do_notify_pidfd(struct task_struct *task); =20 static inline struct pid *get_pid(struct pid *pid) diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h index 2970ef44655a..8c1511edd0e9 100644 --- a/include/uapi/linux/pidfd.h +++ b/include/uapi/linux/pidfd.h @@ -12,7 +12,7 @@ #define PIDFD_THREAD O_EXCL #ifdef __KERNEL__ #include -#define PIDFD_CLONE CLONE_PIDFD +#define PIDFD_STALE CLONE_PIDFD #endif =20 /* Flags for pidfd_send_signal(). */ diff --git a/kernel/fork.c b/kernel/fork.c index f7403e1fb0d4..1d95f4dae327 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2035,55 +2035,11 @@ static inline void rcu_copy_process(struct task_str= uct *p) #endif /* #ifdef CONFIG_TASKS_TRACE_RCU */ } =20 -/** - * __pidfd_prepare - allocate a new pidfd_file and reserve a pidfd - * @pid: the struct pid for which to create a pidfd - * @flags: flags of the new @pidfd - * @ret: Where to return the file for the pidfd. - * - * Allocate a new file that stashes @pid and reserve a new pidfd number in= the - * caller's file descriptor table. The pidfd is reserved but not installed= yet. - * - * The helper doesn't perform checks on @pid which makes it useful for pid= fds - * created via CLONE_PIDFD where @pid has no task attached when the pidfd = and - * pidfd file are prepared. - * - * If this function returns successfully the caller is responsible to eith= er - * call fd_install() passing the returned pidfd and pidfd file as argument= s in - * order to install the pidfd into its file descriptor table or they must = use - * put_unused_fd() and fput() on the returned pidfd and pidfd file - * respectively. - * - * This function is useful when a pidfd must already be reserved but there - * might still be points of failure afterwards and the caller wants to ens= ure - * that no pidfd is leaked into its file descriptor table. - * - * Return: On success, a reserved pidfd is returned from the function and = a new - * pidfd file is returned in the last argument to the function. On - * error, a negative error code is returned from the function and = the - * last argument remains unchanged. - */ -static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct fil= e **ret) -{ - struct file *pidfd_file; - - CLASS(get_unused_fd, pidfd)(O_CLOEXEC); - if (pidfd < 0) - return pidfd; - - pidfd_file =3D pidfs_alloc_file(pid, flags | O_RDWR); - if (IS_ERR(pidfd_file)) - return PTR_ERR(pidfd_file); - - *ret =3D pidfd_file; - return take_fd(pidfd); -} - /** * pidfd_prepare - allocate a new pidfd_file and reserve a pidfd * @pid: the struct pid for which to create a pidfd * @flags: flags of the new @pidfd - * @ret: Where to return the pidfd. + * @ret_file: return the new pidfs file * * Allocate a new file that stashes @pid and reserve a new pidfd number in= the * caller's file descriptor table. The pidfd is reserved but not installed= yet. @@ -2106,16 +2062,26 @@ static int __pidfd_prepare(struct pid *pid, unsigne= d int flags, struct file **re * error, a negative error code is returned from the function and = the * last argument remains unchanged. */ -int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret) +int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret_f= ile) { + struct file *pidfs_file; + /* - * While holding the pidfd waitqueue lock removing the task - * linkage for the thread-group leader pid (PIDTYPE_TGID) isn't - * possible. Thus, if there's still task linkage for PIDTYPE_PID - * not having thread-group leader linkage for the pid means it - * wasn't a thread-group leader in the first place. + * PIDFD_STALE is only allowed to be passed if the caller knows + * that @pid is already registered in pidfs and thus + * PIDFD_INFO_EXIT information is guaranteed to be available. */ - scoped_guard(spinlock_irq, &pid->wait_pidfd.lock) { + if (!(flags & PIDFD_STALE)) { + /* + * While holding the pidfd waitqueue lock removing the + * task linkage for the thread-group leader pid + * (PIDTYPE_TGID) isn't possible. Thus, if there's still + * task linkage for PIDTYPE_PID not having thread-group + * leader linkage for the pid means it wasn't a + * thread-group leader in the first place. + */ + guard(spinlock_irq)(&pid->wait_pidfd.lock); + /* Task has already been reaped. */ if (!pid_has_task(pid, PIDTYPE_PID)) return -ESRCH; @@ -2128,7 +2094,16 @@ int pidfd_prepare(struct pid *pid, unsigned int flag= s, struct file **ret) return -ENOENT; } =20 - return __pidfd_prepare(pid, flags, ret); + CLASS(get_unused_fd, pidfd)(O_CLOEXEC); + if (pidfd < 0) + return pidfd; + + pidfs_file =3D pidfs_alloc_file(pid, flags | O_RDWR); + if (IS_ERR(pidfs_file)) + return PTR_ERR(pidfs_file); + + *ret_file =3D pidfs_file; + return take_fd(pidfd); } =20 static void __delayed_free_task(struct rcu_head *rhp) @@ -2477,7 +2452,7 @@ __latent_entropy struct task_struct *copy_process( * Note that no task has been attached to @pid yet indicate * that via CLONE_PIDFD. */ - retval =3D __pidfd_prepare(pid, flags | PIDFD_CLONE, &pidfile); + retval =3D pidfd_prepare(pid, flags | PIDFD_STALE, &pidfile); if (retval < 0) goto bad_fork_free_pid; pidfd =3D retval; --=20 2.47.2 From nobody Sun Feb 8 02:21:39 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C92262367BB; Fri, 25 Apr 2025 08:11:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745568715; cv=none; b=ZhImhkAUKhgKfokgxLNfzJMIl/jqyxoYZ2A2NJ4kuw9pfSLGbh+iOJUdTuG2n8tfziK6jaAKAmYdd2Txtm4BIw6VuZn1Gls2SJjs1Qii9U9/6h0KVX+DSxMnwPAPd3Ptn4gWqYegrPOdJdDTwKAQ5iFWgyzmfW2fGsrNHEHRcjc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745568715; c=relaxed/simple; bh=lKHgXD3YJcZUCX/EWbivAcuAmix3WP6EI2DGGplB02U=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=eTXoGuwCeEvhnvw/JQlh1b2serCjVCFnd+pUX/7Q0jRM54651x8PHPyaJIYKy/Q34esrCiceSrHOdW9lBnVtH71TxBbtQkv6uJaHZ6oDDKRd3hlq/T0E5uUnVDJdW4Da5An/zD8k5jgcEMcPYHeHenmhqavSjZ+Y+5CFBTLR7IU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=tZNmf8pI; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="tZNmf8pI" Received: by smtp.kernel.org (Postfix) with ESMTPSA id BBCC7C4CEE4; Fri, 25 Apr 2025 08:11:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1745568715; bh=lKHgXD3YJcZUCX/EWbivAcuAmix3WP6EI2DGGplB02U=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=tZNmf8pIDsN9Gw7AU9oWrmMNIF4r8dR8ubLLapWz9qXJ00/fnXawAS5l66BqV07si 8g5r/5OncQJ4roZi82KZ3hZ3gWYyLY4RXI8xapdpfmhyGGv0Co5aTRqKe5VU3lzGL8 R39DxSY6J6LpyB/MpbpRq/kkNhQ4Bv5hQlZXQrk0pyXVzAtA3S09On2qxuMY2VN6ru v7OJA9X+SWYRqra4WPxUEDRJp/6PKTqkqlC7KsysrlEWro/CqJGQCnDh8NebjkvpMd /hzV2PP2iYDb8KkBTGu5awpnxNCgI/vARV1lW2pfXMxpQNH0l1FMh3oNI8QaiYxnr3 9cYd5wg/MpHNw== From: Christian Brauner Date: Fri, 25 Apr 2025 10:11:33 +0200 Subject: [PATCH v2 4/4] net, pidfs: enable handing out pidfds for reaped sk->sk_peer_pid Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20250425-work-pidfs-net-v2-4-450a19461e75@kernel.org> References: <20250425-work-pidfs-net-v2-0-450a19461e75@kernel.org> In-Reply-To: <20250425-work-pidfs-net-v2-0-450a19461e75@kernel.org> To: Oleg Nesterov , Kuniyuki Iwashima , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, netdev@vger.kernel.org, David Rheinsberg , Jan Kara , Alexander Mikhalitsyn , Luca Boccassi , Lennart Poettering , Daan De Meyer , Mike Yuan , Christian Brauner X-Mailer: b4 0.15-dev-c25d1 X-Developer-Signature: v=1; a=openpgp-sha256; l=1187; i=brauner@kernel.org; h=from:subject:message-id; bh=lKHgXD3YJcZUCX/EWbivAcuAmix3WP6EI2DGGplB02U=; b=owGbwMvMwCU28Zj0gdSKO4sYT6slMWRwO2+NV1euXjCjVvrErqdhu7cErzp6Z9E7xTV5k5tn1 B3STJ1b01HKwiDGxSArpsji0G4SLrecp2KzUaYGzBxWJpAhDFycAjCRH5KMDNOTy0+rOjNydDCs SH5/MkHIfsP+qWkPjE78f/XJq+qZwBJGhsO5vA4HBThkHdMMu7a0/zu1OHt6Dcf293uulPA/MGd dyAYA X-Developer-Key: i=brauner@kernel.org; a=openpgp; fpr=4880B8C9BD0E5106FC070F4F7B3C391EFEA93624 Now that all preconditions are met, allow handing out pidfs for reaped sk->sk_peer_pids. Signed-off-by: Christian Brauner Reviewed-by: David Rheinsberg Reviewed-by: Kuniyuki Iwashima --- net/core/sock.c | 14 ++++---------- 1 file changed, 4 insertions(+), 10 deletions(-) diff --git a/net/core/sock.c b/net/core/sock.c index b969d2210656..5ad0b53d0fb0 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -148,6 +148,8 @@ =20 #include =20 +#include + #include "dev.h" =20 static DEFINE_MUTEX(proto_list_mutex); @@ -1891,18 +1893,10 @@ int sk_getsockopt(struct sock *sk, int level, int o= ptname, if (!peer_pid) return -ENODATA; =20 - pidfd =3D pidfd_prepare(peer_pid, 0, &pidfd_file); + pidfd =3D pidfd_prepare(peer_pid, PIDFD_STALE, &pidfd_file); put_pid(peer_pid); - if (pidfd < 0) { - /* - * dbus-broker relies on -EINVAL being returned - * to indicate ESRCH. Paper over it until this - * is fixed in userspace. - */ - if (pidfd =3D=3D -ESRCH) - pidfd =3D -EINVAL; + if (pidfd < 0) return pidfd; - } =20 if (copy_to_sockptr(optval, &pidfd, len) || copy_to_sockptr(optlen, &len, sizeof(int))) { --=20 2.47.2