From nobody Sat Feb 7 22:54:48 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 424602580E7; Thu, 24 Apr 2025 12:25:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745497518; cv=none; b=M4MPcn35J+EVonSp/mZKU7MvXThxu6rPdDSzrzvsirVqb8shDNVBYRcTbB9Uki9evS+OwS+pZNBGP38eDBOJzphTCc9Sw6+ceqyY/hketzFH00aqj601yaucTaMSXnPIvm5aoJwDCgOAFv1l3eA57Q9nZhMDFEa1o8MzwbDk93Y= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745497518; c=relaxed/simple; bh=uwdHfF0zuWFaUykwHH51UXZIlsTqiDAedbt9br9ZKbo=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=tj/lT3I1GJxY9PQuDg0RCHlWdEKD9LlHxPRgHYQCDDeqhneOQQ80r0uWLavsyT5ezrwqiZlKafEnDI3dDxelGhDjk6QXiPj9zaRcEBnmIiAsgFeCVigzpy13oMSxoDHLXICoTWvnYAplVOd8EQI/IZAkSP72JAw/uvf4OZ20R5Y= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=DsdinZro; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="DsdinZro" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 01DD8C4CEE8; Thu, 24 Apr 2025 12:25:13 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1745497517; bh=uwdHfF0zuWFaUykwHH51UXZIlsTqiDAedbt9br9ZKbo=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=DsdinZroUGNomIbZ9rloX3tK/a+iW5QE8pZptL4/Xw1oNwApIwh1dmgna91hDSkMy 7DyccZquXYhG3g4mBJbXTs8CAxzp03r5FjgRunsZDovk2X9ZV7oiFrL0CN95WYCh8A gSm8Oe3uXl0KzIk3F5cT9UBz2jhj2ZMAeEaBpui3qb1A0I7GM1NteaW/oVLz8Qg2aw FR0aLbnU5Clm9Oz1pQD0c63GNdPMAHQ3Q0NbU08mvZjavw/aUYFKDFJHO1/W96AbHS qT6RbEkWPeWRtKo1SFmFyr8z8xC+xQPdA1Wp0wfa7RMkUO24en8seVw0WZZHdNEC8D d2aboEc9HV4mA== From: Christian Brauner Date: Thu, 24 Apr 2025 14:24:34 +0200 Subject: [PATCH RFC 1/4] pidfs: register pid in pidfs Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20250424-work-pidfs-net-v1-1-0dc97227d854@kernel.org> References: <20250424-work-pidfs-net-v1-0-0dc97227d854@kernel.org> In-Reply-To: <20250424-work-pidfs-net-v1-0-0dc97227d854@kernel.org> To: Oleg Nesterov , Kuniyuki Iwashima , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, netdev@vger.kernel.org, David Rheinsberg , Jan Kara , Alexander Mikhalitsyn , Luca Boccassi , Lennart Poettering , Daan De Meyer , Mike Yuan , Christian Brauner X-Mailer: b4 0.15-dev-c25d1 X-Developer-Signature: v=1; a=openpgp-sha256; l=2885; i=brauner@kernel.org; h=from:subject:message-id; bh=uwdHfF0zuWFaUykwHH51UXZIlsTqiDAedbt9br9ZKbo=; b=owGbwMvMwCU28Zj0gdSKO4sYT6slMWRw6S49vdXRXiXs7J+6w2lZj4su3D35KYB7f0j3/H0Od gfFBG6pdZSyMIhxMciKKbI4tJuEyy3nqdhslKkBM4eVCWQIAxenAEzk8ySGf4Z9oSterlQMbX7/ jr/Q5t9lQ6Zbupts18a+OHZ/rZThh28M/+s6XZqVmyyk9gu8aBNrfdSQy7hxxvEJ3y9cttVjCaz w4QMA X-Developer-Key: i=brauner@kernel.org; a=openpgp; fpr=4880B8C9BD0E5106FC070F4F7B3C391EFEA93624 Add simple helpers that allow a struct pid to be pinned via a pidfs dentry/inode. If no pidfs dentry exists a new one will be allocated for it. A reference is taken by pidfs on @pid. The reference must be released via pidfs_put_pid(). This will allow AF_UNIX sockets to allocate a dentry for the peer credentials pid at the time they are recorded where we know the task is still alive. When the task gets reaped its exit status is guaranteed to be recorded and a pidfd can be handed for the reaped task. Signed-off-by: Christian Brauner Reviewed-by: Oleg Nesterov --- fs/pidfs.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++= ++++ include/linux/pidfs.h | 3 +++ 2 files changed, 61 insertions(+) diff --git a/fs/pidfs.c b/fs/pidfs.c index d64a4cbeb0da..8e6c11774c60 100644 --- a/fs/pidfs.c +++ b/fs/pidfs.c @@ -896,6 +896,64 @@ struct file *pidfs_alloc_file(struct pid *pid, unsigne= d int flags) return pidfd_file; } =20 +/** + * pidfs_register_pid - pin a struct pid through pidfs + * @pid: pid to pin + * + * Pin a struct pid through pidfs. Needs to be paired with + * pidfds_put_put() to not risk leaking the pidfs dentry and inode. + * + * Return: On success zero, on error a negative error code is returned. + */ +int pidfs_register_pid(struct pid *pid) +{ + struct path path __free(path_put) =3D {}; + int ret; + + might_sleep(); + + if (!pid) + return 0; + + ret =3D path_from_stashed(&pid->stashed, pidfs_mnt, get_pid(pid), &path); + if (unlikely(ret)) + return ret; + path.dentry =3D NULL; + return 0; +} + +/** + * pidfs_get_pid - pin a struct pid through pidfs + * @pid: pid to pin + * + * Similar to pidfs_register_pid() but only valid if the caller knows + * there's a reference to the @pid through its dentry already. + */ +void pidfs_get_pid(struct pid *pid) +{ + if (!pid) + return; + + WARN_ON_ONCE(stashed_dentry_get(&pid->stashed) =3D=3D NULL); +} + +/** + * pidfs_put_pid - drop a pidfs reference + * @pid: pid to drop + * + * Drop a reference to @pid via pidfs. This is only safe if the + * reference has been taken via pidfs_get_pid(). + */ +void pidfs_put_pid(struct pid *pid) +{ + might_sleep(); + + if (!pid) + return; + + dput(pid->stashed); +} + static void pidfs_inode_init_once(void *data) { struct pidfs_inode *pi =3D data; diff --git a/include/linux/pidfs.h b/include/linux/pidfs.h index 05e6f8f4a026..2676890c4d0d 100644 --- a/include/linux/pidfs.h +++ b/include/linux/pidfs.h @@ -8,5 +8,8 @@ void pidfs_add_pid(struct pid *pid); void pidfs_remove_pid(struct pid *pid); void pidfs_exit(struct task_struct *tsk); extern const struct dentry_operations pidfs_dentry_operations; +int pidfs_register_pid(struct pid *pid); +void pidfs_get_pid(struct pid *pid); +void pidfs_put_pid(struct pid *pid); =20 #endif /* _LINUX_PID_FS_H */ --=20 2.47.2 From nobody Sat Feb 7 22:54:48 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 83CCB25DCE4; Thu, 24 Apr 2025 12:25:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745497522; cv=none; b=b4D0BaWj4ERkukoutW7imnMrfTUTLkUMqPGuUmRohhsd1HTYvWdzkhoU6yqt+JmDXM1VW5kGnKkr6N6jiIp7BknEoBZUK/PnYeuJn5mD3+BnExbTUG+YN4MD6wMGRJJ8eiEycjuI7CeJZez+RR0NUi6yJj6Uo1OyN7baBNgrOZw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745497522; c=relaxed/simple; bh=BAR2mPEdQmLCpUNaicjcrIVZW8pnh6HvsXslZl4NsA4=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=LGcBrs0zjxf8x5O1V8XeAOQ7PDxUnwERh2P5Wt7gdhDx/bfNg0K7Xsl11s4RGiJhCou8NjgdEeWYvPgicjFjaBENSz8Su2MQeQL63RauzHo+n3t/UyCLi825qOxI43JQHWPWeQeaUatRFwr2c1w3GemACIt2ql5PY60vi/olPc4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=XvB/hpcv; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="XvB/hpcv" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 4186AC4CEE3; Thu, 24 Apr 2025 12:25:18 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1745497522; bh=BAR2mPEdQmLCpUNaicjcrIVZW8pnh6HvsXslZl4NsA4=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=XvB/hpcvs4PfU9v42lLjOJdW3kpZK1j1TMROnLbJq64GRoLVUpaRLm5VHGmvCzc70 lAbEx5+G3Abvc81v2gPDoPTOtcuZxeNQR/jVfU9fr9KIAFTsbd+WLl/Ba81FRmHl40 9yTYxeDkAabxaCwMLlwnWa14pKuRB7I2f/seiuEjUj2HZLzx+G5A3+N1x0WTa1EsTe 6ucCtdd0G8s82ZF8f3wlS0KPvnhh/OCyEOjYSOhrTUDGiXMNYeWzoDDWmsUdoKlo6c 7CwQmDxZfW71xucG6W5Zw9TpWbWFUHsiZDdR7jZORuL897CT11ANTKHRr6h7LoI7zV LEP80n+bdHReA== From: Christian Brauner Date: Thu, 24 Apr 2025 14:24:35 +0200 Subject: [PATCH RFC 2/4] net, pidfs: prepare for handing out pidfds for reaped sk->sk_peer_pid Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20250424-work-pidfs-net-v1-2-0dc97227d854@kernel.org> References: <20250424-work-pidfs-net-v1-0-0dc97227d854@kernel.org> In-Reply-To: <20250424-work-pidfs-net-v1-0-0dc97227d854@kernel.org> To: Oleg Nesterov , Kuniyuki Iwashima , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, netdev@vger.kernel.org, David Rheinsberg , Jan Kara , Alexander Mikhalitsyn , Luca Boccassi , Lennart Poettering , Daan De Meyer , Mike Yuan , Christian Brauner X-Mailer: b4 0.15-dev-c25d1 X-Developer-Signature: v=1; a=openpgp-sha256; l=9059; i=brauner@kernel.org; h=from:subject:message-id; bh=BAR2mPEdQmLCpUNaicjcrIVZW8pnh6HvsXslZl4NsA4=; b=owGbwMvMwCU28Zj0gdSKO4sYT6slMWRw6S7lXeDqu/WB+cyNTxbvWt9y/Mzf/1v+p35wmGj3I r+54O/rOR2lLAxiXAyyYoosDu0m4XLLeSo2G2VqwMxhZQIZwsDFKQATcX3K8N/1ZsBOpddXQtfL uEpvlNxWHvf0ge2tT7EyehcTUtPeTuBhZDixZCljUEXQxOc8xV++qm7rWv9W6cps1eyr/AmSp3a 85uYGAA== X-Developer-Key: i=brauner@kernel.org; a=openpgp; fpr=4880B8C9BD0E5106FC070F4F7B3C391EFEA93624 SO_PEERPIDFD currently doesn't support handing out pidfds if the sk->sk_peer_pid thread-group leader has already been reaped. In this case it currently returns EINVAL. Userspace still wants to get a pidfd for a reaped process to have a stable handle it can pass on. This is especially useful now that it is possible to retrieve exit information through a pidfd via the PIDFD_GET_INFO ioctl()'s PIDFD_INFO_EXIT flag. Another summary has been provided by David in [1]: > A pidfd can outlive the task it refers to, and thus user-space must > already be prepared that the task underlying a pidfd is gone at the time > they get their hands on the pidfd. For instance, resolving the pidfd to > a PID via the fdinfo must be prepared to read `-1`. > > Despite user-space knowing that a pidfd might be stale, several kernel > APIs currently add another layer that checks for this. In particular, > SO_PEERPIDFD returns `EINVAL` if the peer-task was already reaped, > but returns a stale pidfd if the task is reaped immediately after the > respective alive-check. > > This has the unfortunate effect that user-space now has two ways to > check for the exact same scenario: A syscall might return > EINVAL/ESRCH/... *or* the pidfd might be stale, even though there is no > particular reason to distinguish both cases. This also propagates > through user-space APIs, which pass on pidfds. They must be prepared to > pass on `-1` *or* the pidfd, because there is no guaranteed way to get a > stale pidfd from the kernel. > Userspace must already deal with a pidfd referring to a reaped task as > the task may exit and get reaped at any time will there are still many > pidfds referring to it. In order to allow handing out reaped pidfd SO_PEERPIDFD needs to ensure that PIDFD_INFO_EXIT information is available whenever a pidfd for a reaped task is created by PIDFD_INFO_EXIT. The uapi promises that reaped pidfds are only handed out if it is guaranteed that the caller sees the exit information: TEST_F(pidfd_info, success_reaped) { struct pidfd_info info =3D { .mask =3D PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT, }; /* * Process has already been reaped and PIDFD_INFO_EXIT been set. * Verify that we can retrieve the exit status of the process. */ ASSERT_EQ(ioctl(self->child_pidfd4, PIDFD_GET_INFO, &info), 0); ASSERT_FALSE(!!(info.mask & PIDFD_INFO_CREDS)); ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT)); ASSERT_TRUE(WIFEXITED(info.exit_code)); ASSERT_EQ(WEXITSTATUS(info.exit_code), 0); } To hand out pidfds for reaped processes we thus allocate a pidfs entry for the relevant sk->sk_peer_pid at the time the sk->sk_peer_pid is stashed and drop it when the socket is destroyed. This guarantees that exit information will always be recorded for the sk->sk_peer_pid task and we can hand out pidfds for reaped processes. Link: https://lore.kernel.org/lkml/20230807085203.819772-1-david@readahead.= eu [1] Signed-off-by: Christian Brauner --- net/unix/af_unix.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++---= ---- 1 file changed, 79 insertions(+), 11 deletions(-) diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index f78a2492826f..83b5aebf499e 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -100,6 +100,7 @@ #include #include #include +#include #include #include #include @@ -643,6 +644,14 @@ static void unix_sock_destructor(struct sock *sk) return; } =20 + if (sock_flag(sk, SOCK_RCU_FREE)) { + pr_info("Attempting to release RCU protected socket with sleeping locks:= %p\n", sk); + return; + } + + if (sk->sk_peer_pid) + pidfs_put_pid(sk->sk_peer_pid); + if (u->addr) unix_release_addr(u->addr); =20 @@ -734,13 +743,48 @@ static void unix_release_sock(struct sock *sk, int em= brion) unix_gc(); /* Garbage collect fds */ } =20 -static void init_peercred(struct sock *sk) +struct af_unix_peercred { + struct pid *peer_pid; + const struct cred *peer_cred; +}; + +static inline int prepare_peercred(struct af_unix_peercred *peercred) +{ + struct pid *pid; + int err; + + pid =3D task_tgid(current); + err =3D pidfs_register_pid(pid); + if (likely(!err)) { + peercred->peer_pid =3D get_pid(pid); + peercred->peer_cred =3D get_current_cred(); + } + return err; +} + +static void drop_peercred(struct af_unix_peercred *peercred) +{ + struct pid *pid =3D NULL; + const struct cred *cred =3D NULL; + + might_sleep(); + + swap(peercred->peer_pid, pid); + swap(peercred->peer_cred, cred); + + pidfs_put_pid(pid); + put_pid(pid); + put_cred(cred); +} + +static inline void init_peercred(struct sock *sk, + const struct af_unix_peercred *peercred) { - sk->sk_peer_pid =3D get_pid(task_tgid(current)); - sk->sk_peer_cred =3D get_current_cred(); + sk->sk_peer_pid =3D peercred->peer_pid; + sk->sk_peer_cred =3D peercred->peer_cred; } =20 -static void update_peercred(struct sock *sk) +static void update_peercred(struct sock *sk, struct af_unix_peercred *peer= cred) { const struct cred *old_cred; struct pid *old_pid; @@ -748,11 +792,11 @@ static void update_peercred(struct sock *sk) spin_lock(&sk->sk_peer_lock); old_pid =3D sk->sk_peer_pid; old_cred =3D sk->sk_peer_cred; - init_peercred(sk); + init_peercred(sk, peercred); spin_unlock(&sk->sk_peer_lock); =20 - put_pid(old_pid); - put_cred(old_cred); + peercred->peer_pid =3D old_pid; + peercred->peer_cred =3D old_cred; } =20 static void copy_peercred(struct sock *sk, struct sock *peersk) @@ -761,6 +805,7 @@ static void copy_peercred(struct sock *sk, struct sock = *peersk) =20 spin_lock(&sk->sk_peer_lock); sk->sk_peer_pid =3D get_pid(peersk->sk_peer_pid); + pidfs_get_pid(sk->sk_peer_pid); sk->sk_peer_cred =3D get_cred(peersk->sk_peer_cred); spin_unlock(&sk->sk_peer_lock); } @@ -770,6 +815,7 @@ static int unix_listen(struct socket *sock, int backlog) int err; struct sock *sk =3D sock->sk; struct unix_sock *u =3D unix_sk(sk); + struct af_unix_peercred peercred =3D {}; =20 err =3D -EOPNOTSUPP; if (sock->type !=3D SOCK_STREAM && sock->type !=3D SOCK_SEQPACKET) @@ -777,6 +823,9 @@ static int unix_listen(struct socket *sock, int backlog) err =3D -EINVAL; if (!READ_ONCE(u->addr)) goto out; /* No listens on an unbound socket */ + err =3D prepare_peercred(&peercred); + if (err) + goto out; unix_state_lock(sk); if (sk->sk_state !=3D TCP_CLOSE && sk->sk_state !=3D TCP_LISTEN) goto out_unlock; @@ -786,11 +835,12 @@ static int unix_listen(struct socket *sock, int backl= og) WRITE_ONCE(sk->sk_state, TCP_LISTEN); =20 /* set credentials so connect can copy them */ - update_peercred(sk); + update_peercred(sk, &peercred); err =3D 0; =20 out_unlock: unix_state_unlock(sk); + drop_peercred(&peercred); out: return err; } @@ -1525,6 +1575,7 @@ static int unix_stream_connect(struct socket *sock, s= truct sockaddr *uaddr, struct sockaddr_un *sunaddr =3D (struct sockaddr_un *)uaddr; struct sock *sk =3D sock->sk, *newsk =3D NULL, *other =3D NULL; struct unix_sock *u =3D unix_sk(sk), *newu, *otheru; + struct af_unix_peercred peercred =3D {}; struct net *net =3D sock_net(sk); struct sk_buff *skb =3D NULL; unsigned char state; @@ -1561,6 +1612,10 @@ static int unix_stream_connect(struct socket *sock, = struct sockaddr *uaddr, goto out; } =20 + err =3D prepare_peercred(&peercred); + if (err) + goto out; + /* Allocate skb for sending to listening sock */ skb =3D sock_wmalloc(newsk, 1, 0, GFP_KERNEL); if (!skb) { @@ -1636,7 +1691,7 @@ static int unix_stream_connect(struct socket *sock, s= truct sockaddr *uaddr, unix_peer(newsk) =3D sk; newsk->sk_state =3D TCP_ESTABLISHED; newsk->sk_type =3D sk->sk_type; - init_peercred(newsk); + init_peercred(newsk, &peercred); newu =3D unix_sk(newsk); newu->listener =3D other; RCU_INIT_POINTER(newsk->sk_wq, &newu->peer_wq); @@ -1695,20 +1750,33 @@ static int unix_stream_connect(struct socket *sock,= struct sockaddr *uaddr, out_free_sk: unix_release_sock(newsk, 0); out: + drop_peercred(&peercred); return err; } =20 static int unix_socketpair(struct socket *socka, struct socket *sockb) { + struct af_unix_peercred ska_peercred =3D {}, skb_peercred =3D {}; struct sock *ska =3D socka->sk, *skb =3D sockb->sk; + int err; + + err =3D prepare_peercred(&ska_peercred); + if (err) + return err; + + err =3D prepare_peercred(&skb_peercred); + if (err) { + drop_peercred(&ska_peercred); + return err; + } =20 /* Join our sockets back to back */ sock_hold(ska); sock_hold(skb); unix_peer(ska) =3D skb; unix_peer(skb) =3D ska; - init_peercred(ska); - init_peercred(skb); + init_peercred(ska, &ska_peercred); + init_peercred(skb, &skb_peercred); =20 ska->sk_state =3D TCP_ESTABLISHED; skb->sk_state =3D TCP_ESTABLISHED; --=20 2.47.2 From nobody Sat Feb 7 22:54:48 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A608325DD1E; Thu, 24 Apr 2025 12:25:26 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745497526; cv=none; b=aY1puDEfGZFssFyhvvZmfFQ1Ju5GYeK/IlYveNt8MgJ8ZYgwoQCzlHynhIG/QeLBX2YiCcjB9Olhuwh16IDUgR1vfUrHLpasl1ZHKL7m2fgm4ih6z8xKlQ2kE+0KJlHTAdcQMhIerenNtxLPlJ1e9xll32HoAEkYJY2HMf177Tw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745497526; c=relaxed/simple; bh=q0Zul5GmEGj0WjtZQ7Mk0Q5/9DR1bnkoJs7kA778bl4=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=iq3ka3BjYfb9OAj7FrT1S8UuRREtosgylx+ZJOUbNxXqwNkROCeV9Xvsgwz6vGyvGxByiyQaOwHLu2Ts81ayE2p0TBjTevpCppPqLqnLDQhZs8D44vg6FgSEgtD46Bu7SvAHPneDQA93+sm4Va3SwnC5Ystipe/FCjOJYUNYNyQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=skjFl12L; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="skjFl12L" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 75DE5C4CEE8; Thu, 24 Apr 2025 12:25:22 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1745497526; bh=q0Zul5GmEGj0WjtZQ7Mk0Q5/9DR1bnkoJs7kA778bl4=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=skjFl12LA4JxPQ9blgcChu0HgzagItI7BnpsuFND+zYE/jXupVQDYSp31KwyItZoI X2JnKrYMtnwsw/SlcezWDi2aNWNven8bldIy7MmPwpU5MoqPgoUyKq7fFi7LrdIhnB EOAiLkTVET6QHqhQgvwksPcbFodxLarKg5eCHI2s7jSfVbIDGpzE6x4KA4UxSw0GyL kKQhM2Hs7k23ZJy1cYGIdgPIbuptTPKDg2hlmqCBAH3ra7SnbsdiXULxR7WFDMny6g 3trd+w9t7oOhbc1Rlyoh1Dk46hChCvvRyAcy9DBYCUktexiBo+asAfteTmIk2a1+rk JKEEP/P0G75dQ== From: Christian Brauner Date: Thu, 24 Apr 2025 14:24:36 +0200 Subject: [PATCH RFC 3/4] pidfs: get rid of __pidfd_prepare() Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20250424-work-pidfs-net-v1-3-0dc97227d854@kernel.org> References: <20250424-work-pidfs-net-v1-0-0dc97227d854@kernel.org> In-Reply-To: <20250424-work-pidfs-net-v1-0-0dc97227d854@kernel.org> To: Oleg Nesterov , Kuniyuki Iwashima , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, netdev@vger.kernel.org, David Rheinsberg , Jan Kara , Alexander Mikhalitsyn , Luca Boccassi , Lennart Poettering , Daan De Meyer , Mike Yuan , Christian Brauner X-Mailer: b4 0.15-dev-c25d1 X-Developer-Signature: v=1; a=openpgp-sha256; l=6898; i=brauner@kernel.org; h=from:subject:message-id; bh=q0Zul5GmEGj0WjtZQ7Mk0Q5/9DR1bnkoJs7kA778bl4=; b=owGbwMvMwCU28Zj0gdSKO4sYT6slMWRw6S69s0Nw6wHGzAv+OssFa01v8Ri5Lz4i3aJxXs8h+ YZto/KPjlIWBjEuBlkxRRaHdpNwueU8FZuNMjVg5rAygQxh4OIUgIlEH2VkmHiGedINp8VSIbn/ e7mf8KeFbZqZpi/cVfT2ZodqxpJvuYwMF1sWMfc8YDLSY/I6Myk7/a03r3XgXmumidNXpN/6/IK TEwA= X-Developer-Key: i=brauner@kernel.org; a=openpgp; fpr=4880B8C9BD0E5106FC070F4F7B3C391EFEA93624 Fold it into pidfd_prepare() and rename PIDFD_CLONE to PIDFD_STALE to indicate that the passed pid might not have task linkage and no explicit check for that should be performed. Signed-off-by: Christian Brauner --- fs/pidfs.c | 12 +++---- include/uapi/linux/pidfd.h | 2 +- kernel/fork.c | 78 ++++++++++++++----------------------------= ---- 3 files changed, 31 insertions(+), 61 deletions(-) diff --git a/fs/pidfs.c b/fs/pidfs.c index 8e6c11774c60..3199ec02aaec 100644 --- a/fs/pidfs.c +++ b/fs/pidfs.c @@ -768,7 +768,7 @@ static inline bool pidfs_pid_valid(struct pid *pid, con= st struct path *path, { enum pid_type type; =20 - if (flags & PIDFD_CLONE) + if (flags & PIDFD_STALE) return true; =20 /* @@ -777,7 +777,7 @@ static inline bool pidfs_pid_valid(struct pid *pid, con= st struct path *path, * pidfd has been allocated perform another check that the pid * is still alive. If it is exit information is available even * if the task gets reaped before the pidfd is returned to - * userspace. The only exception is PIDFD_CLONE where no task + * userspace. The only exception is PIDFD_STALE where no task * linkage has been established for @pid yet and the kernel is * in the middle of process creation so there's nothing for * pidfs to miss. @@ -874,11 +874,11 @@ struct file *pidfs_alloc_file(struct pid *pid, unsign= ed int flags) int ret; =20 /* - * Ensure that PIDFD_CLONE can be passed as a flag without + * Ensure that PIDFD_STALE can be passed as a flag without * overloading other uapi pidfd flags. */ - BUILD_BUG_ON(PIDFD_CLONE =3D=3D PIDFD_THREAD); - BUILD_BUG_ON(PIDFD_CLONE =3D=3D PIDFD_NONBLOCK); + BUILD_BUG_ON(PIDFD_STALE =3D=3D PIDFD_THREAD); + BUILD_BUG_ON(PIDFD_STALE =3D=3D PIDFD_NONBLOCK); =20 ret =3D path_from_stashed(&pid->stashed, pidfs_mnt, get_pid(pid), &path); if (ret < 0) @@ -887,7 +887,7 @@ struct file *pidfs_alloc_file(struct pid *pid, unsigned= int flags) if (!pidfs_pid_valid(pid, &path, flags)) return ERR_PTR(-ESRCH); =20 - flags &=3D ~PIDFD_CLONE; + flags &=3D ~PIDFD_STALE; pidfd_file =3D dentry_open(&path, flags, current_cred()); /* Raise PIDFD_THREAD explicitly as do_dentry_open() strips it. */ if (!IS_ERR(pidfd_file)) diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h index 2970ef44655a..8c1511edd0e9 100644 --- a/include/uapi/linux/pidfd.h +++ b/include/uapi/linux/pidfd.h @@ -12,7 +12,7 @@ #define PIDFD_THREAD O_EXCL #ifdef __KERNEL__ #include -#define PIDFD_CLONE CLONE_PIDFD +#define PIDFD_STALE CLONE_PIDFD #endif =20 /* Flags for pidfd_send_signal(). */ diff --git a/kernel/fork.c b/kernel/fork.c index f7403e1fb0d4..365687e1698f 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2035,50 +2035,6 @@ static inline void rcu_copy_process(struct task_stru= ct *p) #endif /* #ifdef CONFIG_TASKS_TRACE_RCU */ } =20 -/** - * __pidfd_prepare - allocate a new pidfd_file and reserve a pidfd - * @pid: the struct pid for which to create a pidfd - * @flags: flags of the new @pidfd - * @ret: Where to return the file for the pidfd. - * - * Allocate a new file that stashes @pid and reserve a new pidfd number in= the - * caller's file descriptor table. The pidfd is reserved but not installed= yet. - * - * The helper doesn't perform checks on @pid which makes it useful for pid= fds - * created via CLONE_PIDFD where @pid has no task attached when the pidfd = and - * pidfd file are prepared. - * - * If this function returns successfully the caller is responsible to eith= er - * call fd_install() passing the returned pidfd and pidfd file as argument= s in - * order to install the pidfd into its file descriptor table or they must = use - * put_unused_fd() and fput() on the returned pidfd and pidfd file - * respectively. - * - * This function is useful when a pidfd must already be reserved but there - * might still be points of failure afterwards and the caller wants to ens= ure - * that no pidfd is leaked into its file descriptor table. - * - * Return: On success, a reserved pidfd is returned from the function and = a new - * pidfd file is returned in the last argument to the function. On - * error, a negative error code is returned from the function and = the - * last argument remains unchanged. - */ -static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct fil= e **ret) -{ - struct file *pidfd_file; - - CLASS(get_unused_fd, pidfd)(O_CLOEXEC); - if (pidfd < 0) - return pidfd; - - pidfd_file =3D pidfs_alloc_file(pid, flags | O_RDWR); - if (IS_ERR(pidfd_file)) - return PTR_ERR(pidfd_file); - - *ret =3D pidfd_file; - return take_fd(pidfd); -} - /** * pidfd_prepare - allocate a new pidfd_file and reserve a pidfd * @pid: the struct pid for which to create a pidfd @@ -2108,14 +2064,19 @@ static int __pidfd_prepare(struct pid *pid, unsigne= d int flags, struct file **re */ int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret) { - /* - * While holding the pidfd waitqueue lock removing the task - * linkage for the thread-group leader pid (PIDTYPE_TGID) isn't - * possible. Thus, if there's still task linkage for PIDTYPE_PID - * not having thread-group leader linkage for the pid means it - * wasn't a thread-group leader in the first place. - */ - scoped_guard(spinlock_irq, &pid->wait_pidfd.lock) { + struct file *pidfd_file; + + if (!(flags & PIDFD_STALE)) { + /* + * While holding the pidfd waitqueue lock removing the + * task linkage for the thread-group leader pid + * (PIDTYPE_TGID) isn't possible. Thus, if there's still + * task linkage for PIDTYPE_PID not having thread-group + * leader linkage for the pid means it wasn't a + * thread-group leader in the first place. + */ + guard(spinlock_irq)(&pid->wait_pidfd.lock); + /* Task has already been reaped. */ if (!pid_has_task(pid, PIDTYPE_PID)) return -ESRCH; @@ -2128,7 +2089,16 @@ int pidfd_prepare(struct pid *pid, unsigned int flag= s, struct file **ret) return -ENOENT; } =20 - return __pidfd_prepare(pid, flags, ret); + CLASS(get_unused_fd, pidfd)(O_CLOEXEC); + if (pidfd < 0) + return pidfd; + + pidfd_file =3D pidfs_alloc_file(pid, flags | O_RDWR); + if (IS_ERR(pidfd_file)) + return PTR_ERR(pidfd_file); + + *ret =3D pidfd_file; + return take_fd(pidfd); } =20 static void __delayed_free_task(struct rcu_head *rhp) @@ -2477,7 +2447,7 @@ __latent_entropy struct task_struct *copy_process( * Note that no task has been attached to @pid yet indicate * that via CLONE_PIDFD. */ - retval =3D __pidfd_prepare(pid, flags | PIDFD_CLONE, &pidfile); + retval =3D pidfd_prepare(pid, flags | PIDFD_STALE, &pidfile); if (retval < 0) goto bad_fork_free_pid; pidfd =3D retval; --=20 2.47.2 From nobody Sat Feb 7 22:54:48 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C8D5025E46F; Thu, 24 Apr 2025 12:25:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745497530; cv=none; b=ZYmvhvVv4egXe88wElLYy6Ydy4jLNfXG4NV0eaig0N4zvnwwToypyhZApmWcQD4vrC/Fod5yUdVL8ISMCELw4txDiTBSg5Oo7+96Fxg1HTAa6daTiHSANszxgfqr7xTGJRGeEktsAL/Pz0AYGzsv8aAlXcDd9svbt60OrPA78Sc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745497530; c=relaxed/simple; bh=HVAKMMSrEdpK2171Gtf4MhYESy3uUkDfAySuo0BZDy8=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=Gqe5eh83wdH4ruXKil0Hzby0Xd2rLztJ1nFiJM5eOPL3eCNcBHodjveq0xhn4LDC2aziUTaTGQN/oCtaEoUGiH5ta22Tlg+zVBQfkmrnTeyYpaohJ7m9rB+eQhZs0IdP8zgoJnGu1hNeA8sRrKFLE5wtcp3uDHe2BsiQKtbZL5o= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=LI5ab5lb; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="LI5ab5lb" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 9FD24C4CEE3; Thu, 24 Apr 2025 12:25:26 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1745497530; bh=HVAKMMSrEdpK2171Gtf4MhYESy3uUkDfAySuo0BZDy8=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=LI5ab5lb/vb/jaIx1/MQ7TKpN1Bxbly6RRQy56ugtFdzutM1G6lJrXxQXACBlkugJ +m5u7qsea3wKeLpps3Z8PI2qWsojwpskOINaUb0c/O2k+qw0jh4b4oKHzMQ5DONzKE Il1Gpsh0RmB65iMMdkfDJMYqdRKXWnt1JMinlmumgNRE8KKBPvn8zNiglrDV5xmDux y1WAdrnJQPM0OL9paRZ+3uwOInvsZTeweMtunt1/C8SDnzQxQ8sgtvPGUD4b7DJOMu Lplr7vexD/XZClw11TLJ7aydHcCxye3vUhfrCZFhzG7/En4h8jZFaAfrhulOMe0QZa HVqyDnoeokYxw== From: Christian Brauner Date: Thu, 24 Apr 2025 14:24:37 +0200 Subject: [PATCH RFC 4/4] net, pidfs: enable handing out pidfds for reaped sk->sk_peer_pid Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20250424-work-pidfs-net-v1-4-0dc97227d854@kernel.org> References: <20250424-work-pidfs-net-v1-0-0dc97227d854@kernel.org> In-Reply-To: <20250424-work-pidfs-net-v1-0-0dc97227d854@kernel.org> To: Oleg Nesterov , Kuniyuki Iwashima , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, netdev@vger.kernel.org, David Rheinsberg , Jan Kara , Alexander Mikhalitsyn , Luca Boccassi , Lennart Poettering , Daan De Meyer , Mike Yuan , Christian Brauner X-Mailer: b4 0.15-dev-c25d1 X-Developer-Signature: v=1; a=openpgp-sha256; l=829; i=brauner@kernel.org; h=from:subject:message-id; bh=HVAKMMSrEdpK2171Gtf4MhYESy3uUkDfAySuo0BZDy8=; b=owGbwMvMwCU28Zj0gdSKO4sYT6slMWRw6S61OyXCk+U8V/oAy85P9/jTds39eWNSkPOL5IPf3 u7/pHRMpqOUhUGMi0FWTJHFod0kXG45T8Vmo0wNmDmsTCBDGLg4BWAiH1cx/NNtPfGexzFjp8CF H5c2JkidUPDuUivP/XW87l/LtnsF13sYGd6kbpmx/FyMeuVDm/dNQhdFVqUePitZl/w+fU34wl8 LdnMAAA== X-Developer-Key: i=brauner@kernel.org; a=openpgp; fpr=4880B8C9BD0E5106FC070F4F7B3C391EFEA93624 Now that all preconditions are met, allow handing out pidfs for reaped sk->sk_peer_pids. Signed-off-by: Christian Brauner --- net/core/sock.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/net/core/sock.c b/net/core/sock.c index b969d2210656..017b02b69e8b 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -148,6 +148,8 @@ =20 #include =20 +#include + #include "dev.h" =20 static DEFINE_MUTEX(proto_list_mutex); @@ -1891,7 +1893,7 @@ int sk_getsockopt(struct sock *sk, int level, int opt= name, if (!peer_pid) return -ENODATA; =20 - pidfd =3D pidfd_prepare(peer_pid, 0, &pidfd_file); + pidfd =3D pidfd_prepare(peer_pid, PIDFD_STALE, &pidfd_file); put_pid(peer_pid); if (pidfd < 0) { /* --=20 2.47.2