[PATCH v2 0/2] pidfs: ensure consistent ENOENT/ESRCH reporting

Christian Brauner posted 2 patches 8 months, 1 week ago
kernel/exit.c |  5 +++++
kernel/fork.c | 31 +++++++++++++------------------
kernel/pid.c  |  5 -----
3 files changed, 18 insertions(+), 23 deletions(-)
[PATCH v2 0/2] pidfs: ensure consistent ENOENT/ESRCH reporting
Posted by Christian Brauner 8 months, 1 week ago
In a prior patch series we tried to cleanly differentiate between:

(1) The task has already been reaped.
(2) The caller requested a pidfd for a thread-group leader but the pid
actually references a struct pid that isn't used as a thread-group
leader.

as this was causing issues for non-threaded workloads.

But there's cases where the current simple logic is wrong. Specifically,
if the pid was a leader pid and the check races with __unhash_process().
Stabilize this by using the pidfd waitqueue lock.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Christian Brauner (2):
      exit: move wake_up_all() pidfd waiters into __unhash_process()
      pidfs: ensure consistent ENOENT/ESRCH reporting

 kernel/exit.c |  5 +++++
 kernel/fork.c | 31 +++++++++++++------------------
 kernel/pid.c  |  5 -----
 3 files changed, 18 insertions(+), 23 deletions(-)
---
base-commit: 1e940fff94374d04b6c34f896ed9fbad3d2fb706
change-id: 20250411-work-pidfs-enoent-747579160c8c
Re: [PATCH v2 0/2] pidfs: ensure consistent ENOENT/ESRCH reporting
Posted by Nathan Chancellor 8 months ago
Hi Christian,

On Fri, Apr 11, 2025 at 03:22:43PM +0200, Christian Brauner wrote:
> In a prior patch series we tried to cleanly differentiate between:
> 
> (1) The task has already been reaped.
> (2) The caller requested a pidfd for a thread-group leader but the pid
> actually references a struct pid that isn't used as a thread-group
> leader.
> 
> as this was causing issues for non-threaded workloads.
> 
> But there's cases where the current simple logic is wrong. Specifically,
> if the pid was a leader pid and the check races with __unhash_process().
> Stabilize this by using the pidfd waitqueue lock.

After the recent work in vfs-6.16.pidfs (I tested at
a9d7de0f68b79e5e481967fc605698915a37ac13), I am seeing issues with using
'machinectl shell' to connect to a systemd-nspawn container on one of my
machines running Fedora 41 (the container is using Rawhide).

  $ machinectl shell -q nathan@$DEV_IMG $SHELL -l
  Failed to get shell PTY: Connection timed out

My initial bisect attempt landed on the merge of the first series
(1e940fff9437), which does not make much sense because 4fc3f73c16d was
allegedly good in my test, but I did not investigate that too hard since
I have lost enough time on this as it is heh. It never reproduces at
6.15-rc1 and it consistently reproduces at a9d7de0f68b so I figured I
would report it here since you mention this series is a fix for the
first one. If there is any other information I can provide or patches I
can test (either as fixes or for debugging), I am more than happy to do
so.

Cheers,
Nathan
Re: [PATCH v2 0/2] pidfs: ensure consistent ENOENT/ESRCH reporting
Posted by Christian Brauner 8 months ago
On Tue, Apr 15, 2025 at 03:34:54PM -0700, Nathan Chancellor wrote:
> Hi Christian,
> 
> On Fri, Apr 11, 2025 at 03:22:43PM +0200, Christian Brauner wrote:
> > In a prior patch series we tried to cleanly differentiate between:
> > 
> > (1) The task has already been reaped.
> > (2) The caller requested a pidfd for a thread-group leader but the pid
> > actually references a struct pid that isn't used as a thread-group
> > leader.
> > 
> > as this was causing issues for non-threaded workloads.
> > 
> > But there's cases where the current simple logic is wrong. Specifically,
> > if the pid was a leader pid and the check races with __unhash_process().
> > Stabilize this by using the pidfd waitqueue lock.
> 
> After the recent work in vfs-6.16.pidfs (I tested at
> a9d7de0f68b79e5e481967fc605698915a37ac13), I am seeing issues with using
> 'machinectl shell' to connect to a systemd-nspawn container on one of my
> machines running Fedora 41 (the container is using Rawhide).
> 
>   $ machinectl shell -q nathan@$DEV_IMG $SHELL -l
>   Failed to get shell PTY: Connection timed out
> 
> My initial bisect attempt landed on the merge of the first series
> (1e940fff9437), which does not make much sense because 4fc3f73c16d was
> allegedly good in my test, but I did not investigate that too hard since
> I have lost enough time on this as it is heh. It never reproduces at
> 6.15-rc1 and it consistently reproduces at a9d7de0f68b so I figured I
> would report it here since you mention this series is a fix for the
> first one. If there is any other information I can provide or patches I
> can test (either as fixes or for debugging), I am more than happy to do
> so.

Does the following patch make a difference for you?:

diff --git a/kernel/fork.c b/kernel/fork.c
index f7403e1fb0d4..dd30f7e09917 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2118,7 +2118,7 @@ int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
        scoped_guard(spinlock_irq, &pid->wait_pidfd.lock) {
                /* Task has already been reaped. */
                if (!pid_has_task(pid, PIDTYPE_PID))
-                       return -ESRCH;
+                       return -EINVAL;
                /*
                 * If this struct pid isn't used as a thread-group
                 * leader but the caller requested to create a

If it did it would be weird if the first merge is indeed marked as good.
What if you used a non-rawhide version of systemd? Because this might
also be a regression on their side.
Re: [PATCH v2 0/2] pidfs: ensure consistent ENOENT/ESRCH reporting
Posted by Christian Brauner 8 months ago
On Wed, Apr 16, 2025 at 03:55:48PM +0200, Christian Brauner wrote:
> On Tue, Apr 15, 2025 at 03:34:54PM -0700, Nathan Chancellor wrote:
> > Hi Christian,
> > 
> > On Fri, Apr 11, 2025 at 03:22:43PM +0200, Christian Brauner wrote:
> > > In a prior patch series we tried to cleanly differentiate between:
> > > 
> > > (1) The task has already been reaped.
> > > (2) The caller requested a pidfd for a thread-group leader but the pid
> > > actually references a struct pid that isn't used as a thread-group
> > > leader.
> > > 
> > > as this was causing issues for non-threaded workloads.
> > > 
> > > But there's cases where the current simple logic is wrong. Specifically,
> > > if the pid was a leader pid and the check races with __unhash_process().
> > > Stabilize this by using the pidfd waitqueue lock.
> > 
> > After the recent work in vfs-6.16.pidfs (I tested at
> > a9d7de0f68b79e5e481967fc605698915a37ac13), I am seeing issues with using
> > 'machinectl shell' to connect to a systemd-nspawn container on one of my
> > machines running Fedora 41 (the container is using Rawhide).
> > 
> >   $ machinectl shell -q nathan@$DEV_IMG $SHELL -l
> >   Failed to get shell PTY: Connection timed out
> > 
> > My initial bisect attempt landed on the merge of the first series
> > (1e940fff9437), which does not make much sense because 4fc3f73c16d was
> > allegedly good in my test, but I did not investigate that too hard since
> > I have lost enough time on this as it is heh. It never reproduces at
> > 6.15-rc1 and it consistently reproduces at a9d7de0f68b so I figured I
> > would report it here since you mention this series is a fix for the
> > first one. If there is any other information I can provide or patches I
> > can test (either as fixes or for debugging), I am more than happy to do
> > so.

I can't reproduce this issue at all with vfs-6.16.pidfs unfortunately.

> 
> Does the following patch make a difference for you?:
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index f7403e1fb0d4..dd30f7e09917 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2118,7 +2118,7 @@ int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
>         scoped_guard(spinlock_irq, &pid->wait_pidfd.lock) {
>                 /* Task has already been reaped. */
>                 if (!pid_has_task(pid, PIDTYPE_PID))
> -                       return -ESRCH;
> +                       return -EINVAL;
>                 /*
>                  * If this struct pid isn't used as a thread-group
>                  * leader but the caller requested to create a
> 
> If it did it would be weird if the first merge is indeed marked as good.
> What if you used a non-rawhide version of systemd? Because this might
> also be a regression on their side.
Re: [PATCH v2 0/2] pidfs: ensure consistent ENOENT/ESRCH reporting
Posted by Christian Brauner 8 months ago
On Wed, Apr 16, 2025 at 09:47:34PM +0200, Christian Brauner wrote:
> On Wed, Apr 16, 2025 at 03:55:48PM +0200, Christian Brauner wrote:
> > On Tue, Apr 15, 2025 at 03:34:54PM -0700, Nathan Chancellor wrote:
> > > Hi Christian,
> > > 
> > > On Fri, Apr 11, 2025 at 03:22:43PM +0200, Christian Brauner wrote:
> > > > In a prior patch series we tried to cleanly differentiate between:
> > > > 
> > > > (1) The task has already been reaped.
> > > > (2) The caller requested a pidfd for a thread-group leader but the pid
> > > > actually references a struct pid that isn't used as a thread-group
> > > > leader.
> > > > 
> > > > as this was causing issues for non-threaded workloads.
> > > > 
> > > > But there's cases where the current simple logic is wrong. Specifically,
> > > > if the pid was a leader pid and the check races with __unhash_process().
> > > > Stabilize this by using the pidfd waitqueue lock.
> > > 
> > > After the recent work in vfs-6.16.pidfs (I tested at
> > > a9d7de0f68b79e5e481967fc605698915a37ac13), I am seeing issues with using
> > > 'machinectl shell' to connect to a systemd-nspawn container on one of my
> > > machines running Fedora 41 (the container is using Rawhide).
> > > 
> > >   $ machinectl shell -q nathan@$DEV_IMG $SHELL -l
> > >   Failed to get shell PTY: Connection timed out
> > > 
> > > My initial bisect attempt landed on the merge of the first series
> > > (1e940fff9437), which does not make much sense because 4fc3f73c16d was
> > > allegedly good in my test, but I did not investigate that too hard since
> > > I have lost enough time on this as it is heh. It never reproduces at
> > > 6.15-rc1 and it consistently reproduces at a9d7de0f68b so I figured I
> > > would report it here since you mention this series is a fix for the
> > > first one. If there is any other information I can provide or patches I
> > > can test (either as fixes or for debugging), I am more than happy to do
> > > so.
> 
> I can't reproduce this issue at all with vfs-6.16.pidfs unfortunately.
> 
> > 
> > Does the following patch make a difference for you?:
> > 
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index f7403e1fb0d4..dd30f7e09917 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -2118,7 +2118,7 @@ int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
> >         scoped_guard(spinlock_irq, &pid->wait_pidfd.lock) {
> >                 /* Task has already been reaped. */
> >                 if (!pid_has_task(pid, PIDTYPE_PID))
> > -                       return -ESRCH;
> > +                       return -EINVAL;
> >                 /*
> >                  * If this struct pid isn't used as a thread-group
> >                  * leader but the caller requested to create a
> > 
> > If it did it would be weird if the first merge is indeed marked as good.
> > What if you used a non-rawhide version of systemd? Because this might
> > also be a regression on their side.

Ok, I think I understand how this happens. dbus-broker assumes that
only EINVAL is reported by SO_PEERPIDFD when a process is reaped:

https://github.com/bus1/dbus-broker/blob/5d34d91b138fc802a016aa68c093eb81ea31139c/src/util/sockopt.c#L241

It would be great if it would also allow for ESRCH which is the correct
error code anyway. So hopefully we'll get that fixed in userspace. For
now we paper over this in the kernel in the SO_PEERPIDFD code.

Can you please test vfs-6.16.pidfs again. It has the patch I'm
appending.
Re: [PATCH v2 0/2] pidfs: ensure consistent ENOENT/ESRCH reporting
Posted by Nathan Chancellor 8 months ago
On Wed, Apr 16, 2025 at 10:21:02PM +0200, Christian Brauner wrote:
> Ok, I think I understand how this happens. dbus-broker assumes that
> only EINVAL is reported by SO_PEERPIDFD when a process is reaped:
> 
> https://github.com/bus1/dbus-broker/blob/5d34d91b138fc802a016aa68c093eb81ea31139c/src/util/sockopt.c#L241
> 
> It would be great if it would also allow for ESRCH which is the correct
> error code anyway. So hopefully we'll get that fixed in userspace. For
> now we paper over this in the kernel in the SO_PEERPIDFD code.
> 
> Can you please test vfs-6.16.pidfs again. It has the patch I'm
> appending.

Yes, that appears to work fine for me, thanks for the quick fix!

If it would be helpful:

Tested-by: Nathan Chancellor <nathan@kernel.org>

Cheers,
Nathan