fanotify: wake-up all waiters on release

[RFC PATCH] fanotify: wake-up all waiters on release

Posted by Sergey Senozhatsky 8 months, 3 weeks ago

Once reply response is set for all outstanding requests
wake_up_all() of the ->access_waitq waiters so that they
can finish user-wait.  Otherwise fsnotify_destroy_group()
can wait forever for ->user_waits to reach 0 (which it
never will.)

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 fs/notify/fanotify/fanotify_user.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 87f861e9004f..95a3b843cbbf 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -1046,8 +1046,8 @@ static int fanotify_release(struct inode *ignored, struct file *file)
 	}
 	spin_unlock(&group->notification_lock);
 
-	/* Response for all permission events it set, wakeup waiters */
-	wake_up(&group->fanotify_data.access_waitq);
+	/* Response for all permission events is set, wakeup waiters */
+	wake_up_all(&group->fanotify_data.access_waitq);
 
 	/* matches the fanotify_init->fsnotify_alloc_group */
 	fsnotify_destroy_group(group);
-- 
2.49.0.1101.gccaa498523-goog

Re: [RFC PATCH] fanotify: wake-up all waiters on release

Posted by Jan Kara 8 months, 3 weeks ago

On Tue 20-05-25 21:35:12, Sergey Senozhatsky wrote:
> Once reply response is set for all outstanding requests
> wake_up_all() of the ->access_waitq waiters so that they
> can finish user-wait.  Otherwise fsnotify_destroy_group()
> can wait forever for ->user_waits to reach 0 (which it
> never will.)
> 
> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>

We don't use exclusive waits with access_waitq so wake_up() and
wake_up_all() should do the same thing?

								Honza

> ---
>  fs/notify/fanotify/fanotify_user.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> index 87f861e9004f..95a3b843cbbf 100644
> --- a/fs/notify/fanotify/fanotify_user.c
> +++ b/fs/notify/fanotify/fanotify_user.c
> @@ -1046,8 +1046,8 @@ static int fanotify_release(struct inode *ignored, struct file *file)
>  	}
>  	spin_unlock(&group->notification_lock);
>  
> -	/* Response for all permission events it set, wakeup waiters */
> -	wake_up(&group->fanotify_data.access_waitq);
> +	/* Response for all permission events is set, wakeup waiters */
> +	wake_up_all(&group->fanotify_data.access_waitq);
>  
>  	/* matches the fanotify_init->fsnotify_alloc_group */
>  	fsnotify_destroy_group(group);
> -- 
> 2.49.0.1101.gccaa498523-goog
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [RFC PATCH] fanotify: wake-up all waiters on release

Posted by Sergey Senozhatsky 8 months, 3 weeks ago

On (25/05/21 12:18), Jan Kara wrote:
> On Tue 20-05-25 21:35:12, Sergey Senozhatsky wrote:
> > Once reply response is set for all outstanding requests
> > wake_up_all() of the ->access_waitq waiters so that they
> > can finish user-wait.  Otherwise fsnotify_destroy_group()
> > can wait forever for ->user_waits to reach 0 (which it
> > never will.)
> > 
> > Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
> 
> We don't use exclusive waits with access_waitq so wake_up() and
> wake_up_all() should do the same thing?

Oh, non-exclusive waiters, I see.  I totally missed that, thanks.

So... the problem is somewhere else then.  I'm currently looking
at some crashes (across all LTS kernels) where group owner just
gets stuck and then hung-task watchdog kicks in and panics the
system.  Basically just a single backtrace in the kernel logs:

 schedule+0x534/0x2540
 fsnotify_destroy_group+0xa7/0x150
 fanotify_release+0x147/0x160
 ____fput+0xe4/0x2a0
 task_work_run+0x71/0xb0
 do_exit+0x1ea/0x800
 do_group_exit+0x81/0x90
 get_signal+0x32d/0x4e0

My assumption was that it's this wait:
	wait_event(group->notification_waitq, !atomic_read(&group->user_waits));

But I guess I was wrong.

Re: [RFC PATCH] fanotify: wake-up all waiters on release

Posted by Jan Kara 8 months, 2 weeks ago

On Fri 23-05-25 16:18:19, Sergey Senozhatsky wrote:
> On (25/05/21 12:18), Jan Kara wrote:
> > On Tue 20-05-25 21:35:12, Sergey Senozhatsky wrote:
> > > Once reply response is set for all outstanding requests
> > > wake_up_all() of the ->access_waitq waiters so that they
> > > can finish user-wait.  Otherwise fsnotify_destroy_group()
> > > can wait forever for ->user_waits to reach 0 (which it
> > > never will.)
> > > 
> > > Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
> > 
> > We don't use exclusive waits with access_waitq so wake_up() and
> > wake_up_all() should do the same thing?
> 
> Oh, non-exclusive waiters, I see.  I totally missed that, thanks.
> 
> So... the problem is somewhere else then.  I'm currently looking
> at some crashes (across all LTS kernels) where group owner just
> gets stuck and then hung-task watchdog kicks in and panics the
> system.  Basically just a single backtrace in the kernel logs:
> 
>  schedule+0x534/0x2540
>  fsnotify_destroy_group+0xa7/0x150
>  fanotify_release+0x147/0x160
>  ____fput+0xe4/0x2a0
>  task_work_run+0x71/0xb0
>  do_exit+0x1ea/0x800
>  do_group_exit+0x81/0x90
>  get_signal+0x32d/0x4e0
> 
> My assumption was that it's this wait:
> 	wait_event(group->notification_waitq, !atomic_read(&group->user_waits));

Well, you're likely correct we are sleeping in this wait. But likely
there's some process that's indeed waiting for response to fanotify event
from userspace. Do you have a reproducer? Can you dump all blocked tasks
when this happens?

									Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [RFC PATCH] fanotify: wake-up all waiters on release

Posted by Sergey Senozhatsky 8 months, 2 weeks ago

On (25/05/26 14:52), Jan Kara wrote:
> > > We don't use exclusive waits with access_waitq so wake_up() and
> > > wake_up_all() should do the same thing?
> > 
> > Oh, non-exclusive waiters, I see.  I totally missed that, thanks.
> > 
> > So... the problem is somewhere else then.  I'm currently looking
> > at some crashes (across all LTS kernels) where group owner just
> > gets stuck and then hung-task watchdog kicks in and panics the
> > system.  Basically just a single backtrace in the kernel logs:
> > 
> >  schedule+0x534/0x2540
> >  fsnotify_destroy_group+0xa7/0x150
> >  fanotify_release+0x147/0x160
> >  ____fput+0xe4/0x2a0
> >  task_work_run+0x71/0xb0
> >  do_exit+0x1ea/0x800
> >  do_group_exit+0x81/0x90
> >  get_signal+0x32d/0x4e0
> > 
> > My assumption was that it's this wait:
> > 	wait_event(group->notification_waitq, !atomic_read(&group->user_waits));
> 
> Well, you're likely correct we are sleeping in this wait. But likely
> there's some process that's indeed waiting for response to fanotify event
> from userspace. Do you have a reproducer? Can you dump all blocked tasks
> when this happens?

Unfortunately, no.  This happens on consumer devices, which are
not available for any sort of debugging, due to various privacy
protection reasons.  We only get anonymized kernel ramoops/dmesg
on crashes.

So my only option is to add something to the kernel, then roll-out
the patched kernel to the fleet and wait for new crash reports.  The
problem is, all that I can think of sort of fixes the crash as far as
the hung-task watchdog is concerned.  Let me think more about it.

Another silly question: what decrements group->user_waits in case of
that race-condition?

---

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 9dac7f6e72d2b..38b977fe37a71 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -945,8 +945,10 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
        if (FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS)) {
                fsid = fanotify_get_fsid(iter_info);
                /* Racing with mark destruction or creation? */
-               if (!fsid.val[0] && !fsid.val[1])
-                       return 0;
+               if (!fsid.val[0] && !fsid.val[1]) {
+                       ret = 0;
+                       goto finish;
+               }
        }
 
        event = fanotify_alloc_event(group, mask, data, data_type, dir,

Re: [RFC PATCH] fanotify: wake-up all waiters on release

Posted by Sergey Senozhatsky 7 months, 3 weeks ago

On (25/05/26 23:12), Sergey Senozhatsky wrote:
[..]
> > >  schedule+0x534/0x2540
> > >  fsnotify_destroy_group+0xa7/0x150
> > >  fanotify_release+0x147/0x160
> > >  ____fput+0xe4/0x2a0
> > >  task_work_run+0x71/0xb0
> > >  do_exit+0x1ea/0x800
> > >  do_group_exit+0x81/0x90
> > >  get_signal+0x32d/0x4e0

[..]

> @@ -945,8 +945,10 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
>         if (FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS)) {
>                 fsid = fanotify_get_fsid(iter_info);
>                 /* Racing with mark destruction or creation? */
> -               if (!fsid.val[0] && !fsid.val[1])
> -                       return 0;
> +               if (!fsid.val[0] && !fsid.val[1]) {
> +                       ret = 0;
> +                       goto finish;
> +               }
>         }

Surprisingly enough, this did not help.

Jan, one more silly question:

fsnotify_get_mark_safe() and fsnotify_put_mark_wake() can be called on
NULL mark.  Is it possible that between fsnotify_prepare_user_wait(iter_info)
and fsnotify_finish_user_wait(iter_info) iter_info->marks[type] changes in
such a way that creates imbalance?  That is, fsnotify_finish_user_wait() sees
more NULL marks and hence does not rollback all the group->user_waits
increments that fsnotify_prepare_user_wait() did?

Re: [RFC PATCH] fanotify: wake-up all waiters on release

Posted by Sergey Senozhatsky 7 months, 3 weeks ago

On (25/06/20 13:53), Sergey Senozhatsky wrote:
> On (25/05/26 23:12), Sergey Senozhatsky wrote:
[..]
> Surprisingly enough, this did not help.
> 
> Jan, one more silly question:
> 
> fsnotify_get_mark_safe() and fsnotify_put_mark_wake() can be called on
> NULL mark.  Is it possible that between fsnotify_prepare_user_wait(iter_info)
> and fsnotify_finish_user_wait(iter_info) iter_info->marks[type] changes in
> such a way that creates imbalance?  That is, fsnotify_finish_user_wait() sees
> more NULL marks and hence does not rollback all the group->user_waits
> increments that fsnotify_prepare_user_wait() did?

No, that doesn't seem to be possible.  Sorry for the noise.

My another silly idea was, fsnotify_put_mark_wake() is called in a loop
and it tests group->shutdown locklessly, as far as I can tell, so maybe
there is a speculative load and we use stale/"cached" group->shutdown
value w/o ever waking up ->notification_waitq.  Am running out of ideas.

Re: [RFC PATCH] fanotify: wake-up all waiters on release

Posted by Jan Kara 7 months, 2 weeks ago

On Fri 20-06-25 14:48:47, Sergey Senozhatsky wrote:
> On (25/06/20 13:53), Sergey Senozhatsky wrote:
> > On (25/05/26 23:12), Sergey Senozhatsky wrote:
> [..]
> > Surprisingly enough, this did not help.
> > 
> > Jan, one more silly question:
> > 
> > fsnotify_get_mark_safe() and fsnotify_put_mark_wake() can be called on
> > NULL mark.  Is it possible that between fsnotify_prepare_user_wait(iter_info)
> > and fsnotify_finish_user_wait(iter_info) iter_info->marks[type] changes in
> > such a way that creates imbalance?  That is, fsnotify_finish_user_wait() sees
> > more NULL marks and hence does not rollback all the group->user_waits
> > increments that fsnotify_prepare_user_wait() did?
> 
> No, that doesn't seem to be possible.  Sorry for the noise.

Yeah, iter_info is local and should not change outside of the call itself.

> My another silly idea was, fsnotify_put_mark_wake() is called in a loop
> and it tests group->shutdown locklessly, as far as I can tell, so maybe
> there is a speculative load and we use stale/"cached" group->shutdown
> value w/o ever waking up ->notification_waitq.  Am running out of ideas.

Well, but atomic_dec_and_test() in fsnotify_put_mark_wake() should be a
full memory barrier so such reordering should not be possible? At least
not on the CPU, you can check disassembly of fsnotify_put_mark_wake() on
your kernel whether the fetch of group->shutdown indeed happens after the
atomic_dec_and_test() (but it should because && is a sequencing point and
thus a compiler barrier).

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [RFC PATCH] fanotify: wake-up all waiters on release

Posted by Sergey Senozhatsky 7 months, 2 weeks ago

On (25/06/23 12:52), Jan Kara wrote:
> > My another silly idea was, fsnotify_put_mark_wake() is called in a loop
> > and it tests group->shutdown locklessly, as far as I can tell, so maybe
> > there is a speculative load and we use stale/"cached" group->shutdown
> > value w/o ever waking up ->notification_waitq.  Am running out of ideas.
> 
> Well, but atomic_dec_and_test() in fsnotify_put_mark_wake() should be a
> full memory barrier so such reordering should not be possible?

You are right, as always.  Generated code looks fine:

...
     61f:       f0 41 ff 4e 6c          lock decl 0x6c(%r14)
     624:       75 1f                   jne    645 <fsnotify_finish_user_wait+0x55>
     626:       41 80 7e 44 01          cmpb   $0x1,0x44(%r14)
     62b:       75 18                   jne    645 <fsnotify_finish_user_wait+0x55>
...

->shutdown fetch is always done after atomic-dec.

Re: [RFC PATCH] fanotify: wake-up all waiters on release

Posted by Jan Kara 8 months, 2 weeks ago

On Mon 26-05-25 23:12:20, Sergey Senozhatsky wrote:
> On (25/05/26 14:52), Jan Kara wrote:
> > > > We don't use exclusive waits with access_waitq so wake_up() and
> > > > wake_up_all() should do the same thing?
> > > 
> > > Oh, non-exclusive waiters, I see.  I totally missed that, thanks.
> > > 
> > > So... the problem is somewhere else then.  I'm currently looking
> > > at some crashes (across all LTS kernels) where group owner just
> > > gets stuck and then hung-task watchdog kicks in and panics the
> > > system.  Basically just a single backtrace in the kernel logs:
> > > 
> > >  schedule+0x534/0x2540
> > >  fsnotify_destroy_group+0xa7/0x150
> > >  fanotify_release+0x147/0x160
> > >  ____fput+0xe4/0x2a0
> > >  task_work_run+0x71/0xb0
> > >  do_exit+0x1ea/0x800
> > >  do_group_exit+0x81/0x90
> > >  get_signal+0x32d/0x4e0
> > > 
> > > My assumption was that it's this wait:
> > > 	wait_event(group->notification_waitq, !atomic_read(&group->user_waits));
> > 
> > Well, you're likely correct we are sleeping in this wait. But likely
> > there's some process that's indeed waiting for response to fanotify event
> > from userspace. Do you have a reproducer? Can you dump all blocked tasks
> > when this happens?
> 
> Unfortunately, no.  This happens on consumer devices, which are
> not available for any sort of debugging, due to various privacy
> protection reasons.  We only get anonymized kernel ramoops/dmesg
> on crashes.
> 
> So my only option is to add something to the kernel, then roll-out
> the patched kernel to the fleet and wait for new crash reports.  The
> problem is, all that I can think of sort of fixes the crash as far as
> the hung-task watchdog is concerned.  Let me think more about it.
> 
> Another silly question: what decrements group->user_waits in case of
> that race-condition?
> 
> ---
> 
> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index 9dac7f6e72d2b..38b977fe37a71 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -945,8 +945,10 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
>         if (FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS)) {
>                 fsid = fanotify_get_fsid(iter_info);
>                 /* Racing with mark destruction or creation? */
> -               if (!fsid.val[0] && !fsid.val[1])
> -                       return 0;
> +               if (!fsid.val[0] && !fsid.val[1]) {
> +                       ret = 0;
> +                       goto finish;
> +               }
>         }

This code is not present in current upstream kernel. This seems to have
been inadvertedly fixed by commit 30ad1938326b ("fanotify: allow "weak" fsid
when watching a single filesystem") which you likely don't have in your
kernel.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [RFC PATCH] fanotify: wake-up all waiters on release

Posted by Sergey Senozhatsky 8 months, 2 weeks ago

On (25/05/26 18:47), Jan Kara wrote:
> > Another silly question: what decrements group->user_waits in case of
> > that race-condition?
> > 
> > ---
> > 
> > diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> > index 9dac7f6e72d2b..38b977fe37a71 100644
> > --- a/fs/notify/fanotify/fanotify.c
> > +++ b/fs/notify/fanotify/fanotify.c
> > @@ -945,8 +945,10 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
> >         if (FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS)) {
> >                 fsid = fanotify_get_fsid(iter_info);
> >                 /* Racing with mark destruction or creation? */
> > -               if (!fsid.val[0] && !fsid.val[1])
> > -                       return 0;
> > +               if (!fsid.val[0] && !fsid.val[1]) {
> > +                       ret = 0;
> > +                       goto finish;
> > +               }
> >         }
> 
> This code is not present in current upstream kernel. This seems to have
> been inadvertedly fixed by commit 30ad1938326b ("fanotify: allow "weak" fsid
> when watching a single filesystem") which you likely don't have in your
> kernel.

Oh, sweet.  Thanks for the pointers, Jan.