fs/notify/fanotify/fanotify_user.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
Once reply response is set for all outstanding requests
wake_up_all() of the ->access_waitq waiters so that they
can finish user-wait. Otherwise fsnotify_destroy_group()
can wait forever for ->user_waits to reach 0 (which it
never will.)
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
fs/notify/fanotify/fanotify_user.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 87f861e9004f..95a3b843cbbf 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -1046,8 +1046,8 @@ static int fanotify_release(struct inode *ignored, struct file *file)
}
spin_unlock(&group->notification_lock);
- /* Response for all permission events it set, wakeup waiters */
- wake_up(&group->fanotify_data.access_waitq);
+ /* Response for all permission events is set, wakeup waiters */
+ wake_up_all(&group->fanotify_data.access_waitq);
/* matches the fanotify_init->fsnotify_alloc_group */
fsnotify_destroy_group(group);
--
2.49.0.1101.gccaa498523-goog
On Tue 20-05-25 21:35:12, Sergey Senozhatsky wrote: > Once reply response is set for all outstanding requests > wake_up_all() of the ->access_waitq waiters so that they > can finish user-wait. Otherwise fsnotify_destroy_group() > can wait forever for ->user_waits to reach 0 (which it > never will.) > > Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> We don't use exclusive waits with access_waitq so wake_up() and wake_up_all() should do the same thing? Honza > --- > fs/notify/fanotify/fanotify_user.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c > index 87f861e9004f..95a3b843cbbf 100644 > --- a/fs/notify/fanotify/fanotify_user.c > +++ b/fs/notify/fanotify/fanotify_user.c > @@ -1046,8 +1046,8 @@ static int fanotify_release(struct inode *ignored, struct file *file) > } > spin_unlock(&group->notification_lock); > > - /* Response for all permission events it set, wakeup waiters */ > - wake_up(&group->fanotify_data.access_waitq); > + /* Response for all permission events is set, wakeup waiters */ > + wake_up_all(&group->fanotify_data.access_waitq); > > /* matches the fanotify_init->fsnotify_alloc_group */ > fsnotify_destroy_group(group); > -- > 2.49.0.1101.gccaa498523-goog > -- Jan Kara <jack@suse.com> SUSE Labs, CR
On (25/05/21 12:18), Jan Kara wrote: > On Tue 20-05-25 21:35:12, Sergey Senozhatsky wrote: > > Once reply response is set for all outstanding requests > > wake_up_all() of the ->access_waitq waiters so that they > > can finish user-wait. Otherwise fsnotify_destroy_group() > > can wait forever for ->user_waits to reach 0 (which it > > never will.) > > > > Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> > > We don't use exclusive waits with access_waitq so wake_up() and > wake_up_all() should do the same thing? Oh, non-exclusive waiters, I see. I totally missed that, thanks. So... the problem is somewhere else then. I'm currently looking at some crashes (across all LTS kernels) where group owner just gets stuck and then hung-task watchdog kicks in and panics the system. Basically just a single backtrace in the kernel logs: schedule+0x534/0x2540 fsnotify_destroy_group+0xa7/0x150 fanotify_release+0x147/0x160 ____fput+0xe4/0x2a0 task_work_run+0x71/0xb0 do_exit+0x1ea/0x800 do_group_exit+0x81/0x90 get_signal+0x32d/0x4e0 My assumption was that it's this wait: wait_event(group->notification_waitq, !atomic_read(&group->user_waits)); But I guess I was wrong.
On Fri 23-05-25 16:18:19, Sergey Senozhatsky wrote: > On (25/05/21 12:18), Jan Kara wrote: > > On Tue 20-05-25 21:35:12, Sergey Senozhatsky wrote: > > > Once reply response is set for all outstanding requests > > > wake_up_all() of the ->access_waitq waiters so that they > > > can finish user-wait. Otherwise fsnotify_destroy_group() > > > can wait forever for ->user_waits to reach 0 (which it > > > never will.) > > > > > > Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> > > > > We don't use exclusive waits with access_waitq so wake_up() and > > wake_up_all() should do the same thing? > > Oh, non-exclusive waiters, I see. I totally missed that, thanks. > > So... the problem is somewhere else then. I'm currently looking > at some crashes (across all LTS kernels) where group owner just > gets stuck and then hung-task watchdog kicks in and panics the > system. Basically just a single backtrace in the kernel logs: > > schedule+0x534/0x2540 > fsnotify_destroy_group+0xa7/0x150 > fanotify_release+0x147/0x160 > ____fput+0xe4/0x2a0 > task_work_run+0x71/0xb0 > do_exit+0x1ea/0x800 > do_group_exit+0x81/0x90 > get_signal+0x32d/0x4e0 > > My assumption was that it's this wait: > wait_event(group->notification_waitq, !atomic_read(&group->user_waits)); Well, you're likely correct we are sleeping in this wait. But likely there's some process that's indeed waiting for response to fanotify event from userspace. Do you have a reproducer? Can you dump all blocked tasks when this happens? Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR
On (25/05/26 14:52), Jan Kara wrote:
> > > We don't use exclusive waits with access_waitq so wake_up() and
> > > wake_up_all() should do the same thing?
> >
> > Oh, non-exclusive waiters, I see. I totally missed that, thanks.
> >
> > So... the problem is somewhere else then. I'm currently looking
> > at some crashes (across all LTS kernels) where group owner just
> > gets stuck and then hung-task watchdog kicks in and panics the
> > system. Basically just a single backtrace in the kernel logs:
> >
> > schedule+0x534/0x2540
> > fsnotify_destroy_group+0xa7/0x150
> > fanotify_release+0x147/0x160
> > ____fput+0xe4/0x2a0
> > task_work_run+0x71/0xb0
> > do_exit+0x1ea/0x800
> > do_group_exit+0x81/0x90
> > get_signal+0x32d/0x4e0
> >
> > My assumption was that it's this wait:
> > wait_event(group->notification_waitq, !atomic_read(&group->user_waits));
>
> Well, you're likely correct we are sleeping in this wait. But likely
> there's some process that's indeed waiting for response to fanotify event
> from userspace. Do you have a reproducer? Can you dump all blocked tasks
> when this happens?
Unfortunately, no. This happens on consumer devices, which are
not available for any sort of debugging, due to various privacy
protection reasons. We only get anonymized kernel ramoops/dmesg
on crashes.
So my only option is to add something to the kernel, then roll-out
the patched kernel to the fleet and wait for new crash reports. The
problem is, all that I can think of sort of fixes the crash as far as
the hung-task watchdog is concerned. Let me think more about it.
Another silly question: what decrements group->user_waits in case of
that race-condition?
---
diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 9dac7f6e72d2b..38b977fe37a71 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -945,8 +945,10 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
if (FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS)) {
fsid = fanotify_get_fsid(iter_info);
/* Racing with mark destruction or creation? */
- if (!fsid.val[0] && !fsid.val[1])
- return 0;
+ if (!fsid.val[0] && !fsid.val[1]) {
+ ret = 0;
+ goto finish;
+ }
}
event = fanotify_alloc_event(group, mask, data, data_type, dir,
On (25/05/26 23:12), Sergey Senozhatsky wrote:
[..]
> > > schedule+0x534/0x2540
> > > fsnotify_destroy_group+0xa7/0x150
> > > fanotify_release+0x147/0x160
> > > ____fput+0xe4/0x2a0
> > > task_work_run+0x71/0xb0
> > > do_exit+0x1ea/0x800
> > > do_group_exit+0x81/0x90
> > > get_signal+0x32d/0x4e0
[..]
> @@ -945,8 +945,10 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
> if (FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS)) {
> fsid = fanotify_get_fsid(iter_info);
> /* Racing with mark destruction or creation? */
> - if (!fsid.val[0] && !fsid.val[1])
> - return 0;
> + if (!fsid.val[0] && !fsid.val[1]) {
> + ret = 0;
> + goto finish;
> + }
> }
Surprisingly enough, this did not help.
Jan, one more silly question:
fsnotify_get_mark_safe() and fsnotify_put_mark_wake() can be called on
NULL mark. Is it possible that between fsnotify_prepare_user_wait(iter_info)
and fsnotify_finish_user_wait(iter_info) iter_info->marks[type] changes in
such a way that creates imbalance? That is, fsnotify_finish_user_wait() sees
more NULL marks and hence does not rollback all the group->user_waits
increments that fsnotify_prepare_user_wait() did?
On (25/06/20 13:53), Sergey Senozhatsky wrote: > On (25/05/26 23:12), Sergey Senozhatsky wrote: [..] > Surprisingly enough, this did not help. > > Jan, one more silly question: > > fsnotify_get_mark_safe() and fsnotify_put_mark_wake() can be called on > NULL mark. Is it possible that between fsnotify_prepare_user_wait(iter_info) > and fsnotify_finish_user_wait(iter_info) iter_info->marks[type] changes in > such a way that creates imbalance? That is, fsnotify_finish_user_wait() sees > more NULL marks and hence does not rollback all the group->user_waits > increments that fsnotify_prepare_user_wait() did? No, that doesn't seem to be possible. Sorry for the noise. My another silly idea was, fsnotify_put_mark_wake() is called in a loop and it tests group->shutdown locklessly, as far as I can tell, so maybe there is a speculative load and we use stale/"cached" group->shutdown value w/o ever waking up ->notification_waitq. Am running out of ideas.
On Fri 20-06-25 14:48:47, Sergey Senozhatsky wrote: > On (25/06/20 13:53), Sergey Senozhatsky wrote: > > On (25/05/26 23:12), Sergey Senozhatsky wrote: > [..] > > Surprisingly enough, this did not help. > > > > Jan, one more silly question: > > > > fsnotify_get_mark_safe() and fsnotify_put_mark_wake() can be called on > > NULL mark. Is it possible that between fsnotify_prepare_user_wait(iter_info) > > and fsnotify_finish_user_wait(iter_info) iter_info->marks[type] changes in > > such a way that creates imbalance? That is, fsnotify_finish_user_wait() sees > > more NULL marks and hence does not rollback all the group->user_waits > > increments that fsnotify_prepare_user_wait() did? > > No, that doesn't seem to be possible. Sorry for the noise. Yeah, iter_info is local and should not change outside of the call itself. > My another silly idea was, fsnotify_put_mark_wake() is called in a loop > and it tests group->shutdown locklessly, as far as I can tell, so maybe > there is a speculative load and we use stale/"cached" group->shutdown > value w/o ever waking up ->notification_waitq. Am running out of ideas. Well, but atomic_dec_and_test() in fsnotify_put_mark_wake() should be a full memory barrier so such reordering should not be possible? At least not on the CPU, you can check disassembly of fsnotify_put_mark_wake() on your kernel whether the fetch of group->shutdown indeed happens after the atomic_dec_and_test() (but it should because && is a sequencing point and thus a compiler barrier). Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR
On (25/06/23 12:52), Jan Kara wrote:
> > My another silly idea was, fsnotify_put_mark_wake() is called in a loop
> > and it tests group->shutdown locklessly, as far as I can tell, so maybe
> > there is a speculative load and we use stale/"cached" group->shutdown
> > value w/o ever waking up ->notification_waitq. Am running out of ideas.
>
> Well, but atomic_dec_and_test() in fsnotify_put_mark_wake() should be a
> full memory barrier so such reordering should not be possible?
You are right, as always. Generated code looks fine:
...
61f: f0 41 ff 4e 6c lock decl 0x6c(%r14)
624: 75 1f jne 645 <fsnotify_finish_user_wait+0x55>
626: 41 80 7e 44 01 cmpb $0x1,0x44(%r14)
62b: 75 18 jne 645 <fsnotify_finish_user_wait+0x55>
...
->shutdown fetch is always done after atomic-dec.
On Mon 26-05-25 23:12:20, Sergey Senozhatsky wrote:
> On (25/05/26 14:52), Jan Kara wrote:
> > > > We don't use exclusive waits with access_waitq so wake_up() and
> > > > wake_up_all() should do the same thing?
> > >
> > > Oh, non-exclusive waiters, I see. I totally missed that, thanks.
> > >
> > > So... the problem is somewhere else then. I'm currently looking
> > > at some crashes (across all LTS kernels) where group owner just
> > > gets stuck and then hung-task watchdog kicks in and panics the
> > > system. Basically just a single backtrace in the kernel logs:
> > >
> > > schedule+0x534/0x2540
> > > fsnotify_destroy_group+0xa7/0x150
> > > fanotify_release+0x147/0x160
> > > ____fput+0xe4/0x2a0
> > > task_work_run+0x71/0xb0
> > > do_exit+0x1ea/0x800
> > > do_group_exit+0x81/0x90
> > > get_signal+0x32d/0x4e0
> > >
> > > My assumption was that it's this wait:
> > > wait_event(group->notification_waitq, !atomic_read(&group->user_waits));
> >
> > Well, you're likely correct we are sleeping in this wait. But likely
> > there's some process that's indeed waiting for response to fanotify event
> > from userspace. Do you have a reproducer? Can you dump all blocked tasks
> > when this happens?
>
> Unfortunately, no. This happens on consumer devices, which are
> not available for any sort of debugging, due to various privacy
> protection reasons. We only get anonymized kernel ramoops/dmesg
> on crashes.
>
> So my only option is to add something to the kernel, then roll-out
> the patched kernel to the fleet and wait for new crash reports. The
> problem is, all that I can think of sort of fixes the crash as far as
> the hung-task watchdog is concerned. Let me think more about it.
>
> Another silly question: what decrements group->user_waits in case of
> that race-condition?
>
> ---
>
> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index 9dac7f6e72d2b..38b977fe37a71 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -945,8 +945,10 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
> if (FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS)) {
> fsid = fanotify_get_fsid(iter_info);
> /* Racing with mark destruction or creation? */
> - if (!fsid.val[0] && !fsid.val[1])
> - return 0;
> + if (!fsid.val[0] && !fsid.val[1]) {
> + ret = 0;
> + goto finish;
> + }
> }
This code is not present in current upstream kernel. This seems to have
been inadvertedly fixed by commit 30ad1938326b ("fanotify: allow "weak" fsid
when watching a single filesystem") which you likely don't have in your
kernel.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
On (25/05/26 18:47), Jan Kara wrote:
> > Another silly question: what decrements group->user_waits in case of
> > that race-condition?
> >
> > ---
> >
> > diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> > index 9dac7f6e72d2b..38b977fe37a71 100644
> > --- a/fs/notify/fanotify/fanotify.c
> > +++ b/fs/notify/fanotify/fanotify.c
> > @@ -945,8 +945,10 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
> > if (FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS)) {
> > fsid = fanotify_get_fsid(iter_info);
> > /* Racing with mark destruction or creation? */
> > - if (!fsid.val[0] && !fsid.val[1])
> > - return 0;
> > + if (!fsid.val[0] && !fsid.val[1]) {
> > + ret = 0;
> > + goto finish;
> > + }
> > }
>
> This code is not present in current upstream kernel. This seems to have
> been inadvertedly fixed by commit 30ad1938326b ("fanotify: allow "weak" fsid
> when watching a single filesystem") which you likely don't have in your
> kernel.
Oh, sweet. Thanks for the pointers, Jan.
© 2016 - 2025 Red Hat, Inc.