fs/notify/dnotify/dnotify.c | 28 ++++++--- fs/notify/fsnotify.c | 101 ++++++++++++++++++++++--------- fs/notify/fsnotify.h | 3 +- fs/notify/mark.c | 40 +++++++++++- include/linux/fsnotify_backend.h | 8 ++- 5 files changed, 136 insertions(+), 44 deletions(-)
Hi Jan, Amir, Al, Here's my v4 patch series that aims to eliminate soft lockups when updating dentry flags in fsnotify. I've incorporated Jan's suggestion of simply allowing the flag to be lazily cleared in the fsnotify_parent() function, via Amir's patch. This allowed me to drop patch #2 from my previous series (fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem). I replaced it with "fsnotify: require inode lock held during child flag update", patch #5 in this series. I also added "dnotify: move fsnotify_recalc_mask() outside spinlock" to address the sleep-during-atomic issues with dnotify. Jan expressed concerns about lock ordering of the inode rwsem with the fsnotify group mutex. I built this with lockdep enabled (see below for the lock debugging .config section -- I'm not too familiar with lockdep so I wanted a sanity check). I ran all the fanotify, inotify, and dnotify tests I could find in LTP, with no lockdep splats to be found. I don't know that this can completely satisfy the concerns about lock ordering: I'm reading through the code to better understand the concern about "the removal of oneshot mark during modify event generation". But I'm encouraged by the LTP+lockdep results. I also went ahead and did my negative dentry oriented testing. Of course the fsnotify_parent() issue is fully resolved, and when I tested several processes all using inotifywait on the same directory full of negative dentries, I was able to use ftrace to confirm that fsnotify_update_children_dentry_flags() was called exactly once for all processes. No softlockups occurred! I originally wrote this series to make the last patch (#5) optional: if for some reason we didn't think it was necessary to hold the inode rwsem, then we could omit it -- the main penalty being the race condition described in the patch description. I tested without the last patch and LTP passed also with lockdep enabled, but of course when multiple tasks did an inotifywait on the same directory (with many negative dentries) only the first waited for the flag updates, the rest of the tasks immediately returned despite the flags not being ready. I agree with Amir that as long as the lock ordering is fine, we should keep patch #5. And if that's the case, I can reorder the series a bit to make it a bit more logical, and eliminate logic in fsnotify_update_children_dentry_flags() for handling d_move/cursor races, which I promptly delete later in the series. 1. fsnotify: clear PARENT_WATCHED flags lazily 2. fsnotify: Use d_find_any_alias to get dentry associated with inode 3. dnotify: move fsnotify_recalc_mask() outside spinlock 4. fsnotify: require inode lock held during child flag update 5. fsnotify: allow sleepable child flag update Thanks for continuing to read this series, I hope we're making progress toward a simpler way to fix these scaling issues! Stephen Amir Goldstein (1): fsnotify: clear PARENT_WATCHED flags lazily Stephen Brennan (4): fsnotify: Use d_find_any_alias to get dentry associated with inode dnotify: move fsnotify_recalc_mask() outside spinlock fsnotify: allow sleepable child flag update fsnotify: require inode lock held during child flag update fs/notify/dnotify/dnotify.c | 28 ++++++--- fs/notify/fsnotify.c | 101 ++++++++++++++++++++++--------- fs/notify/fsnotify.h | 3 +- fs/notify/mark.c | 40 +++++++++++- include/linux/fsnotify_backend.h | 8 ++- 5 files changed, 136 insertions(+), 44 deletions(-) -- 2.34.1
Stephen Brennan <stephen.s.brennan@oracle.com> writes: > Hi Jan, Amir, Al, > > Here's my v4 patch series that aims to eliminate soft lockups when updating > dentry flags in fsnotify. I've incorporated Jan's suggestion of simply > allowing the flag to be lazily cleared in the fsnotify_parent() function, > via Amir's patch. This allowed me to drop patch #2 from my previous series > (fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem). I > replaced it with "fsnotify: require inode lock held during child flag > update", patch #5 in this series. I also added "dnotify: move > fsnotify_recalc_mask() outside spinlock" to address the sleep-during-atomic > issues with dnotify. > > Jan expressed concerns about lock ordering of the inode rwsem with the > fsnotify group mutex. I built this with lockdep enabled (see below for the > lock debugging .config section -- I'm not too familiar with lockdep so I > wanted a sanity check). I ran all the fanotify, inotify, and dnotify tests > I could find in LTP, with no lockdep splats to be found. I don't know that > this can completely satisfy the concerns about lock ordering: I'm reading > through the code to better understand the concern about "the removal of > oneshot mark during modify event generation". But I'm encouraged by the > LTP+lockdep results. Of course, I forgot to append the .config section: # # Lock Debugging (spinlocks, mutexes, etc...) # CONFIG_LOCK_DEBUGGING_SUPPORT=y CONFIG_PROVE_LOCKING=y # CONFIG_PROVE_RAW_LOCK_NESTING is not set CONFIG_LOCK_STAT=y CONFIG_DEBUG_RT_MUTEXES=y CONFIG_DEBUG_SPINLOCK=y CONFIG_DEBUG_MUTEXES=y CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y CONFIG_DEBUG_RWSEMS=y CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_LOCKDEP=y CONFIG_LOCKDEP_BITS=15 CONFIG_LOCKDEP_CHAINS_BITS=16 CONFIG_LOCKDEP_STACK_TRACE_BITS=19 CONFIG_LOCKDEP_STACK_TRACE_HASH_BITS=14 CONFIG_LOCKDEP_CIRCULAR_QUEUE_BITS=12 CONFIG_DEBUG_LOCKDEP=y CONFIG_DEBUG_ATOMIC_SLEEP=y # CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set # CONFIG_LOCK_TORTURE_TEST is not set # CONFIG_WW_MUTEX_SELFTEST is not set # CONFIG_SCF_TORTURE_TEST is not set CONFIG_CSD_LOCK_WAIT_DEBUG=y # end of Lock Debugging (spinlocks, mutexes, etc...) > > I also went ahead and did my negative dentry oriented testing. Of course > the fsnotify_parent() issue is fully resolved, and when I tested several > processes all using inotifywait on the same directory full of negative > dentries, I was able to use ftrace to confirm that > fsnotify_update_children_dentry_flags() was called exactly once for all > processes. No softlockups occurred! > > I originally wrote this series to make the last patch (#5) optional: if for > some reason we didn't think it was necessary to hold the inode rwsem, then > we could omit it -- the main penalty being the race condition described in > the patch description. I tested without the last patch and LTP passed also > with lockdep enabled, but of course when multiple tasks did an inotifywait > on the same directory (with many negative dentries) only the first waited > for the flag updates, the rest of the tasks immediately returned despite > the flags not being ready. > > I agree with Amir that as long as the lock ordering is fine, we should keep > patch #5. And if that's the case, I can reorder the series a bit to make it > a bit more logical, and eliminate logic in > fsnotify_update_children_dentry_flags() for handling d_move/cursor races, > which I promptly delete later in the series. > > 1. fsnotify: clear PARENT_WATCHED flags lazily > 2. fsnotify: Use d_find_any_alias to get dentry associated with inode > 3. dnotify: move fsnotify_recalc_mask() outside spinlock > 4. fsnotify: require inode lock held during child flag update > 5. fsnotify: allow sleepable child flag update > > Thanks for continuing to read this series, I hope we're making progress > toward a simpler way to fix these scaling issues! > > Stephen > > Amir Goldstein (1): > fsnotify: clear PARENT_WATCHED flags lazily > > Stephen Brennan (4): > fsnotify: Use d_find_any_alias to get dentry associated with inode > dnotify: move fsnotify_recalc_mask() outside spinlock > fsnotify: allow sleepable child flag update > fsnotify: require inode lock held during child flag update > > fs/notify/dnotify/dnotify.c | 28 ++++++--- > fs/notify/fsnotify.c | 101 ++++++++++++++++++++++--------- > fs/notify/fsnotify.h | 3 +- > fs/notify/mark.c | 40 +++++++++++- > include/linux/fsnotify_backend.h | 8 ++- > 5 files changed, 136 insertions(+), 44 deletions(-) > > -- > 2.34.1
Hi Stephen!
On Fri 11-11-22 14:06:09, Stephen Brennan wrote:
> Here's my v4 patch series that aims to eliminate soft lockups when updating
> dentry flags in fsnotify. I've incorporated Jan's suggestion of simply
> allowing the flag to be lazily cleared in the fsnotify_parent() function,
> via Amir's patch. This allowed me to drop patch #2 from my previous series
> (fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem). I
> replaced it with "fsnotify: require inode lock held during child flag
> update", patch #5 in this series. I also added "dnotify: move
> fsnotify_recalc_mask() outside spinlock" to address the sleep-during-atomic
> issues with dnotify.
Yes, the series is now much simpler. Thanks!
> Jan expressed concerns about lock ordering of the inode rwsem with the
> fsnotify group mutex. I built this with lockdep enabled (see below for the
> lock debugging .config section -- I'm not too familiar with lockdep so I
> wanted a sanity check). I ran all the fanotify, inotify, and dnotify tests
> I could find in LTP, with no lockdep splats to be found. I don't know that
> this can completely satisfy the concerns about lock ordering: I'm reading
> through the code to better understand the concern about "the removal of
> oneshot mark during modify event generation". But I'm encouraged by the
> LTP+lockdep results.
So I had a look and I think your patches could cause deadlock at least for
nfsd. The problem is with things like inotify IN_ONESHOT marks. They get
autodeleted as soon as they trigger. Thus e.g. fsnotify_mkdir() can trigger
IN_ONESHOT mark and goes on removing it by calling fsnotify_destroy_mark()
from inotify_handle_inode_event(). And nfsd calls e.g. fsnotify_mkdir()
while holding dir->i_rwsem held. So we have lock ordering like:
nfsd_mkdir()
inode_lock(dir);
...
__nfsd_mkdir(dir, ...)
fsnotify_mkdir(dir, dentry);
...
inotify_handle_inode_event()
...
fsnotify_destroy_mark()
fsnotify_group_lock(group)
So we have dir->i_rwsem > group->mark_mutex. But we also have callchains
like:
inotify_add_watch()
inotify_update_watch()
fsnotify_group_lock(group)
inotify_update_existing_watch()
...
fsnotify_recalc_mask()
inode_lock(dir); -> added by your series
which creates ordering group->mark_mutex > dir->i_rwsem.
It is even worse with dnotify which (even with your patches) ends up
calling fsnotify_recalc_mask() from dnotify_handle_event() so we have a
possibility of direct A->A deadlock. But I'd leave dnotify aside, I think
that can be massaged to not need to call fsnotify_recalc_mask()
(__fsnotify_recalc_mask() would be enough there).
Still I'm not 100% sure about a proper way out of this. The simplicity of
alias->d_subdirs iteration with i_rwsem held is compeling. We could mandate
that fsnotify hooks cannot be called with inode->i_rwsem held (and fixup
nfsd) but IMO that is pushing the complexity from the fsnotify core into
its users which is undesirable. Maybe we could grab inode->i_rwsem in those
places adding / removing notification marks before we grab
group->mark_mutex, just verify (with lockdep) that fsnotify_recalc_mask()
has the inode->i_rwsem held and be done with it? That pushes a bit of
complexity into the fsnotify backends but it is not too bad.
fsnotify_recalc_mask() gets only called by dnotify, inotify, and fanotify.
Amir?
> I originally wrote this series to make the last patch (#5) optional: if for
> some reason we didn't think it was necessary to hold the inode rwsem, then
> we could omit it -- the main penalty being the race condition described in
> the patch description. I tested without the last patch and LTP passed also
> with lockdep enabled, but of course when multiple tasks did an inotifywait
> on the same directory (with many negative dentries) only the first waited
> for the flag updates, the rest of the tasks immediately returned despite
> the flags not being ready.
>
> I agree with Amir that as long as the lock ordering is fine, we should keep
> patch #5. And if that's the case, I can reorder the series a bit to make it
> a bit more logical, and eliminate logic in
> fsnotify_update_children_dentry_flags() for handling d_move/cursor races,
> which I promptly delete later in the series.
>
> 1. fsnotify: clear PARENT_WATCHED flags lazily
> 2. fsnotify: Use d_find_any_alias to get dentry associated with inode
> 3. dnotify: move fsnotify_recalc_mask() outside spinlock
> 4. fsnotify: require inode lock held during child flag update
> 5. fsnotify: allow sleepable child flag update
>
> Thanks for continuing to read this series, I hope we're making progress
> toward a simpler way to fix these scaling issues!
Yeah, so I'd be for making sure i_rwsem is held where we need it first and
only after that add reschedule handling into
fsnotify_update_children_dentry_flags(). That makes the series more
logical.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
On Tue, Nov 22, 2022 at 1:50 PM Jan Kara <jack@suse.cz> wrote: > > Hi Stephen! > > On Fri 11-11-22 14:06:09, Stephen Brennan wrote: > > Here's my v4 patch series that aims to eliminate soft lockups when updating > > dentry flags in fsnotify. I've incorporated Jan's suggestion of simply > > allowing the flag to be lazily cleared in the fsnotify_parent() function, > > via Amir's patch. This allowed me to drop patch #2 from my previous series > > (fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem). I > > replaced it with "fsnotify: require inode lock held during child flag > > update", patch #5 in this series. I also added "dnotify: move > > fsnotify_recalc_mask() outside spinlock" to address the sleep-during-atomic > > issues with dnotify. > > Yes, the series is now much simpler. Thanks! > > > Jan expressed concerns about lock ordering of the inode rwsem with the > > fsnotify group mutex. I built this with lockdep enabled (see below for the > > lock debugging .config section -- I'm not too familiar with lockdep so I > > wanted a sanity check). I ran all the fanotify, inotify, and dnotify tests > > I could find in LTP, with no lockdep splats to be found. I don't know that > > this can completely satisfy the concerns about lock ordering: I'm reading > > through the code to better understand the concern about "the removal of > > oneshot mark during modify event generation". But I'm encouraged by the > > LTP+lockdep results. > > So I had a look and I think your patches could cause deadlock at least for > nfsd. The problem is with things like inotify IN_ONESHOT marks. They get > autodeleted as soon as they trigger. Thus e.g. fsnotify_mkdir() can trigger > IN_ONESHOT mark and goes on removing it by calling fsnotify_destroy_mark() > from inotify_handle_inode_event(). And nfsd calls e.g. fsnotify_mkdir() > while holding dir->i_rwsem held. So we have lock ordering like: > > nfsd_mkdir() > inode_lock(dir); > ... > __nfsd_mkdir(dir, ...) > fsnotify_mkdir(dir, dentry); > ... > inotify_handle_inode_event() > ... > fsnotify_destroy_mark() > fsnotify_group_lock(group) > > So we have dir->i_rwsem > group->mark_mutex. But we also have callchains > like: > > inotify_add_watch() > inotify_update_watch() > fsnotify_group_lock(group) > inotify_update_existing_watch() > ... > fsnotify_recalc_mask() > inode_lock(dir); -> added by your series > > which creates ordering group->mark_mutex > dir->i_rwsem. > > It is even worse with dnotify which (even with your patches) ends up > calling fsnotify_recalc_mask() from dnotify_handle_event() so we have a > possibility of direct A->A deadlock. But I'd leave dnotify aside, I think > that can be massaged to not need to call fsnotify_recalc_mask() > (__fsnotify_recalc_mask() would be enough there). > > Still I'm not 100% sure about a proper way out of this. The simplicity of > alias->d_subdirs iteration with i_rwsem held is compeling. Agreed. > We could mandate > that fsnotify hooks cannot be called with inode->i_rwsem held (and fixup > nfsd) but IMO that is pushing the complexity from the fsnotify core into > its users which is undesirable. I think inode in this context is the parent inode, so all fsnotify hooks in namei.c are holding inode->i_rwsem by design. > Maybe we could grab inode->i_rwsem in those > places adding / removing notification marks before we grab > group->mark_mutex, just verify (with lockdep) that fsnotify_recalc_mask() > has the inode->i_rwsem held and be done with it? That pushes a bit of > complexity into the fsnotify backends but it is not too bad. > fsnotify_recalc_mask() gets only called by dnotify, inotify, and fanotify. > Amir? > Absolutely agree - I think it makes sense and will simplify things a lot. Obviously if we need to assert inode_is_locked() in fsnotify_recalc_mask() only for (conn->type == FSNOTIFY_OBJ_TYPE_INODE). Thanks, Amir.
© 2016 - 2026 Red Hat, Inc.