[PATCH v4 0/7]] VFS: Prepare to lift lookup out of exclusive lock for directory ops

NeilBrown posted 7 patches 1 month, 2 weeks ago
[PATCH v4 0/7]] VFS: Prepare to lift lookup out of exclusive lock for directory ops
Posted by NeilBrown 1 month, 2 weeks ago
Following are 7 VFS patches which modify or introduce APIs that will
allow modifying filesystems so that they will work with a proposed
change to move d_alloc_paralle() out from the parent i_rw_sem lock.

If these can land in a non-rebasing tree, I can work with individual
filesystem maintainers to start using these APIs.

I haven't included d_alloc_noblock_return() as it is only needed for one
fs (ovl) and it is not yet clear that it is the best approach.

I also haven't included the change to d_alloc_name() as that is only
needed so that I can deprecate d_alloc() and there is no rush for that.

Patch 2/7 is exactly the patch Al proposed in the conversation for v3.
I have taken the libery of adding a Signed-off-by from Al to match the
Co-developed-by.  I hope that was not inappropriate.

I have been testing this series over NFS mounts from XFS so patches 2
and 3 don't seem to be causing any problems.  The changes in 4/5/6/7
won't be tested by this, and some cannot be tested until filesystems
start using new interfaces.

Thanks,
NeilBrown


 [PATCH v4 1/7] VFS: fix various typos in documentation for
 [PATCH v4 2/7] VFS: use wait_var_event for waiting in
 [PATCH v4 3/7] VFS: enhance d_splice_alias() to handle in-lookup
 [PATCH v4 4/7] VFS: introduce d_alloc_noblock()
 [PATCH v4 5/7] VFS: add d_duplicate()
 [PATCH v4 6/7] VFS: Add LOOKUP_SHARED flag.
 [PATCH v4 7/7] VFS/xfs/ntfs: drop parent lock across
Re: [PATCH v4 0/7]] VFS: Prepare to lift lookup out of exclusive lock for directory ops
Posted by Jeff Layton 1 month, 2 weeks ago
On Thu, 2026-04-30 at 12:03 +1000, NeilBrown wrote:
> Following are 7 VFS patches which modify or introduce APIs that will
> allow modifying filesystems so that they will work with a proposed
> change to move d_alloc_paralle() out from the parent i_rw_sem lock.
> 
> If these can land in a non-rebasing tree, I can work with individual
> filesystem maintainers to start using these APIs.
> 
> I haven't included d_alloc_noblock_return() as it is only needed for one
> fs (ovl) and it is not yet clear that it is the best approach.
> 
> I also haven't included the change to d_alloc_name() as that is only
> needed so that I can deprecate d_alloc() and there is no rush for that.
> 
> Patch 2/7 is exactly the patch Al proposed in the conversation for v3.
> I have taken the libery of adding a Signed-off-by from Al to match the
> Co-developed-by.  I hope that was not inappropriate.
> 
> I have been testing this series over NFS mounts from XFS so patches 2
> and 3 don't seem to be causing any problems.  The changes in 4/5/6/7
> won't be tested by this, and some cannot be tested until filesystems
> start using new interfaces.
> 
> Thanks,
> NeilBrown
> 
> 
>  [PATCH v4 1/7] VFS: fix various typos in documentation for
>  [PATCH v4 2/7] VFS: use wait_var_event for waiting in
>  [PATCH v4 3/7] VFS: enhance d_splice_alias() to handle in-lookup
>  [PATCH v4 4/7] VFS: introduce d_alloc_noblock()
>  [PATCH v4 5/7] VFS: add d_duplicate()
>  [PATCH v4 6/7] VFS: Add LOOKUP_SHARED flag.
>  [PATCH v4 7/7] VFS/xfs/ntfs: drop parent lock across

I pointed Claude at the version of this in your tree and it spotted a
regression that I think looks legitimate:

  2. Lock imbalance on early return: The parent lock is dropped unconditionally before              
  d_alloc_parallel()/d_alloc(), but three early return paths exit without reacquiring it:           
    - IS_ERR(found) from d_alloc_parallel()                                                         
    - !d_in_lookup(found) from d_alloc_parallel()                                                   
    - !found from d_alloc()                                                                         
                                                                                                    
  The callers (lookup_slow(), lookup_slow_killable()) unconditionally call inode_unlock_shared()
  after ->lookup() returns. If d_add_ci() returns without the lock held, the caller unlocks an      
  unheld rwsem — corrupting its state.                                             

Cheers,
-- 
Jeff Layton <jlayton@kernel.org>
Re: [PATCH v4 0/7]] VFS: Prepare to lift lookup out of exclusive lock for directory ops
Posted by NeilBrown 1 month, 2 weeks ago
On Thu, 30 Apr 2026, Jeff Layton wrote:
> On Thu, 2026-04-30 at 12:03 +1000, NeilBrown wrote:
> > Following are 7 VFS patches which modify or introduce APIs that will
> > allow modifying filesystems so that they will work with a proposed
> > change to move d_alloc_paralle() out from the parent i_rw_sem lock.
> > 
> > If these can land in a non-rebasing tree, I can work with individual
> > filesystem maintainers to start using these APIs.
> > 
> > I haven't included d_alloc_noblock_return() as it is only needed for one
> > fs (ovl) and it is not yet clear that it is the best approach.
> > 
> > I also haven't included the change to d_alloc_name() as that is only
> > needed so that I can deprecate d_alloc() and there is no rush for that.
> > 
> > Patch 2/7 is exactly the patch Al proposed in the conversation for v3.
> > I have taken the libery of adding a Signed-off-by from Al to match the
> > Co-developed-by.  I hope that was not inappropriate.
> > 
> > I have been testing this series over NFS mounts from XFS so patches 2
> > and 3 don't seem to be causing any problems.  The changes in 4/5/6/7
> > won't be tested by this, and some cannot be tested until filesystems
> > start using new interfaces.
> > 
> > Thanks,
> > NeilBrown
> > 
> > 
> >  [PATCH v4 1/7] VFS: fix various typos in documentation for
> >  [PATCH v4 2/7] VFS: use wait_var_event for waiting in
> >  [PATCH v4 3/7] VFS: enhance d_splice_alias() to handle in-lookup
> >  [PATCH v4 4/7] VFS: introduce d_alloc_noblock()
> >  [PATCH v4 5/7] VFS: add d_duplicate()
> >  [PATCH v4 6/7] VFS: Add LOOKUP_SHARED flag.
> >  [PATCH v4 7/7] VFS/xfs/ntfs: drop parent lock across
> 
> I pointed Claude at the version of this in your tree and it spotted a
> regression that I think looks legitimate:
> 
>   2. Lock imbalance on early return: The parent lock is dropped unconditionally before              
>   d_alloc_parallel()/d_alloc(), but three early return paths exit without reacquiring it:           
>     - IS_ERR(found) from d_alloc_parallel()                                                         
>     - !d_in_lookup(found) from d_alloc_parallel()                                                   
>     - !found from d_alloc()                                                                         
>                                                                                                     
>   The callers (lookup_slow(), lookup_slow_killable()) unconditionally call inode_unlock_shared()
>   after ->lookup() returns. If d_add_ci() returns without the lock held, the caller unlocks an      
>   unheld rwsem — corrupting its state.                                             

Thanks for that - yes that was careless.

The unlock/relock is only needed around d_alloc_parallel() so I've put
it there which make the problem go away.

I've updated the github repp.

New patch below.

Thanks,
NeilBrown

From: NeilBrown <neil@brown.name>
Subject: [PATCH] VFS/xfs/ntfs: drop parent lock across d_alloc_parallel() in
 d_add_ci()

A proposed change will invert the lock ordering between
d_alloc_parallel() and inode_lock() on the parent.
When that happens it will not be safe to call d_alloc_parallel() while
holding the parent lock - even shared.

We don't need to keep the parent lock held when d_add_ci() is run - the
VFS doesn't need it as dentry is exclusively held due to
DCACHE_PAR_LOOKUP and the filesystem has finished its work.

So drop and reclaim the lock (shared or exclusive as determined by
LOOKUP_SHARED) to avoid future deadlock.

Signed-off-by: NeilBrown <neil@brown.name>
---
 Documentation/filesystems/porting.rst |  7 +++++++
 fs/dcache.c                           | 21 +++++++++++++++++++--
 fs/ntfs/namei.c                       |  2 +-
 fs/xfs/xfs_iops.c                     |  2 +-
 include/linux/dcache.h                |  3 ++-
 5 files changed, 30 insertions(+), 5 deletions(-)

diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index 5cc6ae19845c..146720fc9f6f 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -1391,3 +1391,10 @@ either form of manual loop.
 **mandatory**
 
 d_alloc_parallel() no longer requires a waitqueue_head.
+
+---
+
+**mandatory**
+
+d_add_ci() must now be passed the flags arguemnt that was given to ->lookup
+
diff --git a/fs/dcache.c b/fs/dcache.c
index 1943607f7547..665ce74eaadc 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2275,6 +2275,7 @@ EXPORT_SYMBOL(d_obtain_root);
  * @dentry: the negative dentry that was passed to the parent's lookup func
  * @inode:  the inode case-insensitive lookup has found
  * @name:   the case-exact name to be associated with the returned dentry
+ * @lookup_flags: flags passed to ->lookup
  *
  * This is to avoid filling the dcache with case-insensitive names to the
  * same inode, only the actual correct case is stored in the dcache for
@@ -2287,7 +2288,7 @@ EXPORT_SYMBOL(d_obtain_root);
  * the exact case, and return the spliced entry.
  */
 struct dentry *d_add_ci(struct dentry *dentry, struct inode *inode,
-			struct qstr *name)
+			struct qstr *name, unsigned int lookup_flags)
 {
 	struct dentry *found, *res;
 
@@ -2301,7 +2302,23 @@ struct dentry *d_add_ci(struct dentry *dentry, struct inode *inode,
 		return found;
 	}
 	if (d_in_lookup(dentry)) {
+		/*
+		 * We are holding parent lock and so don't want to wait
+		 * for a d_in_lookup() dentry.  We can safely drop the
+		 * parent lock and reclaim it as we have exclusive
+		 * access to dentry as it is d_in_lookup() (so
+		 * ->d_parent is stable) and we are near the end
+		 * ->lookup() and will shortly drop the lock anyway.
+		 */
+		if (lookup_flags & LOOKUP_SHARED)
+			inode_unlock_shared(d_inode(dentry->d_parent));
+		else
+			inode_unlock(d_inode(dentry->d_parent));
 		found = d_alloc_parallel(dentry->d_parent, name);
+		if (lookup_flags & LOOKUP_SHARED)
+			inode_lock_shared(d_inode(dentry->d_parent));
+		else
+			inode_lock_nested(d_inode(dentry->d_parent), I_MUTEX_PARENT);
 		if (IS_ERR(found) || !d_in_lookup(found)) {
 			iput(inode);
 			return found;
@@ -2311,7 +2328,7 @@ struct dentry *d_add_ci(struct dentry *dentry, struct inode *inode,
 		if (!found) {
 			iput(inode);
 			return ERR_PTR(-ENOMEM);
-		} 
+		}
 	}
 	res = d_splice_alias(inode, found);
 	if (res) {
diff --git a/fs/ntfs/namei.c b/fs/ntfs/namei.c
index 10894de519c3..e2f3430c2e6d 100644
--- a/fs/ntfs/namei.c
+++ b/fs/ntfs/namei.c
@@ -310,7 +310,7 @@ static struct dentry *ntfs_lookup(struct inode *dir_ino, struct dentry *dent,
 		}
 		nls_name.hash = full_name_hash(dent, nls_name.name, nls_name.len);
 
-		dent = d_add_ci(dent, dent_inode, &nls_name);
+		dent = d_add_ci(dent, dent_inode, &nls_name, flags);
 		kfree(nls_name.name);
 		return dent;
 
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 325c2200c501..db0beb3831a9 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -369,7 +369,7 @@ xfs_vn_ci_lookup(
 	/* else case-insensitive match... */
 	dname.name = ci_name.name;
 	dname.len = ci_name.len;
-	dentry = d_add_ci(dentry, VFS_I(ip), &dname);
+	dentry = d_add_ci(dentry, VFS_I(ip), &dname, flags);
 	kfree(ci_name.name);
 	return dentry;
 }
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index b4663a1a0636..9553bffbb098 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -263,7 +263,8 @@ struct dentry *d_duplicate(struct dentry *dentry);
 /* weird procfs mess; *NOT* exported */
 extern struct dentry * d_splice_alias_ops(struct inode *, struct dentry *,
 					  const struct dentry_operations *);
-extern struct dentry * d_add_ci(struct dentry *, struct inode *, struct qstr *);
+extern struct dentry * d_add_ci(struct dentry *, struct inode *, struct qstr *,
+				unsigned int);
 extern bool d_same_name(const struct dentry *dentry, const struct dentry *parent,
 			const struct qstr *name);
 extern struct dentry *d_find_any_alias(struct inode *inode);
-- 
2.50.0.107.gf914562f5916.dirty