ceph: only d_add() negative dentries when they are unhashed

[PATCH] ceph: only d_add() negative dentries when they are unhashed

Posted by Max Kellermann 5 days, 21 hours ago

Ceph can call d_add(dentry, NULL) on a negative dentry that is already
present in the primary dcache hash.

In the current VFS that is not safe.  d_add() goes through __d_add()
to __d_rehash(), which unconditionally reinserts dentry->d_hash into
the hlist_bl bucket.  If the dentry is already hashed, reinserting the
same node can corrupt the bucket, including creating a self-loop.
Once that happens, __d_lookup() can spin forever in the hlist_bl walk,
typically looping only on the d_name.hash mismatch check and
eventually triggering RCU stall reports like this one:

 rcu: INFO: rcu_sched self-detected stall on CPU
 rcu:         87-....: (2100 ticks this GP) idle=3a4c/1/0x4000000000000000 softirq=25003319/25003319 fqs=829
 rcu:         (t=2101 jiffies g=79058445 q=698988 ncpus=192)
 CPU: 87 UID: 2952868916 PID: 3933303 Comm: php-cgi8.3 Not tainted 6.18.17-i1-amd #950 NONE
 Hardware name: Dell Inc. PowerEdge R7615/0G9DHV, BIOS 1.6.6 09/22/2023
 RIP: 0010:__d_lookup+0x46/0xb0
 Code: c1 e8 07 48 8d 04 c2 48 8b 00 49 89 fc 49 89 f5 48 89 c3 48 83 e3 fe 48 83 f8 01 77 0f eb 2d 0f 1f 44 00 00 48 8b 1b 48 85 db <74> 20 39 6b 18 75 f3 48 8d 7b 78 e8 ba 85 d0 00 4c 39 63 10 74 1f
 RSP: 0018:ff745a70c8253898 EFLAGS: 00000282
 RAX: ff26e470054cb208 RBX: ff26e470054cb208 RCX: 000000006e958966
 RDX: ff26e48267340000 RSI: ff745a70c82539b0 RDI: ff26e458f74655c0
 RBP: 000000006e958966 R08: 0000000000000180 R09: 9cd08d909b919a89
 R10: ff26e458f74655c0 R11: 0000000000000000 R12: ff26e458f74655c0
 R13: ff745a70c82539b0 R14: d0d0d0d0d0d0d0d0 R15: 2f2f2f2f2f2f2f2f
 FS:  00007f5770896980(0000) GS:ff26e482c5d88000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007f5764de50c0 CR3: 000000a72abb5001 CR4: 0000000000771ef0
 PKRU: 55555554
 Call Trace:
  <TASK>
  lookup_fast+0x9f/0x100
  walk_component+0x1f/0x150
  link_path_walk+0x20e/0x3d0
  path_lookupat+0x68/0x180
  filename_lookup+0xdc/0x1e0
  vfs_statx+0x6c/0x140
  vfs_fstatat+0x67/0xa0
  __do_sys_newfstatat+0x24/0x60
  do_syscall_64+0x6a/0x230
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

This is reachable with reused cached negative dentries.  A Ceph lookup
or atomic_open can be handed a negative dentry that is already hashed,
and fs/ceph/dir.c then hits one of two paths that incorrectly assume
"negative" also means "unhashed":

  - ceph_finish_lookup():
      MDS reply is -ENOENT with no trace
      -> d_add(dentry, NULL)

  - ceph_lookup():
      local ENOENT fast path for a complete directory with shared caps
      -> d_add(dentry, NULL)

Both paths can therefore re-add an already-hashed negative dentry.

Ceph already uses the correct pattern elsewhere: ceph_fill_trace() only
calls d_add(dn, NULL) for a negative null-dentry reply when d_unhashed(dn)
is true.

Fix both fs/ceph/dir.c sites the same way: only call d_add() for a
negative dentry when it is actually unhashed.  If the negative dentry
is already hashed, leave it in place and reuse it as-is.

This preserves the existing behavior for unhashed dentries while
avoiding d_hash list corruption for reused hashed negatives.

Fixes: 2817b000b02c ("ceph: directory operations")
Cc: stable@vger.kernel.org
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
---
 fs/ceph/dir.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index bac9cfb6b982..27ce9e55e947 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -769,7 +769,8 @@ struct dentry *ceph_finish_lookup(struct ceph_mds_request *req,
 				d_drop(dentry);
 				err = -ENOENT;
 			} else {
-				d_add(dentry, NULL);
+				if (d_unhashed(dentry))
+					d_add(dentry, NULL);
 			}
 		}
 	}
@@ -840,7 +841,8 @@ static struct dentry *ceph_lookup(struct inode *dir, struct dentry *dentry,
 			spin_unlock(&ci->i_ceph_lock);
 			doutc(cl, " dir %llx.%llx complete, -ENOENT\n",
 			      ceph_vinop(dir));
-			d_add(dentry, NULL);
+			if (d_unhashed(dentry))
+				d_add(dentry, NULL);
 			di->lease_shared_gen = atomic_read(&ci->i_shared_gen);
 			return NULL;
 		}
-- 
2.47.3

Re: [PATCH] ceph: only d_add() negative dentries when they are unhashed

Posted by Viacheslav Dubeyko 5 days, 19 hours ago

On Fri, 2026-03-27 at 17:23 +0100, Max Kellermann wrote:
> Ceph can call d_add(dentry, NULL) on a negative dentry that is already
> present in the primary dcache hash.
> 
> In the current VFS that is not safe.  d_add() goes through __d_add()
> to __d_rehash(), which unconditionally reinserts dentry->d_hash into
> the hlist_bl bucket.  If the dentry is already hashed, reinserting the
> same node can corrupt the bucket, including creating a self-loop.
> Once that happens, __d_lookup() can spin forever in the hlist_bl walk,
> typically looping only on the d_name.hash mismatch check and
> eventually triggering RCU stall reports like this one:
> 
>  rcu: INFO: rcu_sched self-detected stall on CPU
>  rcu:         87-....: (2100 ticks this GP) idle=3a4c/1/0x4000000000000000 softirq=25003319/25003319 fqs=829
>  rcu:         (t=2101 jiffies g=79058445 q=698988 ncpus=192)
>  CPU: 87 UID: 2952868916 PID: 3933303 Comm: php-cgi8.3 Not tainted 6.18.17-i1-amd #950 NONE
>  Hardware name: Dell Inc. PowerEdge R7615/0G9DHV, BIOS 1.6.6 09/22/2023
>  RIP: 0010:__d_lookup+0x46/0xb0
>  Code: c1 e8 07 48 8d 04 c2 48 8b 00 49 89 fc 49 89 f5 48 89 c3 48 83 e3 fe 48 83 f8 01 77 0f eb 2d 0f 1f 44 00 00 48 8b 1b 48 85 db <74> 20 39 6b 18 75 f3 48 8d 7b 78 e8 ba 85 d0 00 4c 39 63 10 74 1f
>  RSP: 0018:ff745a70c8253898 EFLAGS: 00000282
>  RAX: ff26e470054cb208 RBX: ff26e470054cb208 RCX: 000000006e958966
>  RDX: ff26e48267340000 RSI: ff745a70c82539b0 RDI: ff26e458f74655c0
>  RBP: 000000006e958966 R08: 0000000000000180 R09: 9cd08d909b919a89
>  R10: ff26e458f74655c0 R11: 0000000000000000 R12: ff26e458f74655c0
>  R13: ff745a70c82539b0 R14: d0d0d0d0d0d0d0d0 R15: 2f2f2f2f2f2f2f2f
>  FS:  00007f5770896980(0000) GS:ff26e482c5d88000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 00007f5764de50c0 CR3: 000000a72abb5001 CR4: 0000000000771ef0
>  PKRU: 55555554
>  Call Trace:
>   <TASK>
>   lookup_fast+0x9f/0x100
>   walk_component+0x1f/0x150
>   link_path_walk+0x20e/0x3d0
>   path_lookupat+0x68/0x180
>   filename_lookup+0xdc/0x1e0
>   vfs_statx+0x6c/0x140
>   vfs_fstatat+0x67/0xa0
>   __do_sys_newfstatat+0x24/0x60
>   do_syscall_64+0x6a/0x230
>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> This is reachable with reused cached negative dentries.  A Ceph lookup
> or atomic_open can be handed a negative dentry that is already hashed,
> and fs/ceph/dir.c then hits one of two paths that incorrectly assume
> "negative" also means "unhashed":
> 
>   - ceph_finish_lookup():
>       MDS reply is -ENOENT with no trace
>       -> d_add(dentry, NULL)
> 
>   - ceph_lookup():
>       local ENOENT fast path for a complete directory with shared caps
>       -> d_add(dentry, NULL)
> 
> Both paths can therefore re-add an already-hashed negative dentry.
> 
> Ceph already uses the correct pattern elsewhere: ceph_fill_trace() only
> calls d_add(dn, NULL) for a negative null-dentry reply when d_unhashed(dn)
> is true.
> 
> Fix both fs/ceph/dir.c sites the same way: only call d_add() for a
> negative dentry when it is actually unhashed.  If the negative dentry
> is already hashed, leave it in place and reuse it as-is.
> 
> This preserves the existing behavior for unhashed dentries while
> avoiding d_hash list corruption for reused hashed negatives.
> 
> Fixes: 2817b000b02c ("ceph: directory operations")
> Cc: stable@vger.kernel.org
> Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
> ---
>  fs/ceph/dir.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
> index bac9cfb6b982..27ce9e55e947 100644
> --- a/fs/ceph/dir.c
> +++ b/fs/ceph/dir.c
> @@ -769,7 +769,8 @@ struct dentry *ceph_finish_lookup(struct ceph_mds_request *req,
>  				d_drop(dentry);
>  				err = -ENOENT;
>  			} else {
> -				d_add(dentry, NULL);
> +				if (d_unhashed(dentry))
> +					d_add(dentry, NULL);
>  			}
>  		}
>  	}
> @@ -840,7 +841,8 @@ static struct dentry *ceph_lookup(struct inode *dir, struct dentry *dentry,
>  			spin_unlock(&ci->i_ceph_lock);
>  			doutc(cl, " dir %llx.%llx complete, -ENOENT\n",
>  			      ceph_vinop(dir));
> -			d_add(dentry, NULL);
> +			if (d_unhashed(dentry))
> +				d_add(dentry, NULL);
>  			di->lease_shared_gen = atomic_read(&ci->i_shared_gen);
>  			return NULL;
>  		}

Makes sense.

Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>

Let me run xfstests for the patch to double check that everything works well.

Thanks,
Slava.

RE: [PATCH] ceph: only d_add() negative dentries when they are unhashed

Posted by Viacheslav Dubeyko 2 days, 21 hours ago

On Fri, 2026-03-27 at 18:44 +0000, Viacheslav Dubeyko wrote:
> On Fri, 2026-03-27 at 17:23 +0100, Max Kellermann wrote:
> > Ceph can call d_add(dentry, NULL) on a negative dentry that is already
> > present in the primary dcache hash.
> > 
> > In the current VFS that is not safe.  d_add() goes through __d_add()
> > to __d_rehash(), which unconditionally reinserts dentry->d_hash into
> > the hlist_bl bucket.  If the dentry is already hashed, reinserting the
> > same node can corrupt the bucket, including creating a self-loop.
> > Once that happens, __d_lookup() can spin forever in the hlist_bl walk,
> > typically looping only on the d_name.hash mismatch check and
> > eventually triggering RCU stall reports like this one:
> > 
> >  rcu: INFO: rcu_sched self-detected stall on CPU
> >  rcu:         87-....: (2100 ticks this GP) idle=3a4c/1/0x4000000000000000 softirq=25003319/25003319 fqs=829
> >  rcu:         (t=2101 jiffies g=79058445 q=698988 ncpus=192)
> >  CPU: 87 UID: 2952868916 PID: 3933303 Comm: php-cgi8.3 Not tainted 6.18.17-i1-amd #950 NONE
> >  Hardware name: Dell Inc. PowerEdge R7615/0G9DHV, BIOS 1.6.6 09/22/2023
> >  RIP: 0010:__d_lookup+0x46/0xb0
> >  Code: c1 e8 07 48 8d 04 c2 48 8b 00 49 89 fc 49 89 f5 48 89 c3 48 83 e3 fe 48 83 f8 01 77 0f eb 2d 0f 1f 44 00 00 48 8b 1b 48 85 db <74> 20 39 6b 18 75 f3 48 8d 7b 78 e8 ba 85 d0 00 4c 39 63 10 74 1f
> >  RSP: 0018:ff745a70c8253898 EFLAGS: 00000282
> >  RAX: ff26e470054cb208 RBX: ff26e470054cb208 RCX: 000000006e958966
> >  RDX: ff26e48267340000 RSI: ff745a70c82539b0 RDI: ff26e458f74655c0
> >  RBP: 000000006e958966 R08: 0000000000000180 R09: 9cd08d909b919a89
> >  R10: ff26e458f74655c0 R11: 0000000000000000 R12: ff26e458f74655c0
> >  R13: ff745a70c82539b0 R14: d0d0d0d0d0d0d0d0 R15: 2f2f2f2f2f2f2f2f
> >  FS:  00007f5770896980(0000) GS:ff26e482c5d88000(0000) knlGS:0000000000000000
> >  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >  CR2: 00007f5764de50c0 CR3: 000000a72abb5001 CR4: 0000000000771ef0
> >  PKRU: 55555554
> >  Call Trace:
> >   <TASK>
> >   lookup_fast+0x9f/0x100
> >   walk_component+0x1f/0x150
> >   link_path_walk+0x20e/0x3d0
> >   path_lookupat+0x68/0x180
> >   filename_lookup+0xdc/0x1e0
> >   vfs_statx+0x6c/0x140
> >   vfs_fstatat+0x67/0xa0
> >   __do_sys_newfstatat+0x24/0x60
> >   do_syscall_64+0x6a/0x230
> >   entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > 
> > This is reachable with reused cached negative dentries.  A Ceph lookup
> > or atomic_open can be handed a negative dentry that is already hashed,
> > and fs/ceph/dir.c then hits one of two paths that incorrectly assume
> > "negative" also means "unhashed":
> > 
> >   - ceph_finish_lookup():
> >       MDS reply is -ENOENT with no trace
> >       -> d_add(dentry, NULL)
> > 
> >   - ceph_lookup():
> >       local ENOENT fast path for a complete directory with shared caps
> >       -> d_add(dentry, NULL)
> > 
> > Both paths can therefore re-add an already-hashed negative dentry.
> > 
> > Ceph already uses the correct pattern elsewhere: ceph_fill_trace() only
> > calls d_add(dn, NULL) for a negative null-dentry reply when d_unhashed(dn)
> > is true.
> > 
> > Fix both fs/ceph/dir.c sites the same way: only call d_add() for a
> > negative dentry when it is actually unhashed.  If the negative dentry
> > is already hashed, leave it in place and reuse it as-is.
> > 
> > This preserves the existing behavior for unhashed dentries while
> > avoiding d_hash list corruption for reused hashed negatives.
> > 
> > Fixes: 2817b000b02c ("ceph: directory operations")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
> > ---
> >  fs/ceph/dir.c | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
> > index bac9cfb6b982..27ce9e55e947 100644
> > --- a/fs/ceph/dir.c
> > +++ b/fs/ceph/dir.c
> > @@ -769,7 +769,8 @@ struct dentry *ceph_finish_lookup(struct ceph_mds_request *req,
> >  				d_drop(dentry);
> >  				err = -ENOENT;
> >  			} else {
> > -				d_add(dentry, NULL);
> > +				if (d_unhashed(dentry))
> > +					d_add(dentry, NULL);
> >  			}
> >  		}
> >  	}
> > @@ -840,7 +841,8 @@ static struct dentry *ceph_lookup(struct inode *dir, struct dentry *dentry,
> >  			spin_unlock(&ci->i_ceph_lock);
> >  			doutc(cl, " dir %llx.%llx complete, -ENOENT\n",
> >  			      ceph_vinop(dir));
> > -			d_add(dentry, NULL);
> > +			if (d_unhashed(dentry))
> > +				d_add(dentry, NULL);
> >  			di->lease_shared_gen = atomic_read(&ci->i_shared_gen);
> >  			return NULL;
> >  		}
> 
> Makes sense.
> 
> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
> 
> Let me run xfstests for the patch to double check that everything works well.
> 

I had multiple xfstests issues during last run. Most probably, it was some
glitch on my side or inconsistent build. I need to repeat the xfstests run with
the patch.

Thanks,
Slava.

RE: [PATCH] ceph: only d_add() negative dentries when they are unhashed

Posted by Viacheslav Dubeyko 1 day, 21 hours ago

On Mon, 2026-03-30 at 17:04 +0000, Viacheslav Dubeyko wrote:
> On Fri, 2026-03-27 at 18:44 +0000, Viacheslav Dubeyko wrote:
> > On Fri, 2026-03-27 at 17:23 +0100, Max Kellermann wrote:
> > > 

<skipped>

> > 
> > Let me run xfstests for the patch to double check that everything works well.
> > 
> 
> I had multiple xfstests issues during last run. Most probably, it was some
> glitch on my side or inconsistent build. I need to repeat the xfstests run with
> the patch.
> 

Sorry, it looks like 7.0-rc5 has some inconsistent state. Let me switch on 7.0-
rc1 because I was able to run xfstests successfully before on 7.0-rc1.

Thanks,
Slava.