vfs: limit directory child dentry retention

[RFC PATCH] vfs: limit directory child dentry retention

Posted by Ian Kent 1 day, 22 hours ago

If there's a very large number of children present in a directory dentry
then the benifit from retaining stale child dentries for re-use can
become ineffective. Even hashed lookup can become ineffective as hash
chains grow, time taken to umount a file system can increase a lot, as
well as child dentry traversals resulting in lock held too long log
messages.

But when a directory dentry has a very large number of children the
parent dentry reference count is dominated by the contribution of its
children. So it makes sense to not retain dentries if the parent
reference count is large.

Setting some large high water mark (eg. 500000) over which dentries
are discarded instead of retained on final dput() would help a lot
by preventing dentry caching contributing to the problem.

Signed-off-by: Ian Kent <raven@themaw.net>
---
 Documentation/admin-guide/sysctl/fs.rst |  7 +++++++
 fs/dcache.c                             | 28 +++++++++++++++++++++++++
 2 files changed, 35 insertions(+)

diff --git a/Documentation/admin-guide/sysctl/fs.rst b/Documentation/admin-guide/sysctl/fs.rst
index 9b7f65c3efd8..7649254f2d0d 100644
--- a/Documentation/admin-guide/sysctl/fs.rst
+++ b/Documentation/admin-guide/sysctl/fs.rst
@@ -75,6 +75,13 @@ negative dentries which do not map to any files. Instead,
 they help speeding up rejection of non-existing files provided
 by the users.
 
+dir-stale-max
+-------------
+
+Used to limit the number of stale child dentries retained in a
+directory before the benifit of caching the dentry is negated by
+the cost of traversing hash buckets during lookups or enumerating
+the directory children. Initially set to 500000.
 
 file-max & file-nr
 ------------------
diff --git a/fs/dcache.c b/fs/dcache.c
index 7ba1801d8132..298b4c3b1493 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -86,6 +86,14 @@ __cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
 
 EXPORT_SYMBOL(rename_lock);
 
+static long dsm_zero = 0;
+static long dsm_max = ULONG_MAX/2;
+
+/* Highwater mark for number of stale entries in a directory (loosely
+ * measured by parent dentry reference count).
+ */
+static unsigned long dir_stale_max __read_mostly = 500000;
+
 static struct kmem_cache *__dentry_cache __ro_after_init;
 #define dentry_cache runtime_const_ptr(__dentry_cache)
 
@@ -216,6 +224,15 @@ static const struct ctl_table fs_dcache_sysctls[] = {
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_ONE,
 	},
+	{
+		.procname	= "dir-stale-max",
+		.data		= &dir_stale_max,
+		.maxlen		= sizeof(dir_stale_max),
+		.mode		= 0644,
+		.proc_handler	= proc_doulongvec_minmax,
+		.extra1		= &dsm_zero,
+		.extra2		= &dsm_max,
+	},
 };
 
 static const struct ctl_table vm_dcache_sysctls[] = {
@@ -768,6 +785,17 @@ static inline bool retain_dentry(struct dentry *dentry, bool locked)
 	if (unlikely(d_flags & DCACHE_DONTCACHE))
 		return false;
 
+	if (dir_stale_max) {
+		unsigned long p_count;
+
+		// If the parent reference count is higher than some large value
+		// its dominated by the contribution of its children so there's
+		// no benefit caching the dentry over re-allocating it.
+		p_count = READ_ONCE(dentry->d_parent->d_lockref.count);
+		if (unlikely(p_count > dir_stale_max))
+			return false;
+	}
+
 	// At this point it looks like we ought to keep it.  We also might
 	// need to do something - put it on LRU if it wasn't there already
 	// and mark it referenced if it was on LRU, but not marked yet.
-- 
2.53.0

Re: [RFC PATCH] vfs: limit directory child dentry retention

Posted by Christian Brauner 1 day, 14 hours ago

On Tue, Mar 31, 2026 at 09:29:09AM +0800, Ian Kent wrote:
> If there's a very large number of children present in a directory dentry
> then the benifit from retaining stale child dentries for re-use can
> become ineffective. Even hashed lookup can become ineffective as hash
> chains grow, time taken to umount a file system can increase a lot, as
> well as child dentry traversals resulting in lock held too long log
> messages.

Fwiw, there's also e6957c99dca5 ("vfs: Add a sysctl for automated deletion of dentry")

This patch introduces the concept conditionally, where the associated
dentry is deleted only when the user explicitly opts for it during file
removal. A new sysctl fs.automated_deletion_of_dentry is added for this
purpose. Its default value is set to 0.

I have no massive objections to your approach. It feels a bit hacky tbh
as it seems to degrade performance for new workloads in favor old
workloads. The LRU should sort this out though.

> But when a directory dentry has a very large number of children the
> parent dentry reference count is dominated by the contribution of its
> children. So it makes sense to not retain dentries if the parent
> reference count is large.
> 
> Setting some large high water mark (eg. 500000) over which dentries
> are discarded instead of retained on final dput() would help a lot
> by preventing dentry caching contributing to the problem.
> 
> Signed-off-by: Ian Kent <raven@themaw.net>
> ---
>  Documentation/admin-guide/sysctl/fs.rst |  7 +++++++
>  fs/dcache.c                             | 28 +++++++++++++++++++++++++
>  2 files changed, 35 insertions(+)
> 
> diff --git a/Documentation/admin-guide/sysctl/fs.rst b/Documentation/admin-guide/sysctl/fs.rst
> index 9b7f65c3efd8..7649254f2d0d 100644
> --- a/Documentation/admin-guide/sysctl/fs.rst
> +++ b/Documentation/admin-guide/sysctl/fs.rst
> @@ -75,6 +75,13 @@ negative dentries which do not map to any files. Instead,
>  they help speeding up rejection of non-existing files provided
>  by the users.
>  
> +dir-stale-max
> +-------------
> +
> +Used to limit the number of stale child dentries retained in a
> +directory before the benifit of caching the dentry is negated by
> +the cost of traversing hash buckets during lookups or enumerating
> +the directory children. Initially set to 500000.
>  
>  file-max & file-nr
>  ------------------
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 7ba1801d8132..298b4c3b1493 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -86,6 +86,14 @@ __cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
>  
>  EXPORT_SYMBOL(rename_lock);
>  
> +static long dsm_zero = 0;
> +static long dsm_max = ULONG_MAX/2;
> +
> +/* Highwater mark for number of stale entries in a directory (loosely
> + * measured by parent dentry reference count).
> + */
> +static unsigned long dir_stale_max __read_mostly = 500000;
> +
>  static struct kmem_cache *__dentry_cache __ro_after_init;
>  #define dentry_cache runtime_const_ptr(__dentry_cache)
>  
> @@ -216,6 +224,15 @@ static const struct ctl_table fs_dcache_sysctls[] = {
>  		.extra1		= SYSCTL_ZERO,
>  		.extra2		= SYSCTL_ONE,
>  	},
> +	{
> +		.procname	= "dir-stale-max",
> +		.data		= &dir_stale_max,
> +		.maxlen		= sizeof(dir_stale_max),
> +		.mode		= 0644,
> +		.proc_handler	= proc_doulongvec_minmax,
> +		.extra1		= &dsm_zero,
> +		.extra2		= &dsm_max,
> +	},
>  };
>  
>  static const struct ctl_table vm_dcache_sysctls[] = {
> @@ -768,6 +785,17 @@ static inline bool retain_dentry(struct dentry *dentry, bool locked)
>  	if (unlikely(d_flags & DCACHE_DONTCACHE))
>  		return false;
>  
> +	if (dir_stale_max) {
> +		unsigned long p_count;
> +
> +		// If the parent reference count is higher than some large value
> +		// its dominated by the contribution of its children so there's
> +		// no benefit caching the dentry over re-allocating it.
> +		p_count = READ_ONCE(dentry->d_parent->d_lockref.count);
> +		if (unlikely(p_count > dir_stale_max))
> +			return false;
> +	}
> +
>  	// At this point it looks like we ought to keep it.  We also might
>  	// need to do something - put it on LRU if it wasn't there already
>  	// and mark it referenced if it was on LRU, but not marked yet.
> -- 
> 2.53.0
>

Re: [RFC PATCH] vfs: limit directory child dentry retention

Posted by Ian Kent 21 hours ago

On 31/3/26 17:39, Christian Brauner wrote:
> On Tue, Mar 31, 2026 at 09:29:09AM +0800, Ian Kent wrote:
>> If there's a very large number of children present in a directory dentry
>> then the benifit from retaining stale child dentries for re-use can
>> become ineffective. Even hashed lookup can become ineffective as hash
>> chains grow, time taken to umount a file system can increase a lot, as
>> well as child dentry traversals resulting in lock held too long log
>> messages.
> Fwiw, there's also e6957c99dca5 ("vfs: Add a sysctl for automated deletion of dentry")

I'm pretty sure I saw that earlier on but had forgotten about it when I

reviewed the bug this time around. It is essentially 681ce8623567 ("vfs:

Delete the associated dentry when deleting a file") with opt-in of course.


>
> This patch introduces the concept conditionally, where the associated
> dentry is deleted only when the user explicitly opts for it during file
> removal. A new sysctl fs.automated_deletion_of_dentry is added for this
> purpose. Its default value is set to 0.

I meant to update Documentation/admin-guide/sysctl/fs.rst to also say that

setting dir-stale-max to 0 disables it. I also get the impression you might

feel better about this if the default was 0 as well.


The thing that I don't much like with the d_delete() approach is that it

fails to cater for files that have been closed and are otherwise unused

who's dentries make there way to the LRU eventually resulting in the bad

behaviour being discussed.


>
> I have no massive objections to your approach. It feels a bit hacky tbh
> as it seems to degrade performance for new workloads in favor old
> workloads. The LRU should sort this out though.

My aim was to improve performance so I'm a bit puzzled by the comment.


The problem is the sheer number of dentry objects and the consequences

of that. Hash table chains growing will affect performance, umounting

the mount will take ages, and there are cases of child dentry traversals

in the VFS. Once you get a large number of stale dentries that necessarily

need to stay linked into the structures to get the benefit of caching your

exposed to this problem.


The LRU mechanism is so far unable to cope with this.


Ian

>
>> But when a directory dentry has a very large number of children the
>> parent dentry reference count is dominated by the contribution of its
>> children. So it makes sense to not retain dentries if the parent
>> reference count is large.
>>
>> Setting some large high water mark (eg. 500000) over which dentries
>> are discarded instead of retained on final dput() would help a lot
>> by preventing dentry caching contributing to the problem.
>>
>> Signed-off-by: Ian Kent <raven@themaw.net>
>> ---
>>   Documentation/admin-guide/sysctl/fs.rst |  7 +++++++
>>   fs/dcache.c                             | 28 +++++++++++++++++++++++++
>>   2 files changed, 35 insertions(+)
>>
>> diff --git a/Documentation/admin-guide/sysctl/fs.rst b/Documentation/admin-guide/sysctl/fs.rst
>> index 9b7f65c3efd8..7649254f2d0d 100644
>> --- a/Documentation/admin-guide/sysctl/fs.rst
>> +++ b/Documentation/admin-guide/sysctl/fs.rst
>> @@ -75,6 +75,13 @@ negative dentries which do not map to any files. Instead,
>>   they help speeding up rejection of non-existing files provided
>>   by the users.
>>   
>> +dir-stale-max
>> +-------------
>> +
>> +Used to limit the number of stale child dentries retained in a
>> +directory before the benifit of caching the dentry is negated by
>> +the cost of traversing hash buckets during lookups or enumerating
>> +the directory children. Initially set to 500000.
>>   
>>   file-max & file-nr
>>   ------------------
>> diff --git a/fs/dcache.c b/fs/dcache.c
>> index 7ba1801d8132..298b4c3b1493 100644
>> --- a/fs/dcache.c
>> +++ b/fs/dcache.c
>> @@ -86,6 +86,14 @@ __cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
>>   
>>   EXPORT_SYMBOL(rename_lock);
>>   
>> +static long dsm_zero = 0;
>> +static long dsm_max = ULONG_MAX/2;
>> +
>> +/* Highwater mark for number of stale entries in a directory (loosely
>> + * measured by parent dentry reference count).
>> + */
>> +static unsigned long dir_stale_max __read_mostly = 500000;
>> +
>>   static struct kmem_cache *__dentry_cache __ro_after_init;
>>   #define dentry_cache runtime_const_ptr(__dentry_cache)
>>   
>> @@ -216,6 +224,15 @@ static const struct ctl_table fs_dcache_sysctls[] = {
>>   		.extra1		= SYSCTL_ZERO,
>>   		.extra2		= SYSCTL_ONE,
>>   	},
>> +	{
>> +		.procname	= "dir-stale-max",
>> +		.data		= &dir_stale_max,
>> +		.maxlen		= sizeof(dir_stale_max),
>> +		.mode		= 0644,
>> +		.proc_handler	= proc_doulongvec_minmax,
>> +		.extra1		= &dsm_zero,
>> +		.extra2		= &dsm_max,
>> +	},
>>   };
>>   
>>   static const struct ctl_table vm_dcache_sysctls[] = {
>> @@ -768,6 +785,17 @@ static inline bool retain_dentry(struct dentry *dentry, bool locked)
>>   	if (unlikely(d_flags & DCACHE_DONTCACHE))
>>   		return false;
>>   
>> +	if (dir_stale_max) {
>> +		unsigned long p_count;
>> +
>> +		// If the parent reference count is higher than some large value
>> +		// its dominated by the contribution of its children so there's
>> +		// no benefit caching the dentry over re-allocating it.
>> +		p_count = READ_ONCE(dentry->d_parent->d_lockref.count);
>> +		if (unlikely(p_count > dir_stale_max))
>> +			return false;
>> +	}
>> +
>>   	// At this point it looks like we ought to keep it.  We also might
>>   	// need to do something - put it on LRU if it wasn't there already
>>   	// and mark it referenced if it was on LRU, but not marked yet.
>> -- 
>> 2.53.0
>>

Re: [RFC PATCH] vfs: limit directory child dentry retention

Posted by Gao Xiang 1 day, 13 hours ago

Hi,

On 2026/3/31 17:39, Christian Brauner wrote:
> On Tue, Mar 31, 2026 at 09:29:09AM +0800, Ian Kent wrote:
>> If there's a very large number of children present in a directory dentry
>> then the benifit from retaining stale child dentries for re-use can
>> become ineffective. Even hashed lookup can become ineffective as hash
>> chains grow, time taken to umount a file system can increase a lot, as
>> well as child dentry traversals resulting in lock held too long log
>> messages.
> 
> Fwiw, there's also e6957c99dca5 ("vfs: Add a sysctl for automated deletion of dentry")
> 
> This patch introduces the concept conditionally, where the associated
> dentry is deleted only when the user explicitly opts for it during file
> removal. A new sysctl fs.automated_deletion_of_dentry is added for this
> purpose. Its default value is set to 0.
> 
> I have no massive objections to your approach. It feels a bit hacky tbh
> as it seems to degrade performance for new workloads in favor old
> workloads. The LRU should sort this out though.

JFYI, another issue we once observed on user workloads is that

`d_lockref.count` can exceed `int` on very very large
directories in reality (also combined with cached
negative dentries).

It can be a real overflow, this commit can help but it
doesn't strictly resolve this, anyway.

Thanks,
Gao Xiang

Re: [RFC PATCH] vfs: limit directory child dentry retention

Posted by Ian Kent 22 hours ago

On 31/3/26 17:54, Gao Xiang wrote:
> Hi,
>
> On 2026/3/31 17:39, Christian Brauner wrote:
>> On Tue, Mar 31, 2026 at 09:29:09AM +0800, Ian Kent wrote:
>>> If there's a very large number of children present in a directory 
>>> dentry
>>> then the benifit from retaining stale child dentries for re-use can
>>> become ineffective. Even hashed lookup can become ineffective as hash
>>> chains grow, time taken to umount a file system can increase a lot, as
>>> well as child dentry traversals resulting in lock held too long log
>>> messages.
>>
>> Fwiw, there's also e6957c99dca5 ("vfs: Add a sysctl for automated 
>> deletion of dentry")
>>
>> This patch introduces the concept conditionally, where the associated
>> dentry is deleted only when the user explicitly opts for it during file
>> removal. A new sysctl fs.automated_deletion_of_dentry is added for this
>> purpose. Its default value is set to 0.
>>
>> I have no massive objections to your approach. It feels a bit hacky tbh
>> as it seems to degrade performance for new workloads in favor old
>> workloads. The LRU should sort this out though.
>
> JFYI, another issue we once observed on user workloads is that
>
> `d_lockref.count` can exceed `int` on very very large
> directories in reality (also combined with cached
> negative dentries).

Ouch!

So more than 2 Billion?

I suspect in that case you have much bigger problems than 7 or 8

million dentries on the LRU list and linked into the directory.


>
> It can be a real overflow, this commit can help but it
> doesn't strictly resolve this, anyway.
>
> Thanks,
> Gao Xiang

Re: [RFC PATCH] vfs: limit directory child dentry retention

Posted by Gao Xiang 21 hours ago


On 2026/4/1 09:38, Ian Kent wrote:
> On 31/3/26 17:54, Gao Xiang wrote:
>> Hi,
>>
>> On 2026/3/31 17:39, Christian Brauner wrote:
>>> On Tue, Mar 31, 2026 at 09:29:09AM +0800, Ian Kent wrote:
>>>> If there's a very large number of children present in a directory dentry
>>>> then the benifit from retaining stale child dentries for re-use can
>>>> become ineffective. Even hashed lookup can become ineffective as hash
>>>> chains grow, time taken to umount a file system can increase a lot, as
>>>> well as child dentry traversals resulting in lock held too long log
>>>> messages.
>>>
>>> Fwiw, there's also e6957c99dca5 ("vfs: Add a sysctl for automated deletion of dentry")
>>>
>>> This patch introduces the concept conditionally, where the associated
>>> dentry is deleted only when the user explicitly opts for it during file
>>> removal. A new sysctl fs.automated_deletion_of_dentry is added for this
>>> purpose. Its default value is set to 0.
>>>
>>> I have no massive objections to your approach. It feels a bit hacky tbh
>>> as it seems to degrade performance for new workloads in favor old
>>> workloads. The LRU should sort this out though.
>>
>> JFYI, another issue we once observed on user workloads is that
>>
>> `d_lockref.count` can exceed `int` on very very large
>> directories in reality (also combined with cached
>> negative dentries).
> 
> Ouch!
> 
> So more than 2 Billion?

We received some report.

> 
> I suspect in that case you have much bigger problems than 7 or 8
> 
> million dentries on the LRU list and linked into the directory.

That shrinker seemed not to be triggered at all
since the memory was abundant on those bare
metals; I don't see how it cannot happen with
enough memory and trigger negative lookups on
a directory for example.

However, it was a report quite few years ago, but
I remembered it was a real user report
(`d_lockref.count` overflowed).

Thanks,
Gao Xiang

> 
> 
>>
>> It can be a real overflow, this commit can help but it
>> doesn't strictly resolve this, anyway.
>>
>> Thanks,
>> Gao Xiang

Re: [RFC PATCH] vfs: limit directory child dentry retention

Posted by Ian Kent 21 hours ago

On 1/4/26 09:47, Gao Xiang wrote:
>
>
> On 2026/4/1 09:38, Ian Kent wrote:
>> On 31/3/26 17:54, Gao Xiang wrote:
>>> Hi,
>>>
>>> On 2026/3/31 17:39, Christian Brauner wrote:
>>>> On Tue, Mar 31, 2026 at 09:29:09AM +0800, Ian Kent wrote:
>>>>> If there's a very large number of children present in a directory 
>>>>> dentry
>>>>> then the benifit from retaining stale child dentries for re-use can
>>>>> become ineffective. Even hashed lookup can become ineffective as hash
>>>>> chains grow, time taken to umount a file system can increase a 
>>>>> lot, as
>>>>> well as child dentry traversals resulting in lock held too long log
>>>>> messages.
>>>>
>>>> Fwiw, there's also e6957c99dca5 ("vfs: Add a sysctl for automated 
>>>> deletion of dentry")
>>>>
>>>> This patch introduces the concept conditionally, where the associated
>>>> dentry is deleted only when the user explicitly opts for it during 
>>>> file
>>>> removal. A new sysctl fs.automated_deletion_of_dentry is added for 
>>>> this
>>>> purpose. Its default value is set to 0.
>>>>
>>>> I have no massive objections to your approach. It feels a bit hacky 
>>>> tbh
>>>> as it seems to degrade performance for new workloads in favor old
>>>> workloads. The LRU should sort this out though.
>>>
>>> JFYI, another issue we once observed on user workloads is that
>>>
>>> `d_lockref.count` can exceed `int` on very very large
>>> directories in reality (also combined with cached
>>> negative dentries).

Yeah, that's a problem for sure.

I hadn't considered such a large number of dentries so I wasn't

trying to resolve this case and I guess the change here would

only postpone the need to re-think dcache design which I suspect

is what would be needed.


Ian

>>
>> Ouch!
>>
>> So more than 2 Billion?
>
> We received some report.
>
>>
>> I suspect in that case you have much bigger problems than 7 or 8
>>
>> million dentries on the LRU list and linked into the directory.
>
> That shrinker seemed not to be triggered at all
> since the memory was abundant on those bare
> metals; I don't see how it cannot happen with
> enough memory and trigger negative lookups on
> a directory for example.
>
> However, it was a report quite few years ago, but
> I remembered it was a real user report
> (`d_lockref.count` overflowed).
>
> Thanks,
> Gao Xiang
>
>>
>>
>>>
>>> It can be a real overflow, this commit can help but it
>>> doesn't strictly resolve this, anyway.
>>>
>>> Thanks,
>>> Gao Xiang
>

Re: [RFC PATCH] vfs: limit directory child dentry retention

Posted by Linus Torvalds 21 hours ago

On Tue, 31 Mar 2026 at 19:21, Ian Kent <raven@themaw.net> wrote:
>
> On 1/4/26 09:47, Gao Xiang wrote:
> >>>
> >>> `d_lockref.count` can exceed `int` on very very large
> >>> directories in reality (also combined with cached
> >>> negative dentries).
>
> I hadn't considered such a large number of dentries so I wasn't
> trying to resolve this case and I guess the change here would
> only postpone the need to re-think dcache design which I suspect
> is what would be needed.

I think it should be trivial to limit the lockref count. We did that
for the page count, and it wasn't all that hard: see try_get_page().

It doesn't even require complicated atomic sequences, because you
don't have to be very precise. If things get close to being too large,
you just fail it. And you don't fail every kind of operation, you only
fail the ones that are accessible to users as a way to artificially
inflate the numbers.

In the case of page counts, it was things like splicing the same page
over and over again, so the only operation that actually needed that
"stop at big numbers" was generic_pipe_buf_get().

I'm not sure how you make up large number of dentries in directories
if we just have that limit on negative dentries (which seems
reasonable).

So I think this is very analogous to that page count thing.

            Linus

Re: [RFC PATCH] vfs: limit directory child dentry retention

Posted by Mateusz Guzik 1 day, 8 hours ago

On Tue, Mar 31, 2026 at 05:54:01PM +0800, Gao Xiang wrote:
> JFYI, another issue we once observed on user workloads is that
> 
> `d_lockref.count` can exceed `int` on very very large
> directories in reality (also combined with cached
> negative dentries).
> 
> It can be a real overflow, this commit can help but it
> doesn't strictly resolve this, anyway.

Another way to contribute to the problem is to mass open the same file,
which results in one ref per fd.

Or to put it differently, sooner or later the dentry refcount will have
to switch to 64 bits on 64 bit systems.

There are 2 issues with it that I see:
1. no space

struct dentry is 192 bytes in size without any holes so growing is an
eyebrow-raiser.

space can be freed by either lowering the size of shortname_store 40 ->
32 bytes or converting d_hash linkage to be single-linked. The latter
means hash removals turn O(n) from O(1), but that very traversal is
already there to find the dentry during lookup. Thus if it constitutes a
problem, things are already bad in the sizing or hashing department.

2. lockref itself

If one was to implement a lockref variant with an 8 byte count and a 4
byte spinlock, one would need to use 16 byte atomics and that's
atrocious af performance wise.

Perhaps it would be feasible to hack the lock as a bit in the count, but
I don't think that's warranted.

The good news here is that lockref is already a performance problem
because of cmpxchg loops on both sides of ref/unref and AFAICS there is
a perfectly sensible way to move away from it.

Mandatory remark that numerous commonly syscalls can avoid the ref trip
in the common case, but getting there requires a lot of rototoiling in
LSM code.

So the fastest thing would lock xadd on both sides of course, but going
that far from the get go is asking for trouble because of baked in
assumptions about no transitions 0->1 when dlock is held.

Instead, a state which is already way faster than the current thing would
"lock cmpxchg" to grab the ref and "lock xadd" to release it, with a
dedicated bit spent to temporarily block lockless operation on ref side
(any place which wants to keep the ref at 0 would have to issue an
atomic to freeze it and then the current guarantee is provided).

There is no significant difficulty here as far as complexity goes, but
there is a lot of prerequisite churn to go through -- lockref use is
open-coded all over and the count access is inconsistently either doing
a raw load or going through d_count().

I had a WIP patch to do it, but other churn-ey changes in dcache mean it
needs to be redone from scratch.

Maybe I'll get around to doing it.

Re: [RFC PATCH] vfs: limit directory child dentry retention

Posted by Gao Xiang 1 day, 8 hours ago

Hi,

On 2026/3/31 22:59, Mateusz Guzik wrote:
> On Tue, Mar 31, 2026 at 05:54:01PM +0800, Gao Xiang wrote:
>> JFYI, another issue we once observed on user workloads is that
>>
>> `d_lockref.count` can exceed `int` on very very large
>> directories in reality (also combined with cached
>> negative dentries).
>>
>> It can be a real overflow, this commit can help but it
>> doesn't strictly resolve this, anyway.
> 
> 
> Another way to contribute to the problem is to mass open the same file,
> which results in one ref per fd.
> 
> Or to put it differently, sooner or later the dentry refcount will have
> to switch to 64 bits on 64 bit systems.

Yes, but my own basic question on this is that do we
really need 64-bit refcount for each dentry?

  - do we need to cache so many child dentries at the
    same time? some real use case?

  - do we need to cache so many negative dentries for
    a single directory?

  - do we need to really care mass open the same file?

or just find a way to error out blindly when the
refcount is nearly overflowed? together with this
retain_dentry() change to make cached dentries
in the low watermark.

just my .2 cents since currently I don't work on
vfs stuffs.

Thanks,
Gao Xiang