[RFC PATCH] ovl: keep merged and impure readdir caches separate

Nirmoy Das posted 1 patch 1 month ago
fs/overlayfs/readdir.c | 38 ++++++++++++++++++++++++++------------
1 file changed, 26 insertions(+), 12 deletions(-)
[RFC PATCH] ovl: keep merged and impure readdir caches separate
Posted by Nirmoy Das 1 month ago
Overlayfs uses one inode cache slot for two readdir cache users with
different lifetime rules. Merged directory iteration pins the cache from
open directory files with cache->refcount. Impure real-directory iteration
uses the inode cache as an unrefcounted lookup table.

Those caches cannot be reused interchangeably. If merged iteration finds
an impure cache in the inode slot, it can pin and seek through a cache
that was not built for merged iteration. If impure iteration finds a merged
cache, it can walk an object whose lifetime is controlled by open directory
files. Either direction can leave ovl_iterate() using stale cache entries.

Add ovl_dir_cache_drop() to detach the inode cache before freeing it. Keep
refcounted merged caches alive until ovl_cache_put(), stop publishing new
merged caches through the inode slot, and let impure iteration reuse only
unrefcounted caches.

Fixes: 4edb83bb1041 ("ovl: constant d_ino for non-merge dirs")
Reported-by: syzbot+a16fb0cce329a320661c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=a16fb0cce329a320661c
Assisted-by: Codex:GPT-5 [lore] [checkpatch]
Signed-off-by: Nirmoy Das <nirmoyd@nvidia.com>
---
 fs/overlayfs/readdir.c | 38 ++++++++++++++++++++++++++------------
 1 file changed, 26 insertions(+), 12 deletions(-)

diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
index 1dcc75b3a90f9..326d8ad173881 100644
--- a/fs/overlayfs/readdir.c
+++ b/fs/overlayfs/readdir.c
@@ -292,6 +292,27 @@ void ovl_dir_cache_free(struct inode *inode)
 	}
 }
 
+static void ovl_dir_cache_drop(struct inode *inode)
+{
+	struct ovl_dir_cache *cache = ovl_dir_cache(inode);
+
+	if (!cache)
+		return;
+
+	ovl_set_dir_cache(inode, NULL);
+
+	/*
+	 * Merged dir caches are refcounted by open directory files.  If the
+	 * inode cache is replaced while such a file still references it, keep
+	 * the old cache alive until ovl_cache_put().
+	 */
+	if (cache->refcount)
+		return;
+
+	ovl_cache_free(&cache->entries);
+	kfree(cache);
+}
+
 static void ovl_cache_put(struct ovl_dir_file *od, struct inode *inode)
 {
 	struct ovl_dir_cache *cache = od->cache;
@@ -485,13 +506,7 @@ static struct ovl_dir_cache *ovl_cache_get(struct dentry *dentry)
 	struct ovl_dir_cache *cache;
 	struct inode *inode = d_inode(dentry);
 
-	cache = ovl_dir_cache(inode);
-	if (cache && ovl_inode_version_get(inode) == cache->version) {
-		WARN_ON(!cache->refcount);
-		cache->refcount++;
-		return cache;
-	}
-	ovl_set_dir_cache(d_inode(dentry), NULL);
+	ovl_dir_cache_drop(inode);
 
 	cache = kzalloc_obj(struct ovl_dir_cache);
 	if (!cache)
@@ -509,7 +524,6 @@ static struct ovl_dir_cache *ovl_cache_get(struct dentry *dentry)
 	}
 
 	cache->version = ovl_inode_version_get(inode);
-	ovl_set_dir_cache(inode, cache);
 
 	return cache;
 }
@@ -699,12 +713,12 @@ static struct ovl_dir_cache *ovl_cache_get_impure(const struct path *path)
 	struct ovl_dir_cache *cache;
 
 	cache = ovl_dir_cache(inode);
-	if (cache && ovl_inode_version_get(inode) == cache->version)
+	if (cache && !cache->refcount &&
+	    ovl_inode_version_get(inode) == cache->version)
 		return cache;
 
-	/* Impure cache is not refcounted, free it here */
-	ovl_dir_cache_free(inode);
-	ovl_set_dir_cache(inode, NULL);
+	/* Drop stale or incompatible inode cache before building impure cache */
+	ovl_dir_cache_drop(inode);
 
 	cache = kzalloc_obj(struct ovl_dir_cache);
 	if (!cache)

base-commit: e98d21c170b01ddef366f023bbfcf6b31509fa83
-- 
2.43.0
Re: [RFC PATCH] ovl: keep merged and impure readdir caches separate
Posted by Amir Goldstein 1 month ago
Hi Nirmoy,

Thanks for the patch.

On Mon, May 11, 2026 at 8:21 AM Nirmoy Das <nirmoyd@nvidia.com> wrote:
>
> Overlayfs uses one inode cache slot for two readdir cache users with
> different lifetime rules. Merged directory iteration pins the cache from
> open directory files with cache->refcount. Impure real-directory iteration
> uses the inode cache as an unrefcounted lookup table.
>
> Those caches cannot be reused interchangeably. If merged iteration finds
> an impure cache in the inode slot, it can pin and seek through a cache
> that was not built for merged iteration. If impure iteration finds a merged
> cache, it can walk an object whose lifetime is controlled by open directory
> files. Either direction can leave ovl_iterate() using stale cache entries.
>
> Add ovl_dir_cache_drop() to detach the inode cache before freeing it. Keep
> refcounted merged caches alive until ovl_cache_put(), stop publishing new
> merged caches through the inode slot,

This is unacceptable - invalidating merged dir cache to paper over
another root bug, which was not really fixed.

> and let impure iteration reuse only
> unrefcounted caches.

All those guards are nice, but how does a directory inode change from
being merged to impure or vice versa?
This should never happen.

It took a lot of arguing with Sonnet about wrong leads to find the real bug.

Real bug was introduced by:
b79e05aaa1667 ("ovl: no direct iteration for dir with origin xattr")

It changed the test from:

od->is_real = !OVL_TYPE_MERGE(type);
od->is_upper = OVL_TYPE_UPPER(type);

to:

od->is_real = !ovl_test_flag(OVL_WHITEOUTS, d_inode(dir));
od->is_upper = OVL_TYPE_UPPER(type);

But there is a race window in copy up of a directory where
upper is set and published before the OVL_WHITEOUTS flag
is set.

an opendir() observing this state inside the race window will
wrongly start to iterate_real and another opendir later will
observe the flag and start iterate_merged - boom!

Attached is what I think is a correct fix.
WDYT?

Thanks,
Amir.
Re: [RFC PATCH] ovl: keep merged and impure readdir caches separate
Posted by Nirmoy Das 1 month ago
Hi Amir,

On 11.05.26 23:54, Amir Goldstein wrote:
> Hi Nirmoy,
>
> Thanks for the patch.
>
> On Mon, May 11, 2026 at 8:21 AM Nirmoy Das <nirmoyd@nvidia.com> wrote:
>> Overlayfs uses one inode cache slot for two readdir cache users with
>> different lifetime rules. Merged directory iteration pins the cache from
>> open directory files with cache->refcount. Impure real-directory iteration
>> uses the inode cache as an unrefcounted lookup table.
>>
>> Those caches cannot be reused interchangeably. If merged iteration finds
>> an impure cache in the inode slot, it can pin and seek through a cache
>> that was not built for merged iteration. If impure iteration finds a merged
>> cache, it can walk an object whose lifetime is controlled by open directory
>> files. Either direction can leave ovl_iterate() using stale cache entries.
>>
>> Add ovl_dir_cache_drop() to detach the inode cache before freeing it. Keep
>> refcounted merged caches alive until ovl_cache_put(), stop publishing new
>> merged caches through the inode slot,
> This is unacceptable - invalidating merged dir cache to paper over
> another root bug, which was not really fixed.
>
>> and let impure iteration reuse only
>> unrefcounted caches.
> All those guards are nice, but how does a directory inode change from
> being merged to impure or vice versa?
> This should never happen.
>
> It took a lot of arguing with Sonnet about wrong leads to find the real bug.
>
> Real bug was introduced by:
> b79e05aaa1667 ("ovl: no direct iteration for dir with origin xattr")
>
> It changed the test from:
>
> od->is_real = !OVL_TYPE_MERGE(type);
> od->is_upper = OVL_TYPE_UPPER(type);
>
> to:
>
> od->is_real = !ovl_test_flag(OVL_WHITEOUTS, d_inode(dir));
> od->is_upper = OVL_TYPE_UPPER(type);
>
> But there is a race window in copy up of a directory where
> upper is set and published before the OVL_WHITEOUTS flag
> is set.
>
> an opendir() observing this state inside the race window will
> wrongly start to iterate_real and another opendir later will
> observe the flag and start iterate_merged - boom!
>
> Attached is what I think is a correct fix.
> WDYT?

This looks correct with my limited understanding of the overlay code. I 
still see the issue when running with

the syzkaller-derived C reproducer in a loop inside an arm64 virtme/qemu 
VM with a KASAN/debug kernel.


Let me try to debug it more and get back.

Regards,

Nirmoy

>
> Thanks,
> Amir.
Re: [RFC PATCH] ovl: keep merged and impure readdir caches separate
Posted by Amir Goldstein 1 month ago
On Tue, May 12, 2026 at 8:28 PM Nirmoy Das <nirmoyd@nvidia.com> wrote:
>
> Hi Amir,
>
> On 11.05.26 23:54, Amir Goldstein wrote:
> > Hi Nirmoy,
> >
> > Thanks for the patch.
> >
> > On Mon, May 11, 2026 at 8:21 AM Nirmoy Das <nirmoyd@nvidia.com> wrote:
> >> Overlayfs uses one inode cache slot for two readdir cache users with
> >> different lifetime rules. Merged directory iteration pins the cache from
> >> open directory files with cache->refcount. Impure real-directory iteration
> >> uses the inode cache as an unrefcounted lookup table.
> >>
> >> Those caches cannot be reused interchangeably. If merged iteration finds
> >> an impure cache in the inode slot, it can pin and seek through a cache
> >> that was not built for merged iteration. If impure iteration finds a merged
> >> cache, it can walk an object whose lifetime is controlled by open directory
> >> files. Either direction can leave ovl_iterate() using stale cache entries.
> >>
> >> Add ovl_dir_cache_drop() to detach the inode cache before freeing it. Keep
> >> refcounted merged caches alive until ovl_cache_put(), stop publishing new
> >> merged caches through the inode slot,
> > This is unacceptable - invalidating merged dir cache to paper over
> > another root bug, which was not really fixed.
> >
> >> and let impure iteration reuse only
> >> unrefcounted caches.
> > All those guards are nice, but how does a directory inode change from
> > being merged to impure or vice versa?
> > This should never happen.
> >
> > It took a lot of arguing with Sonnet about wrong leads to find the real bug.
> >
> > Real bug was introduced by:
> > b79e05aaa1667 ("ovl: no direct iteration for dir with origin xattr")
> >
> > It changed the test from:
> >
> > od->is_real = !OVL_TYPE_MERGE(type);
> > od->is_upper = OVL_TYPE_UPPER(type);
> >
> > to:
> >
> > od->is_real = !ovl_test_flag(OVL_WHITEOUTS, d_inode(dir));
> > od->is_upper = OVL_TYPE_UPPER(type);
> >
> > But there is a race window in copy up of a directory where
> > upper is set and published before the OVL_WHITEOUTS flag
> > is set.
> >
> > an opendir() observing this state inside the race window will
> > wrongly start to iterate_real and another opendir later will
> > observe the flag and start iterate_merged - boom!
> >
> > Attached is what I think is a correct fix.
> > WDYT?
>
> This looks correct with my limited understanding of the overlay code. I
> still see the issue when running with
>
> the syzkaller-derived C reproducer in a loop inside an arm64 virtme/qemu
> VM with a KASAN/debug kernel.
>
>
> Let me try to debug it more and get back.

Thanks! I appreciate it.
I wasn't sure how reliably the reproducer works.

If you do find a fix to the root cause, do feel free to use my patch as is
or modified and post with changes and proper commit message.

b.t.w I am not against adding WARN_ON() assertions in readdir
code to make sure we do not hit a problem like this again,
but as assertions, not as code that looks like it should really
work in this situation (i.e. no dropping of wrong cache just error).

FYI, I had contemplated an alternative fix, but went with the
smallest one:

implement ovl_i_path_type(struct inode *inode)
and make ovl_path_type(struct dentry *dentry) a wrapper

static inline bool ovl_dir_is_real(struct inode *dir)
{
        return !ovl_test_flag(OVL_WHITEOUTS, dir) &&
                   !OVL_TYPE_MERGE(ovl_i_path_type(dir));
}

This fix is a bit nicer to understand and the OVL_WHITEOUTS
flags becomes an optimization for OVL_TYPE_MERGE()
in addition to handling the remnant whiteouts case.

If this happens to survive the reproducer for some reason,
do feel free to post it instead of my patch.

Thanks,
Amir.
Re: [RFC PATCH] ovl: keep merged and impure readdir caches separate
Posted by Nirmoy Das 4 weeks, 1 day ago
Hi Amir,

After a lot of debug and traces I found another bug in
ovl_iterate_merged(): err is set from PTR_ERR(cache) before the
IS_ERR(cache) check, so on success err holds the truncated cache
pointer.

Claude generated this: 
  getdents64
  └── iterate_dir(outer_overlay_file)
      └── ovl_iterate_merged                                    [OUTER]
          │
          ├── cache = ovl_cache_get(dentry):                    [OUTER ovl_cache_get]
          │   │   cache_local = kzalloc(...)
          │   │   res = ovl_dir_read_merged(...)
          │   │   │
          │   │   └── iterate_dir(inner_overlay_file)
          │   │       └── ovl_iterate_merged                    [INNER]
          │   │           ├── cache_inner = ovl_cache_get(...)  valid 0xFFFF8881_CAC4D940
          │   │           ├── err = PTR_ERR(cache_inner)        = -893068992 (low32 of ptr)
          │   │           ├── IS_ERR(cache_inner) → FALSE
          │   │           └── return err;                       ← leaked stale int
          │   │
          │   └── return ERR_PTR(res);                          res = -893068992 (int)
          │                                                     sign-extended →
          │                                                     (void *)0xFFFFFFFF_CAC4D940
          │
          ├── err = PTR_ERR(cache);                             err = -893068992
          ├── if (IS_ERR(cache))   ← FALSE: IS_ERR only trips on top 4K
          │                                  (errno window);
          │                                  0xFFFFFFFF_CAC4D940 sits below
          │       return err;                                    ← skipped
          ├── od->cache = cache;   ★ corrupted pointer stored
          └── ovl_seek_cursor(od, ctx->pos)
                  list_for_each(p, &od->cache->entries)
                  p = *(&od->cache->entries) on bad pointer     ★ PAGE FAULT

Sent it as a separate patch: "[PATCH] ovl: keep err zero after
successful ovl_cache_get()".

Let me know what you think.

Regards,
Nirmoy
Re: [RFC PATCH] ovl: keep merged and impure readdir caches separate
Posted by Amir Goldstein 4 weeks, 1 day ago
On Thu, May 14, 2026 at 2:14 PM Nirmoy Das <nirmoyd@nvidia.com> wrote:
>
> Hi Amir,
>
> After a lot of debug and traces I found another bug in

Thanks for persisting on this problem!

> ovl_iterate_merged(): err is set from PTR_ERR(cache) before the
> IS_ERR(cache) check, so on success err holds the truncated cache
> pointer.
>
> Claude generated this:
>   getdents64
>   └── iterate_dir(outer_overlay_file)
>       └── ovl_iterate_merged                                    [OUTER]
>           │
>           ├── cache = ovl_cache_get(dentry):                    [OUTER ovl_cache_get]
>           │   │   cache_local = kzalloc(...)
>           │   │   res = ovl_dir_read_merged(...)
>           │   │   │
>           │   │   └── iterate_dir(inner_overlay_file)
>           │   │       └── ovl_iterate_merged                    [INNER]
>           │   │           ├── cache_inner = ovl_cache_get(...)  valid 0xFFFF8881_CAC4D940
>           │   │           ├── err = PTR_ERR(cache_inner)        = -893068992 (low32 of ptr)
>           │   │           ├── IS_ERR(cache_inner) → FALSE
>           │   │           └── return err;                       ← leaked stale int
>           │   │
>           │   └── return ERR_PTR(res);                          res = -893068992 (int)
>           │                                                     sign-extended →
>           │                                                     (void *)0xFFFFFFFF_CAC4D940

Wow! This is demonic.
I think we need to fortify ERR_PTR() to not accept illegal errno values.
I'll try to write a patch.

>           │
>           ├── err = PTR_ERR(cache);                             err = -893068992
>           ├── if (IS_ERR(cache))   ← FALSE: IS_ERR only trips on top 4K
>           │                                  (errno window);
>           │                                  0xFFFFFFFF_CAC4D940 sits below
>           │       return err;                                    ← skipped
>           ├── od->cache = cache;   ★ corrupted pointer stored
>           └── ovl_seek_cursor(od, ctx->pos)
>                   list_for_each(p, &od->cache->entries)
>                   p = *(&od->cache->entries) on bad pointer     ★ PAGE FAULT
>

Oh boy!

> Sent it as a separate patch: "[PATCH] ovl: keep err zero after
> successful ovl_cache_get()".
>
> Let me know what you think.

Patch looks great - real root cause, gave you minor comments.

IIUC, as far as you know, the repro never triggered the theoretic
OVL_WHITEOUTS race?

It was always just this ERR_PTR(PTR_ERR()) black magic?

Thanks,
Amir.