btrfs: defer freeing of subpage private state to free_folio

[RFC PATCH] btrfs: defer freeing of subpage private state to free_folio

Posted by JP Kobryn 1 week, 1 day ago

During reclaim, file-mapped folio callbacks can be invoked through the
address_space_operations "aops" interface. In one specific case involving
btrfs, the release_folio() callback is made and the btrfs-specific private
data is freed. Afterward, continuing in the reclaim path, the folio will be
freed from the page cache if its refcount has reached zero. Because there
is a window between the freeing of the private data and the refcount check,
it is possible for another task to increment the refcount in parallel and
the folio will remain in the page cache with NULL private data. The other
task then acquires the folio and is forced to take precautions on accessing
the private field.

There is existing code that is aware of this. In some of these places, a
NULL check is performed on the private field and if not present it gets
recreated. There are surrounding comments referring to the race of freeing
the private data. For example:

/*
 * We unlock the page after the io is completed and then re-lock it
 * above.  release_folio() could have come in between that and cleared
 * folio private, but left the page in the mapping.  Set the page mapped
 * here to make sure it's properly set for the subpage stuff.
 */
ret = set_folio_extent_mapped(folio);

It's worth noting in advance that the protections currently in place and
also the points in which btrfs invokes filemap_invalidate_inode() may be
sufficient. The purpose of this patch though, is to ensure the btrfs
private subpage metadata lives as long as its folio, which may help avoid
the loss of subpage metadata and improve maintainability (by preventing any
possible use after free in present/future code). Currently the private data
is freed in the btrfs-specific aops callback release_folio(), but this
proposed change instead defers the freeing until the aops free_folio()
callback.

The patch also might have the advantage of being easy to backport to the
LTS trees. On that note, it's worth mentioning that we encountered a kernel
panic as a result of this sequence on a 6.16-based arm64 host (configured
with 64k pages so btrfs is in subpage mode). On our 6.16 kernel, the race
window is shown below between points A and B:

[mm] page cache reclaim path        [fs] relocation in subpage mode
shrink_folio_list()
  folio_trylock() /* lock acquired */
  filemap_release_folio()
    mapping->a_ops->release_folio()
      btrfs_release_folio()
        __btrfs_release_folio()
          clear_folio_extent_mapped()
            btrfs_detach_folio_state()
              bfs = folio_detach_private(folio)
              btrfs_free_folio_state(folio)
                kfree(bfs) /* point A */

                                   prealloc_file_extent_cluster()
                                     filemap_lock_folio()
                                       folio_try_get() /* inc refcount */
                                       folio_lock() /* wait for lock */

  __remove_mapping()
    if (!folio_ref_freeze(folio, refcount)) /* point B */
      goto cannot_free /* folio remains in cache */

  folio_unlock(folio) /* lock released */

                                   /* lock acquired */
                                   btrfs_subpage_clear_updodate()
                                     bfs = folio->priv /* use-after-free */

This exact race during relocation should not occur in the latest upstream
code, but it's an example of a backport opportunity for this patch.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 fs/btrfs/extent_io.c |  6 ++++--
 fs/btrfs/inode.c     | 18 ++++++++++++++++++
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 3df399dc8856..d83d3f9ae3af 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -928,8 +928,10 @@ void clear_folio_extent_mapped(struct folio *folio)
 		return;
 
 	fs_info = folio_to_fs_info(folio);
-	if (btrfs_is_subpage(fs_info, folio))
-		return btrfs_detach_folio_state(fs_info, folio, BTRFS_SUBPAGE_DATA);
+	if (btrfs_is_subpage(fs_info, folio)) {
+		/* freeing of private subpage data is deferred to btrfs_free_folio */
+		return;
+	}
 
 	folio_detach_private(folio);
 }
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b8abfe7439a3..7a832ee3b591 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7565,6 +7565,23 @@ static bool btrfs_release_folio(struct folio *folio, gfp_t gfp_flags)
 	return __btrfs_release_folio(folio, gfp_flags);
 }
 
+/* frees subpage private data if present */
+static void btrfs_free_folio(struct folio *folio)
+{
+	struct btrfs_folio_state *bfs;
+
+	if (!folio_test_private(folio))
+		return;
+
+	bfs = folio_detach_private(folio);
+	if (bfs == (void *)EXTENT_FOLIO_PRIVATE) {
+		/* extent map flag is detached in btrfs_folio_release */
+		return;
+	}
+
+	btrfs_free_folio_state(bfs);
+}
+
 #ifdef CONFIG_MIGRATION
 static int btrfs_migrate_folio(struct address_space *mapping,
 			     struct folio *dst, struct folio *src,
@@ -10651,6 +10668,7 @@ static const struct address_space_operations btrfs_aops = {
 	.invalidate_folio = btrfs_invalidate_folio,
 	.launder_folio	= btrfs_launder_folio,
 	.release_folio	= btrfs_release_folio,
+	.free_folio = btrfs_free_folio,
 	.migrate_folio	= btrfs_migrate_folio,
 	.dirty_folio	= filemap_dirty_folio,
 	.error_remove_folio = generic_error_remove_folio,
-- 
2.47.3

Re: [RFC PATCH] btrfs: defer freeing of subpage private state to free_folio

Posted by Qu Wenruo 1 week, 1 day ago


在 2026/1/30 09:38, JP Kobryn 写道:
[...]
> The patch also might have the advantage of being easy to backport to the
> LTS trees. On that note, it's worth mentioning that we encountered a kernel
> panic as a result of this sequence on a 6.16-based arm64 host (configured
> with 64k pages so btrfs is in subpage mode). On our 6.16 kernel, the race
> window is shown below between points A and B:
> 
> [mm] page cache reclaim path        [fs] relocation in subpage mode
> shrink_folio_list()
>    folio_trylock() /* lock acquired */
>    filemap_release_folio()
>      mapping->a_ops->release_folio()
>        btrfs_release_folio()
>          __btrfs_release_folio()
>            clear_folio_extent_mapped()
>              btrfs_detach_folio_state()
>                bfs = folio_detach_private(folio)
>                btrfs_free_folio_state(folio)
>                  kfree(bfs) /* point A */
> 
>                                     prealloc_file_extent_cluster()
>                                       filemap_lock_folio()

Mind to explain which function is calling filemap_lock_folio()?

I guess it's filemap_invalidate_inode() -> filemap_fdatawrite_range() -> 
filemap_writeback() -> btrfs_writepages() -> extent_write_cache_pages().

>                                         folio_try_get() /* inc refcount */
>                                         folio_lock() /* wait for lock */


Another question here is, since the folio is already released in the mm 
path, the folio should not have dirty flag set.

That means inside extent_write_cache_pages(), the folio_test_dirty() 
should return false, and we should just unlock the folio without 
touching it anymore.

Mind to explain why we still continue the writeback of a non-dirty folio?

> 
>    __remove_mapping()
>      if (!folio_ref_freeze(folio, refcount)) /* point B */
>        goto cannot_free /* folio remains in cache */
> 
>    folio_unlock(folio) /* lock released */
> 
>                                     /* lock acquired */
>                                     btrfs_subpage_clear_updodate()

Mind to provide more context of where the btrfs_subpage_clear_uptodate() 
call is from?

>                                       bfs = folio->priv /* use-after-free */
> 
> This exact race during relocation should not occur in the latest upstream
> code, but it's an example of a backport opportunity for this patch.

And mind to explain what is missing in 6.16 kernel that causes the above 
use-after-free?

> 
> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
> ---
>   fs/btrfs/extent_io.c |  6 ++++--
>   fs/btrfs/inode.c     | 18 ++++++++++++++++++
>   2 files changed, 22 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 3df399dc8856..d83d3f9ae3af 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -928,8 +928,10 @@ void clear_folio_extent_mapped(struct folio *folio)
>   		return;
>   
>   	fs_info = folio_to_fs_info(folio);
> -	if (btrfs_is_subpage(fs_info, folio))
> -		return btrfs_detach_folio_state(fs_info, folio, BTRFS_SUBPAGE_DATA);
> +	if (btrfs_is_subpage(fs_info, folio)) {
> +		/* freeing of private subpage data is deferred to btrfs_free_folio */
> +		return;
> +	}

Another question is, why only two fses (nfs for dir inode, and orangefs) 
are utilizing the free_folio() callback.

Iomap is doing the same as btrfs and only calls ifs_free() in 
release_folio() and invalidate_folio().

Thus it looks like free_folio() callback is not the recommended way to 
free folio->private pointer.

Cc fsdevel list on whether the free_folio() callback should have new 
callers.

>   
>   	folio_detach_private(folio);

This means for regular folio cases, we still remove the private flag of 
such folio.

It may be fine for most cases as we will not touch folio->private 
anyway, but this still looks like a inconsistent behavior, especially 
the free_folio() callback has handling for both cases.

Thanks,
Qu

>   }
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index b8abfe7439a3..7a832ee3b591 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7565,6 +7565,23 @@ static bool btrfs_release_folio(struct folio *folio, gfp_t gfp_flags)
>   	return __btrfs_release_folio(folio, gfp_flags);
>   }
>   
> +/* frees subpage private data if present */
> +static void btrfs_free_folio(struct folio *folio)
> +{
> +	struct btrfs_folio_state *bfs;
> +
> +	if (!folio_test_private(folio))
> +		return;
> +
> +	bfs = folio_detach_private(folio);
> +	if (bfs == (void *)EXTENT_FOLIO_PRIVATE) {
> +		/* extent map flag is detached in btrfs_folio_release */
> +		return;
> +	}
> +
> +	btrfs_free_folio_state(bfs);
> +}
> +
>   #ifdef CONFIG_MIGRATION
>   static int btrfs_migrate_folio(struct address_space *mapping,
>   			     struct folio *dst, struct folio *src,
> @@ -10651,6 +10668,7 @@ static const struct address_space_operations btrfs_aops = {
>   	.invalidate_folio = btrfs_invalidate_folio,
>   	.launder_folio	= btrfs_launder_folio,
>   	.release_folio	= btrfs_release_folio,
> +	.free_folio = btrfs_free_folio,
>   	.migrate_folio	= btrfs_migrate_folio,
>   	.dirty_folio	= filemap_dirty_folio,
>   	.error_remove_folio = generic_error_remove_folio,

Re: [RFC PATCH] btrfs: defer freeing of subpage private state to free_folio

Posted by Boris Burkov 1 week, 1 day ago

On Fri, Jan 30, 2026 at 01:46:59PM +1030, Qu Wenruo wrote:
> 
> 
> 在 2026/1/30 09:38, JP Kobryn 写道:
> [...]
> > The patch also might have the advantage of being easy to backport to the
> > LTS trees. On that note, it's worth mentioning that we encountered a kernel
> > panic as a result of this sequence on a 6.16-based arm64 host (configured
> > with 64k pages so btrfs is in subpage mode). On our 6.16 kernel, the race
> > window is shown below between points A and B:
> > 
> > [mm] page cache reclaim path        [fs] relocation in subpage mode
> > shrink_folio_list()
> >    folio_trylock() /* lock acquired */
> >    filemap_release_folio()
> >      mapping->a_ops->release_folio()
> >        btrfs_release_folio()
> >          __btrfs_release_folio()
> >            clear_folio_extent_mapped()
> >              btrfs_detach_folio_state()
> >                bfs = folio_detach_private(folio)
> >                btrfs_free_folio_state(folio)
> >                  kfree(bfs) /* point A */
> > 
> >                                     prealloc_file_extent_cluster()
> >                                       filemap_lock_folio()
> 
> Mind to explain which function is calling filemap_lock_folio()?
> 
> I guess it's filemap_invalidate_inode() -> filemap_fdatawrite_range() ->
> filemap_writeback() -> btrfs_writepages() -> extent_write_cache_pages().
> 

I think you may have missed it in the diagram, and some of the function
names may have shifted a bit between kernels, but it is relocation.

On current btrfs/for-next, I think it would be:

relocate_file_extent_cluster()
  relocate_one_folio()
    filemap_lock_folio()

> >                                         folio_try_get() /* inc refcount */
> >                                         folio_lock() /* wait for lock */
> 
> 
> Another question here is, since the folio is already released in the mm
> path, the folio should not have dirty flag set.
> 
> That means inside extent_write_cache_pages(), the folio_test_dirty() should
> return false, and we should just unlock the folio without touching it
> anymore.
> 
> Mind to explain why we still continue the writeback of a non-dirty folio?
> 

I think this question is answered by the above as well: we aren't in
writeback, we are in relocation.

Thanks,
Boris

> > 
> >    __remove_mapping()
> >      if (!folio_ref_freeze(folio, refcount)) /* point B */
> >        goto cannot_free /* folio remains in cache */
> > 
> >    folio_unlock(folio) /* lock released */
> > 
> >                                     /* lock acquired */
> >                                     btrfs_subpage_clear_updodate()
> 
> Mind to provide more context of where the btrfs_subpage_clear_uptodate()
> call is from?
> 
> >                                       bfs = folio->priv /* use-after-free */
> > 
> > This exact race during relocation should not occur in the latest upstream
> > code, but it's an example of a backport opportunity for this patch.
> 
> And mind to explain what is missing in 6.16 kernel that causes the above
> use-after-free?
> 
> > 
> > Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
> > ---
> >   fs/btrfs/extent_io.c |  6 ++++--
> >   fs/btrfs/inode.c     | 18 ++++++++++++++++++
> >   2 files changed, 22 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index 3df399dc8856..d83d3f9ae3af 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -928,8 +928,10 @@ void clear_folio_extent_mapped(struct folio *folio)
> >   		return;
> >   	fs_info = folio_to_fs_info(folio);
> > -	if (btrfs_is_subpage(fs_info, folio))
> > -		return btrfs_detach_folio_state(fs_info, folio, BTRFS_SUBPAGE_DATA);
> > +	if (btrfs_is_subpage(fs_info, folio)) {
> > +		/* freeing of private subpage data is deferred to btrfs_free_folio */
> > +		return;
> > +	}
> 
> Another question is, why only two fses (nfs for dir inode, and orangefs) are
> utilizing the free_folio() callback.
> 
> Iomap is doing the same as btrfs and only calls ifs_free() in
> release_folio() and invalidate_folio().
> 
> Thus it looks like free_folio() callback is not the recommended way to free
> folio->private pointer.
> 
> Cc fsdevel list on whether the free_folio() callback should have new
> callers.
> 
> >   	folio_detach_private(folio);
> 
> This means for regular folio cases, we still remove the private flag of such
> folio.
> 
> It may be fine for most cases as we will not touch folio->private anyway,
> but this still looks like a inconsistent behavior, especially the
> free_folio() callback has handling for both cases.
> 
> Thanks,
> Qu
> 
> >   }
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index b8abfe7439a3..7a832ee3b591 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -7565,6 +7565,23 @@ static bool btrfs_release_folio(struct folio *folio, gfp_t gfp_flags)
> >   	return __btrfs_release_folio(folio, gfp_flags);
> >   }
> > +/* frees subpage private data if present */
> > +static void btrfs_free_folio(struct folio *folio)
> > +{
> > +	struct btrfs_folio_state *bfs;
> > +
> > +	if (!folio_test_private(folio))
> > +		return;
> > +
> > +	bfs = folio_detach_private(folio);
> > +	if (bfs == (void *)EXTENT_FOLIO_PRIVATE) {
> > +		/* extent map flag is detached in btrfs_folio_release */
> > +		return;
> > +	}
> > +
> > +	btrfs_free_folio_state(bfs);
> > +}
> > +
> >   #ifdef CONFIG_MIGRATION
> >   static int btrfs_migrate_folio(struct address_space *mapping,
> >   			     struct folio *dst, struct folio *src,
> > @@ -10651,6 +10668,7 @@ static const struct address_space_operations btrfs_aops = {
> >   	.invalidate_folio = btrfs_invalidate_folio,
> >   	.launder_folio	= btrfs_launder_folio,
> >   	.release_folio	= btrfs_release_folio,
> > +	.free_folio = btrfs_free_folio,
> >   	.migrate_folio	= btrfs_migrate_folio,
> >   	.dirty_folio	= filemap_dirty_folio,
> >   	.error_remove_folio = generic_error_remove_folio,
>

Re: [RFC PATCH] btrfs: defer freeing of subpage private state to free_folio

Posted by Qu Wenruo 1 week, 1 day ago


在 2026/1/30 17:04, Boris Burkov 写道:
> On Fri, Jan 30, 2026 at 01:46:59PM +1030, Qu Wenruo wrote:
>>
>>
>> 在 2026/1/30 09:38, JP Kobryn 写道:
>> [...]
>>> The patch also might have the advantage of being easy to backport to the
>>> LTS trees. On that note, it's worth mentioning that we encountered a kernel
>>> panic as a result of this sequence on a 6.16-based arm64 host (configured
>>> with 64k pages so btrfs is in subpage mode). On our 6.16 kernel, the race
>>> window is shown below between points A and B:
>>>
>>> [mm] page cache reclaim path        [fs] relocation in subpage mode
>>> shrink_folio_list()
>>>     folio_trylock() /* lock acquired */
>>>     filemap_release_folio()
>>>       mapping->a_ops->release_folio()
>>>         btrfs_release_folio()
>>>           __btrfs_release_folio()
>>>             clear_folio_extent_mapped()
>>>               btrfs_detach_folio_state()
>>>                 bfs = folio_detach_private(folio)
>>>                 btrfs_free_folio_state(folio)
>>>                   kfree(bfs) /* point A */
>>>
>>>                                      prealloc_file_extent_cluster()
>>>                                        filemap_lock_folio()
>>
>> Mind to explain which function is calling filemap_lock_folio()?
>>
>> I guess it's filemap_invalidate_inode() -> filemap_fdatawrite_range() ->
>> filemap_writeback() -> btrfs_writepages() -> extent_write_cache_pages().
>>
> 
> I think you may have missed it in the diagram, and some of the function
> names may have shifted a bit between kernels, but it is relocation.
> 
> On current btrfs/for-next, I think it would be:
> 
> relocate_file_extent_cluster()
>    relocate_one_folio()
>      filemap_lock_folio()

Thanks, indeed the filemap_lock_folio() inside 
prealloc_file_extent_cluster() only exists in v6.16 code base, which 
does partial folio invalidating manually.

That code is no longer there, and gets replaced with a much healthier 
solution.

> 
>>>                                          folio_try_get() /* inc refcount */
>>>                                          folio_lock() /* wait for lock */
>>
>>
>> Another question here is, since the folio is already released in the mm
>> path, the folio should not have dirty flag set.
>>
>> That means inside extent_write_cache_pages(), the folio_test_dirty() should
>> return false, and we should just unlock the folio without touching it
>> anymore.
>>
>> Mind to explain why we still continue the writeback of a non-dirty folio?
>>
> 
> I think this question is answered by the above as well: we aren't in
> writeback, we are in relocation.

I see the problem now. And thankfully it's commit 4e346baee95f ("btrfs: 
reloc: unconditionally invalidate the page cache for each cluster") 
fixing the behavior.

And yes, the old code can indeed hit the problem.

But still, the commit 4e346baee95f ("btrfs: reloc: unconditionally 
invalidate the page cache for each cluster") itself shouldn't be that 
hard to backport.

Thanks,
Qu

> 
> Thanks,
> Boris
> 
>>>
>>>     __remove_mapping()
>>>       if (!folio_ref_freeze(folio, refcount)) /* point B */
>>>         goto cannot_free /* folio remains in cache */
>>>
>>>     folio_unlock(folio) /* lock released */
>>>
>>>                                      /* lock acquired */
>>>                                      btrfs_subpage_clear_updodate()
>>
>> Mind to provide more context of where the btrfs_subpage_clear_uptodate()
>> call is from?
>>
>>>                                        bfs = folio->priv /* use-after-free */
>>>
>>> This exact race during relocation should not occur in the latest upstream
>>> code, but it's an example of a backport opportunity for this patch.
>>
>> And mind to explain what is missing in 6.16 kernel that causes the above
>> use-after-free?
>>
>>>
>>> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
>>> ---
>>>    fs/btrfs/extent_io.c |  6 ++++--
>>>    fs/btrfs/inode.c     | 18 ++++++++++++++++++
>>>    2 files changed, 22 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>>> index 3df399dc8856..d83d3f9ae3af 100644
>>> --- a/fs/btrfs/extent_io.c
>>> +++ b/fs/btrfs/extent_io.c
>>> @@ -928,8 +928,10 @@ void clear_folio_extent_mapped(struct folio *folio)
>>>    		return;
>>>    	fs_info = folio_to_fs_info(folio);
>>> -	if (btrfs_is_subpage(fs_info, folio))
>>> -		return btrfs_detach_folio_state(fs_info, folio, BTRFS_SUBPAGE_DATA);
>>> +	if (btrfs_is_subpage(fs_info, folio)) {
>>> +		/* freeing of private subpage data is deferred to btrfs_free_folio */
>>> +		return;
>>> +	}
>>
>> Another question is, why only two fses (nfs for dir inode, and orangefs) are
>> utilizing the free_folio() callback.
>>
>> Iomap is doing the same as btrfs and only calls ifs_free() in
>> release_folio() and invalidate_folio().
>>
>> Thus it looks like free_folio() callback is not the recommended way to free
>> folio->private pointer.
>>
>> Cc fsdevel list on whether the free_folio() callback should have new
>> callers.
>>
>>>    	folio_detach_private(folio);
>>
>> This means for regular folio cases, we still remove the private flag of such
>> folio.
>>
>> It may be fine for most cases as we will not touch folio->private anyway,
>> but this still looks like a inconsistent behavior, especially the
>> free_folio() callback has handling for both cases.
>>
>> Thanks,
>> Qu
>>
>>>    }
>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>> index b8abfe7439a3..7a832ee3b591 100644
>>> --- a/fs/btrfs/inode.c
>>> +++ b/fs/btrfs/inode.c
>>> @@ -7565,6 +7565,23 @@ static bool btrfs_release_folio(struct folio *folio, gfp_t gfp_flags)
>>>    	return __btrfs_release_folio(folio, gfp_flags);
>>>    }
>>> +/* frees subpage private data if present */
>>> +static void btrfs_free_folio(struct folio *folio)
>>> +{
>>> +	struct btrfs_folio_state *bfs;
>>> +
>>> +	if (!folio_test_private(folio))
>>> +		return;
>>> +
>>> +	bfs = folio_detach_private(folio);
>>> +	if (bfs == (void *)EXTENT_FOLIO_PRIVATE) {
>>> +		/* extent map flag is detached in btrfs_folio_release */
>>> +		return;
>>> +	}
>>> +
>>> +	btrfs_free_folio_state(bfs);
>>> +}
>>> +
>>>    #ifdef CONFIG_MIGRATION
>>>    static int btrfs_migrate_folio(struct address_space *mapping,
>>>    			     struct folio *dst, struct folio *src,
>>> @@ -10651,6 +10668,7 @@ static const struct address_space_operations btrfs_aops = {
>>>    	.invalidate_folio = btrfs_invalidate_folio,
>>>    	.launder_folio	= btrfs_launder_folio,
>>>    	.release_folio	= btrfs_release_folio,
>>> +	.free_folio = btrfs_free_folio,
>>>    	.migrate_folio	= btrfs_migrate_folio,
>>>    	.dirty_folio	= filemap_dirty_folio,
>>>    	.error_remove_folio = generic_error_remove_folio,
>>

Re: [RFC PATCH] btrfs: defer freeing of subpage private state to free_folio

Posted by Matthew Wilcox 1 week, 1 day ago

On Fri, Jan 30, 2026 at 01:46:59PM +1030, Qu Wenruo wrote:
> Another question is, why only two fses (nfs for dir inode, and orangefs) are
> utilizing the free_folio() callback.

Alas, secretmem and guest_memfd are also using it.  Nevertheless, I'm
not a fan of this interface existing, and would prefer to not introduce
new users.  Like launder_folio, which btrfs has also mistakenly used.

Re: [RFC PATCH] btrfs: defer freeing of subpage private state to free_folio

Posted by JP Kobryn 1 week ago

On 1/29/26 9:14 PM, Matthew Wilcox wrote:
> On Fri, Jan 30, 2026 at 01:46:59PM +1030, Qu Wenruo wrote:
>> Another question is, why only two fses (nfs for dir inode, and orangefs) are
>> utilizing the free_folio() callback.
> 
> Alas, secretmem and guest_memfd are also using it.  Nevertheless, I'm
> not a fan of this interface existing, and would prefer to not introduce
> new users.  Like launder_folio, which btrfs has also mistakenly used.
> 

The part that felt concerning is how the private state is lost. If
release_folio() frees this state but the folio persists in the cache,
users of the folio afterward have to recreate the state. Is that the
expectation on how filesystems should handle this situation?

In the case of the existing btrfs code, when the state is recreated (in
subpage mode), the bitmap data and lock states are all zeroed.

Re: [RFC PATCH] btrfs: defer freeing of subpage private state to free_folio

Posted by Qu Wenruo 1 week ago


在 2026/1/31 03:40, JP Kobryn 写道:
> On 1/29/26 9:14 PM, Matthew Wilcox wrote:
>> On Fri, Jan 30, 2026 at 01:46:59PM +1030, Qu Wenruo wrote:
>>> Another question is, why only two fses (nfs for dir inode, and 
>>> orangefs) are
>>> utilizing the free_folio() callback.
>>
>> Alas, secretmem and guest_memfd are also using it.  Nevertheless, I'm
>> not a fan of this interface existing, and would prefer to not introduce
>> new users.  Like launder_folio, which btrfs has also mistakenly used.
>>
> 
> The part that felt concerning is how the private state is lost. If
> release_folio() frees this state but the folio persists in the cache,
> users of the folio afterward have to recreate the state. Is that the
> expectation on how filesystems should handle this situation?

I believe that's the case.

Just like what we did in btrfs_do_readpage() and prepare_one_folio().

There is no difference between getting a new page and a page that is 
released but not removed from the filemap.

> 
> In the case of the existing btrfs code, when the state is recreated (in
> subpage mode), the bitmap data and lock states are all zeroed.

That's expected.

Thanks,
Qu

Re: [RFC PATCH] btrfs: defer freeing of subpage private state to free_folio

Posted by JP Kobryn 1 week ago

On 1/30/26 12:36 PM, Qu Wenruo wrote:
> 
> 
> 在 2026/1/31 03:40, JP Kobryn 写道:
>> On 1/29/26 9:14 PM, Matthew Wilcox wrote:
>>> On Fri, Jan 30, 2026 at 01:46:59PM +1030, Qu Wenruo wrote:
>>>> Another question is, why only two fses (nfs for dir inode, and 
>>>> orangefs) are
>>>> utilizing the free_folio() callback.
>>>
>>> Alas, secretmem and guest_memfd are also using it.  Nevertheless, I'm
>>> not a fan of this interface existing, and would prefer to not introduce
>>> new users.  Like launder_folio, which btrfs has also mistakenly used.
>>>
>>
>> The part that felt concerning is how the private state is lost. If
>> release_folio() frees this state but the folio persists in the cache,
>> users of the folio afterward have to recreate the state. Is that the
>> expectation on how filesystems should handle this situation?
> 
> I believe that's the case.
> 
> Just like what we did in btrfs_do_readpage() and prepare_one_folio().
> 
> There is no difference between getting a new page and a page that is 
> released but not removed from the filemap.
> 
>>
>> In the case of the existing btrfs code, when the state is recreated (in
>> subpage mode), the bitmap data and lock states are all zeroed.
> 
> That's expected.
> 

Thanks all for the feedback. I get it now that we should treat it like a
fresh folio where applicable. With that said, I may have found a path
where unguarded access to the private field is happening. I'll send a
patch shortly and you can let me know your thoughts.

Re: [RFC PATCH] btrfs: defer freeing of subpage private state to free_folio

Posted by Matthew Wilcox 1 week ago

On Fri, Jan 30, 2026 at 09:10:11AM -0800, JP Kobryn wrote:
> On 1/29/26 9:14 PM, Matthew Wilcox wrote:
> > On Fri, Jan 30, 2026 at 01:46:59PM +1030, Qu Wenruo wrote:
> > > Another question is, why only two fses (nfs for dir inode, and orangefs) are
> > > utilizing the free_folio() callback.
> > 
> > Alas, secretmem and guest_memfd are also using it.  Nevertheless, I'm
> > not a fan of this interface existing, and would prefer to not introduce
> > new users.  Like launder_folio, which btrfs has also mistakenly used.
> 
> The part that felt concerning is how the private state is lost. If
> release_folio() frees this state but the folio persists in the cache,
> users of the folio afterward have to recreate the state. Is that the
> expectation on how filesystems should handle this situation?

Yes; that's how iomap and buffer_head both handle it.  If an operation
happens that needs sub-folio tracking, allocate the per-folio state
then continue as before.

> In the case of the existing btrfs code, when the state is recreated (in
> subpage mode), the bitmap data and lock states are all zeroed.

If the folio is uptodate, iomap will initialise the uptodate bitmap to
all-set rather than all-clear.  The dirty bitmap will similarly reflect
whetheer the folio dirty bit is set or clear.  Obviously we lose some
precision there (the folio may have been only partially dirty, or some of
the blocks in it may already have been uptodate), but that's not likely
to cause any performance problems.  When ->release_folio is called, we're
expecting to evict the folio from the page cache ...  we just failed to
do so, so it's reasonable to treat it as a freshly allocated folio.

Re: [RFC PATCH] btrfs: defer freeing of subpage private state to free_folio

Posted by Boris Burkov 1 week, 1 day ago

On Fri, Jan 30, 2026 at 05:14:26AM +0000, Matthew Wilcox wrote:
> On Fri, Jan 30, 2026 at 01:46:59PM +1030, Qu Wenruo wrote:
> > Another question is, why only two fses (nfs for dir inode, and orangefs) are
> > utilizing the free_folio() callback.
> 
> Alas, secretmem and guest_memfd are also using it.  Nevertheless, I'm
> not a fan of this interface existing, and would prefer to not introduce

From the VFS documentation:

"free_folio is called once the folio is no longer visible in the page cache
in order to allow the cleanup of any private data."

That sounds like the perfect solution to exactly the issue JP observed.
So if it is not intended to be used in this way, I think it would be
beneficial to more accurately document the proper (never?) use of
free_folio() to avoid this pitfall.

I will also point out that the usage in orangefs closely matches this
one in spirit: kfree-ing some longer lived object tracking ranges in the
folio.

As for this patch, do you have any suggestion for how to handle the
problem of freeing allocated memory stored in folio->private? To me, it
feels quite wrong to do it in release_folio() if that can end with a
folio that can still be found in the mapping. Should we have all users
that find and lock the folio detect whether or not private is set
up appropriately and re-allocate it if not? Or should we only lookup and
use the private field only from VFS triggered paths (rather than btrfs's
internal relocation stuff here)

> new users.  Like launder_folio, which btrfs has also mistakenly used.
> 

for context
https://lore.kernel.org/linux-btrfs/070b1d025ef6eb292638bb97683cd5c35ffe42eb.1721775142.git.boris@bur.io/

IIRC, the issue was that we call invalidate_inode_pages2 on outstanding
dirty folios while cleaning up a failed transaction which calls
launder_folio() and release_folio(). At this point, btrfs does not have
private set on the folio so release_folio() is a no-op. So I used
launder_folio() to clean up otherwise leaked state that was 1:1 with
dirty folios, so it also felt like a "natural" fit. I apologize for
this undesirable usage, but I was genuinely trying to do the right
thing and take advantage of an abstract interface that very cleanly
modeled my problem. At the time, I interpreted the existence of an
interface that neatly fit my problem to be a credit to the design of
VFS. In the future, I will make sure to cc fsdevel before adding any
new usage of the VFS.

At a high level, would it be reasonable to mark the page private when
it is dirtied and the qgroup reservation is made, so that we do go into
release_folio() via invalidate_inode_pages2? I can look into that.

Ultimately, cleaning up qgroup reservation accounting while aborting a
transaction is hardly the most important thing in the world. It's just
something that looks messy while cleaning up the filesystem. So if we
want to get rid of launder_folio() even if I can't figure out how to
make it use release_folio(), all we really need is for someone to ask
nicely and I can probably come up with a hack. :)

Thanks,
Boris