fs/btrfs/extent_io.c | 6 ++++-- fs/btrfs/inode.c | 18 ++++++++++++++++++ 2 files changed, 22 insertions(+), 2 deletions(-)
During reclaim, file-mapped folio callbacks can be invoked through the
address_space_operations "aops" interface. In one specific case involving
btrfs, the release_folio() callback is made and the btrfs-specific private
data is freed. Afterward, continuing in the reclaim path, the folio will be
freed from the page cache if its refcount has reached zero. Because there
is a window between the freeing of the private data and the refcount check,
it is possible for another task to increment the refcount in parallel and
the folio will remain in the page cache with NULL private data. The other
task then acquires the folio and is forced to take precautions on accessing
the private field.
There is existing code that is aware of this. In some of these places, a
NULL check is performed on the private field and if not present it gets
recreated. There are surrounding comments referring to the race of freeing
the private data. For example:
/*
* We unlock the page after the io is completed and then re-lock it
* above. release_folio() could have come in between that and cleared
* folio private, but left the page in the mapping. Set the page mapped
* here to make sure it's properly set for the subpage stuff.
*/
ret = set_folio_extent_mapped(folio);
It's worth noting in advance that the protections currently in place and
also the points in which btrfs invokes filemap_invalidate_inode() may be
sufficient. The purpose of this patch though, is to ensure the btrfs
private subpage metadata lives as long as its folio, which may help avoid
the loss of subpage metadata and improve maintainability (by preventing any
possible use after free in present/future code). Currently the private data
is freed in the btrfs-specific aops callback release_folio(), but this
proposed change instead defers the freeing until the aops free_folio()
callback.
The patch also might have the advantage of being easy to backport to the
LTS trees. On that note, it's worth mentioning that we encountered a kernel
panic as a result of this sequence on a 6.16-based arm64 host (configured
with 64k pages so btrfs is in subpage mode). On our 6.16 kernel, the race
window is shown below between points A and B:
[mm] page cache reclaim path [fs] relocation in subpage mode
shrink_folio_list()
folio_trylock() /* lock acquired */
filemap_release_folio()
mapping->a_ops->release_folio()
btrfs_release_folio()
__btrfs_release_folio()
clear_folio_extent_mapped()
btrfs_detach_folio_state()
bfs = folio_detach_private(folio)
btrfs_free_folio_state(folio)
kfree(bfs) /* point A */
prealloc_file_extent_cluster()
filemap_lock_folio()
folio_try_get() /* inc refcount */
folio_lock() /* wait for lock */
__remove_mapping()
if (!folio_ref_freeze(folio, refcount)) /* point B */
goto cannot_free /* folio remains in cache */
folio_unlock(folio) /* lock released */
/* lock acquired */
btrfs_subpage_clear_updodate()
bfs = folio->priv /* use-after-free */
This exact race during relocation should not occur in the latest upstream
code, but it's an example of a backport opportunity for this patch.
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
fs/btrfs/extent_io.c | 6 ++++--
fs/btrfs/inode.c | 18 ++++++++++++++++++
2 files changed, 22 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 3df399dc8856..d83d3f9ae3af 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -928,8 +928,10 @@ void clear_folio_extent_mapped(struct folio *folio)
return;
fs_info = folio_to_fs_info(folio);
- if (btrfs_is_subpage(fs_info, folio))
- return btrfs_detach_folio_state(fs_info, folio, BTRFS_SUBPAGE_DATA);
+ if (btrfs_is_subpage(fs_info, folio)) {
+ /* freeing of private subpage data is deferred to btrfs_free_folio */
+ return;
+ }
folio_detach_private(folio);
}
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b8abfe7439a3..7a832ee3b591 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7565,6 +7565,23 @@ static bool btrfs_release_folio(struct folio *folio, gfp_t gfp_flags)
return __btrfs_release_folio(folio, gfp_flags);
}
+/* frees subpage private data if present */
+static void btrfs_free_folio(struct folio *folio)
+{
+ struct btrfs_folio_state *bfs;
+
+ if (!folio_test_private(folio))
+ return;
+
+ bfs = folio_detach_private(folio);
+ if (bfs == (void *)EXTENT_FOLIO_PRIVATE) {
+ /* extent map flag is detached in btrfs_folio_release */
+ return;
+ }
+
+ btrfs_free_folio_state(bfs);
+}
+
#ifdef CONFIG_MIGRATION
static int btrfs_migrate_folio(struct address_space *mapping,
struct folio *dst, struct folio *src,
@@ -10651,6 +10668,7 @@ static const struct address_space_operations btrfs_aops = {
.invalidate_folio = btrfs_invalidate_folio,
.launder_folio = btrfs_launder_folio,
.release_folio = btrfs_release_folio,
+ .free_folio = btrfs_free_folio,
.migrate_folio = btrfs_migrate_folio,
.dirty_folio = filemap_dirty_folio,
.error_remove_folio = generic_error_remove_folio,
--
2.47.3
在 2026/1/30 09:38, JP Kobryn 写道:
[...]
> The patch also might have the advantage of being easy to backport to the
> LTS trees. On that note, it's worth mentioning that we encountered a kernel
> panic as a result of this sequence on a 6.16-based arm64 host (configured
> with 64k pages so btrfs is in subpage mode). On our 6.16 kernel, the race
> window is shown below between points A and B:
>
> [mm] page cache reclaim path [fs] relocation in subpage mode
> shrink_folio_list()
> folio_trylock() /* lock acquired */
> filemap_release_folio()
> mapping->a_ops->release_folio()
> btrfs_release_folio()
> __btrfs_release_folio()
> clear_folio_extent_mapped()
> btrfs_detach_folio_state()
> bfs = folio_detach_private(folio)
> btrfs_free_folio_state(folio)
> kfree(bfs) /* point A */
>
> prealloc_file_extent_cluster()
> filemap_lock_folio()
Mind to explain which function is calling filemap_lock_folio()?
I guess it's filemap_invalidate_inode() -> filemap_fdatawrite_range() ->
filemap_writeback() -> btrfs_writepages() -> extent_write_cache_pages().
> folio_try_get() /* inc refcount */
> folio_lock() /* wait for lock */
Another question here is, since the folio is already released in the mm
path, the folio should not have dirty flag set.
That means inside extent_write_cache_pages(), the folio_test_dirty()
should return false, and we should just unlock the folio without
touching it anymore.
Mind to explain why we still continue the writeback of a non-dirty folio?
>
> __remove_mapping()
> if (!folio_ref_freeze(folio, refcount)) /* point B */
> goto cannot_free /* folio remains in cache */
>
> folio_unlock(folio) /* lock released */
>
> /* lock acquired */
> btrfs_subpage_clear_updodate()
Mind to provide more context of where the btrfs_subpage_clear_uptodate()
call is from?
> bfs = folio->priv /* use-after-free */
>
> This exact race during relocation should not occur in the latest upstream
> code, but it's an example of a backport opportunity for this patch.
And mind to explain what is missing in 6.16 kernel that causes the above
use-after-free?
>
> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
> ---
> fs/btrfs/extent_io.c | 6 ++++--
> fs/btrfs/inode.c | 18 ++++++++++++++++++
> 2 files changed, 22 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 3df399dc8856..d83d3f9ae3af 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -928,8 +928,10 @@ void clear_folio_extent_mapped(struct folio *folio)
> return;
>
> fs_info = folio_to_fs_info(folio);
> - if (btrfs_is_subpage(fs_info, folio))
> - return btrfs_detach_folio_state(fs_info, folio, BTRFS_SUBPAGE_DATA);
> + if (btrfs_is_subpage(fs_info, folio)) {
> + /* freeing of private subpage data is deferred to btrfs_free_folio */
> + return;
> + }
Another question is, why only two fses (nfs for dir inode, and orangefs)
are utilizing the free_folio() callback.
Iomap is doing the same as btrfs and only calls ifs_free() in
release_folio() and invalidate_folio().
Thus it looks like free_folio() callback is not the recommended way to
free folio->private pointer.
Cc fsdevel list on whether the free_folio() callback should have new
callers.
>
> folio_detach_private(folio);
This means for regular folio cases, we still remove the private flag of
such folio.
It may be fine for most cases as we will not touch folio->private
anyway, but this still looks like a inconsistent behavior, especially
the free_folio() callback has handling for both cases.
Thanks,
Qu
> }
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index b8abfe7439a3..7a832ee3b591 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7565,6 +7565,23 @@ static bool btrfs_release_folio(struct folio *folio, gfp_t gfp_flags)
> return __btrfs_release_folio(folio, gfp_flags);
> }
>
> +/* frees subpage private data if present */
> +static void btrfs_free_folio(struct folio *folio)
> +{
> + struct btrfs_folio_state *bfs;
> +
> + if (!folio_test_private(folio))
> + return;
> +
> + bfs = folio_detach_private(folio);
> + if (bfs == (void *)EXTENT_FOLIO_PRIVATE) {
> + /* extent map flag is detached in btrfs_folio_release */
> + return;
> + }
> +
> + btrfs_free_folio_state(bfs);
> +}
> +
> #ifdef CONFIG_MIGRATION
> static int btrfs_migrate_folio(struct address_space *mapping,
> struct folio *dst, struct folio *src,
> @@ -10651,6 +10668,7 @@ static const struct address_space_operations btrfs_aops = {
> .invalidate_folio = btrfs_invalidate_folio,
> .launder_folio = btrfs_launder_folio,
> .release_folio = btrfs_release_folio,
> + .free_folio = btrfs_free_folio,
> .migrate_folio = btrfs_migrate_folio,
> .dirty_folio = filemap_dirty_folio,
> .error_remove_folio = generic_error_remove_folio,
On Fri, Jan 30, 2026 at 01:46:59PM +1030, Qu Wenruo wrote:
>
>
> 在 2026/1/30 09:38, JP Kobryn 写道:
> [...]
> > The patch also might have the advantage of being easy to backport to the
> > LTS trees. On that note, it's worth mentioning that we encountered a kernel
> > panic as a result of this sequence on a 6.16-based arm64 host (configured
> > with 64k pages so btrfs is in subpage mode). On our 6.16 kernel, the race
> > window is shown below between points A and B:
> >
> > [mm] page cache reclaim path [fs] relocation in subpage mode
> > shrink_folio_list()
> > folio_trylock() /* lock acquired */
> > filemap_release_folio()
> > mapping->a_ops->release_folio()
> > btrfs_release_folio()
> > __btrfs_release_folio()
> > clear_folio_extent_mapped()
> > btrfs_detach_folio_state()
> > bfs = folio_detach_private(folio)
> > btrfs_free_folio_state(folio)
> > kfree(bfs) /* point A */
> >
> > prealloc_file_extent_cluster()
> > filemap_lock_folio()
>
> Mind to explain which function is calling filemap_lock_folio()?
>
> I guess it's filemap_invalidate_inode() -> filemap_fdatawrite_range() ->
> filemap_writeback() -> btrfs_writepages() -> extent_write_cache_pages().
>
I think you may have missed it in the diagram, and some of the function
names may have shifted a bit between kernels, but it is relocation.
On current btrfs/for-next, I think it would be:
relocate_file_extent_cluster()
relocate_one_folio()
filemap_lock_folio()
> > folio_try_get() /* inc refcount */
> > folio_lock() /* wait for lock */
>
>
> Another question here is, since the folio is already released in the mm
> path, the folio should not have dirty flag set.
>
> That means inside extent_write_cache_pages(), the folio_test_dirty() should
> return false, and we should just unlock the folio without touching it
> anymore.
>
> Mind to explain why we still continue the writeback of a non-dirty folio?
>
I think this question is answered by the above as well: we aren't in
writeback, we are in relocation.
Thanks,
Boris
> >
> > __remove_mapping()
> > if (!folio_ref_freeze(folio, refcount)) /* point B */
> > goto cannot_free /* folio remains in cache */
> >
> > folio_unlock(folio) /* lock released */
> >
> > /* lock acquired */
> > btrfs_subpage_clear_updodate()
>
> Mind to provide more context of where the btrfs_subpage_clear_uptodate()
> call is from?
>
> > bfs = folio->priv /* use-after-free */
> >
> > This exact race during relocation should not occur in the latest upstream
> > code, but it's an example of a backport opportunity for this patch.
>
> And mind to explain what is missing in 6.16 kernel that causes the above
> use-after-free?
>
> >
> > Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
> > ---
> > fs/btrfs/extent_io.c | 6 ++++--
> > fs/btrfs/inode.c | 18 ++++++++++++++++++
> > 2 files changed, 22 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index 3df399dc8856..d83d3f9ae3af 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -928,8 +928,10 @@ void clear_folio_extent_mapped(struct folio *folio)
> > return;
> > fs_info = folio_to_fs_info(folio);
> > - if (btrfs_is_subpage(fs_info, folio))
> > - return btrfs_detach_folio_state(fs_info, folio, BTRFS_SUBPAGE_DATA);
> > + if (btrfs_is_subpage(fs_info, folio)) {
> > + /* freeing of private subpage data is deferred to btrfs_free_folio */
> > + return;
> > + }
>
> Another question is, why only two fses (nfs for dir inode, and orangefs) are
> utilizing the free_folio() callback.
>
> Iomap is doing the same as btrfs and only calls ifs_free() in
> release_folio() and invalidate_folio().
>
> Thus it looks like free_folio() callback is not the recommended way to free
> folio->private pointer.
>
> Cc fsdevel list on whether the free_folio() callback should have new
> callers.
>
> > folio_detach_private(folio);
>
> This means for regular folio cases, we still remove the private flag of such
> folio.
>
> It may be fine for most cases as we will not touch folio->private anyway,
> but this still looks like a inconsistent behavior, especially the
> free_folio() callback has handling for both cases.
>
> Thanks,
> Qu
>
> > }
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index b8abfe7439a3..7a832ee3b591 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -7565,6 +7565,23 @@ static bool btrfs_release_folio(struct folio *folio, gfp_t gfp_flags)
> > return __btrfs_release_folio(folio, gfp_flags);
> > }
> > +/* frees subpage private data if present */
> > +static void btrfs_free_folio(struct folio *folio)
> > +{
> > + struct btrfs_folio_state *bfs;
> > +
> > + if (!folio_test_private(folio))
> > + return;
> > +
> > + bfs = folio_detach_private(folio);
> > + if (bfs == (void *)EXTENT_FOLIO_PRIVATE) {
> > + /* extent map flag is detached in btrfs_folio_release */
> > + return;
> > + }
> > +
> > + btrfs_free_folio_state(bfs);
> > +}
> > +
> > #ifdef CONFIG_MIGRATION
> > static int btrfs_migrate_folio(struct address_space *mapping,
> > struct folio *dst, struct folio *src,
> > @@ -10651,6 +10668,7 @@ static const struct address_space_operations btrfs_aops = {
> > .invalidate_folio = btrfs_invalidate_folio,
> > .launder_folio = btrfs_launder_folio,
> > .release_folio = btrfs_release_folio,
> > + .free_folio = btrfs_free_folio,
> > .migrate_folio = btrfs_migrate_folio,
> > .dirty_folio = filemap_dirty_folio,
> > .error_remove_folio = generic_error_remove_folio,
>
在 2026/1/30 17:04, Boris Burkov 写道:
> On Fri, Jan 30, 2026 at 01:46:59PM +1030, Qu Wenruo wrote:
>>
>>
>> 在 2026/1/30 09:38, JP Kobryn 写道:
>> [...]
>>> The patch also might have the advantage of being easy to backport to the
>>> LTS trees. On that note, it's worth mentioning that we encountered a kernel
>>> panic as a result of this sequence on a 6.16-based arm64 host (configured
>>> with 64k pages so btrfs is in subpage mode). On our 6.16 kernel, the race
>>> window is shown below between points A and B:
>>>
>>> [mm] page cache reclaim path [fs] relocation in subpage mode
>>> shrink_folio_list()
>>> folio_trylock() /* lock acquired */
>>> filemap_release_folio()
>>> mapping->a_ops->release_folio()
>>> btrfs_release_folio()
>>> __btrfs_release_folio()
>>> clear_folio_extent_mapped()
>>> btrfs_detach_folio_state()
>>> bfs = folio_detach_private(folio)
>>> btrfs_free_folio_state(folio)
>>> kfree(bfs) /* point A */
>>>
>>> prealloc_file_extent_cluster()
>>> filemap_lock_folio()
>>
>> Mind to explain which function is calling filemap_lock_folio()?
>>
>> I guess it's filemap_invalidate_inode() -> filemap_fdatawrite_range() ->
>> filemap_writeback() -> btrfs_writepages() -> extent_write_cache_pages().
>>
>
> I think you may have missed it in the diagram, and some of the function
> names may have shifted a bit between kernels, but it is relocation.
>
> On current btrfs/for-next, I think it would be:
>
> relocate_file_extent_cluster()
> relocate_one_folio()
> filemap_lock_folio()
Thanks, indeed the filemap_lock_folio() inside
prealloc_file_extent_cluster() only exists in v6.16 code base, which
does partial folio invalidating manually.
That code is no longer there, and gets replaced with a much healthier
solution.
>
>>> folio_try_get() /* inc refcount */
>>> folio_lock() /* wait for lock */
>>
>>
>> Another question here is, since the folio is already released in the mm
>> path, the folio should not have dirty flag set.
>>
>> That means inside extent_write_cache_pages(), the folio_test_dirty() should
>> return false, and we should just unlock the folio without touching it
>> anymore.
>>
>> Mind to explain why we still continue the writeback of a non-dirty folio?
>>
>
> I think this question is answered by the above as well: we aren't in
> writeback, we are in relocation.
I see the problem now. And thankfully it's commit 4e346baee95f ("btrfs:
reloc: unconditionally invalidate the page cache for each cluster")
fixing the behavior.
And yes, the old code can indeed hit the problem.
But still, the commit 4e346baee95f ("btrfs: reloc: unconditionally
invalidate the page cache for each cluster") itself shouldn't be that
hard to backport.
Thanks,
Qu
>
> Thanks,
> Boris
>
>>>
>>> __remove_mapping()
>>> if (!folio_ref_freeze(folio, refcount)) /* point B */
>>> goto cannot_free /* folio remains in cache */
>>>
>>> folio_unlock(folio) /* lock released */
>>>
>>> /* lock acquired */
>>> btrfs_subpage_clear_updodate()
>>
>> Mind to provide more context of where the btrfs_subpage_clear_uptodate()
>> call is from?
>>
>>> bfs = folio->priv /* use-after-free */
>>>
>>> This exact race during relocation should not occur in the latest upstream
>>> code, but it's an example of a backport opportunity for this patch.
>>
>> And mind to explain what is missing in 6.16 kernel that causes the above
>> use-after-free?
>>
>>>
>>> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
>>> ---
>>> fs/btrfs/extent_io.c | 6 ++++--
>>> fs/btrfs/inode.c | 18 ++++++++++++++++++
>>> 2 files changed, 22 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>>> index 3df399dc8856..d83d3f9ae3af 100644
>>> --- a/fs/btrfs/extent_io.c
>>> +++ b/fs/btrfs/extent_io.c
>>> @@ -928,8 +928,10 @@ void clear_folio_extent_mapped(struct folio *folio)
>>> return;
>>> fs_info = folio_to_fs_info(folio);
>>> - if (btrfs_is_subpage(fs_info, folio))
>>> - return btrfs_detach_folio_state(fs_info, folio, BTRFS_SUBPAGE_DATA);
>>> + if (btrfs_is_subpage(fs_info, folio)) {
>>> + /* freeing of private subpage data is deferred to btrfs_free_folio */
>>> + return;
>>> + }
>>
>> Another question is, why only two fses (nfs for dir inode, and orangefs) are
>> utilizing the free_folio() callback.
>>
>> Iomap is doing the same as btrfs and only calls ifs_free() in
>> release_folio() and invalidate_folio().
>>
>> Thus it looks like free_folio() callback is not the recommended way to free
>> folio->private pointer.
>>
>> Cc fsdevel list on whether the free_folio() callback should have new
>> callers.
>>
>>> folio_detach_private(folio);
>>
>> This means for regular folio cases, we still remove the private flag of such
>> folio.
>>
>> It may be fine for most cases as we will not touch folio->private anyway,
>> but this still looks like a inconsistent behavior, especially the
>> free_folio() callback has handling for both cases.
>>
>> Thanks,
>> Qu
>>
>>> }
>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>> index b8abfe7439a3..7a832ee3b591 100644
>>> --- a/fs/btrfs/inode.c
>>> +++ b/fs/btrfs/inode.c
>>> @@ -7565,6 +7565,23 @@ static bool btrfs_release_folio(struct folio *folio, gfp_t gfp_flags)
>>> return __btrfs_release_folio(folio, gfp_flags);
>>> }
>>> +/* frees subpage private data if present */
>>> +static void btrfs_free_folio(struct folio *folio)
>>> +{
>>> + struct btrfs_folio_state *bfs;
>>> +
>>> + if (!folio_test_private(folio))
>>> + return;
>>> +
>>> + bfs = folio_detach_private(folio);
>>> + if (bfs == (void *)EXTENT_FOLIO_PRIVATE) {
>>> + /* extent map flag is detached in btrfs_folio_release */
>>> + return;
>>> + }
>>> +
>>> + btrfs_free_folio_state(bfs);
>>> +}
>>> +
>>> #ifdef CONFIG_MIGRATION
>>> static int btrfs_migrate_folio(struct address_space *mapping,
>>> struct folio *dst, struct folio *src,
>>> @@ -10651,6 +10668,7 @@ static const struct address_space_operations btrfs_aops = {
>>> .invalidate_folio = btrfs_invalidate_folio,
>>> .launder_folio = btrfs_launder_folio,
>>> .release_folio = btrfs_release_folio,
>>> + .free_folio = btrfs_free_folio,
>>> .migrate_folio = btrfs_migrate_folio,
>>> .dirty_folio = filemap_dirty_folio,
>>> .error_remove_folio = generic_error_remove_folio,
>>
On Fri, Jan 30, 2026 at 01:46:59PM +1030, Qu Wenruo wrote: > Another question is, why only two fses (nfs for dir inode, and orangefs) are > utilizing the free_folio() callback. Alas, secretmem and guest_memfd are also using it. Nevertheless, I'm not a fan of this interface existing, and would prefer to not introduce new users. Like launder_folio, which btrfs has also mistakenly used.
On 1/29/26 9:14 PM, Matthew Wilcox wrote: > On Fri, Jan 30, 2026 at 01:46:59PM +1030, Qu Wenruo wrote: >> Another question is, why only two fses (nfs for dir inode, and orangefs) are >> utilizing the free_folio() callback. > > Alas, secretmem and guest_memfd are also using it. Nevertheless, I'm > not a fan of this interface existing, and would prefer to not introduce > new users. Like launder_folio, which btrfs has also mistakenly used. > The part that felt concerning is how the private state is lost. If release_folio() frees this state but the folio persists in the cache, users of the folio afterward have to recreate the state. Is that the expectation on how filesystems should handle this situation? In the case of the existing btrfs code, when the state is recreated (in subpage mode), the bitmap data and lock states are all zeroed.
在 2026/1/31 03:40, JP Kobryn 写道: > On 1/29/26 9:14 PM, Matthew Wilcox wrote: >> On Fri, Jan 30, 2026 at 01:46:59PM +1030, Qu Wenruo wrote: >>> Another question is, why only two fses (nfs for dir inode, and >>> orangefs) are >>> utilizing the free_folio() callback. >> >> Alas, secretmem and guest_memfd are also using it. Nevertheless, I'm >> not a fan of this interface existing, and would prefer to not introduce >> new users. Like launder_folio, which btrfs has also mistakenly used. >> > > The part that felt concerning is how the private state is lost. If > release_folio() frees this state but the folio persists in the cache, > users of the folio afterward have to recreate the state. Is that the > expectation on how filesystems should handle this situation? I believe that's the case. Just like what we did in btrfs_do_readpage() and prepare_one_folio(). There is no difference between getting a new page and a page that is released but not removed from the filemap. > > In the case of the existing btrfs code, when the state is recreated (in > subpage mode), the bitmap data and lock states are all zeroed. That's expected. Thanks, Qu
On 1/30/26 12:36 PM, Qu Wenruo wrote: > > > 在 2026/1/31 03:40, JP Kobryn 写道: >> On 1/29/26 9:14 PM, Matthew Wilcox wrote: >>> On Fri, Jan 30, 2026 at 01:46:59PM +1030, Qu Wenruo wrote: >>>> Another question is, why only two fses (nfs for dir inode, and >>>> orangefs) are >>>> utilizing the free_folio() callback. >>> >>> Alas, secretmem and guest_memfd are also using it. Nevertheless, I'm >>> not a fan of this interface existing, and would prefer to not introduce >>> new users. Like launder_folio, which btrfs has also mistakenly used. >>> >> >> The part that felt concerning is how the private state is lost. If >> release_folio() frees this state but the folio persists in the cache, >> users of the folio afterward have to recreate the state. Is that the >> expectation on how filesystems should handle this situation? > > I believe that's the case. > > Just like what we did in btrfs_do_readpage() and prepare_one_folio(). > > There is no difference between getting a new page and a page that is > released but not removed from the filemap. > >> >> In the case of the existing btrfs code, when the state is recreated (in >> subpage mode), the bitmap data and lock states are all zeroed. > > That's expected. > Thanks all for the feedback. I get it now that we should treat it like a fresh folio where applicable. With that said, I may have found a path where unguarded access to the private field is happening. I'll send a patch shortly and you can let me know your thoughts.
On Fri, Jan 30, 2026 at 09:10:11AM -0800, JP Kobryn wrote: > On 1/29/26 9:14 PM, Matthew Wilcox wrote: > > On Fri, Jan 30, 2026 at 01:46:59PM +1030, Qu Wenruo wrote: > > > Another question is, why only two fses (nfs for dir inode, and orangefs) are > > > utilizing the free_folio() callback. > > > > Alas, secretmem and guest_memfd are also using it. Nevertheless, I'm > > not a fan of this interface existing, and would prefer to not introduce > > new users. Like launder_folio, which btrfs has also mistakenly used. > > The part that felt concerning is how the private state is lost. If > release_folio() frees this state but the folio persists in the cache, > users of the folio afterward have to recreate the state. Is that the > expectation on how filesystems should handle this situation? Yes; that's how iomap and buffer_head both handle it. If an operation happens that needs sub-folio tracking, allocate the per-folio state then continue as before. > In the case of the existing btrfs code, when the state is recreated (in > subpage mode), the bitmap data and lock states are all zeroed. If the folio is uptodate, iomap will initialise the uptodate bitmap to all-set rather than all-clear. The dirty bitmap will similarly reflect whetheer the folio dirty bit is set or clear. Obviously we lose some precision there (the folio may have been only partially dirty, or some of the blocks in it may already have been uptodate), but that's not likely to cause any performance problems. When ->release_folio is called, we're expecting to evict the folio from the page cache ... we just failed to do so, so it's reasonable to treat it as a freshly allocated folio.
On Fri, Jan 30, 2026 at 05:14:26AM +0000, Matthew Wilcox wrote: > On Fri, Jan 30, 2026 at 01:46:59PM +1030, Qu Wenruo wrote: > > Another question is, why only two fses (nfs for dir inode, and orangefs) are > > utilizing the free_folio() callback. > > Alas, secretmem and guest_memfd are also using it. Nevertheless, I'm > not a fan of this interface existing, and would prefer to not introduce From the VFS documentation: "free_folio is called once the folio is no longer visible in the page cache in order to allow the cleanup of any private data." That sounds like the perfect solution to exactly the issue JP observed. So if it is not intended to be used in this way, I think it would be beneficial to more accurately document the proper (never?) use of free_folio() to avoid this pitfall. I will also point out that the usage in orangefs closely matches this one in spirit: kfree-ing some longer lived object tracking ranges in the folio. As for this patch, do you have any suggestion for how to handle the problem of freeing allocated memory stored in folio->private? To me, it feels quite wrong to do it in release_folio() if that can end with a folio that can still be found in the mapping. Should we have all users that find and lock the folio detect whether or not private is set up appropriately and re-allocate it if not? Or should we only lookup and use the private field only from VFS triggered paths (rather than btrfs's internal relocation stuff here) > new users. Like launder_folio, which btrfs has also mistakenly used. > for context https://lore.kernel.org/linux-btrfs/070b1d025ef6eb292638bb97683cd5c35ffe42eb.1721775142.git.boris@bur.io/ IIRC, the issue was that we call invalidate_inode_pages2 on outstanding dirty folios while cleaning up a failed transaction which calls launder_folio() and release_folio(). At this point, btrfs does not have private set on the folio so release_folio() is a no-op. So I used launder_folio() to clean up otherwise leaked state that was 1:1 with dirty folios, so it also felt like a "natural" fit. I apologize for this undesirable usage, but I was genuinely trying to do the right thing and take advantage of an abstract interface that very cleanly modeled my problem. At the time, I interpreted the existence of an interface that neatly fit my problem to be a credit to the design of VFS. In the future, I will make sure to cc fsdevel before adding any new usage of the VFS. At a high level, would it be reasonable to mark the page private when it is dirtied and the qgroup reservation is made, so that we do go into release_folio() via invalidate_inode_pages2? I can look into that. Ultimately, cleaning up qgroup reservation accounting while aborting a transaction is hardly the most important thing in the world. It's just something that looks messy while cleaning up the filesystem. So if we want to get rid of launder_folio() even if I can't figure out how to make it use release_folio(), all we really need is for someone to ask nicely and I can probably come up with a hack. :) Thanks, Boris
© 2016 - 2026 Red Hat, Inc.