[v1] guest_memfd: Track amount of memory allocated on inode

[RFC PATCH v1 00/10] guest_memfd: Track amount of memory allocated on inode

Posted by Ackerley Tng 1 month, 3 weeks ago

Hi,

Currently, guest_memfd doesn't update inode's i_blocks or i_bytes at
all. Hence, st_blocks in the struct populated by a userspace fstat()
call on a guest_memfd will always be 0. This patch series makes
guest_memfd track the amount of memory allocated on an inode, which
allows fstat() to accurately report that on requests from userspace.

The inode's i_blocks and i_bytes fields are updated when the folio is
associated or disassociated from the guest_memfd inode, which are at
allocation and truncation times respectively.

To update inode fields at truncation time, this series implements a
custom truncation function for guest_memfd. An alternative would be to
update truncate_inode_pages_range() to return the number of bytes
truncated or add/use some hook.

Implementing a custom truncation function was chosen to provide
flexibility for handling truncations in future when guest_memfd
supports sources of pages other than the buddy allocator. This
approach of a custom truncation function also aligns with shmem, which
has a custom shmem_truncate_range().

To update inode fields at allocation time, kvm_gmem_get_folio() is
simply augmented such that when a folio is added to the filemap, the
size of the folio is updated into inode fields.

The second patch, to use filemap_alloc_folio() during allocation of
guest_memfd folios, was written as a debugging step to resolve a bug
found by syzbot [1], but turned out to not be the fix. I include it
here because it cleans up the allocation process and provides a nice
foundation for updating inode fields during allocations.

The first patch was separately submitted [2], and provided here since
it is a prerequisite simplication before application of the second
patch.

[1] https://lore.kernel.org/all/29c347bde68ec027259654e8e85371307edf7058.1770148108.git.ackerleytng@google.com/
[2] https://lore.kernel.org/all/20260129172646.2361462-1-ackerleytng@google.com/

Ackerley Tng (10):
  KVM: guest_memfd: Don't set FGP_ACCESSED when getting folios
  KVM: guest_memfd: Directly allocate folios with filemap_alloc_folio()
  mm: truncate: Expose preparation steps for
    truncate_inode_pages_final()
  KVM: guest_memfd: Implement evict_inode for guest_memfd
  mm: Export unmap_mapping_folio() for KVM
  mm: filemap: Export filemap_remove_folio()
  KVM: guest_memfd: Implement custom truncation function
  KVM: guest_memfd: Track amount of memory allocated on inode
  KVM: selftests: Wrap fstat() to assert success
  KVM: selftests: Test that st_blocks is updated on allocation

 include/linux/mm.h                            |   3 +
 mm/filemap.c                                  |   2 +
 mm/internal.h                                 |   2 -
 mm/memory.c                                   |   2 +
 mm/truncate.c                                 |  21 +++-
 .../testing/selftests/kvm/guest_memfd_test.c  |  32 +++--
 .../selftests/kvm/include/kvm_syscalls.h      |   2 +
 virt/kvm/guest_memfd.c                        | 116 +++++++++++++++---
 8 files changed, 149 insertions(+), 31 deletions(-)


base-commit: b1195183ed42f1522fae3fe44ebee3af437aa000
--
2.53.0.345.g96ddfc5eaa-goog

Re: [RFC PATCH v1 00/10] guest_memfd: Track amount of memory allocated on inode

Posted by David Hildenbrand (Arm) 1 month, 3 weeks ago

On 2/23/26 08:04, Ackerley Tng wrote:
> Hi,
> 
> Currently, guest_memfd doesn't update inode's i_blocks or i_bytes at
> all. Hence, st_blocks in the struct populated by a userspace fstat()
> call on a guest_memfd will always be 0. This patch series makes
> guest_memfd track the amount of memory allocated on an inode, which
> allows fstat() to accurately report that on requests from userspace.
> 
> The inode's i_blocks and i_bytes fields are updated when the folio is
> associated or disassociated from the guest_memfd inode, which are at
> allocation and truncation times respectively.
> 
> To update inode fields at truncation time, this series implements a
> custom truncation function for guest_memfd. An alternative would be to
> update truncate_inode_pages_range() to return the number of bytes
> truncated or add/use some hook.
> 
> Implementing a custom truncation function was chosen to provide
> flexibility for handling truncations in future when guest_memfd
> supports sources of pages other than the buddy allocator. This
> approach of a custom truncation function also aligns with shmem, which
> has a custom shmem_truncate_range().

Just wondered how shmem does it: it's through 
dquot_alloc_block_nodirty() / dquot_free_block_nodirty().

It's a shame we can't just use folio_free(). Could we maybe have a 
different callback (when the mapping is still guaranteed to be around) 
from where we could update i_blocks on the freeing path?

-- 
Cheers,

David

Re: [RFC PATCH v1 00/10] guest_memfd: Track amount of memory allocated on inode

Posted by Ackerley Tng 1 month, 3 weeks ago

"David Hildenbrand (Arm)" <david@kernel.org> writes:

> On 2/23/26 08:04, Ackerley Tng wrote:
>> Hi,
>>
>> Currently, guest_memfd doesn't update inode's i_blocks or i_bytes at
>> all. Hence, st_blocks in the struct populated by a userspace fstat()
>> call on a guest_memfd will always be 0. This patch series makes
>> guest_memfd track the amount of memory allocated on an inode, which
>> allows fstat() to accurately report that on requests from userspace.
>>
>> The inode's i_blocks and i_bytes fields are updated when the folio is
>> associated or disassociated from the guest_memfd inode, which are at
>> allocation and truncation times respectively.
>>
>> To update inode fields at truncation time, this series implements a
>> custom truncation function for guest_memfd. An alternative would be to
>> update truncate_inode_pages_range() to return the number of bytes
>> truncated or add/use some hook.
>>
>> Implementing a custom truncation function was chosen to provide
>> flexibility for handling truncations in future when guest_memfd
>> supports sources of pages other than the buddy allocator. This
>> approach of a custom truncation function also aligns with shmem, which
>> has a custom shmem_truncate_range().
>
> Just wondered how shmem does it: it's through
> dquot_alloc_block_nodirty() / dquot_free_block_nodirty().
>
> It's a shame we can't just use folio_free().

Yup, Hugh pointed out that struct address_space *mapping (and inode) may already
have been freed by the time .free_folio() is called [1].

[1] https://lore.kernel.org/all/7c2677e1-daf7-3b49-0a04-1efdf451379a@google.com/

> Could we maybe have a
> different callback (when the mapping is still guaranteed to be around)
> from where we could update i_blocks on the freeing path?

Do you mean that we should add a new callback to struct
address_space_operations?

.invalidate_folio semantically seems suitable. This is called from
truncate_cleanup_folio() and is conditioned on
folio_needs_release(). guest_memfd could make itself need release, but
IIUC that would cause a NULL pointer dereference in
filemap_release_folio() since try_to_free_buffers() -> drop_buffers()
will dereference folio->private.

From the name, .release_folio sounds eligible, but this is meant for
releasing data attached to a folio, not quite the same as updating inode
fields. This is also not called in the truncation path.

>
> --
> Cheers,
>
> David

Re: [RFC PATCH v1 00/10] guest_memfd: Track amount of memory allocated on inode

Posted by David Hildenbrand (Arm) 1 month, 3 weeks ago

On 2/24/26 00:42, Ackerley Tng wrote:
> "David Hildenbrand (Arm)" <david@kernel.org> writes:
> 
>> On 2/23/26 08:04, Ackerley Tng wrote:
>>> Hi,
>>>
>>> Currently, guest_memfd doesn't update inode's i_blocks or i_bytes at
>>> all. Hence, st_blocks in the struct populated by a userspace fstat()
>>> call on a guest_memfd will always be 0. This patch series makes
>>> guest_memfd track the amount of memory allocated on an inode, which
>>> allows fstat() to accurately report that on requests from userspace.
>>>
>>> The inode's i_blocks and i_bytes fields are updated when the folio is
>>> associated or disassociated from the guest_memfd inode, which are at
>>> allocation and truncation times respectively.
>>>
>>> To update inode fields at truncation time, this series implements a
>>> custom truncation function for guest_memfd. An alternative would be to
>>> update truncate_inode_pages_range() to return the number of bytes
>>> truncated or add/use some hook.
>>>
>>> Implementing a custom truncation function was chosen to provide
>>> flexibility for handling truncations in future when guest_memfd
>>> supports sources of pages other than the buddy allocator. This
>>> approach of a custom truncation function also aligns with shmem, which
>>> has a custom shmem_truncate_range().
>>
>> Just wondered how shmem does it: it's through
>> dquot_alloc_block_nodirty() / dquot_free_block_nodirty().
>>
>> It's a shame we can't just use folio_free().
> 
> Yup, Hugh pointed out that struct address_space *mapping (and inode) may already
> have been freed by the time .free_folio() is called [1].
> 
> [1] https://lore.kernel.org/all/7c2677e1-daf7-3b49-0a04-1efdf451379a@google.com/
> 
>> Could we maybe have a
>> different callback (when the mapping is still guaranteed to be around)
>> from where we could update i_blocks on the freeing path?
> 
> Do you mean that we should add a new callback to struct
> address_space_operations?

If that avoids having to implement truncation completely ourselves, that might be one
option we could discuss, yes.

Something like:

diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 7c753148af88..94f8bb81f017 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -764,6 +764,7 @@ cache in your filesystem.  The following members are defined:
                sector_t (*bmap)(struct address_space *, sector_t);
                void (*invalidate_folio) (struct folio *, size_t start, size_t len);
                bool (*release_folio)(struct folio *, gfp_t);
+               void (*remove_folio)(struct folio *folio);
                void (*free_folio)(struct folio *);
                ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter);
                int (*migrate_folio)(struct mapping *, struct folio *dst,
@@ -922,6 +923,11 @@ cache in your filesystem.  The following members are defined:
        its release_folio will need to ensure this.  Possibly it can
        clear the uptodate flag if it cannot free private data yet.
 
+``remove_folio``
+       remove_folio is called just before the folio is removed from the
+       page cache in order to allow the cleanup of properties (e.g.,
+       accounting) that needs the address_space mapping.
+
 ``free_folio``
        free_folio is called once the folio is no longer visible in the
        page cache in order to allow the cleanup of any private data.
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8b3dd145b25e..f7f6930977a1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -422,6 +422,7 @@ struct address_space_operations {
        sector_t (*bmap)(struct address_space *, sector_t);
        void (*invalidate_folio) (struct folio *, size_t offset, size_t len);
        bool (*release_folio)(struct folio *, gfp_t);
+       void (*remove_folio)(struct folio *folio);
        void (*free_folio)(struct folio *folio);
        ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter);
        /*
diff --git a/mm/filemap.c b/mm/filemap.c
index 6cd7974d4ada..5a810eaacab2 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -250,8 +250,14 @@ void filemap_free_folio(struct address_space *mapping, struct folio *folio)
 void filemap_remove_folio(struct folio *folio)
 {
        struct address_space *mapping = folio->mapping;
+       void (*remove_folio)(struct folio *);
 
        BUG_ON(!folio_test_locked(folio));
+
+       remove_folio = mapping->a_ops->remove_folio;
+       if (unlikely(remove_folio))
+               remove_folio(folio);
+
        spin_lock(&mapping->host->i_lock);
        xa_lock_irq(&mapping->i_pages);
        __filemap_remove_folio(folio, NULL);


Ideally we'd perform it under the lock just after clearing folio->mapping, but I guess that
might be more controversial.

For accounting you need the above might be good enough, but I am not sure for how many
other use cases there might be.

-- 
Cheers,

David

Re: [RFC PATCH v1 00/10] guest_memfd: Track amount of memory allocated on inode

Posted by Ackerley Tng 1 month, 3 weeks ago

"David Hildenbrand (Arm)" <david@kernel.org> writes:

>
> [...snip...]
>
>>> Could we maybe have a
>>> different callback (when the mapping is still guaranteed to be around)
>>> from where we could update i_blocks on the freeing path?
>>
>> Do you mean that we should add a new callback to struct
>> address_space_operations?
>
> If that avoids having to implement truncation completely ourselves, that might be one
> option we could discuss, yes.
>
> Something like:
>
> diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
> index 7c753148af88..94f8bb81f017 100644
> --- a/Documentation/filesystems/vfs.rst
> +++ b/Documentation/filesystems/vfs.rst
> @@ -764,6 +764,7 @@ cache in your filesystem.  The following members are defined:
>                 sector_t (*bmap)(struct address_space *, sector_t);
>                 void (*invalidate_folio) (struct folio *, size_t start, size_t len);
>                 bool (*release_folio)(struct folio *, gfp_t);
> +               void (*remove_folio)(struct folio *folio);
>                 void (*free_folio)(struct folio *);
>                 ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter);
>                 int (*migrate_folio)(struct mapping *, struct folio *dst,
> @@ -922,6 +923,11 @@ cache in your filesystem.  The following members are defined:
>         its release_folio will need to ensure this.  Possibly it can
>         clear the uptodate flag if it cannot free private data yet.
>
> +``remove_folio``
> +       remove_folio is called just before the folio is removed from the
> +       page cache in order to allow the cleanup of properties (e.g.,
> +       accounting) that needs the address_space mapping.
> +
>  ``free_folio``
>         free_folio is called once the folio is no longer visible in the
>         page cache in order to allow the cleanup of any private data.
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 8b3dd145b25e..f7f6930977a1 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -422,6 +422,7 @@ struct address_space_operations {
>         sector_t (*bmap)(struct address_space *, sector_t);
>         void (*invalidate_folio) (struct folio *, size_t offset, size_t len);
>         bool (*release_folio)(struct folio *, gfp_t);
> +       void (*remove_folio)(struct folio *folio);
>         void (*free_folio)(struct folio *folio);
>         ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter);
>         /*
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 6cd7974d4ada..5a810eaacab2 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -250,8 +250,14 @@ void filemap_free_folio(struct address_space *mapping, struct folio *folio)
>  void filemap_remove_folio(struct folio *folio)
>  {
>         struct address_space *mapping = folio->mapping;
> +       void (*remove_folio)(struct folio *);
>
>         BUG_ON(!folio_test_locked(folio));
> +
> +       remove_folio = mapping->a_ops->remove_folio;
> +       if (unlikely(remove_folio))
> +               remove_folio(folio);
> +
>         spin_lock(&mapping->host->i_lock);
>         xa_lock_irq(&mapping->i_pages);
>         __filemap_remove_folio(folio, NULL);
>

Thanks for this suggestion, I'll try this out and send another revision.

>
> Ideally we'd perform it under the lock just after clearing folio->mapping, but I guess that
> might be more controversial.
>
> For accounting you need the above might be good enough, but I am not sure for how many
> other use cases there might be.
>
> --
> Cheers,
>
> David

Re: [RFC PATCH v1 00/10] guest_memfd: Track amount of memory allocated on inode

Posted by Ackerley Tng 1 month, 2 weeks ago

Ackerley Tng <ackerleytng@google.com> writes:

> "David Hildenbrand (Arm)" <david@kernel.org> writes:
>
>>
>> [...snip...]
>>
>>>> Could we maybe have a
>>>> different callback (when the mapping is still guaranteed to be around)
>>>> from where we could update i_blocks on the freeing path?
>>>
>>> Do you mean that we should add a new callback to struct
>>> address_space_operations?
>>
>> If that avoids having to implement truncation completely ourselves, that might be one
>> option we could discuss, yes.
>>
>> Something like:
>>
>> diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
>> index 7c753148af88..94f8bb81f017 100644
>> --- a/Documentation/filesystems/vfs.rst
>> +++ b/Documentation/filesystems/vfs.rst
>> @@ -764,6 +764,7 @@ cache in your filesystem.  The following members are defined:
>>                 sector_t (*bmap)(struct address_space *, sector_t);
>>                 void (*invalidate_folio) (struct folio *, size_t start, size_t len);
>>                 bool (*release_folio)(struct folio *, gfp_t);
>> +               void (*remove_folio)(struct folio *folio);
>>                 void (*free_folio)(struct folio *);
>>                 ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter);
>>                 int (*migrate_folio)(struct mapping *, struct folio *dst,
>> @@ -922,6 +923,11 @@ cache in your filesystem.  The following members are defined:
>>         its release_folio will need to ensure this.  Possibly it can
>>         clear the uptodate flag if it cannot free private data yet.
>>
>> +``remove_folio``
>> +       remove_folio is called just before the folio is removed from the
>> +       page cache in order to allow the cleanup of properties (e.g.,
>> +       accounting) that needs the address_space mapping.
>> +
>>  ``free_folio``
>>         free_folio is called once the folio is no longer visible in the
>>         page cache in order to allow the cleanup of any private data.
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 8b3dd145b25e..f7f6930977a1 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -422,6 +422,7 @@ struct address_space_operations {
>>         sector_t (*bmap)(struct address_space *, sector_t);
>>         void (*invalidate_folio) (struct folio *, size_t offset, size_t len);
>>         bool (*release_folio)(struct folio *, gfp_t);
>> +       void (*remove_folio)(struct folio *folio);
>>         void (*free_folio)(struct folio *folio);
>>         ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter);
>>         /*
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 6cd7974d4ada..5a810eaacab2 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -250,8 +250,14 @@ void filemap_free_folio(struct address_space *mapping, struct folio *folio)
>>  void filemap_remove_folio(struct folio *folio)
>>  {
>>         struct address_space *mapping = folio->mapping;
>> +       void (*remove_folio)(struct folio *);
>>
>>         BUG_ON(!folio_test_locked(folio));
>> +
>> +       remove_folio = mapping->a_ops->remove_folio;
>> +       if (unlikely(remove_folio))
>> +               remove_folio(folio);
>> +
>>         spin_lock(&mapping->host->i_lock);
>>         xa_lock_irq(&mapping->i_pages);
>>         __filemap_remove_folio(folio, NULL);
>>
>
> Thanks for this suggestion, I'll try this out and send another revision.
>
>>
>> Ideally we'd perform it under the lock just after clearing folio->mapping, but I guess that
>> might be more controversial.
>>

I'm not sure which lock you were referring to, I hope it's not the
inode's i_lock? Why is calling the callback under lock frowned upon?

I found .remove_folio also had to be called from
delete_from_page_cache_batch() for it to work. Then I saw that both of
those functions already use filemap_unaccount_folio(), and after all,
like you said, guest_memfd will be using this callback for accounting,
so in RFC v2 [1] I used .unaccount_folio instead, and it is called under
the inode's i_lock from filemap_unaccount_folio().

[1] https://lore.kernel.org/all/20260225-gmem-st-blocks-v2-0-87d7098119a9@google.com/T/

>> For accounting you need the above might be good enough, but I am not sure for how many
>> other use cases there might be.
>>
>> --
>> Cheers,
>>
>> David

Re: [RFC PATCH v1 00/10] guest_memfd: Track amount of memory allocated on inode

Posted by David Hildenbrand (Arm) 1 month, 2 weeks ago

On 2/25/26 08:31, Ackerley Tng wrote:
> Ackerley Tng <ackerleytng@google.com> writes:
> 
>> "David Hildenbrand (Arm)" <david@kernel.org> writes:
>>
>>>
>>> [...snip...]
>>>
>>>
>>> If that avoids having to implement truncation completely ourselves, that might be one
>>> option we could discuss, yes.
>>>
>>> Something like:
>>>
>>> diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
>>> index 7c753148af88..94f8bb81f017 100644
>>> --- a/Documentation/filesystems/vfs.rst
>>> +++ b/Documentation/filesystems/vfs.rst
>>> @@ -764,6 +764,7 @@ cache in your filesystem.  The following members are defined:
>>>                 sector_t (*bmap)(struct address_space *, sector_t);
>>>                 void (*invalidate_folio) (struct folio *, size_t start, size_t len);
>>>                 bool (*release_folio)(struct folio *, gfp_t);
>>> +               void (*remove_folio)(struct folio *folio);
>>>                 void (*free_folio)(struct folio *);
>>>                 ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter);
>>>                 int (*migrate_folio)(struct mapping *, struct folio *dst,
>>> @@ -922,6 +923,11 @@ cache in your filesystem.  The following members are defined:
>>>         its release_folio will need to ensure this.  Possibly it can
>>>         clear the uptodate flag if it cannot free private data yet.
>>>
>>> +``remove_folio``
>>> +       remove_folio is called just before the folio is removed from the
>>> +       page cache in order to allow the cleanup of properties (e.g.,
>>> +       accounting) that needs the address_space mapping.
>>> +
>>>  ``free_folio``
>>>         free_folio is called once the folio is no longer visible in the
>>>         page cache in order to allow the cleanup of any private data.
>>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>>> index 8b3dd145b25e..f7f6930977a1 100644
>>> --- a/include/linux/fs.h
>>> +++ b/include/linux/fs.h
>>> @@ -422,6 +422,7 @@ struct address_space_operations {
>>>         sector_t (*bmap)(struct address_space *, sector_t);
>>>         void (*invalidate_folio) (struct folio *, size_t offset, size_t len);
>>>         bool (*release_folio)(struct folio *, gfp_t);
>>> +       void (*remove_folio)(struct folio *folio);
>>>         void (*free_folio)(struct folio *folio);
>>>         ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter);
>>>         /*
>>> diff --git a/mm/filemap.c b/mm/filemap.c
>>> index 6cd7974d4ada..5a810eaacab2 100644
>>> --- a/mm/filemap.c
>>> +++ b/mm/filemap.c
>>> @@ -250,8 +250,14 @@ void filemap_free_folio(struct address_space *mapping, struct folio *folio)
>>>  void filemap_remove_folio(struct folio *folio)
>>>  {
>>>         struct address_space *mapping = folio->mapping;
>>> +       void (*remove_folio)(struct folio *);
>>>
>>>         BUG_ON(!folio_test_locked(folio));
>>> +
>>> +       remove_folio = mapping->a_ops->remove_folio;
>>> +       if (unlikely(remove_folio))
>>> +               remove_folio(folio);
>>> +
>>>         spin_lock(&mapping->host->i_lock);
>>>         xa_lock_irq(&mapping->i_pages);
>>>         __filemap_remove_folio(folio, NULL);
>>>
>>
>> Thanks for this suggestion, I'll try this out and send another revision.
>>
>>>
>>> Ideally we'd perform it under the lock just after clearing folio->mapping, but I guess that
>>> might be more controversial.
>>>
> 
> I'm not sure which lock you were referring to, I hope it's not the
> inode's i_lock? Why is calling the callback under lock frowned upon?

I meant the two locks: mapping->host->i_lock and mapping->i_pages.

I'd assume new callbacks that might result in holding these precious
locks longer might be a problem for some people. Well, maybe, maybe not.

I guess .free_folio() is called outside the lock because it's assumed to
possibly do more expensive operations.

-- 
Cheers,

David

Re: [RFC PATCH v1 00/10] guest_memfd: Track amount of memory allocated on inode

Posted by Ackerley Tng 1 month, 2 weeks ago

"David Hildenbrand (Arm)" <david@kernel.org> writes:

> On 2/25/26 08:31, Ackerley Tng wrote:
>> Ackerley Tng <ackerleytng@google.com> writes:
>>
>>> "David Hildenbrand (Arm)" <david@kernel.org> writes:
>>>
>>>>
>>>> [...snip...]
>>>>
>>>>
>>>> If that avoids having to implement truncation completely ourselves, that might be one
>>>> option we could discuss, yes.
>>>>
>>>> Something like:
>>>>
>>>> diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
>>>> index 7c753148af88..94f8bb81f017 100644
>>>> --- a/Documentation/filesystems/vfs.rst
>>>> +++ b/Documentation/filesystems/vfs.rst
>>>> @@ -764,6 +764,7 @@ cache in your filesystem.  The following members are defined:
>>>>                 sector_t (*bmap)(struct address_space *, sector_t);
>>>>                 void (*invalidate_folio) (struct folio *, size_t start, size_t len);
>>>>                 bool (*release_folio)(struct folio *, gfp_t);
>>>> +               void (*remove_folio)(struct folio *folio);
>>>>                 void (*free_folio)(struct folio *);
>>>>                 ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter);
>>>>                 int (*migrate_folio)(struct mapping *, struct folio *dst,
>>>> @@ -922,6 +923,11 @@ cache in your filesystem.  The following members are defined:
>>>>         its release_folio will need to ensure this.  Possibly it can
>>>>         clear the uptodate flag if it cannot free private data yet.
>>>>
>>>> +``remove_folio``
>>>> +       remove_folio is called just before the folio is removed from the
>>>> +       page cache in order to allow the cleanup of properties (e.g.,
>>>> +       accounting) that needs the address_space mapping.
>>>> +
>>>>  ``free_folio``
>>>>         free_folio is called once the folio is no longer visible in the
>>>>         page cache in order to allow the cleanup of any private data.
>>>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>>>> index 8b3dd145b25e..f7f6930977a1 100644
>>>> --- a/include/linux/fs.h
>>>> +++ b/include/linux/fs.h
>>>> @@ -422,6 +422,7 @@ struct address_space_operations {
>>>>         sector_t (*bmap)(struct address_space *, sector_t);
>>>>         void (*invalidate_folio) (struct folio *, size_t offset, size_t len);
>>>>         bool (*release_folio)(struct folio *, gfp_t);
>>>> +       void (*remove_folio)(struct folio *folio);
>>>>         void (*free_folio)(struct folio *folio);
>>>>         ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter);
>>>>         /*
>>>> diff --git a/mm/filemap.c b/mm/filemap.c
>>>> index 6cd7974d4ada..5a810eaacab2 100644
>>>> --- a/mm/filemap.c
>>>> +++ b/mm/filemap.c
>>>> @@ -250,8 +250,14 @@ void filemap_free_folio(struct address_space *mapping, struct folio *folio)
>>>>  void filemap_remove_folio(struct folio *folio)
>>>>  {
>>>>         struct address_space *mapping = folio->mapping;
>>>> +       void (*remove_folio)(struct folio *);
>>>>
>>>>         BUG_ON(!folio_test_locked(folio));
>>>> +
>>>> +       remove_folio = mapping->a_ops->remove_folio;
>>>> +       if (unlikely(remove_folio))
>>>> +               remove_folio(folio);
>>>> +
>>>>         spin_lock(&mapping->host->i_lock);
>>>>         xa_lock_irq(&mapping->i_pages);
>>>>         __filemap_remove_folio(folio, NULL);
>>>>
>>>
>>> Thanks for this suggestion, I'll try this out and send another revision.
>>>
>>>>
>>>> Ideally we'd perform it under the lock just after clearing folio->mapping, but I guess that
>>>> might be more controversial.
>>>>
>>
>> I'm not sure which lock you were referring to, I hope it's not the
>> inode's i_lock? Why is calling the callback under lock frowned upon?
>
> I meant the two locks: mapping->host->i_lock and mapping->i_pages.
>
> I'd assume new callbacks that might result in holding these precious
> locks longer might be a problem for some people. Well, maybe, maybe not.
>

The extra time (for guest_memfd, and almost no extra time for other
filesystems) is on the truncation path, hopefully that isn't a hot path!

> I guess .free_folio() is called outside the lock because it's assumed to
> possibly do more expensive operations.
>

I thought .free_folio() was called outside of the lock because after the
folio is removed from the filemap, there should be no more inode/filemap
related contention, so any cleanup can definitely be done outside the
inode/filemap locks.

> --
> Cheers,
>
> David