[RFC PATCH v3 0/4] Support large folios for tmpfs

Baolin Wang posted 4 patches 1 month, 2 weeks ago
Documentation/filesystems/tmpfs.rst |   7 +-
include/linux/pagemap.h             |  16 ++++-
mm/shmem.c                          | 105 ++++++++++++++++++++--------
3 files changed, 94 insertions(+), 34 deletions(-)
[RFC PATCH v3 0/4] Support large folios for tmpfs
Posted by Baolin Wang 1 month, 2 weeks ago
Hi,

This RFC patch series attempts to support large folios for tmpfs. 

Considering that tmpfs already has the 'huge=' option to control the THP
allocation, it is necessary to maintain compatibility with the 'huge='
option, as well as considering the 'deny' and 'force' option controlled
by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.

Add a new huge option 'write_size' to support large folio allocation based
on the write size for tmpfs write and fallocate paths. So the huge pages
allocation strategy for tmpfs is that, if the 'huge=' option
(huge=always/within_size/advise) is enabled or the 'shmem_enabled' option
is 'force', it need just allow PMD sized THP to keep backward compatibility
for tmpfs. While 'huge=' option is disabled (huge=never) or the 'shmem_enabled'
option is 'deny', it will still disable any large folio allocations. Only
when the 'huge=' option is 'write_size', it will allow allocating large
folios based on the write size.

And I think the 'huge=write_size' option should be the default behavior
for tmpfs in future.

Any comments and suggestions are appreciated. Thanks.

Changes from RFC v2:
 - Drop mTHP interfaces to control huge page allocation, per Matthew.
 - Add a new helper to calculate the order, suggested by Matthew.
 - Add a new huge=write_size option to allocate large folios based on
   the write size.
 - Add a new patch to update the documentation.

Changes from RFC v1:
 - Drop patch 1.
 - Use 'write_end' to calculate the length in shmem_allowable_huge_orders().
 - Update shmem_mapping_size_order() per Daniel.

Baolin Wang (4):
  mm: factor out the order calculation into a new helper
  mm: shmem: change shmem_huge_global_enabled() to return huge order
    bitmap
  mm: shmem: add large folio support to the write and fallocate paths
    for tmpfs
  docs: tmpfs: add documention for 'write_size' huge option

 Documentation/filesystems/tmpfs.rst |   7 +-
 include/linux/pagemap.h             |  16 ++++-
 mm/shmem.c                          | 105 ++++++++++++++++++++--------
 3 files changed, 94 insertions(+), 34 deletions(-)

-- 
2.39.3
Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
Posted by Matthew Wilcox 1 month, 1 week ago
On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote:
> Considering that tmpfs already has the 'huge=' option to control the THP
> allocation, it is necessary to maintain compatibility with the 'huge='
> option, as well as considering the 'deny' and 'force' option controlled
> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.

No, it's not.  No other filesystem honours these settings.  tmpfs would
not have had these settings if it were written today.  It should simply
ignore them, the way that NFS ignores the "intr" mount option now that
we have a better solution to the original problem.

To reiterate my position:

 - When using tmpfs as a filesystem, it should behave like other
   filesystems.
 - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED, it should
   behave like anonymous memory.

No more special mount options.
Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
Posted by Baolin Wang 1 month, 1 week ago
+ Kirill

On 2024/10/16 22:06, Matthew Wilcox wrote:
> On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote:
>> Considering that tmpfs already has the 'huge=' option to control the THP
>> allocation, it is necessary to maintain compatibility with the 'huge='
>> option, as well as considering the 'deny' and 'force' option controlled
>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
> 
> No, it's not.  No other filesystem honours these settings.  tmpfs would
> not have had these settings if it were written today.  It should simply
> ignore them, the way that NFS ignores the "intr" mount option now that
> we have a better solution to the original problem.
> 
> To reiterate my position:
> 
>   - When using tmpfs as a filesystem, it should behave like other
>     filesystems.
>   - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED, it should
>     behave like anonymous memory.

I do agree with your point to some extent, but the ‘huge=’ option has 
existed for nearly 8 years, and the huge orders based on write size may 
not achieve the performance of PMD-sized THP in some scenarios, such as 
when the write length is consistently 4K. So, I am still concerned that 
ignoring the 'huge' option could lead to compatibility issues.

Another possible choice is to make the huge pages allocation based on 
write size as the *default* behavior for tmpfs, while marking the 
'huge=' option as deprecated and gradually removing it if there are no 
user complaints about performance issues.

Let's also see what Hugh and Kirill think.

Hugh, Kirill, do you have any inputs?
Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
Posted by Kirill A. Shutemov 1 month, 1 week ago
On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote:
> + Kirill
> 
> On 2024/10/16 22:06, Matthew Wilcox wrote:
> > On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote:
> > > Considering that tmpfs already has the 'huge=' option to control the THP
> > > allocation, it is necessary to maintain compatibility with the 'huge='
> > > option, as well as considering the 'deny' and 'force' option controlled
> > > by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
> > 
> > No, it's not.  No other filesystem honours these settings.  tmpfs would
> > not have had these settings if it were written today.  It should simply
> > ignore them, the way that NFS ignores the "intr" mount option now that
> > we have a better solution to the original problem.
> > 
> > To reiterate my position:
> > 
> >   - When using tmpfs as a filesystem, it should behave like other
> >     filesystems.
> >   - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED, it should
> >     behave like anonymous memory.
> 
> I do agree with your point to some extent, but the ‘huge=’ option has
> existed for nearly 8 years, and the huge orders based on write size may not
> achieve the performance of PMD-sized THP in some scenarios, such as when the
> write length is consistently 4K. So, I am still concerned that ignoring the
> 'huge' option could lead to compatibility issues.

Yeah, I don't think we are there yet to ignore the mount option.

Maybe we need to get a new generic interface to request the semantics
tmpfs has with huge= on per-inode level on any fs. Like a set of FADV_*
handles to make kernel allocate PMD-size folio on any allocation or on
allocations within i_size. I think this behaviour is useful beyond tmpfs.

Then huge= implementation for tmpfs can be re-defined to set these
per-inode FADV_ flags by default. This way we can keep tmpfs compatible
with current deployments and less special comparing to rest of
filesystems on kernel side.

If huge= is not set, tmpfs would behave the same way as the rest of
filesystems.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov
Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
Posted by Baolin Wang 1 month ago

On 2024/10/17 19:26, Kirill A. Shutemov wrote:
> On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote:
>> + Kirill
>>
>> On 2024/10/16 22:06, Matthew Wilcox wrote:
>>> On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote:
>>>> Considering that tmpfs already has the 'huge=' option to control the THP
>>>> allocation, it is necessary to maintain compatibility with the 'huge='
>>>> option, as well as considering the 'deny' and 'force' option controlled
>>>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>>
>>> No, it's not.  No other filesystem honours these settings.  tmpfs would
>>> not have had these settings if it were written today.  It should simply
>>> ignore them, the way that NFS ignores the "intr" mount option now that
>>> we have a better solution to the original problem.
>>>
>>> To reiterate my position:
>>>
>>>    - When using tmpfs as a filesystem, it should behave like other
>>>      filesystems.
>>>    - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED, it should
>>>      behave like anonymous memory.
>>
>> I do agree with your point to some extent, but the ‘huge=’ option has
>> existed for nearly 8 years, and the huge orders based on write size may not
>> achieve the performance of PMD-sized THP in some scenarios, such as when the
>> write length is consistently 4K. So, I am still concerned that ignoring the
>> 'huge' option could lead to compatibility issues.
> 
> Yeah, I don't think we are there yet to ignore the mount option.

OK.

> Maybe we need to get a new generic interface to request the semantics
> tmpfs has with huge= on per-inode level on any fs. Like a set of FADV_*
> handles to make kernel allocate PMD-size folio on any allocation or on
> allocations within i_size. I think this behaviour is useful beyond tmpfs.
> 
> Then huge= implementation for tmpfs can be re-defined to set these
> per-inode FADV_ flags by default. This way we can keep tmpfs compatible
> with current deployments and less special comparing to rest of
> filesystems on kernel side.

I did a quick search, and I didn't find any other fs that require 
PMD-sized huge pages, so I am not sure if FADV_* is useful for 
filesystems other than tmpfs. Please correct me if I missed something.

> If huge= is not set, tmpfs would behave the same way as the rest of
> filesystems.

So if 'huge=' is not set, tmpfs write()/fallocate() can still allocate 
large folios based on the write size? If yes, that means it will change 
the default huge behavior for tmpfs. Because previously having 'huge=' 
is not set means the huge option is 'SHMEM_HUGE_NEVER', which is similar 
to what I mentioned:
"Another possible choice is to make the huge pages allocation based on 
write size as the *default* behavior for tmpfs, ..."
Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
Posted by Kirill A. Shutemov 1 month ago
On Mon, Oct 21, 2024 at 02:24:18PM +0800, Baolin Wang wrote:
> 
> 
> On 2024/10/17 19:26, Kirill A. Shutemov wrote:
> > On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote:
> > > + Kirill
> > > 
> > > On 2024/10/16 22:06, Matthew Wilcox wrote:
> > > > On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote:
> > > > > Considering that tmpfs already has the 'huge=' option to control the THP
> > > > > allocation, it is necessary to maintain compatibility with the 'huge='
> > > > > option, as well as considering the 'deny' and 'force' option controlled
> > > > > by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
> > > > 
> > > > No, it's not.  No other filesystem honours these settings.  tmpfs would
> > > > not have had these settings if it were written today.  It should simply
> > > > ignore them, the way that NFS ignores the "intr" mount option now that
> > > > we have a better solution to the original problem.
> > > > 
> > > > To reiterate my position:
> > > > 
> > > >    - When using tmpfs as a filesystem, it should behave like other
> > > >      filesystems.
> > > >    - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED, it should
> > > >      behave like anonymous memory.
> > > 
> > > I do agree with your point to some extent, but the ‘huge=’ option has
> > > existed for nearly 8 years, and the huge orders based on write size may not
> > > achieve the performance of PMD-sized THP in some scenarios, such as when the
> > > write length is consistently 4K. So, I am still concerned that ignoring the
> > > 'huge' option could lead to compatibility issues.
> > 
> > Yeah, I don't think we are there yet to ignore the mount option.
> 
> OK.
> 
> > Maybe we need to get a new generic interface to request the semantics
> > tmpfs has with huge= on per-inode level on any fs. Like a set of FADV_*
> > handles to make kernel allocate PMD-size folio on any allocation or on
> > allocations within i_size. I think this behaviour is useful beyond tmpfs.
> > 
> > Then huge= implementation for tmpfs can be re-defined to set these
> > per-inode FADV_ flags by default. This way we can keep tmpfs compatible
> > with current deployments and less special comparing to rest of
> > filesystems on kernel side.
> 
> I did a quick search, and I didn't find any other fs that require PMD-sized
> huge pages, so I am not sure if FADV_* is useful for filesystems other than
> tmpfs. Please correct me if I missed something.

What do you mean by "require"? THPs are always opportunistic.

IIUC, we don't have a way to hint kernel to use huge pages for a file on
read from backing storage. Readahead is not always the right way.

> > If huge= is not set, tmpfs would behave the same way as the rest of
> > filesystems.
> 
> So if 'huge=' is not set, tmpfs write()/fallocate() can still allocate large
> folios based on the write size? If yes, that means it will change the
> default huge behavior for tmpfs. Because previously having 'huge=' is not
> set means the huge option is 'SHMEM_HUGE_NEVER', which is similar to what I
> mentioned:
> "Another possible choice is to make the huge pages allocation based on write
> size as the *default* behavior for tmpfs, ..."

I am more worried about breaking existing users of huge pages. So changing
behaviour of users who don't specify huge is okay to me.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov
Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
Posted by Baolin Wang 1 month ago

On 2024/10/21 16:54, Kirill A. Shutemov wrote:
> On Mon, Oct 21, 2024 at 02:24:18PM +0800, Baolin Wang wrote:
>>
>>
>> On 2024/10/17 19:26, Kirill A. Shutemov wrote:
>>> On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote:
>>>> + Kirill
>>>>
>>>> On 2024/10/16 22:06, Matthew Wilcox wrote:
>>>>> On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote:
>>>>>> Considering that tmpfs already has the 'huge=' option to control the THP
>>>>>> allocation, it is necessary to maintain compatibility with the 'huge='
>>>>>> option, as well as considering the 'deny' and 'force' option controlled
>>>>>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>>>>
>>>>> No, it's not.  No other filesystem honours these settings.  tmpfs would
>>>>> not have had these settings if it were written today.  It should simply
>>>>> ignore them, the way that NFS ignores the "intr" mount option now that
>>>>> we have a better solution to the original problem.
>>>>>
>>>>> To reiterate my position:
>>>>>
>>>>>     - When using tmpfs as a filesystem, it should behave like other
>>>>>       filesystems.
>>>>>     - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED, it should
>>>>>       behave like anonymous memory.
>>>>
>>>> I do agree with your point to some extent, but the ‘huge=’ option has
>>>> existed for nearly 8 years, and the huge orders based on write size may not
>>>> achieve the performance of PMD-sized THP in some scenarios, such as when the
>>>> write length is consistently 4K. So, I am still concerned that ignoring the
>>>> 'huge' option could lead to compatibility issues.
>>>
>>> Yeah, I don't think we are there yet to ignore the mount option.
>>
>> OK.
>>
>>> Maybe we need to get a new generic interface to request the semantics
>>> tmpfs has with huge= on per-inode level on any fs. Like a set of FADV_*
>>> handles to make kernel allocate PMD-size folio on any allocation or on
>>> allocations within i_size. I think this behaviour is useful beyond tmpfs.
>>>
>>> Then huge= implementation for tmpfs can be re-defined to set these
>>> per-inode FADV_ flags by default. This way we can keep tmpfs compatible
>>> with current deployments and less special comparing to rest of
>>> filesystems on kernel side.
>>
>> I did a quick search, and I didn't find any other fs that require PMD-sized
>> huge pages, so I am not sure if FADV_* is useful for filesystems other than
>> tmpfs. Please correct me if I missed something.
> 
> What do you mean by "require"? THPs are always opportunistic.
> 
> IIUC, we don't have a way to hint kernel to use huge pages for a file on
> read from backing storage. Readahead is not always the right way.

IIUC, most file systems use method similar to iomap buffered IO (see 
iomap_get_folio()) to allocate huge pages. What I mean is that, it would 
be better to have a real use case to add a hint for allocating THP 
(other than tmpfs).

>>> If huge= is not set, tmpfs would behave the same way as the rest of
>>> filesystems.
>>
>> So if 'huge=' is not set, tmpfs write()/fallocate() can still allocate large
>> folios based on the write size? If yes, that means it will change the
>> default huge behavior for tmpfs. Because previously having 'huge=' is not
>> set means the huge option is 'SHMEM_HUGE_NEVER', which is similar to what I
>> mentioned:
>> "Another possible choice is to make the huge pages allocation based on write
>> size as the *default* behavior for tmpfs, ..."
> 
> I am more worried about breaking existing users of huge pages. So changing
> behaviour of users who don't specify huge is okay to me.

OK. Good.
Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
Posted by Kirill A. Shutemov 1 month ago
On Tue, Oct 22, 2024 at 11:34:14AM +0800, Baolin Wang wrote:
> IIUC, most file systems use method similar to iomap buffered IO (see
> iomap_get_folio()) to allocate huge pages. What I mean is that, it would be
> better to have a real use case to add a hint for allocating THP (other than
> tmpfs).

I would be nice to hear from folks who works with production what the
actual needs are.

But I find asymmetry between MADV_ hints and FADV_ hints wrt huge pages
not justified. I think it would be easy to find use-cases for
FADV_HUGEPAGE/FADV_NOHUGEPAGE.

Furthermore I think it would be useful to have some kind of mechanism to
make these hints persistent: any open of a file would have these hints set
by default based on inode metadata on backing storage. Although, I am not
sure what the right way to archive that. xattrs?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov
Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
Posted by Baolin Wang 1 month ago

On 2024/10/22 18:06, Kirill A. Shutemov wrote:
> On Tue, Oct 22, 2024 at 11:34:14AM +0800, Baolin Wang wrote:
>> IIUC, most file systems use method similar to iomap buffered IO (see
>> iomap_get_folio()) to allocate huge pages. What I mean is that, it would be
>> better to have a real use case to add a hint for allocating THP (other than
>> tmpfs).
> 
> I would be nice to hear from folks who works with production what the
> actual needs are.
> 
> But I find asymmetry between MADV_ hints and FADV_ hints wrt huge pages
> not justified. I think it would be easy to find use-cases for
> FADV_HUGEPAGE/FADV_NOHUGEPAGE.
> 
> Furthermore I think it would be useful to have some kind of mechanism to
> make these hints persistent: any open of a file would have these hints set
> by default based on inode metadata on backing storage. Although, I am not
> sure what the right way to archive that. xattrs?

May be can re-use mapping_set_folio_order_range()?
Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
Posted by Daniel Gomez 1 month ago
On Mon Oct 21, 2024 at 10:54 AM CEST, Kirill A. Shutemov wrote:
> On Mon, Oct 21, 2024 at 02:24:18PM +0800, Baolin Wang wrote:
>> 
>> 
>> On 2024/10/17 19:26, Kirill A. Shutemov wrote:
>> > On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote:
>> > > + Kirill
>> > > 
>> > > On 2024/10/16 22:06, Matthew Wilcox wrote:
>> > > > On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote:
>> > > > > Considering that tmpfs already has the 'huge=' option to control the THP
>> > > > > allocation, it is necessary to maintain compatibility with the 'huge='
>> > > > > option, as well as considering the 'deny' and 'force' option controlled
>> > > > > by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>> > > > 
>> > > > No, it's not.  No other filesystem honours these settings.  tmpfs would
>> > > > not have had these settings if it were written today.  It should simply
>> > > > ignore them, the way that NFS ignores the "intr" mount option now that
>> > > > we have a better solution to the original problem.
>> > > > 
>> > > > To reiterate my position:
>> > > > 
>> > > >    - When using tmpfs as a filesystem, it should behave like other
>> > > >      filesystems.
>> > > >    - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED, it should
>> > > >      behave like anonymous memory.
>> > > 
>> > > I do agree with your point to some extent, but the ‘huge=’ option has
>> > > existed for nearly 8 years, and the huge orders based on write size may not
>> > > achieve the performance of PMD-sized THP in some scenarios, such as when the
>> > > write length is consistently 4K. So, I am still concerned that ignoring the
>> > > 'huge' option could lead to compatibility issues.
>> > 
>> > Yeah, I don't think we are there yet to ignore the mount option.
>> 
>> OK.
>> 
>> > Maybe we need to get a new generic interface to request the semantics
>> > tmpfs has with huge= on per-inode level on any fs. Like a set of FADV_*
>> > handles to make kernel allocate PMD-size folio on any allocation or on
>> > allocations within i_size. I think this behaviour is useful beyond tmpfs.
>> > 
>> > Then huge= implementation for tmpfs can be re-defined to set these
>> > per-inode FADV_ flags by default. This way we can keep tmpfs compatible
>> > with current deployments and less special comparing to rest of
>> > filesystems on kernel side.
>> 
>> I did a quick search, and I didn't find any other fs that require PMD-sized
>> huge pages, so I am not sure if FADV_* is useful for filesystems other than
>> tmpfs. Please correct me if I missed something.
>
> What do you mean by "require"? THPs are always opportunistic.
>
> IIUC, we don't have a way to hint kernel to use huge pages for a file on
> read from backing storage. Readahead is not always the right way.
>
>> > If huge= is not set, tmpfs would behave the same way as the rest of
>> > filesystems.
>> 
>> So if 'huge=' is not set, tmpfs write()/fallocate() can still allocate large
>> folios based on the write size? If yes, that means it will change the
>> default huge behavior for tmpfs. Because previously having 'huge=' is not
>> set means the huge option is 'SHMEM_HUGE_NEVER', which is similar to what I
>> mentioned:
>> "Another possible choice is to make the huge pages allocation based on write
>> size as the *default* behavior for tmpfs, ..."
>
> I am more worried about breaking existing users of huge pages. So changing
> behaviour of users who don't specify huge is okay to me.

I think moving tmpfs to allocate large folios opportunistically by
default (as it was proposed initially) doesn't necessary conflict with
the default behaviour (huge=never). We just need to clarify that in
the documentation.

However, and IIRC, one of the requests from Hugh was to have a way to
disable large folios which is something other FS do not have control
of as of today. Ryan sent a proposal to actually control that globally
but I think it didn't move forward. So, what are we missing to go back
to implement large folios in tmpfs in the default case, as any other fs
leveraging large folios?
Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
Posted by Baolin Wang 1 month ago

On 2024/10/21 21:34, Daniel Gomez wrote:
> On Mon Oct 21, 2024 at 10:54 AM CEST, Kirill A. Shutemov wrote:
>> On Mon, Oct 21, 2024 at 02:24:18PM +0800, Baolin Wang wrote:
>>>
>>>
>>> On 2024/10/17 19:26, Kirill A. Shutemov wrote:
>>>> On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote:
>>>>> + Kirill
>>>>>
>>>>> On 2024/10/16 22:06, Matthew Wilcox wrote:
>>>>>> On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote:
>>>>>>> Considering that tmpfs already has the 'huge=' option to control the THP
>>>>>>> allocation, it is necessary to maintain compatibility with the 'huge='
>>>>>>> option, as well as considering the 'deny' and 'force' option controlled
>>>>>>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>>>>>
>>>>>> No, it's not.  No other filesystem honours these settings.  tmpfs would
>>>>>> not have had these settings if it were written today.  It should simply
>>>>>> ignore them, the way that NFS ignores the "intr" mount option now that
>>>>>> we have a better solution to the original problem.
>>>>>>
>>>>>> To reiterate my position:
>>>>>>
>>>>>>     - When using tmpfs as a filesystem, it should behave like other
>>>>>>       filesystems.
>>>>>>     - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED, it should
>>>>>>       behave like anonymous memory.
>>>>>
>>>>> I do agree with your point to some extent, but the ‘huge=’ option has
>>>>> existed for nearly 8 years, and the huge orders based on write size may not
>>>>> achieve the performance of PMD-sized THP in some scenarios, such as when the
>>>>> write length is consistently 4K. So, I am still concerned that ignoring the
>>>>> 'huge' option could lead to compatibility issues.
>>>>
>>>> Yeah, I don't think we are there yet to ignore the mount option.
>>>
>>> OK.
>>>
>>>> Maybe we need to get a new generic interface to request the semantics
>>>> tmpfs has with huge= on per-inode level on any fs. Like a set of FADV_*
>>>> handles to make kernel allocate PMD-size folio on any allocation or on
>>>> allocations within i_size. I think this behaviour is useful beyond tmpfs.
>>>>
>>>> Then huge= implementation for tmpfs can be re-defined to set these
>>>> per-inode FADV_ flags by default. This way we can keep tmpfs compatible
>>>> with current deployments and less special comparing to rest of
>>>> filesystems on kernel side.
>>>
>>> I did a quick search, and I didn't find any other fs that require PMD-sized
>>> huge pages, so I am not sure if FADV_* is useful for filesystems other than
>>> tmpfs. Please correct me if I missed something.
>>
>> What do you mean by "require"? THPs are always opportunistic.
>>
>> IIUC, we don't have a way to hint kernel to use huge pages for a file on
>> read from backing storage. Readahead is not always the right way.
>>
>>>> If huge= is not set, tmpfs would behave the same way as the rest of
>>>> filesystems.
>>>
>>> So if 'huge=' is not set, tmpfs write()/fallocate() can still allocate large
>>> folios based on the write size? If yes, that means it will change the
>>> default huge behavior for tmpfs. Because previously having 'huge=' is not
>>> set means the huge option is 'SHMEM_HUGE_NEVER', which is similar to what I
>>> mentioned:
>>> "Another possible choice is to make the huge pages allocation based on write
>>> size as the *default* behavior for tmpfs, ..."
>>
>> I am more worried about breaking existing users of huge pages. So changing
>> behaviour of users who don't specify huge is okay to me.
> 
> I think moving tmpfs to allocate large folios opportunistically by
> default (as it was proposed initially) doesn't necessary conflict with
> the default behaviour (huge=never). We just need to clarify that in
> the documentation.
> 
> However, and IIRC, one of the requests from Hugh was to have a way to
> disable large folios which is something other FS do not have control
> of as of today. Ryan sent a proposal to actually control that globally
> but I think it didn't move forward. So, what are we missing to go back
> to implement large folios in tmpfs in the default case, as any other fs
> leveraging large folios?

IMHO, as I discussed with Kirill, we still need maintain compatibility 
with the 'huge=' mount option. This means that if 'huge=never' is set 
for tmpfs, huge page allocation will still be prohibited (which can 
address Hugh's request?). However, if 'huge=' is not set, we can 
allocate large folios based on the write size.
Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
Posted by David Hildenbrand 1 month ago
On 22.10.24 05:41, Baolin Wang wrote:
> 
> 
> On 2024/10/21 21:34, Daniel Gomez wrote:
>> On Mon Oct 21, 2024 at 10:54 AM CEST, Kirill A. Shutemov wrote:
>>> On Mon, Oct 21, 2024 at 02:24:18PM +0800, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2024/10/17 19:26, Kirill A. Shutemov wrote:
>>>>> On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote:
>>>>>> + Kirill
>>>>>>
>>>>>> On 2024/10/16 22:06, Matthew Wilcox wrote:
>>>>>>> On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote:
>>>>>>>> Considering that tmpfs already has the 'huge=' option to control the THP
>>>>>>>> allocation, it is necessary to maintain compatibility with the 'huge='
>>>>>>>> option, as well as considering the 'deny' and 'force' option controlled
>>>>>>>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>>>>>>
>>>>>>> No, it's not.  No other filesystem honours these settings.  tmpfs would
>>>>>>> not have had these settings if it were written today.  It should simply
>>>>>>> ignore them, the way that NFS ignores the "intr" mount option now that
>>>>>>> we have a better solution to the original problem.
>>>>>>>
>>>>>>> To reiterate my position:
>>>>>>>
>>>>>>>      - When using tmpfs as a filesystem, it should behave like other
>>>>>>>        filesystems.
>>>>>>>      - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED, it should
>>>>>>>        behave like anonymous memory.
>>>>>>
>>>>>> I do agree with your point to some extent, but the ‘huge=’ option has
>>>>>> existed for nearly 8 years, and the huge orders based on write size may not
>>>>>> achieve the performance of PMD-sized THP in some scenarios, such as when the
>>>>>> write length is consistently 4K. So, I am still concerned that ignoring the
>>>>>> 'huge' option could lead to compatibility issues.
>>>>>
>>>>> Yeah, I don't think we are there yet to ignore the mount option.
>>>>
>>>> OK.
>>>>
>>>>> Maybe we need to get a new generic interface to request the semantics
>>>>> tmpfs has with huge= on per-inode level on any fs. Like a set of FADV_*
>>>>> handles to make kernel allocate PMD-size folio on any allocation or on
>>>>> allocations within i_size. I think this behaviour is useful beyond tmpfs.
>>>>>
>>>>> Then huge= implementation for tmpfs can be re-defined to set these
>>>>> per-inode FADV_ flags by default. This way we can keep tmpfs compatible
>>>>> with current deployments and less special comparing to rest of
>>>>> filesystems on kernel side.
>>>>
>>>> I did a quick search, and I didn't find any other fs that require PMD-sized
>>>> huge pages, so I am not sure if FADV_* is useful for filesystems other than
>>>> tmpfs. Please correct me if I missed something.
>>>
>>> What do you mean by "require"? THPs are always opportunistic.
>>>
>>> IIUC, we don't have a way to hint kernel to use huge pages for a file on
>>> read from backing storage. Readahead is not always the right way.
>>>
>>>>> If huge= is not set, tmpfs would behave the same way as the rest of
>>>>> filesystems.
>>>>
>>>> So if 'huge=' is not set, tmpfs write()/fallocate() can still allocate large
>>>> folios based on the write size? If yes, that means it will change the
>>>> default huge behavior for tmpfs. Because previously having 'huge=' is not
>>>> set means the huge option is 'SHMEM_HUGE_NEVER', which is similar to what I
>>>> mentioned:
>>>> "Another possible choice is to make the huge pages allocation based on write
>>>> size as the *default* behavior for tmpfs, ..."
>>>
>>> I am more worried about breaking existing users of huge pages. So changing
>>> behaviour of users who don't specify huge is okay to me.
>>
>> I think moving tmpfs to allocate large folios opportunistically by
>> default (as it was proposed initially) doesn't necessary conflict with
>> the default behaviour (huge=never). We just need to clarify that in
>> the documentation.
>>
>> However, and IIRC, one of the requests from Hugh was to have a way to
>> disable large folios which is something other FS do not have control
>> of as of today. Ryan sent a proposal to actually control that globally
>> but I think it didn't move forward. So, what are we missing to go back
>> to implement large folios in tmpfs in the default case, as any other fs
>> leveraging large folios?
> 
> IMHO, as I discussed with Kirill, we still need maintain compatibility
> with the 'huge=' mount option. This means that if 'huge=never' is set
> for tmpfs, huge page allocation will still be prohibited (which can
> address Hugh's request?). However, if 'huge=' is not set, we can
> allocate large folios based on the write size.

I consider allocating large folios in shmem/tmpfs on the write path less 
controversial than allocating them on the page fault path -- especially 
as long as we stay within the size to-be-written.

I think in RHEL THP on shmem/tmpfs are disabled as default (e.g., 
shmem_enabled=never). Maybe because of some rather undesired 
side-effects (maybe some are historical?): I recall issues with VMs with 
THP+ memory ballooning, as we cannot reclaim pages of folios if 
splitting fails). I assume most of these problematic use cases don't use 
tmpfs as an ordinary file system (write()/read()), but mmap() the whole 
thing.

Sadly, I don't find any information about shmem/tmpfs + THP in the RHEL 
documentation; most documentation is only concerned about anon THP. 
Which makes me conclude that they are not suggested as of now.

I see more issues with allocating them on the page fault path and not 
having a way to disable it -- compared to allocating them on the write() 
path.

Getting Hugh's opinion in this would be very valuable.

-- 
Cheers,

David / dhildenb

Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
Posted by Baolin Wang 1 month ago

On 2024/10/22 23:31, David Hildenbrand wrote:
> On 22.10.24 05:41, Baolin Wang wrote:
>>
>>
>> On 2024/10/21 21:34, Daniel Gomez wrote:
>>> On Mon Oct 21, 2024 at 10:54 AM CEST, Kirill A. Shutemov wrote:
>>>> On Mon, Oct 21, 2024 at 02:24:18PM +0800, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 2024/10/17 19:26, Kirill A. Shutemov wrote:
>>>>>> On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote:
>>>>>>> + Kirill
>>>>>>>
>>>>>>> On 2024/10/16 22:06, Matthew Wilcox wrote:
>>>>>>>> On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote:
>>>>>>>>> Considering that tmpfs already has the 'huge=' option to 
>>>>>>>>> control the THP
>>>>>>>>> allocation, it is necessary to maintain compatibility with the 
>>>>>>>>> 'huge='
>>>>>>>>> option, as well as considering the 'deny' and 'force' option 
>>>>>>>>> controlled
>>>>>>>>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>>>>>>>
>>>>>>>> No, it's not.  No other filesystem honours these settings.  
>>>>>>>> tmpfs would
>>>>>>>> not have had these settings if it were written today.  It should 
>>>>>>>> simply
>>>>>>>> ignore them, the way that NFS ignores the "intr" mount option 
>>>>>>>> now that
>>>>>>>> we have a better solution to the original problem.
>>>>>>>>
>>>>>>>> To reiterate my position:
>>>>>>>>
>>>>>>>>      - When using tmpfs as a filesystem, it should behave like 
>>>>>>>> other
>>>>>>>>        filesystems.
>>>>>>>>      - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED, 
>>>>>>>> it should
>>>>>>>>        behave like anonymous memory.
>>>>>>>
>>>>>>> I do agree with your point to some extent, but the ‘huge=’ option 
>>>>>>> has
>>>>>>> existed for nearly 8 years, and the huge orders based on write 
>>>>>>> size may not
>>>>>>> achieve the performance of PMD-sized THP in some scenarios, such 
>>>>>>> as when the
>>>>>>> write length is consistently 4K. So, I am still concerned that 
>>>>>>> ignoring the
>>>>>>> 'huge' option could lead to compatibility issues.
>>>>>>
>>>>>> Yeah, I don't think we are there yet to ignore the mount option.
>>>>>
>>>>> OK.
>>>>>
>>>>>> Maybe we need to get a new generic interface to request the semantics
>>>>>> tmpfs has with huge= on per-inode level on any fs. Like a set of 
>>>>>> FADV_*
>>>>>> handles to make kernel allocate PMD-size folio on any allocation 
>>>>>> or on
>>>>>> allocations within i_size. I think this behaviour is useful beyond 
>>>>>> tmpfs.
>>>>>>
>>>>>> Then huge= implementation for tmpfs can be re-defined to set these
>>>>>> per-inode FADV_ flags by default. This way we can keep tmpfs 
>>>>>> compatible
>>>>>> with current deployments and less special comparing to rest of
>>>>>> filesystems on kernel side.
>>>>>
>>>>> I did a quick search, and I didn't find any other fs that require 
>>>>> PMD-sized
>>>>> huge pages, so I am not sure if FADV_* is useful for filesystems 
>>>>> other than
>>>>> tmpfs. Please correct me if I missed something.
>>>>
>>>> What do you mean by "require"? THPs are always opportunistic.
>>>>
>>>> IIUC, we don't have a way to hint kernel to use huge pages for a 
>>>> file on
>>>> read from backing storage. Readahead is not always the right way.
>>>>
>>>>>> If huge= is not set, tmpfs would behave the same way as the rest of
>>>>>> filesystems.
>>>>>
>>>>> So if 'huge=' is not set, tmpfs write()/fallocate() can still 
>>>>> allocate large
>>>>> folios based on the write size? If yes, that means it will change the
>>>>> default huge behavior for tmpfs. Because previously having 'huge=' 
>>>>> is not
>>>>> set means the huge option is 'SHMEM_HUGE_NEVER', which is similar 
>>>>> to what I
>>>>> mentioned:
>>>>> "Another possible choice is to make the huge pages allocation based 
>>>>> on write
>>>>> size as the *default* behavior for tmpfs, ..."
>>>>
>>>> I am more worried about breaking existing users of huge pages. So 
>>>> changing
>>>> behaviour of users who don't specify huge is okay to me.
>>>
>>> I think moving tmpfs to allocate large folios opportunistically by
>>> default (as it was proposed initially) doesn't necessary conflict with
>>> the default behaviour (huge=never). We just need to clarify that in
>>> the documentation.
>>>
>>> However, and IIRC, one of the requests from Hugh was to have a way to
>>> disable large folios which is something other FS do not have control
>>> of as of today. Ryan sent a proposal to actually control that globally
>>> but I think it didn't move forward. So, what are we missing to go back
>>> to implement large folios in tmpfs in the default case, as any other fs
>>> leveraging large folios?
>>
>> IMHO, as I discussed with Kirill, we still need maintain compatibility
>> with the 'huge=' mount option. This means that if 'huge=never' is set
>> for tmpfs, huge page allocation will still be prohibited (which can
>> address Hugh's request?). However, if 'huge=' is not set, we can
>> allocate large folios based on the write size.
> 
> I consider allocating large folios in shmem/tmpfs on the write path less 
> controversial than allocating them on the page fault path -- especially 
> as long as we stay within the size to-be-written.
> 
> I think in RHEL THP on shmem/tmpfs are disabled as default (e.g., 
> shmem_enabled=never). Maybe because of some rather undesired 
> side-effects (maybe some are historical?): I recall issues with VMs with 
> THP+ memory ballooning, as we cannot reclaim pages of folios if 
> splitting fails). I assume most of these problematic use cases don't use 
> tmpfs as an ordinary file system (write()/read()), but mmap() the whole 
> thing.
> 
> Sadly, I don't find any information about shmem/tmpfs + THP in the RHEL 
> documentation; most documentation is only concerned about anon THP. 
> Which makes me conclude that they are not suggested as of now.
> 
> I see more issues with allocating them on the page fault path and not 
> having a way to disable it -- compared to allocating them on the write() 
> path.

I may not understand your issues. IIUC, you can disable allocating huge 
pages on the page fault path by using the 'huge=never' mount option or 
setting shmem_enabled=deny. No?
Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
Posted by David Hildenbrand 1 month ago
On 23.10.24 10:04, Baolin Wang wrote:
> 
> 
> On 2024/10/22 23:31, David Hildenbrand wrote:
>> On 22.10.24 05:41, Baolin Wang wrote:
>>>
>>>
>>> On 2024/10/21 21:34, Daniel Gomez wrote:
>>>> On Mon Oct 21, 2024 at 10:54 AM CEST, Kirill A. Shutemov wrote:
>>>>> On Mon, Oct 21, 2024 at 02:24:18PM +0800, Baolin Wang wrote:
>>>>>>
>>>>>>
>>>>>> On 2024/10/17 19:26, Kirill A. Shutemov wrote:
>>>>>>> On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote:
>>>>>>>> + Kirill
>>>>>>>>
>>>>>>>> On 2024/10/16 22:06, Matthew Wilcox wrote:
>>>>>>>>> On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote:
>>>>>>>>>> Considering that tmpfs already has the 'huge=' option to
>>>>>>>>>> control the THP
>>>>>>>>>> allocation, it is necessary to maintain compatibility with the
>>>>>>>>>> 'huge='
>>>>>>>>>> option, as well as considering the 'deny' and 'force' option
>>>>>>>>>> controlled
>>>>>>>>>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>>>>>>>>
>>>>>>>>> No, it's not.  No other filesystem honours these settings.
>>>>>>>>> tmpfs would
>>>>>>>>> not have had these settings if it were written today.  It should
>>>>>>>>> simply
>>>>>>>>> ignore them, the way that NFS ignores the "intr" mount option
>>>>>>>>> now that
>>>>>>>>> we have a better solution to the original problem.
>>>>>>>>>
>>>>>>>>> To reiterate my position:
>>>>>>>>>
>>>>>>>>>       - When using tmpfs as a filesystem, it should behave like
>>>>>>>>> other
>>>>>>>>>         filesystems.
>>>>>>>>>       - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED,
>>>>>>>>> it should
>>>>>>>>>         behave like anonymous memory.
>>>>>>>>
>>>>>>>> I do agree with your point to some extent, but the ‘huge=’ option
>>>>>>>> has
>>>>>>>> existed for nearly 8 years, and the huge orders based on write
>>>>>>>> size may not
>>>>>>>> achieve the performance of PMD-sized THP in some scenarios, such
>>>>>>>> as when the
>>>>>>>> write length is consistently 4K. So, I am still concerned that
>>>>>>>> ignoring the
>>>>>>>> 'huge' option could lead to compatibility issues.
>>>>>>>
>>>>>>> Yeah, I don't think we are there yet to ignore the mount option.
>>>>>>
>>>>>> OK.
>>>>>>
>>>>>>> Maybe we need to get a new generic interface to request the semantics
>>>>>>> tmpfs has with huge= on per-inode level on any fs. Like a set of
>>>>>>> FADV_*
>>>>>>> handles to make kernel allocate PMD-size folio on any allocation
>>>>>>> or on
>>>>>>> allocations within i_size. I think this behaviour is useful beyond
>>>>>>> tmpfs.
>>>>>>>
>>>>>>> Then huge= implementation for tmpfs can be re-defined to set these
>>>>>>> per-inode FADV_ flags by default. This way we can keep tmpfs
>>>>>>> compatible
>>>>>>> with current deployments and less special comparing to rest of
>>>>>>> filesystems on kernel side.
>>>>>>
>>>>>> I did a quick search, and I didn't find any other fs that require
>>>>>> PMD-sized
>>>>>> huge pages, so I am not sure if FADV_* is useful for filesystems
>>>>>> other than
>>>>>> tmpfs. Please correct me if I missed something.
>>>>>
>>>>> What do you mean by "require"? THPs are always opportunistic.
>>>>>
>>>>> IIUC, we don't have a way to hint kernel to use huge pages for a
>>>>> file on
>>>>> read from backing storage. Readahead is not always the right way.
>>>>>
>>>>>>> If huge= is not set, tmpfs would behave the same way as the rest of
>>>>>>> filesystems.
>>>>>>
>>>>>> So if 'huge=' is not set, tmpfs write()/fallocate() can still
>>>>>> allocate large
>>>>>> folios based on the write size? If yes, that means it will change the
>>>>>> default huge behavior for tmpfs. Because previously having 'huge='
>>>>>> is not
>>>>>> set means the huge option is 'SHMEM_HUGE_NEVER', which is similar
>>>>>> to what I
>>>>>> mentioned:
>>>>>> "Another possible choice is to make the huge pages allocation based
>>>>>> on write
>>>>>> size as the *default* behavior for tmpfs, ..."
>>>>>
>>>>> I am more worried about breaking existing users of huge pages. So
>>>>> changing
>>>>> behaviour of users who don't specify huge is okay to me.
>>>>
>>>> I think moving tmpfs to allocate large folios opportunistically by
>>>> default (as it was proposed initially) doesn't necessary conflict with
>>>> the default behaviour (huge=never). We just need to clarify that in
>>>> the documentation.
>>>>
>>>> However, and IIRC, one of the requests from Hugh was to have a way to
>>>> disable large folios which is something other FS do not have control
>>>> of as of today. Ryan sent a proposal to actually control that globally
>>>> but I think it didn't move forward. So, what are we missing to go back
>>>> to implement large folios in tmpfs in the default case, as any other fs
>>>> leveraging large folios?
>>>
>>> IMHO, as I discussed with Kirill, we still need maintain compatibility
>>> with the 'huge=' mount option. This means that if 'huge=never' is set
>>> for tmpfs, huge page allocation will still be prohibited (which can
>>> address Hugh's request?). However, if 'huge=' is not set, we can
>>> allocate large folios based on the write size.
>>
>> I consider allocating large folios in shmem/tmpfs on the write path less
>> controversial than allocating them on the page fault path -- especially
>> as long as we stay within the size to-be-written.
>>
>> I think in RHEL THP on shmem/tmpfs are disabled as default (e.g.,
>> shmem_enabled=never). Maybe because of some rather undesired
>> side-effects (maybe some are historical?): I recall issues with VMs with
>> THP+ memory ballooning, as we cannot reclaim pages of folios if
>> splitting fails). I assume most of these problematic use cases don't use
>> tmpfs as an ordinary file system (write()/read()), but mmap() the whole
>> thing.
>>
>> Sadly, I don't find any information about shmem/tmpfs + THP in the RHEL
>> documentation; most documentation is only concerned about anon THP.
>> Which makes me conclude that they are not suggested as of now.
>>
>> I see more issues with allocating them on the page fault path and not
>> having a way to disable it -- compared to allocating them on the write()
>> path.
> 
> I may not understand your issues. IIUC, you can disable allocating huge
> pages on the page fault path by using the 'huge=never' mount option or
> setting shmem_enabled=deny. No?

That's what I am saying: if there is some way to disable it that will 
keep working, great.

-- 
Cheers,

David / dhildenb

Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
Posted by Daniel Gomez 1 month ago
On Wed Oct 23, 2024 at 11:27 AM CEST, David Hildenbrand wrote:
> On 23.10.24 10:04, Baolin Wang wrote:
> > 
> > 
> > On 2024/10/22 23:31, David Hildenbrand wrote:
> >> On 22.10.24 05:41, Baolin Wang wrote:
> >>>
> >>>
> >>> On 2024/10/21 21:34, Daniel Gomez wrote:
> >>>> On Mon Oct 21, 2024 at 10:54 AM CEST, Kirill A. Shutemov wrote:
> >>>>> On Mon, Oct 21, 2024 at 02:24:18PM +0800, Baolin Wang wrote:
> >>>>>>
> >>>>>>
> >>>>>> On 2024/10/17 19:26, Kirill A. Shutemov wrote:
> >>>>>>> On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote:
> >>>>>>>> + Kirill
> >>>>>>>>
> >>>>>>>> On 2024/10/16 22:06, Matthew Wilcox wrote:
> >>>>>>>>> On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote:
> >>>>>>>>>> Considering that tmpfs already has the 'huge=' option to
> >>>>>>>>>> control the THP
> >>>>>>>>>> allocation, it is necessary to maintain compatibility with the
> >>>>>>>>>> 'huge='
> >>>>>>>>>> option, as well as considering the 'deny' and 'force' option
> >>>>>>>>>> controlled
> >>>>>>>>>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
> >>>>>>>>>
> >>>>>>>>> No, it's not.  No other filesystem honours these settings.
> >>>>>>>>> tmpfs would
> >>>>>>>>> not have had these settings if it were written today.  It should
> >>>>>>>>> simply
> >>>>>>>>> ignore them, the way that NFS ignores the "intr" mount option
> >>>>>>>>> now that
> >>>>>>>>> we have a better solution to the original problem.
> >>>>>>>>>
> >>>>>>>>> To reiterate my position:
> >>>>>>>>>
> >>>>>>>>>       - When using tmpfs as a filesystem, it should behave like
> >>>>>>>>> other
> >>>>>>>>>         filesystems.
> >>>>>>>>>       - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED,
> >>>>>>>>> it should
> >>>>>>>>>         behave like anonymous memory.
> >>>>>>>>
> >>>>>>>> I do agree with your point to some extent, but the ‘huge=’ option
> >>>>>>>> has
> >>>>>>>> existed for nearly 8 years, and the huge orders based on write
> >>>>>>>> size may not
> >>>>>>>> achieve the performance of PMD-sized THP in some scenarios, such
> >>>>>>>> as when the
> >>>>>>>> write length is consistently 4K. So, I am still concerned that
> >>>>>>>> ignoring the
> >>>>>>>> 'huge' option could lead to compatibility issues.
> >>>>>>>
> >>>>>>> Yeah, I don't think we are there yet to ignore the mount option.
> >>>>>>
> >>>>>> OK.
> >>>>>>
> >>>>>>> Maybe we need to get a new generic interface to request the semantics
> >>>>>>> tmpfs has with huge= on per-inode level on any fs. Like a set of
> >>>>>>> FADV_*
> >>>>>>> handles to make kernel allocate PMD-size folio on any allocation
> >>>>>>> or on
> >>>>>>> allocations within i_size. I think this behaviour is useful beyond
> >>>>>>> tmpfs.
> >>>>>>>
> >>>>>>> Then huge= implementation for tmpfs can be re-defined to set these
> >>>>>>> per-inode FADV_ flags by default. This way we can keep tmpfs
> >>>>>>> compatible
> >>>>>>> with current deployments and less special comparing to rest of
> >>>>>>> filesystems on kernel side.
> >>>>>>
> >>>>>> I did a quick search, and I didn't find any other fs that require
> >>>>>> PMD-sized
> >>>>>> huge pages, so I am not sure if FADV_* is useful for filesystems
> >>>>>> other than
> >>>>>> tmpfs. Please correct me if I missed something.
> >>>>>
> >>>>> What do you mean by "require"? THPs are always opportunistic.
> >>>>>
> >>>>> IIUC, we don't have a way to hint kernel to use huge pages for a
> >>>>> file on
> >>>>> read from backing storage. Readahead is not always the right way.
> >>>>>
> >>>>>>> If huge= is not set, tmpfs would behave the same way as the rest of
> >>>>>>> filesystems.
> >>>>>>
> >>>>>> So if 'huge=' is not set, tmpfs write()/fallocate() can still
> >>>>>> allocate large
> >>>>>> folios based on the write size? If yes, that means it will change the
> >>>>>> default huge behavior for tmpfs. Because previously having 'huge='
> >>>>>> is not
> >>>>>> set means the huge option is 'SHMEM_HUGE_NEVER', which is similar
> >>>>>> to what I
> >>>>>> mentioned:
> >>>>>> "Another possible choice is to make the huge pages allocation based
> >>>>>> on write
> >>>>>> size as the *default* behavior for tmpfs, ..."
> >>>>>
> >>>>> I am more worried about breaking existing users of huge pages. So
> >>>>> changing
> >>>>> behaviour of users who don't specify huge is okay to me.
> >>>>
> >>>> I think moving tmpfs to allocate large folios opportunistically by
> >>>> default (as it was proposed initially) doesn't necessary conflict with
> >>>> the default behaviour (huge=never). We just need to clarify that in
> >>>> the documentation.
> >>>>
> >>>> However, and IIRC, one of the requests from Hugh was to have a way to
> >>>> disable large folios which is something other FS do not have control
> >>>> of as of today. Ryan sent a proposal to actually control that globally
> >>>> but I think it didn't move forward. So, what are we missing to go back
> >>>> to implement large folios in tmpfs in the default case, as any other fs
> >>>> leveraging large folios?
> >>>
> >>> IMHO, as I discussed with Kirill, we still need maintain compatibility
> >>> with the 'huge=' mount option. This means that if 'huge=never' is set
> >>> for tmpfs, huge page allocation will still be prohibited (which can
> >>> address Hugh's request?). However, if 'huge=' is not set, we can
> >>> allocate large folios based on the write size.

So, in order to make tmpfs behave like other filesystems, we need to
allocate large folios by default. Not setting 'huge=' is the same as
setting it to 'huge=never' as per documentation. But 'huge=' is meant to
control THP, not large folios, so it should not have a conflict here, or
else, what case are you thinking?

So, to make tmpfs behave like other filesystems, we need to allocate
large folios by default. According to the documentation, not setting
'huge=' is the same as setting 'huge=never.' However, 'huge=' is
intended to control THP, not large folios, so there shouldn't be
a conflict in this case. Can you clarify what specific scenario or
conflict you're considering here? Perhaps when large folios order is the
same as PMD-size?

> >>
> >> I consider allocating large folios in shmem/tmpfs on the write path less
> >> controversial than allocating them on the page fault path -- especially
> >> as long as we stay within the size to-be-written.
> >>
> >> I think in RHEL THP on shmem/tmpfs are disabled as default (e.g.,
> >> shmem_enabled=never). Maybe because of some rather undesired
> >> side-effects (maybe some are historical?): I recall issues with VMs with
> >> THP+ memory ballooning, as we cannot reclaim pages of folios if
> >> splitting fails). I assume most of these problematic use cases don't use
> >> tmpfs as an ordinary file system (write()/read()), but mmap() the whole
> >> thing.
> >>
> >> Sadly, I don't find any information about shmem/tmpfs + THP in the RHEL
> >> documentation; most documentation is only concerned about anon THP.
> >> Which makes me conclude that they are not suggested as of now.
> >>
> >> I see more issues with allocating them on the page fault path and not
> >> having a way to disable it -- compared to allocating them on the write()
> >> path.
> > 
> > I may not understand your issues. IIUC, you can disable allocating huge
> > pages on the page fault path by using the 'huge=never' mount option or
> > setting shmem_enabled=deny. No?
>
> That's what I am saying: if there is some way to disable it that will 
> keep working, great.

I agree. That aligns with what I recall Hugh requested. However, I
believe if that is the way to go, we shouldn't limit it to tmpfs.
Otherwise, why should tmpfs be prevented from allocating large folios if
other filesystems in the system are allowed to allocate them? I think,
if we want to disable large folios we should make it more generic,
something similar to Ryan's proposal [1] for controlling folio sizes.

[1] https://lore.kernel.org/all/20240717071257.4141363-1-ryan.roberts@arm.com/

That said, there has already been disagreement on this point here [2].

[2] https://lore.kernel.org/all/ZvVRiJYfaXD645Nh@casper.infradead.org/
Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
Posted by David Hildenbrand 4 weeks, 1 day ago
Sorry for the late reply!

>>>>> IMHO, as I discussed with Kirill, we still need maintain compatibility
>>>>> with the 'huge=' mount option. This means that if 'huge=never' is set
>>>>> for tmpfs, huge page allocation will still be prohibited (which can
>>>>> address Hugh's request?). However, if 'huge=' is not set, we can
>>>>> allocate large folios based on the write size.
> 
> So, in order to make tmpfs behave like other filesystems, we need to
> allocate large folios by default. Not setting 'huge=' is the same as
> setting it to 'huge=never' as per documentation. But 'huge=' is meant to
> control THP, not large folios, so it should not have a conflict here, or
> else, what case are you thinking?

I think we really have to move away from "huge/thp == PMD", that's a 
historical artifact. Everything else will simply be inconsistent and 
confusing in the future -- and I don't see any real need for that. For 
anonymous memory and anon shmem we managed the transition. (there is a 
longer writeup from me about this topic, so I won't go into detail).


I think I raised this in the past, but tmpfs/shmem is just like any 
other file system .. except it sometimes really isn't and behaves much 
more like (swappable) anonymous memory. (or mlocked files)

There are many systems out there that run without swap enabled, or with 
extremely minimal swap (IIRC until recently kubernetes was completely 
incompatible with swapping). Swap can even be disabled today for shmem 
using a mount option.

That's a big difference to all other file systems where you are 
guaranteed to have backend storage where you can simply evict under 
memory pressure (might temporarily fail, of course).

I *think* that's the reason why we have the "huge=" parameter that also 
controls the THP allocations during page faults (IOW possible memory 
over-allocation). Maybe also because it was a new feature, and we only 
had a single THP size.

There is, of course also the "fallocate() might not free up memory if 
there is an unexpected reference on the page because splitting it will 
fail" problem, that even exists when not over-allocating memory in the 
first place ...


So ...I don't think tmpfs behaves like other file system in some cases. 
And I don't think ignoring these points is a good idea.

Fortunately I don't maintain that code :)


If we don't want to go with the shmem_enabled toggles, we should 
probably still extend the documentation to cover "all THP sizes", like 
we did elsewhere.

huge=never: no THPs of any size
huge=always: THPs of any size (fault/write/etc)
huge=fadvise: like "always" but only with fadvise/madvise
huge=within_size: like "fadvise" but respect i_size

We could think about adding a "nowaste" extension and try make it the 
default.

For example

"huge=always:nowaste: THPs of any size as long as we don't over-allocate 
memory (write)"

The sysfs toggles have their beauty as well and could be useful (I'm 
pretty sure they will be useful :) ):

"huge=always;sysfs": THPs of any size (fault/write/etc) as configured in 
sysfs.

Too many options here to explore, too little time I have to spend on 
this. Just to throw out some ideas.

What I can really suggest is not making this one of the remaining 
interfaces where "huge" means "PMD-sized" once other sizes exist.

> 
>>>>
>>>> I consider allocating large folios in shmem/tmpfs on the write path less
>>>> controversial than allocating them on the page fault path -- especially
>>>> as long as we stay within the size to-be-written.
>>>>
>>>> I think in RHEL THP on shmem/tmpfs are disabled as default (e.g.,
>>>> shmem_enabled=never). Maybe because of some rather undesired
>>>> side-effects (maybe some are historical?): I recall issues with VMs with
>>>> THP+ memory ballooning, as we cannot reclaim pages of folios if
>>>> splitting fails). I assume most of these problematic use cases don't use
>>>> tmpfs as an ordinary file system (write()/read()), but mmap() the whole
>>>> thing.
>>>>
>>>> Sadly, I don't find any information about shmem/tmpfs + THP in the RHEL
>>>> documentation; most documentation is only concerned about anon THP.
>>>> Which makes me conclude that they are not suggested as of now.
>>>>
>>>> I see more issues with allocating them on the page fault path and not
>>>> having a way to disable it -- compared to allocating them on the write()
>>>> path.
>>>
>>> I may not understand your issues. IIUC, you can disable allocating huge
>>> pages on the page fault path by using the 'huge=never' mount option or
>>> setting shmem_enabled=deny. No?
>>
>> That's what I am saying: if there is some way to disable it that will
>> keep working, great.
> 
> I agree. That aligns with what I recall Hugh requested. However, I
> believe if that is the way to go, we shouldn't limit it to tmpfs.
> Otherwise, why should tmpfs be prevented from allocating large folios if
> other filesystems in the system are allowed to allocate them?

See above. On systems without/little swap you might not want them for 
shmem/tmpfs, but would happily use them elsewhere.

The "write() won't waste memory" case is really interesting, the 
"fallocate cannot free the memory" still exists. A shrinker might help.

-- 
Cheers,

David / dhildenb
Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
Posted by Daniel Gomez 3 weeks, 5 days ago
On Fri Oct 25, 2024 at 10:21 PM CEST, David Hildenbrand wrote:
> Sorry for the late reply!
>
> >>>>> IMHO, as I discussed with Kirill, we still need maintain compatibility
> >>>>> with the 'huge=' mount option. This means that if 'huge=never' is set
> >>>>> for tmpfs, huge page allocation will still be prohibited (which can
> >>>>> address Hugh's request?). However, if 'huge=' is not set, we can
> >>>>> allocate large folios based on the write size.
> > 
> > So, in order to make tmpfs behave like other filesystems, we need to
> > allocate large folios by default. Not setting 'huge=' is the same as
> > setting it to 'huge=never' as per documentation. But 'huge=' is meant to
> > control THP, not large folios, so it should not have a conflict here, or
> > else, what case are you thinking?
>
> I think we really have to move away from "huge/thp == PMD", that's a 
> historical artifact. Everything else will simply be inconsistent and 
> confusing in the future -- and I don't see any real need for that. For 
> anonymous memory and anon shmem we managed the transition. (there is a 
> longer writeup from me about this topic, so I won't go into detail).
>
>
> I think I raised this in the past, but tmpfs/shmem is just like any 
> other file system .. except it sometimes really isn't and behaves much 
> more like (swappable) anonymous memory. (or mlocked files)
>
> There are many systems out there that run without swap enabled, or with 
> extremely minimal swap (IIRC until recently kubernetes was completely 
> incompatible with swapping). Swap can even be disabled today for shmem 
> using a mount option.
>
> That's a big difference to all other file systems where you are 
> guaranteed to have backend storage where you can simply evict under 
> memory pressure (might temporarily fail, of course).
>
> I *think* that's the reason why we have the "huge=" parameter that also 
> controls the THP allocations during page faults (IOW possible memory 
> over-allocation). Maybe also because it was a new feature, and we only 
> had a single THP size.
>
> There is, of course also the "fallocate() might not free up memory if 
> there is an unexpected reference on the page because splitting it will 
> fail" problem, that even exists when not over-allocating memory in the 
> first place ...
>
>
> So ...I don't think tmpfs behaves like other file system in some cases. 
> And I don't think ignoring these points is a good idea.

Assuming a system without swap, what's the difference you are concern
about between using the current tmpfs allocation method vs large folios
implementation?

>
> Fortunately I don't maintain that code :)
>
>
> If we don't want to go with the shmem_enabled toggles, we should 
> probably still extend the documentation to cover "all THP sizes", like 
> we did elsewhere.
>
> huge=never: no THPs of any size
> huge=always: THPs of any size (fault/write/etc)
> huge=fadvise: like "always" but only with fadvise/madvise
> huge=within_size: like "fadvise" but respect i_size
>
> We could think about adding a "nowaste" extension and try make it the 
> default.
>
> For example
>
> "huge=always:nowaste: THPs of any size as long as we don't over-allocate 
> memory (write)"

This is the default behaviour in other fs too. I don't think is
necessary to make it explicit.

>
> The sysfs toggles have their beauty as well and could be useful (I'm 
> pretty sure they will be useful :) ):
>
> "huge=always;sysfs": THPs of any size (fault/write/etc) as configured in 
> sysfs.
>
> Too many options here to explore, too little time I have to spend on 
> this. Just to throw out some ideas.
>
> What I can really suggest is not making this one of the remaining 
> interfaces where "huge" means "PMD-sized" once other sizes exist.
>
> > 
> >>>>
> >>>> I consider allocating large folios in shmem/tmpfs on the write path less
> >>>> controversial than allocating them on the page fault path -- especially
> >>>> as long as we stay within the size to-be-written.
> >>>>
> >>>> I think in RHEL THP on shmem/tmpfs are disabled as default (e.g.,
> >>>> shmem_enabled=never). Maybe because of some rather undesired
> >>>> side-effects (maybe some are historical?): I recall issues with VMs with
> >>>> THP+ memory ballooning, as we cannot reclaim pages of folios if
> >>>> splitting fails). I assume most of these problematic use cases don't use
> >>>> tmpfs as an ordinary file system (write()/read()), but mmap() the whole
> >>>> thing.
> >>>>
> >>>> Sadly, I don't find any information about shmem/tmpfs + THP in the RHEL
> >>>> documentation; most documentation is only concerned about anon THP.
> >>>> Which makes me conclude that they are not suggested as of now.
> >>>>
> >>>> I see more issues with allocating them on the page fault path and not
> >>>> having a way to disable it -- compared to allocating them on the write()
> >>>> path.
> >>>
> >>> I may not understand your issues. IIUC, you can disable allocating huge
> >>> pages on the page fault path by using the 'huge=never' mount option or
> >>> setting shmem_enabled=deny. No?
> >>
> >> That's what I am saying: if there is some way to disable it that will
> >> keep working, great.
> > 
> > I agree. That aligns with what I recall Hugh requested. However, I
> > believe if that is the way to go, we shouldn't limit it to tmpfs.
> > Otherwise, why should tmpfs be prevented from allocating large folios if
> > other filesystems in the system are allowed to allocate them?
>
> See above. On systems without/little swap you might not want them for 
> shmem/tmpfs, but would happily use them elsewhere.
>
> The "write() won't waste memory" case is really interesting, the 
> "fallocate cannot free the memory" still exists. A shrinker might help.

The previous implementation with large folios allocation was wrong
and was actually wasting memory by rounding up while trying to find
the order. Matthew already pointed it out [1]. So, with that fixed, we
should not end up wasting memory.

https://lore.kernel.org/all/ZvVQoY8Tn_BNc79T@casper.infradead.org/
Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
Posted by David Hildenbrand 3 weeks, 4 days ago
On 28.10.24 22:56, Daniel Gomez wrote:
> On Fri Oct 25, 2024 at 10:21 PM CEST, David Hildenbrand wrote:
>> Sorry for the late reply!
>>
>>>>>>> IMHO, as I discussed with Kirill, we still need maintain compatibility
>>>>>>> with the 'huge=' mount option. This means that if 'huge=never' is set
>>>>>>> for tmpfs, huge page allocation will still be prohibited (which can
>>>>>>> address Hugh's request?). However, if 'huge=' is not set, we can
>>>>>>> allocate large folios based on the write size.
>>>
>>> So, in order to make tmpfs behave like other filesystems, we need to
>>> allocate large folios by default. Not setting 'huge=' is the same as
>>> setting it to 'huge=never' as per documentation. But 'huge=' is meant to
>>> control THP, not large folios, so it should not have a conflict here, or
>>> else, what case are you thinking?
>>
>> I think we really have to move away from "huge/thp == PMD", that's a
>> historical artifact. Everything else will simply be inconsistent and
>> confusing in the future -- and I don't see any real need for that. For
>> anonymous memory and anon shmem we managed the transition. (there is a
>> longer writeup from me about this topic, so I won't go into detail).
>>
>>
>> I think I raised this in the past, but tmpfs/shmem is just like any
>> other file system .. except it sometimes really isn't and behaves much
>> more like (swappable) anonymous memory. (or mlocked files)
>>
>> There are many systems out there that run without swap enabled, or with
>> extremely minimal swap (IIRC until recently kubernetes was completely
>> incompatible with swapping). Swap can even be disabled today for shmem
>> using a mount option.
>>
>> That's a big difference to all other file systems where you are
>> guaranteed to have backend storage where you can simply evict under
>> memory pressure (might temporarily fail, of course).
>>
>> I *think* that's the reason why we have the "huge=" parameter that also
>> controls the THP allocations during page faults (IOW possible memory
>> over-allocation). Maybe also because it was a new feature, and we only
>> had a single THP size.
>>
>> There is, of course also the "fallocate() might not free up memory if
>> there is an unexpected reference on the page because splitting it will
>> fail" problem, that even exists when not over-allocating memory in the
>> first place ...
>>
>>
>> So ...I don't think tmpfs behaves like other file system in some cases.
>> And I don't think ignoring these points is a good idea.
> 
> Assuming a system without swap, what's the difference you are concern
> about between using the current tmpfs allocation method vs large folios
> implementation?

As raised above, there is the interesting interaction between 
fallocate(FALLOC_FL_PUNCH_HOLE) and raised refcounts, where we can fail 
to reclaim memory.

shmem_fallocate()->shmem_truncate_range()->truncate_inode_pages_range()->truncate_inode_partial_folio().

It's better than it was in the past -- in the past we didn't even try 
splitting, but today splitting can still fail and we'll never try 
reclaiming that memory again later. This is very different to anonymous 
memory where we have the deferred split queue+remember which pages where 
zapped implicitly using the page tables (instead of zeroing them out and 
not freeing up the memory).

It's one of the issues people ran into when using THP+shmem for backing 
guest VMs along with memory ballooning. For that reason, the 
recommendation still is to disable THP when using shmem for backing 
guest VMs and relying on memory overcommit optimizations such as memory 
balloon inflation.

> 
>>
>> Fortunately I don't maintain that code :)
>>
>>
>> If we don't want to go with the shmem_enabled toggles, we should
>> probably still extend the documentation to cover "all THP sizes", like
>> we did elsewhere.
>>
>> huge=never: no THPs of any size
>> huge=always: THPs of any size (fault/write/etc)
>> huge=fadvise: like "always" but only with fadvise/madvise
>> huge=within_size: like "fadvise" but respect i_size
>>
>> We could think about adding a "nowaste" extension and try make it the
>> default.
>>
>> For example
>>
>> "huge=always:nowaste: THPs of any size as long as we don't over-allocate
>> memory (write)"
> 
> This is the default behaviour in other fs too. I don't think is
> necessary to make it explicit.

Please keep in mind that allocating THPs of different size during *page 
faults* will have to fit into the whole picture we are creating here.

That's also what "huge=always" controls for shmem today IIRC.

>>>
>>>>>>
>>>>>> I consider allocating large folios in shmem/tmpfs on the write path less
>>>>>> controversial than allocating them on the page fault path -- especially
>>>>>> as long as we stay within the size to-be-written.
>>>>>>
>>>>>> I think in RHEL THP on shmem/tmpfs are disabled as default (e.g.,
>>>>>> shmem_enabled=never). Maybe because of some rather undesired
>>>>>> side-effects (maybe some are historical?): I recall issues with VMs with
>>>>>> THP+ memory ballooning, as we cannot reclaim pages of folios if
>>>>>> splitting fails). I assume most of these problematic use cases don't use
>>>>>> tmpfs as an ordinary file system (write()/read()), but mmap() the whole
>>>>>> thing.
>>>>>>
>>>>>> Sadly, I don't find any information about shmem/tmpfs + THP in the RHEL
>>>>>> documentation; most documentation is only concerned about anon THP.
>>>>>> Which makes me conclude that they are not suggested as of now.
>>>>>>
>>>>>> I see more issues with allocating them on the page fault path and not
>>>>>> having a way to disable it -- compared to allocating them on the write()
>>>>>> path.
>>>>>
>>>>> I may not understand your issues. IIUC, you can disable allocating huge
>>>>> pages on the page fault path by using the 'huge=never' mount option or
>>>>> setting shmem_enabled=deny. No?
>>>>
>>>> That's what I am saying: if there is some way to disable it that will
>>>> keep working, great.
>>>
>>> I agree. That aligns with what I recall Hugh requested. However, I
>>> believe if that is the way to go, we shouldn't limit it to tmpfs.
>>> Otherwise, why should tmpfs be prevented from allocating large folios if
>>> other filesystems in the system are allowed to allocate them?
>>
>> See above. On systems without/little swap you might not want them for
>> shmem/tmpfs, but would happily use them elsewhere.
>>
>> The "write() won't waste memory" case is really interesting, the
>> "fallocate cannot free the memory" still exists. A shrinker might help.
> 
> The previous implementation with large folios allocation was wrong
> and was actually wasting memory by rounding up while trying to find
> the order. Matthew already pointed it out [1]. So, with that fixed, we
> should not end up wasting memory.

Again, we should have a clear path forward how we deal with page faults 
and how this fits into the picture.

-- 
Cheers,

David / dhildenb
Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
Posted by David Hildenbrand 3 weeks, 5 days ago
On 25.10.24 22:21, David Hildenbrand wrote:
> Sorry for the late reply!
> 
>>>>>> IMHO, as I discussed with Kirill, we still need maintain compatibility
>>>>>> with the 'huge=' mount option. This means that if 'huge=never' is set
>>>>>> for tmpfs, huge page allocation will still be prohibited (which can
>>>>>> address Hugh's request?). However, if 'huge=' is not set, we can
>>>>>> allocate large folios based on the write size.
>>
>> So, in order to make tmpfs behave like other filesystems, we need to
>> allocate large folios by default. Not setting 'huge=' is the same as
>> setting it to 'huge=never' as per documentation. But 'huge=' is meant to
>> control THP, not large folios, so it should not have a conflict here, or
>> else, what case are you thinking?
> 
> I think we really have to move away from "huge/thp == PMD", that's a
> historical artifact. Everything else will simply be inconsistent and
> confusing in the future -- and I don't see any real need for that. For
> anonymous memory and anon shmem we managed the transition. (there is a
> longer writeup from me about this topic, so I won't go into detail).
> 
> 
> I think I raised this in the past, but tmpfs/shmem is just like any
> other file system .. except it sometimes really isn't and behaves much
> more like (swappable) anonymous memory. (or mlocked files)
> 
> There are many systems out there that run without swap enabled, or with
> extremely minimal swap (IIRC until recently kubernetes was completely
> incompatible with swapping). Swap can even be disabled today for shmem
> using a mount option.
> 
> That's a big difference to all other file systems where you are
> guaranteed to have backend storage where you can simply evict under
> memory pressure (might temporarily fail, of course).
> 
> I *think* that's the reason why we have the "huge=" parameter that also
> controls the THP allocations during page faults (IOW possible memory
> over-allocation). Maybe also because it was a new feature, and we only
> had a single THP size.
> 
> There is, of course also the "fallocate() might not free up memory if
> there is an unexpected reference on the page because splitting it will
> fail" problem, that even exists when not over-allocating memory in the
> first place ...
> 
> 
> So ...I don't think tmpfs behaves like other file system in some cases.
> And I don't think ignoring these points is a good idea.
> 
> Fortunately I don't maintain that code :)
> 
> 
> If we don't want to go with the shmem_enabled toggles, we should
> probably still extend the documentation to cover "all THP sizes", like
> we did elsewhere.
> 
> huge=never: no THPs of any size
> huge=always: THPs of any size (fault/write/etc)
> huge=fadvise: like "always" but only with fadvise/madvise
> huge=within_size: like "fadvise" but respect i_size

Thinking some more about that over the weekend, this is likely the way 
to go, paired with conditionally changing the default to 
always/within_size. I suggest a kconfig option for that.

That should probably do as a first shot; I assume people will want more 
control over which size to use, especially during page faults, but that 
can likely be added later.

-- 
Cheers,

David / dhildenb
Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
Posted by Baolin Wang 3 weeks, 2 days ago
Sorry for late reply.

On 2024/10/28 17:48, David Hildenbrand wrote:
> On 25.10.24 22:21, David Hildenbrand wrote:
>> Sorry for the late reply!
>>
>>>>>>> IMHO, as I discussed with Kirill, we still need maintain 
>>>>>>> compatibility
>>>>>>> with the 'huge=' mount option. This means that if 'huge=never' is 
>>>>>>> set
>>>>>>> for tmpfs, huge page allocation will still be prohibited (which can
>>>>>>> address Hugh's request?). However, if 'huge=' is not set, we can
>>>>>>> allocate large folios based on the write size.
>>>
>>> So, in order to make tmpfs behave like other filesystems, we need to
>>> allocate large folios by default. Not setting 'huge=' is the same as
>>> setting it to 'huge=never' as per documentation. But 'huge=' is meant to
>>> control THP, not large folios, so it should not have a conflict here, or
>>> else, what case are you thinking?
>>
>> I think we really have to move away from "huge/thp == PMD", that's a
>> historical artifact. Everything else will simply be inconsistent and
>> confusing in the future -- and I don't see any real need for that. For
>> anonymous memory and anon shmem we managed the transition. (there is a
>> longer writeup from me about this topic, so I won't go into detail).
>>
>>
>> I think I raised this in the past, but tmpfs/shmem is just like any
>> other file system .. except it sometimes really isn't and behaves much
>> more like (swappable) anonymous memory. (or mlocked files)
>>
>> There are many systems out there that run without swap enabled, or with
>> extremely minimal swap (IIRC until recently kubernetes was completely
>> incompatible with swapping). Swap can even be disabled today for shmem
>> using a mount option.
>>
>> That's a big difference to all other file systems where you are
>> guaranteed to have backend storage where you can simply evict under
>> memory pressure (might temporarily fail, of course).
>>
>> I *think* that's the reason why we have the "huge=" parameter that also
>> controls the THP allocations during page faults (IOW possible memory
>> over-allocation). Maybe also because it was a new feature, and we only
>> had a single THP size.
>>
>> There is, of course also the "fallocate() might not free up memory if
>> there is an unexpected reference on the page because splitting it will
>> fail" problem, that even exists when not over-allocating memory in the
>> first place ...
>>
>>
>> So ...I don't think tmpfs behaves like other file system in some cases.
>> And I don't think ignoring these points is a good idea.
>>
>> Fortunately I don't maintain that code :)
>>
>>
>> If we don't want to go with the shmem_enabled toggles, we should
>> probably still extend the documentation to cover "all THP sizes", like
>> we did elsewhere.
>>
>> huge=never: no THPs of any size
>> huge=always: THPs of any size (fault/write/etc)
>> huge=fadvise: like "always" but only with fadvise/madvise
>> huge=within_size: like "fadvise" but respect i_size
> 
> Thinking some more about that over the weekend, this is likely the way 
> to go, paired with conditionally changing the default to 
> always/within_size. I suggest a kconfig option for that.

I am still worried about adding a new kconfig option, which might 
complicate the tmpfs controls further.

> That should probably do as a first shot; I assume people will want more 
> control over which size to use, especially during page faults, but that 
> can likely be added later.

After some discussions, I think the first step is to achieve two goals: 
1) Try to make tmpfs use large folios like other file systems, that 
means we should avoid adding more complex control options (per Matthew).
2) Still need maintain compatibility with the 'huge=' mount option (per 
Kirill), as I also remembered we have customers who use 
'huge=within_size' to allocate THPs for better performance.

Based on these considerations, my first step is to neither add a new 
'huge=' option parameter nor introduce the mTHP interfaces control for 
tmpfs, but rather to change the default huge allocation behavior for 
tmpfs. That is to say, when 'huge=' option is not configured, we will 
allow the huge folios allocation based on the write size. As a result, 
the behavior of huge pages for tmpfs will change as follows:

no 'huge=' set: can allocate any size huge folios based on write size
huge=never: no any size huge folios
huge=always: only PMD sized THP allocation as before
huge=fadvise: like "always" but only with fadvise/madvise
huge=within_size: like "fadvise" but respect i_size

The next step is to continue discussing whether to add a new Kconfig 
option or FADV_* in the future.

So what do you think?
Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
Posted by David Hildenbrand 3 weeks, 2 days ago
>>>
>>> If we don't want to go with the shmem_enabled toggles, we should
>>> probably still extend the documentation to cover "all THP sizes", like
>>> we did elsewhere.
>>>
>>> huge=never: no THPs of any size
>>> huge=always: THPs of any size (fault/write/etc)
>>> huge=fadvise: like "always" but only with fadvise/madvise
>>> huge=within_size: like "fadvise" but respect i_size
>>
>> Thinking some more about that over the weekend, this is likely the way
>> to go, paired with conditionally changing the default to
>> always/within_size. I suggest a kconfig option for that.
> 
> I am still worried about adding a new kconfig option, which might
> complicate the tmpfs controls further.

Why exactly?

If we are changing a default similar to 
CONFIG_TRANSPARENT_HUGEPAGE_NEVER -> CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS, 
it would make perfectly sense to give people building a kernel control 
over that.

If we want to support this feature in a distro kernel like RHEL we'll 
have to leave the default unmodified. Otherwise I see no way (excluding 
downstream-only hacks) to backport this into distro kernels.

> 
>> That should probably do as a first shot; I assume people will want more
>> control over which size to use, especially during page faults, but that
>> can likely be added later.

I know, it puts you in a bad position because there are different 
opinions floating around. But let's try to find something that is 
reasonable and still acceptable. And let's hope that Hugh will voice an 
opinion :D

> 
> After some discussions, I think the first step is to achieve two goals:
> 1) Try to make tmpfs use large folios like other file systems, that
> means we should avoid adding more complex control options (per Matthew).
> 2) Still need maintain compatibility with the 'huge=' mount option (per
> Kirill), as I also remembered we have customers who use
> 'huge=within_size' to allocate THPs for better performance.

> 
> Based on these considerations, my first step is to neither add a new
> 'huge=' option parameter nor introduce the mTHP interfaces control for
> tmpfs, but rather to change the default huge allocation behavior for
> tmpfs. That is to say, when 'huge=' option is not configured, we will
> allow the huge folios allocation based on the write size. As a result,
> the behavior of huge pages for tmpfs will change as follows:
 > > no 'huge=' set: can allocate any size huge folios based on write size
 > huge=never: no any size huge folios> huge=always: only PMD sized THP 
allocation as before
 > huge=fadvise: like "always" but only with fadvise/madvise> 
huge=within_size: like "fadvise" but respect i_size

I don't like that:

(a) there is no way to explicitly enable/name that new behavior.
(b) "always" etc. are only concerned about PMDs.


So again, I suggest:

huge=never: No THPs of any size
huge=always: THPs of any size
huge=fadvise: like "always" but only with fadvise/madvise 
huge=within_size: like "fadvise" but respect i_size

"huge=" default depends on a Kconfig option.

With that we:

(1) Maximize the cases where we will use large folios of any sizes
     (which Willy cares about).
(2) Have a way to disable them completely (which I care about).
(3) Allow distros to keep the default unchanged.

Likely, for now we will only try allocating PMD-sized THPs during page 
faults, and allocate different sizes only during write(). So the effect 
for many use cases (VMs, DBs) that primarily mmap() tmpfs files will be 
completely unchanged even with "huge=always".

It will get more tricky once we change that behavior as well, but that's 
something to likely figure out if it is a real problem at at different 
day :)


I really preferred using the sysfs toggles (as discussed with Hugh in 
the meeting back then), but I can also understand why we at least want 
to try making tmpfs behave more like other file systems. But I'm a bit 
more careful to not ignore the cases where it really isn't like any 
other file system.

If we start making PMD-sized THPs special in any non-configurable way, 
then we are effectively off *worse* than allowing to configure them 
properly. So if someone voices "but we want only PMD-sized" ones, the 
next one will say "but we only want cont-pte sized-ones" and then we 
should provide an option to control the actual sizes to use differently, 
in some way. But let's see if that is even required.

-- 
Cheers,

David / dhildenb
Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
Posted by Baolin Wang 3 weeks, 2 days ago

On 2024/10/31 16:53, David Hildenbrand wrote:
>>>>
>>>> If we don't want to go with the shmem_enabled toggles, we should
>>>> probably still extend the documentation to cover "all THP sizes", like
>>>> we did elsewhere.
>>>>
>>>> huge=never: no THPs of any size
>>>> huge=always: THPs of any size (fault/write/etc)
>>>> huge=fadvise: like "always" but only with fadvise/madvise
>>>> huge=within_size: like "fadvise" but respect i_size
>>>
>>> Thinking some more about that over the weekend, this is likely the way
>>> to go, paired with conditionally changing the default to
>>> always/within_size. I suggest a kconfig option for that.
>>
>> I am still worried about adding a new kconfig option, which might
>> complicate the tmpfs controls further.
> 
> Why exactly?

There will be more options to control huge pages allocation for tmpfs, 
which may confuse users and make life harder? Yes, we can add some 
documentation, but I'm still a bit cautious about this.

> If we are changing a default similar to 
> CONFIG_TRANSPARENT_HUGEPAGE_NEVER -> CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS, 
> it would make perfectly sense to give people building a kernel control 
> over that.
> 
> If we want to support this feature in a distro kernel like RHEL we'll 
> have to leave the default unmodified. Otherwise I see no way (excluding 
> downstream-only hacks) to backport this into distro kernels.
> 
>>
>>> That should probably do as a first shot; I assume people will want more
>>> control over which size to use, especially during page faults, but that
>>> can likely be added later.
> 
> I know, it puts you in a bad position because there are different 
> opinions floating around. But let's try to find something that is 
> reasonable and still acceptable. And let's hope that Hugh will voice an 
> opinion :D

Yes, I am also waiting to see if Hugh has any inputs :)

>> After some discussions, I think the first step is to achieve two goals:
>> 1) Try to make tmpfs use large folios like other file systems, that
>> means we should avoid adding more complex control options (per Matthew).
>> 2) Still need maintain compatibility with the 'huge=' mount option (per
>> Kirill), as I also remembered we have customers who use
>> 'huge=within_size' to allocate THPs for better performance.
> 
>>
>> Based on these considerations, my first step is to neither add a new
>> 'huge=' option parameter nor introduce the mTHP interfaces control for
>> tmpfs, but rather to change the default huge allocation behavior for
>> tmpfs. That is to say, when 'huge=' option is not configured, we will
>> allow the huge folios allocation based on the write size. As a result,
>> the behavior of huge pages for tmpfs will change as follows:
>  > > no 'huge=' set: can allocate any size huge folios based on write size
>  > huge=never: no any size huge folios> huge=always: only PMD sized THP 
> allocation as before
>  > huge=fadvise: like "always" but only with fadvise/madvise> 
> huge=within_size: like "fadvise" but respect i_size
> 
> I don't like that:
> 
> (a) there is no way to explicitly enable/name that new behavior.

But this is similar to other file systems that enable large folios 
(setting mapping_set_large_folios()), and I haven't seen any other file 
systems supporting large folios requiring a new Kconfig. Maybe tmpfs is 
a bit special?

If we all agree that tmpfs is a bit special when using huge pages, then 
fine, a Kconfig option might be needed.

> (b) "always" etc. are only concerned about PMDs.

Yes, currently maintain the same semantics as before, in case users 
still expect THPs.

> So again, I suggest:
> 
> huge=never: No THPs of any size
> huge=always: THPs of any size
> huge=fadvise: like "always" but only with fadvise/madvise 
> huge=within_size: like "fadvise" but respect i_size
> 
> "huge=" default depends on a Kconfig option.
> 
> With that we:
> 
> (1) Maximize the cases where we will use large folios of any sizes
>      (which Willy cares about).
> (2) Have a way to disable them completely (which I care about).
> (3) Allow distros to keep the default unchanged.
> 
> Likely, for now we will only try allocating PMD-sized THPs during page 
> faults, and allocate different sizes only during write(). So the effect 
> for many use cases (VMs, DBs) that primarily mmap() tmpfs files will be 
> completely unchanged even with "huge=always".
> 
> It will get more tricky once we change that behavior as well, but that's 
> something to likely figure out if it is a real problem at at different 
> day :)
> 
> 
> I really preferred using the sysfs toggles (as discussed with Hugh in 
> the meeting back then), but I can also understand why we at least want 
> to try making tmpfs behave more like other file systems. But I'm a bit 
> more careful to not ignore the cases where it really isn't like any 
> other file system.

That's also my previous thought, but Matthew is strongly against that. 
Let's step by step.

> If we start making PMD-sized THPs special in any non-configurable way, 
> then we are effectively off *worse* than allowing to configure them 
> properly. So if someone voices "but we want only PMD-sized" ones, the 
> next one will say "but we only want cont-pte sized-ones" and then we 
> should provide an option to control the actual sizes to use differently, 
> in some way. But let's see if that is even required.

Yes, I agree. So what I am thinking is, the 'huge=' option should be 
gradually deprecated in the future and eventually tmpfs can allocate any 
size large folios as default.
Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
Posted by David Hildenbrand 3 weeks, 2 days ago
>>> I am still worried about adding a new kconfig option, which might
>>> complicate the tmpfs controls further.
>>
>> Why exactly?
> 
> There will be more options to control huge pages allocation for tmpfs,
> which may confuse users and make life harder? Yes, we can add some
> documentation, but I'm still a bit cautious about this.

If it's just "changing the default from "huge=never" to "huge=X" I don't 
see a big problem here. Again, we already do that for anon THPs.

If we make more behavior depend on than (which I don't think we should 
be doing), I agree that it would be more controversial.

[..]

>>>
>>>> That should probably do as a first shot; I assume people will want more
>>>> control over which size to use, especially during page faults, but that
>>>> can likely be added later.
>>
>> I know, it puts you in a bad position because there are different
>> opinions floating around. But let's try to find something that is
>> reasonable and still acceptable. And let's hope that Hugh will voice an
>> opinion :D
> 
> Yes, I am also waiting to see if Hugh has any inputs :)

We keep saying that ... I have to find a way to summon him :)

> 
>>> After some discussions, I think the first step is to achieve two goals:
>>> 1) Try to make tmpfs use large folios like other file systems, that
>>> means we should avoid adding more complex control options (per Matthew).
>>> 2) Still need maintain compatibility with the 'huge=' mount option (per
>>> Kirill), as I also remembered we have customers who use
>>> 'huge=within_size' to allocate THPs for better performance.
>>
>>>
>>> Based on these considerations, my first step is to neither add a new
>>> 'huge=' option parameter nor introduce the mTHP interfaces control for
>>> tmpfs, but rather to change the default huge allocation behavior for
>>> tmpfs. That is to say, when 'huge=' option is not configured, we will
>>> allow the huge folios allocation based on the write size. As a result,
>>> the behavior of huge pages for tmpfs will change as follows:
>>   > > no 'huge=' set: can allocate any size huge folios based on write size
>>   > huge=never: no any size huge folios> huge=always: only PMD sized THP
>> allocation as before
>>   > huge=fadvise: like "always" but only with fadvise/madvise>
>> huge=within_size: like "fadvise" but respect i_size
>>
>> I don't like that:
>>
>> (a) there is no way to explicitly enable/name that new behavior.
> 
> But this is similar to other file systems that enable large folios
> (setting mapping_set_large_folios()), and I haven't seen any other file
> systems supporting large folios requiring a new Kconfig. Maybe tmpfs is
> a bit special?

I'm afraid I don't have the energy to explain once more why I think 
tmpfs is not just like any other file system in some cases.

And distributions are rather careful when it comes to something like 
this ...

> 
> If we all agree that tmpfs is a bit special when using huge pages, then
> fine, a Kconfig option might be needed.
> 
>> (b) "always" etc. are only concerned about PMDs.
> 
> Yes, currently maintain the same semantics as before, in case users
> still expect THPs.

Again, I don't think that is a reasonable approach to make PMD-sized 
ones special here. It will all get seriously confusing and inconsistent.

THPs are opportunistic after all, and page fault behavior will remain 
unchanged (PMD-sized) for now. And even if we support other sizes during 
page faults, we'd like start with the largest size (PMD-size) first, and 
it likely might just all work better than before.

Happy to learn where this really makes a difference.

Of course, if you change the default behavior (which you are planning), 
it's ... a changed default.

If there are reasons to have more tunables regarding the sizes to use, 
then it should not be limited to PMD-size.

 > >> So again, I suggest:
>>
>> huge=never: No THPs of any size
>> huge=always: THPs of any size
>> huge=fadvise: like "always" but only with fadvise/madvise
>> huge=within_size: like "fadvise" but respect i_size
>>
>> "huge=" default depends on a Kconfig option.
>>
>> With that we:
>>
>> (1) Maximize the cases where we will use large folios of any sizes
>>       (which Willy cares about).
>> (2) Have a way to disable them completely (which I care about).
>> (3) Allow distros to keep the default unchanged.
>>
>> Likely, for now we will only try allocating PMD-sized THPs during page
>> faults, and allocate different sizes only during write(). So the effect
>> for many use cases (VMs, DBs) that primarily mmap() tmpfs files will be
>> completely unchanged even with "huge=always".
>>
>> It will get more tricky once we change that behavior as well, but that's
>> something to likely figure out if it is a real problem at at different
>> day :)
>>
>>
>> I really preferred using the sysfs toggles (as discussed with Hugh in
>> the meeting back then), but I can also understand why we at least want
>> to try making tmpfs behave more like other file systems. But I'm a bit
>> more careful to not ignore the cases where it really isn't like any
>> other file system.
> 
> That's also my previous thought, but Matthew is strongly against that.
> Let's step by step.

Yes, I understand his view as well.

But I won't blindly agree to the "tmpfs is just like any other file 
system" opinion :)

 > >> If we start making PMD-sized THPs special in any non-configurable way,
>> then we are effectively off *worse* than allowing to configure them
>> properly. So if someone voices "but we want only PMD-sized" ones, the
>> next one will say "but we only want cont-pte sized-ones" and then we
>> should provide an option to control the actual sizes to use differently,
>> in some way. But let's see if that is even required.
> 
> Yes, I agree. So what I am thinking is, the 'huge=' option should be
> gradually deprecated in the future and eventually tmpfs can allocate any
> size large folios as default.

Let's be realistic, it won't get removed any time soon. ;)

So changing "huge=always" etc. semantics to reflect our new size 
options, and then try changing the default (with the option for 
people/distros to have the old default) is a reasonable approach, at 
least to me.

I'm trying to stay open-minded here, but the proposal I heard so far is 
not particularly appealing.

-- 
Cheers,

David / dhildenb

Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
Posted by David Hildenbrand 3 weeks, 2 days ago
>>> I am still worried about adding a new kconfig option, which might
>>> complicate the tmpfs controls further.
>>
>> Why exactly?
> 
> There will be more options to control huge pages allocation for tmpfs,
> which may confuse users and make life harder? Yes, we can add some
> documentation, but I'm still a bit cautious about this.

If it's just "changing the default from "huge=never" to "huge=X" I don't 
see a big problem here. Again, we already do that for anon THPs.

If we make more behavior depend on than (which I don't think we should 
be doing), I agree that it would be more controversial.

[..]

>>>
>>>> That should probably do as a first shot; I assume people will want more
>>>> control over which size to use, especially during page faults, but that
>>>> can likely be added later.
>>
>> I know, it puts you in a bad position because there are different
>> opinions floating around. But let's try to find something that is
>> reasonable and still acceptable. And let's hope that Hugh will voice an
>> opinion :D
> 
> Yes, I am also waiting to see if Hugh has any inputs :)

We keep saying that ... I have to find a way to summon him :)

> 
>>> After some discussions, I think the first step is to achieve two goals:
>>> 1) Try to make tmpfs use large folios like other file systems, that
>>> means we should avoid adding more complex control options (per Matthew).
>>> 2) Still need maintain compatibility with the 'huge=' mount option (per
>>> Kirill), as I also remembered we have customers who use
>>> 'huge=within_size' to allocate THPs for better performance.
>>
>>>
>>> Based on these considerations, my first step is to neither add a new
>>> 'huge=' option parameter nor introduce the mTHP interfaces control for
>>> tmpfs, but rather to change the default huge allocation behavior for
>>> tmpfs. That is to say, when 'huge=' option is not configured, we will
>>> allow the huge folios allocation based on the write size. As a result,
>>> the behavior of huge pages for tmpfs will change as follows:
>>   > > no 'huge=' set: can allocate any size huge folios based on write size
>>   > huge=never: no any size huge folios> huge=always: only PMD sized THP
>> allocation as before
>>   > huge=fadvise: like "always" but only with fadvise/madvise>
>> huge=within_size: like "fadvise" but respect i_size
>>
>> I don't like that:
>>
>> (a) there is no way to explicitly enable/name that new behavior.
> 
> But this is similar to other file systems that enable large folios
> (setting mapping_set_large_folios()), and I haven't seen any other file
> systems supporting large folios requiring a new Kconfig. Maybe tmpfs is
> a bit special?

I'm afraid I don't have the energy to explain once more why I think 
tmpfs is not just like any other file system in some cases.

And distributions are rather careful when it comes to something like 
this ...

> 
> If we all agree that tmpfs is a bit special when using huge pages, then
> fine, a Kconfig option might be needed.
> 
>> (b) "always" etc. are only concerned about PMDs.
> 
> Yes, currently maintain the same semantics as before, in case users
> still expect THPs.

Again, I don't think that is a reasonable approach to make PMD-sized 
ones special here. It will all get seriously confusing and inconsistent.

THPs are opportunistic after all, and page fault behavior will remain 
unchanged (PMD-sized) for now. And even if we support other sizes during 
page faults, we'd like start with the largest size (PMD-size) first, and 
it likely might just all work better than before.

Happy to learn where this really makes a difference.

Of course, if you change the default behavior (which you are planning), 
it's ... a changed default.

If there are reasons to have more tunables regarding the sizes to use, 
then it should not be limited to PMD-size.

 > >> So again, I suggest:
>>
>> huge=never: No THPs of any size
>> huge=always: THPs of any size
>> huge=fadvise: like "always" but only with fadvise/madvise
>> huge=within_size: like "fadvise" but respect i_size
>>
>> "huge=" default depends on a Kconfig option.
>>
>> With that we:
>>
>> (1) Maximize the cases where we will use large folios of any sizes
>>       (which Willy cares about).
>> (2) Have a way to disable them completely (which I care about).
>> (3) Allow distros to keep the default unchanged.
>>
>> Likely, for now we will only try allocating PMD-sized THPs during page
>> faults, and allocate different sizes only during write(). So the effect
>> for many use cases (VMs, DBs) that primarily mmap() tmpfs files will be
>> completely unchanged even with "huge=always".
>>
>> It will get more tricky once we change that behavior as well, but that's
>> something to likely figure out if it is a real problem at at different
>> day :)
>>
>>
>> I really preferred using the sysfs toggles (as discussed with Hugh in
>> the meeting back then), but I can also understand why we at least want
>> to try making tmpfs behave more like other file systems. But I'm a bit
>> more careful to not ignore the cases where it really isn't like any
>> other file system.
> 
> That's also my previous thought, but Matthew is strongly against that.
> Let's step by step.

Yes, I understand his view as well.

But I won't blindly agree to the "tmpfs is just like any other file 
system" opinion :)

 > >> If we start making PMD-sized THPs special in any non-configurable way,
>> then we are effectively off *worse* than allowing to configure them
>> properly. So if someone voices "but we want only PMD-sized" ones, the
>> next one will say "but we only want cont-pte sized-ones" and then we
>> should provide an option to control the actual sizes to use differently,
>> in some way. But let's see if that is even required.
> 
> Yes, I agree. So what I am thinking is, the 'huge=' option should be
> gradually deprecated in the future and eventually tmpfs can allocate any
> size large folios as default.

Let's be realistic, it won't get removed any time soon. ;)

So changing "huge=always" etc. semantics to reflect our new size 
options, and then try changing the default (with the option for 
people/distros to have the old default) is a reasonable approach, at 
least to me.

I'm trying to stay open-minded here, but the proposal I heard so far is 
not particularly appealing.

-- 
Cheers,

David / dhildenb

Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
Posted by Baolin Wang 2 weeks, 4 days ago

On 2024/10/31 18:46, David Hildenbrand wrote:
[snip]

>>> I don't like that:
>>>
>>> (a) there is no way to explicitly enable/name that new behavior.
>>
>> But this is similar to other file systems that enable large folios
>> (setting mapping_set_large_folios()), and I haven't seen any other file
>> systems supporting large folios requiring a new Kconfig. Maybe tmpfs is
>> a bit special?
> 
> I'm afraid I don't have the energy to explain once more why I think 
> tmpfs is not just like any other file system in some cases.
> 
> And distributions are rather careful when it comes to something like 
> this ...
> 
>>
>> If we all agree that tmpfs is a bit special when using huge pages, then
>> fine, a Kconfig option might be needed.
>>
>>> (b) "always" etc. are only concerned about PMDs.
>>
>> Yes, currently maintain the same semantics as before, in case users
>> still expect THPs.
> 
> Again, I don't think that is a reasonable approach to make PMD-sized 
> ones special here. It will all get seriously confusing and inconsistent.

I agree PMD-sized should not be special. This is all for backward 
compatibility with the ‘huge=’ mount option, and adding a new kconfig is 
also for this purpose.

> THPs are opportunistic after all, and page fault behavior will remain 
> unchanged (PMD-sized) for now. And even if we support other sizes during 
> page faults, we'd like start with the largest size (PMD-size) first, and 
> it likely might just all work better than before.
> 
> Happy to learn where this really makes a difference.
> 
> Of course, if you change the default behavior (which you are planning), 
> it's ... a changed default.
> 
> If there are reasons to have more tunables regarding the sizes to use, 
> then it should not be limited to PMD-size.

I have tried to modify the code according to your suggestion (not tested 
yet). These are what you had in mind?

static inline unsigned int
shmem_mapping_size_order(struct address_space *mapping, pgoff_t index, 
loff_t write_end)
{
         unsigned int order;
         size_t size;

         if (!mapping_large_folio_support(mapping) || !write_end)
                 return 0;

         /* Calculate the write size based on the write_end */
         size = write_end - (index << PAGE_SHIFT);
         order = filemap_get_order(size);
         if (!order)
                 return 0;

         /* If we're not aligned, allocate a smaller folio */
         if (index & ((1UL << order) - 1))
                 order = __ffs(index);

         order = min_t(size_t, order, MAX_PAGECACHE_ORDER);
         return order > 0 ? BIT(order + 1) - 1 : 0;
}

static unsigned int shmem_huge_global_enabled(struct inode *inode, 
pgoff_t index,
                                               loff_t write_end, bool 
shmem_huge_force,
                                               unsigned long vm_flags)
{
         bool is_shmem = inode->i_sb == shm_mnt->mnt_sb;
         unsigned long within_size_orders;
         unsigned int order;
         loff_t i_size;

         if (HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER)
                 return 0;
         if (!S_ISREG(inode->i_mode))
                 return 0;
         if (shmem_huge == SHMEM_HUGE_DENY)
                 return 0;
         if (shmem_huge_force || shmem_huge == SHMEM_HUGE_FORCE)
                 return BIT(HPAGE_PMD_ORDER);

         switch (SHMEM_SB(inode->i_sb)->huge) {
         case SHMEM_HUGE_NEVER:
                 return 0;
         case SHMEM_HUGE_ALWAYS:
                 if (is_shmem || IS_ENABLED(CONFIG_USE_ONLY_THP_FOR_TMPFS))
                         return BIT(HPAGE_PMD_ORDER);

                 return shmem_mapping_size_order(inode->i_mapping, 
index, write_end);
         case SHMEM_HUGE_WITHIN_SIZE:
                 if (is_shmem || IS_ENABLED(CONFIG_USE_ONLY_THP_FOR_TMPFS))
                         within_size_orders = BIT(HPAGE_PMD_ORDER);
                 else
                         within_size_orders = 
shmem_mapping_size_order(inode->i_mapping,
 
index, write_end);

                 order = highest_order(within_size_orders);
                 while (within_size_orders) {
                         index = round_up(index + 1, 1 << order);
                         i_size = max(write_end, i_size_read(inode));
                         i_size = round_up(i_size, PAGE_SIZE);
                         if (i_size >> PAGE_SHIFT >= index)
                                 return within_size_orders;

                         order = next_order(&within_size_orders, order);
                 }
                 fallthrough;
         case SHMEM_HUGE_ADVISE:
                 if (vm_flags & VM_HUGEPAGE) {
                         if (is_shmem || IS_ENABLED(USE_ONLY_THP_FOR_TMPFS))
                                 return BIT(HPAGE_PMD_ORDER);

                         return shmem_mapping_size_order(inode->i_mapping,
                                                         index, write_end);
                 }
                 fallthrough;
         default:
                 return 0;
         }
}

1) Add a new 'CONFIG_USE_ONLY_THP_FOR_TMPFS' kconfig to keep ‘huge=’ 
mount option compatibility.
2) For tmpfs write(), if CONFIG_USE_ONLY_THP_FOR_TMPFS is not enabled, 
then will get the possible huge orders based on the write size.
3) For tmpfs mmap() fault, always use PMD-sized huge order.
4) For shmem, ignore the write size logic and always use PMD-sized THP 
to check if the global huge is enabled.

However, in case 2), if 'huge=always' and write size is less than 4K, so 
we will allocate small pages, that will break the 'huge' semantics? 
Maybe it's not something to worry too much about.

>>> huge=never: No THPs of any size
>>> huge=always: THPs of any size
>>> huge=fadvise: like "always" but only with fadvise/madvise
>>> huge=within_size: like "fadvise" but respect i_size
>>>
>>> "huge=" default depends on a Kconfig option.
>>>
>>> With that we:
>>>
>>> (1) Maximize the cases where we will use large folios of any sizes
>>>       (which Willy cares about).
>>> (2) Have a way to disable them completely (which I care about).
>>> (3) Allow distros to keep the default unchanged.
>>>
>>> Likely, for now we will only try allocating PMD-sized THPs during page
>>> faults, and allocate different sizes only during write(). So the effect
>>> for many use cases (VMs, DBs) that primarily mmap() tmpfs files will be
>>> completely unchanged even with "huge=always".
>>>
>>> It will get more tricky once we change that behavior as well, but that's
>>> something to likely figure out if it is a real problem at at different
>>> day :)
>>>
>>>
>>> I really preferred using the sysfs toggles (as discussed with Hugh in
>>> the meeting back then), but I can also understand why we at least want
>>> to try making tmpfs behave more like other file systems. But I'm a bit
>>> more careful to not ignore the cases where it really isn't like any
>>> other file system.
>>
>> That's also my previous thought, but Matthew is strongly against that.
>> Let's step by step.
> 
> Yes, I understand his view as well.
> 
> But I won't blindly agree to the "tmpfs is just like any other file 
> system" opinion :)
> 
>  > >> If we start making PMD-sized THPs special in any non-configurable 
> way,
>>> then we are effectively off *worse* than allowing to configure them
>>> properly. So if someone voices "but we want only PMD-sized" ones, the
>>> next one will say "but we only want cont-pte sized-ones" and then we
>>> should provide an option to control the actual sizes to use differently,
>>> in some way. But let's see if that is even required.
>>
>> Yes, I agree. So what I am thinking is, the 'huge=' option should be
>> gradually deprecated in the future and eventually tmpfs can allocate any
>> size large folios as default.
> 
> Let's be realistic, it won't get removed any time soon. ;)
> 
> So changing "huge=always" etc. semantics to reflect our new size 
> options, and then try changing the default (with the option for 
> people/distros to have the old default) is a reasonable approach, at 
> least to me.
> 
> I'm trying to stay open-minded here, but the proposal I heard so far is 
> not particularly appealing.
> 
Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
Posted by David Hildenbrand 2 weeks, 4 days ago
On 05.11.24 13:45, Baolin Wang wrote:
> 
> 
> On 2024/10/31 18:46, David Hildenbrand wrote:
> [snip]
> 
>>>> I don't like that:
>>>>
>>>> (a) there is no way to explicitly enable/name that new behavior.
>>>
>>> But this is similar to other file systems that enable large folios
>>> (setting mapping_set_large_folios()), and I haven't seen any other file
>>> systems supporting large folios requiring a new Kconfig. Maybe tmpfs is
>>> a bit special?
>>
>> I'm afraid I don't have the energy to explain once more why I think
>> tmpfs is not just like any other file system in some cases.
>>
>> And distributions are rather careful when it comes to something like
>> this ...
>>
>>>
>>> If we all agree that tmpfs is a bit special when using huge pages, then
>>> fine, a Kconfig option might be needed.
>>>
>>>> (b) "always" etc. are only concerned about PMDs.
>>>
>>> Yes, currently maintain the same semantics as before, in case users
>>> still expect THPs.
>>
>> Again, I don't think that is a reasonable approach to make PMD-sized
>> ones special here. It will all get seriously confusing and inconsistent.
> 
> I agree PMD-sized should not be special. This is all for backward
> compatibility with the ‘huge=’ mount option, and adding a new kconfig is
> also for this purpose.
> 
>> THPs are opportunistic after all, and page fault behavior will remain
>> unchanged (PMD-sized) for now. And even if we support other sizes during
>> page faults, we'd like start with the largest size (PMD-size) first, and
>> it likely might just all work better than before.
>>
>> Happy to learn where this really makes a difference.
>>
>> Of course, if you change the default behavior (which you are planning),
>> it's ... a changed default.
>>
>> If there are reasons to have more tunables regarding the sizes to use,
>> then it should not be limited to PMD-size.
> 
> I have tried to modify the code according to your suggestion (not tested
> yet). These are what you had in mind?
> 
> static inline unsigned int
> shmem_mapping_size_order(struct address_space *mapping, pgoff_t index,
> loff_t write_end)
> {
>           unsigned int order;
>           size_t size;
> 
>           if (!mapping_large_folio_support(mapping) || !write_end)
>                   return 0;
> 
>           /* Calculate the write size based on the write_end */
>           size = write_end - (index << PAGE_SHIFT);
>           order = filemap_get_order(size);
>           if (!order)
>                   return 0;
> 
>           /* If we're not aligned, allocate a smaller folio */
>           if (index & ((1UL << order) - 1))
>                   order = __ffs(index);
> 
>           order = min_t(size_t, order, MAX_PAGECACHE_ORDER);
>           return order > 0 ? BIT(order + 1) - 1 : 0;
> }
> 
> static unsigned int shmem_huge_global_enabled(struct inode *inode,
> pgoff_t index,
>                                                 loff_t write_end, bool
> shmem_huge_force,
>                                                 unsigned long vm_flags)
> {
>           bool is_shmem = inode->i_sb == shm_mnt->mnt_sb;
>           unsigned long within_size_orders;
>           unsigned int order;
>           loff_t i_size;
> 
>           if (HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER)
>                   return 0;
>           if (!S_ISREG(inode->i_mode))
>                   return 0;
>           if (shmem_huge == SHMEM_HUGE_DENY)
>                   return 0;
>           if (shmem_huge_force || shmem_huge == SHMEM_HUGE_FORCE)
>                   return BIT(HPAGE_PMD_ORDER);
> 
>           switch (SHMEM_SB(inode->i_sb)->huge) {
>           case SHMEM_HUGE_NEVER:
>                   return 0;
>           case SHMEM_HUGE_ALWAYS:
>                   if (is_shmem || IS_ENABLED(CONFIG_USE_ONLY_THP_FOR_TMPFS))
>                           return BIT(HPAGE_PMD_ORDER);
> 
>                   return shmem_mapping_size_order(inode->i_mapping,
> index, write_end);
>           case SHMEM_HUGE_WITHIN_SIZE:
>                   if (is_shmem || IS_ENABLED(CONFIG_USE_ONLY_THP_FOR_TMPFS))
>                           within_size_orders = BIT(HPAGE_PMD_ORDER);
>                   else
>                           within_size_orders =
> shmem_mapping_size_order(inode->i_mapping,
>   
> index, write_end);
> 
>                   order = highest_order(within_size_orders);
>                   while (within_size_orders) {
>                           index = round_up(index + 1, 1 << order);
>                           i_size = max(write_end, i_size_read(inode));
>                           i_size = round_up(i_size, PAGE_SIZE);
>                           if (i_size >> PAGE_SHIFT >= index)
>                                   return within_size_orders;
> 
>                           order = next_order(&within_size_orders, order);
>                   }
>                   fallthrough;
>           case SHMEM_HUGE_ADVISE:
>                   if (vm_flags & VM_HUGEPAGE) {
>                           if (is_shmem || IS_ENABLED(USE_ONLY_THP_FOR_TMPFS))
>                                   return BIT(HPAGE_PMD_ORDER);
> 
>                           return shmem_mapping_size_order(inode->i_mapping,
>                                                           index, write_end);
>                   }
>                   fallthrough;
>           default:
>                   return 0;
>           }
> }
> 
> 1) Add a new 'CONFIG_USE_ONLY_THP_FOR_TMPFS' kconfig to keep ‘huge=’
> mount option compatibility.
> 2) For tmpfs write(), if CONFIG_USE_ONLY_THP_FOR_TMPFS is not enabled,
> then will get the possible huge orders based on the write size.
> 3) For tmpfs mmap() fault, always use PMD-sized huge order.
> 4) For shmem, ignore the write size logic and always use PMD-sized THP
> to check if the global huge is enabled.
> 
> However, in case 2), if 'huge=always' and write size is less than 4K, so
> we will allocate small pages, that will break the 'huge' semantics?
> Maybe it's not something to worry too much about.

Probably I didn't express clearly what I think we should, because this is
not quite what I had in mind.

I would use the CONFIG_USE_ONLY_THP_FOR_TMPFS way of doing it only if
really required. As raised, if someone needs finer control, providing that
only for a single size is rather limiting.



This is what I hope we can do (doc update to show what I mean):

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 5034915f4e8e8..d7d1a9acdbfc5 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -349,11 +349,24 @@ user, the PMD_ORDER hugepage policy will be overridden. If the policy for
  PMD_ORDER is not defined within a valid ``thp_shmem``, its policy will
  default to ``never``.
  
-Hugepages in tmpfs/shmem
-========================
+tmpfs/shmem
+===========
  
-You can control hugepage allocation policy in tmpfs with mount option
-``huge=``. It can have following values:
+Traditionally, tmpfs only supported a single huge page size ("PMD"). Today,
+it also supports smaller sizes just like anonymous memory, often referred
+to as "multi-size THP" (mTHP). Huge pages of any size are commonly
+represented in the kernel as "large folios".
+
+While there is fine control over the huge page sizes to use for the internal
+shmem mount (see below), ordinary tmpfs mounts will make use of all
+available huge page sizes without any control over the exact sizes,
+behaving more like other file systems.
+
+tmpfs mounts
+------------
+
+The THP allocation policy for tmpfs mounts can be adjusted using the mount
+option: ``huge=``. It can have following values:
  
  always
      Attempt to allocate huge pages every time we need a new page;
@@ -368,19 +381,20 @@ within_size
  advise
      Only allocate huge pages if requested with fadvise()/madvise();
  
-The default policy is ``never``.
+Remember, that the kernel may use huge pages of all available sizes, and
+that no fine control as for the internal tmpfs mount is available.
+
+The default policy in the past was ``never``, but it can now be adjusted
+using the CONFIG_TMPFS_TRANSPARENT_HUGEPAGE_ALWAYS,
+CONFIG_TMPFS_TRANSPARENT_HUGEPAGE_NEVER etc.
  
  ``mount -o remount,huge= /mountpoint`` works fine after mount: remounting
  ``huge=never`` will not attempt to break up huge pages at all, just stop more
  from being allocated.
  
-There's also sysfs knob to control hugepage allocation policy for internal
-shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount
-is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or
-MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem.
-
-In addition to policies listed above, shmem_enabled allows two further
-values:
+In addition to policies listed above, the sysfs knob
+/sys/kernel/mm/transparent_hugepage/shmem_enabled will affect the
+allocation policy of tmpfs mounts, when set to the following values:
  
  deny
      For use in emergencies, to force the huge option off from
@@ -388,13 +402,26 @@ deny
  force
      Force the huge option on for all - very useful for testing;
  
-Shmem can also use "multi-size THP" (mTHP) by adding a new sysfs knob to
-control mTHP allocation:
-'/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled',
-and its value for each mTHP is essentially consistent with the global
-setting.  An 'inherit' option is added to ensure compatibility with these
-global settings.  Conversely, the options 'force' and 'deny' are dropped,
-which are rather testing artifacts from the old ages.
+
+shmem / internal tmpfs
+----------------------
+
+The mount internal tmpfs mount is used for SysV SHM, memfds, shared anonymous
+mmaps (of /dev/zero or MAP_ANONYMOUS), GPU drivers' DRM  objects, Ashmem.
+
+To control the THP allocation policy for this internal tmpfs mount, the
+sysfs knob /sys/kernel/mm/transparent_hugepage/shmem_enabled and the knobs
+per THP size in
+'/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled'
+can be used.
+
+The global knob has the same semantics as the ``huge=`` mount options
+for tmpfs mounts, except that the different huge page sizes can be controlled
+individually, and will only use the setting of the global knob when the
+per-size knob is set to 'inherit'.
+
+The options 'force' and 'deny' are dropped for the individual sizes, which
+are rather testing artifacts from the old ages.
  
  always
      Attempt to allocate <size> huge pages every time we need a new page;
diff --git a/Documentation/filesystems/tmpfs.rst b/Documentation/filesystems/tmpfs.rst
index 56a26c843dbe9..10de8f706d07b 100644



There is this question of "do we need the old way of doing it and only
allocate PMDs". For that, likely a config similar to the one you propose might
make sense, but I would want to see if there is real demand for that. In particular:
for whom the smaller sizes are a problem when bigger (PMD) sizes were
enabled in the past.

-- 
Cheers,

David / dhildenb

Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
Posted by Baolin Wang 2 weeks, 3 days ago

On 2024/11/5 22:56, David Hildenbrand wrote:
> On 05.11.24 13:45, Baolin Wang wrote:
>>
>>
>> On 2024/10/31 18:46, David Hildenbrand wrote:
>> [snip]
>>
>>>>> I don't like that:
>>>>>
>>>>> (a) there is no way to explicitly enable/name that new behavior.
>>>>
>>>> But this is similar to other file systems that enable large folios
>>>> (setting mapping_set_large_folios()), and I haven't seen any other file
>>>> systems supporting large folios requiring a new Kconfig. Maybe tmpfs is
>>>> a bit special?
>>>
>>> I'm afraid I don't have the energy to explain once more why I think
>>> tmpfs is not just like any other file system in some cases.
>>>
>>> And distributions are rather careful when it comes to something like
>>> this ...
>>>
>>>>
>>>> If we all agree that tmpfs is a bit special when using huge pages, then
>>>> fine, a Kconfig option might be needed.
>>>>
>>>>> (b) "always" etc. are only concerned about PMDs.
>>>>
>>>> Yes, currently maintain the same semantics as before, in case users
>>>> still expect THPs.
>>>
>>> Again, I don't think that is a reasonable approach to make PMD-sized
>>> ones special here. It will all get seriously confusing and inconsistent.
>>
>> I agree PMD-sized should not be special. This is all for backward
>> compatibility with the ‘huge=’ mount option, and adding a new kconfig is
>> also for this purpose.
>>
>>> THPs are opportunistic after all, and page fault behavior will remain
>>> unchanged (PMD-sized) for now. And even if we support other sizes during
>>> page faults, we'd like start with the largest size (PMD-size) first, and
>>> it likely might just all work better than before.
>>>
>>> Happy to learn where this really makes a difference.
>>>
>>> Of course, if you change the default behavior (which you are planning),
>>> it's ... a changed default.
>>>
>>> If there are reasons to have more tunables regarding the sizes to use,
>>> then it should not be limited to PMD-size.
>>
>> I have tried to modify the code according to your suggestion (not tested
>> yet). These are what you had in mind?
>>
>> static inline unsigned int
>> shmem_mapping_size_order(struct address_space *mapping, pgoff_t index,
>> loff_t write_end)
>> {
>>           unsigned int order;
>>           size_t size;
>>
>>           if (!mapping_large_folio_support(mapping) || !write_end)
>>                   return 0;
>>
>>           /* Calculate the write size based on the write_end */
>>           size = write_end - (index << PAGE_SHIFT);
>>           order = filemap_get_order(size);
>>           if (!order)
>>                   return 0;
>>
>>           /* If we're not aligned, allocate a smaller folio */
>>           if (index & ((1UL << order) - 1))
>>                   order = __ffs(index);
>>
>>           order = min_t(size_t, order, MAX_PAGECACHE_ORDER);
>>           return order > 0 ? BIT(order + 1) - 1 : 0;
>> }
>>
>> static unsigned int shmem_huge_global_enabled(struct inode *inode,
>> pgoff_t index,
>>                                                 loff_t write_end, bool
>> shmem_huge_force,
>>                                                 unsigned long vm_flags)
>> {
>>           bool is_shmem = inode->i_sb == shm_mnt->mnt_sb;
>>           unsigned long within_size_orders;
>>           unsigned int order;
>>           loff_t i_size;
>>
>>           if (HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER)
>>                   return 0;
>>           if (!S_ISREG(inode->i_mode))
>>                   return 0;
>>           if (shmem_huge == SHMEM_HUGE_DENY)
>>                   return 0;
>>           if (shmem_huge_force || shmem_huge == SHMEM_HUGE_FORCE)
>>                   return BIT(HPAGE_PMD_ORDER);
>>
>>           switch (SHMEM_SB(inode->i_sb)->huge) {
>>           case SHMEM_HUGE_NEVER:
>>                   return 0;
>>           case SHMEM_HUGE_ALWAYS:
>>                   if (is_shmem || 
>> IS_ENABLED(CONFIG_USE_ONLY_THP_FOR_TMPFS))
>>                           return BIT(HPAGE_PMD_ORDER);
>>
>>                   return shmem_mapping_size_order(inode->i_mapping,
>> index, write_end);
>>           case SHMEM_HUGE_WITHIN_SIZE:
>>                   if (is_shmem || 
>> IS_ENABLED(CONFIG_USE_ONLY_THP_FOR_TMPFS))
>>                           within_size_orders = BIT(HPAGE_PMD_ORDER);
>>                   else
>>                           within_size_orders =
>> shmem_mapping_size_order(inode->i_mapping,
>> index, write_end);
>>
>>                   order = highest_order(within_size_orders);
>>                   while (within_size_orders) {
>>                           index = round_up(index + 1, 1 << order);
>>                           i_size = max(write_end, i_size_read(inode));
>>                           i_size = round_up(i_size, PAGE_SIZE);
>>                           if (i_size >> PAGE_SHIFT >= index)
>>                                   return within_size_orders;
>>
>>                           order = next_order(&within_size_orders, order);
>>                   }
>>                   fallthrough;
>>           case SHMEM_HUGE_ADVISE:
>>                   if (vm_flags & VM_HUGEPAGE) {
>>                           if (is_shmem || 
>> IS_ENABLED(USE_ONLY_THP_FOR_TMPFS))
>>                                   return BIT(HPAGE_PMD_ORDER);
>>
>>                           return 
>> shmem_mapping_size_order(inode->i_mapping,
>>                                                           index, 
>> write_end);
>>                   }
>>                   fallthrough;
>>           default:
>>                   return 0;
>>           }
>> }
>>
>> 1) Add a new 'CONFIG_USE_ONLY_THP_FOR_TMPFS' kconfig to keep ‘huge=’
>> mount option compatibility.
>> 2) For tmpfs write(), if CONFIG_USE_ONLY_THP_FOR_TMPFS is not enabled,
>> then will get the possible huge orders based on the write size.
>> 3) For tmpfs mmap() fault, always use PMD-sized huge order.
>> 4) For shmem, ignore the write size logic and always use PMD-sized THP
>> to check if the global huge is enabled.
>>
>> However, in case 2), if 'huge=always' and write size is less than 4K, so
>> we will allocate small pages, that will break the 'huge' semantics?
>> Maybe it's not something to worry too much about.
> 
> Probably I didn't express clearly what I think we should, because this is
> not quite what I had in mind.
> 
> I would use the CONFIG_USE_ONLY_THP_FOR_TMPFS way of doing it only if
> really required. As raised, if someone needs finer control, providing that
> only for a single size is rather limiting.

OK. I misunderstood your points.

> This is what I hope we can do (doc update to show what I mean):

Thanks for updating the doc. I'd like to include them in the next version.

> diff --git a/Documentation/admin-guide/mm/transhuge.rst 
> b/Documentation/admin-guide/mm/transhuge.rst
> index 5034915f4e8e8..d7d1a9acdbfc5 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -349,11 +349,24 @@ user, the PMD_ORDER hugepage policy will be 
> overridden. If the policy for
>   PMD_ORDER is not defined within a valid ``thp_shmem``, its policy will
>   default to ``never``.
> 
> -Hugepages in tmpfs/shmem
> -========================
> +tmpfs/shmem
> +===========
> 
> -You can control hugepage allocation policy in tmpfs with mount option
> -``huge=``. It can have following values:
> +Traditionally, tmpfs only supported a single huge page size ("PMD"). 
> Today,
> +it also supports smaller sizes just like anonymous memory, often referred
> +to as "multi-size THP" (mTHP). Huge pages of any size are commonly
> +represented in the kernel as "large folios".
> +
> +While there is fine control over the huge page sizes to use for the 
> internal
> +shmem mount (see below), ordinary tmpfs mounts will make use of all
> +available huge page sizes without any control over the exact sizes,
> +behaving more like other file systems.
> +
> +tmpfs mounts
> +------------
> +
> +The THP allocation policy for tmpfs mounts can be adjusted using the mount
> +option: ``huge=``. It can have following values:
> 
>   always
>       Attempt to allocate huge pages every time we need a new page;
> @@ -368,19 +381,20 @@ within_size
>   advise
>       Only allocate huge pages if requested with fadvise()/madvise();
> 
> -The default policy is ``never``.
> +Remember, that the kernel may use huge pages of all available sizes, and
> +that no fine control as for the internal tmpfs mount is available.
> +
> +The default policy in the past was ``never``, but it can now be adjusted
> +using the CONFIG_TMPFS_TRANSPARENT_HUGEPAGE_ALWAYS,
> +CONFIG_TMPFS_TRANSPARENT_HUGEPAGE_NEVER etc.
> 
>   ``mount -o remount,huge= /mountpoint`` works fine after mount: remounting
>   ``huge=never`` will not attempt to break up huge pages at all, just 
> stop more
>   from being allocated.
> 
> -There's also sysfs knob to control hugepage allocation policy for internal
> -shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount
> -is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or
> -MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem.
> -
> -In addition to policies listed above, shmem_enabled allows two further
> -values:
> +In addition to policies listed above, the sysfs knob
> +/sys/kernel/mm/transparent_hugepage/shmem_enabled will affect the
> +allocation policy of tmpfs mounts, when set to the following values:
> 
>   deny
>       For use in emergencies, to force the huge option off from
> @@ -388,13 +402,26 @@ deny
>   force
>       Force the huge option on for all - very useful for testing;
> 
> -Shmem can also use "multi-size THP" (mTHP) by adding a new sysfs knob to
> -control mTHP allocation:
> -'/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled',
> -and its value for each mTHP is essentially consistent with the global
> -setting.  An 'inherit' option is added to ensure compatibility with these
> -global settings.  Conversely, the options 'force' and 'deny' are dropped,
> -which are rather testing artifacts from the old ages.
> +
> +shmem / internal tmpfs
> +----------------------
> +
> +The mount internal tmpfs mount is used for SysV SHM, memfds, shared 
> anonymous
> +mmaps (of /dev/zero or MAP_ANONYMOUS), GPU drivers' DRM  objects, Ashmem.
> +
> +To control the THP allocation policy for this internal tmpfs mount, the
> +sysfs knob /sys/kernel/mm/transparent_hugepage/shmem_enabled and the knobs
> +per THP size in
> +'/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled'
> +can be used.
> +
> +The global knob has the same semantics as the ``huge=`` mount options
> +for tmpfs mounts, except that the different huge page sizes can be 
> controlled
> +individually, and will only use the setting of the global knob when the
> +per-size knob is set to 'inherit'.
> +
> +The options 'force' and 'deny' are dropped for the individual sizes, which
> +are rather testing artifacts from the old ages.
> 
>   always
>       Attempt to allocate <size> huge pages every time we need a new page;
> diff --git a/Documentation/filesystems/tmpfs.rst 
> b/Documentation/filesystems/tmpfs.rst
> index 56a26c843dbe9..10de8f706d07b 100644
> 
> 
> 
> There is this question of "do we need the old way of doing it and only
> allocate PMDs". For that, likely a config similar to the one you propose 
> might
> make sense, but I would want to see if there is real demand for that. In 
> particular:
> for whom the smaller sizes are a problem when bigger (PMD) sizes were
> enabled in the past.

I am also not sure if such a case exists. I can remove this kconfig for 
now, and we can consider it again if someone really complains this in 
the future.
Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
Posted by Baolin Wang 4 weeks, 1 day ago

On 2024/10/24 18:49, Daniel Gomez wrote:
> On Wed Oct 23, 2024 at 11:27 AM CEST, David Hildenbrand wrote:
>> On 23.10.24 10:04, Baolin Wang wrote:
>>>
>>>
>>> On 2024/10/22 23:31, David Hildenbrand wrote:
>>>> On 22.10.24 05:41, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 2024/10/21 21:34, Daniel Gomez wrote:
>>>>>> On Mon Oct 21, 2024 at 10:54 AM CEST, Kirill A. Shutemov wrote:
>>>>>>> On Mon, Oct 21, 2024 at 02:24:18PM +0800, Baolin Wang wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2024/10/17 19:26, Kirill A. Shutemov wrote:
>>>>>>>>> On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote:
>>>>>>>>>> + Kirill
>>>>>>>>>>
>>>>>>>>>> On 2024/10/16 22:06, Matthew Wilcox wrote:
>>>>>>>>>>> On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote:
>>>>>>>>>>>> Considering that tmpfs already has the 'huge=' option to
>>>>>>>>>>>> control the THP
>>>>>>>>>>>> allocation, it is necessary to maintain compatibility with the
>>>>>>>>>>>> 'huge='
>>>>>>>>>>>> option, as well as considering the 'deny' and 'force' option
>>>>>>>>>>>> controlled
>>>>>>>>>>>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>>>>>>>>>>
>>>>>>>>>>> No, it's not.  No other filesystem honours these settings.
>>>>>>>>>>> tmpfs would
>>>>>>>>>>> not have had these settings if it were written today.  It should
>>>>>>>>>>> simply
>>>>>>>>>>> ignore them, the way that NFS ignores the "intr" mount option
>>>>>>>>>>> now that
>>>>>>>>>>> we have a better solution to the original problem.
>>>>>>>>>>>
>>>>>>>>>>> To reiterate my position:
>>>>>>>>>>>
>>>>>>>>>>>        - When using tmpfs as a filesystem, it should behave like
>>>>>>>>>>> other
>>>>>>>>>>>          filesystems.
>>>>>>>>>>>        - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED,
>>>>>>>>>>> it should
>>>>>>>>>>>          behave like anonymous memory.
>>>>>>>>>>
>>>>>>>>>> I do agree with your point to some extent, but the ‘huge=’ option
>>>>>>>>>> has
>>>>>>>>>> existed for nearly 8 years, and the huge orders based on write
>>>>>>>>>> size may not
>>>>>>>>>> achieve the performance of PMD-sized THP in some scenarios, such
>>>>>>>>>> as when the
>>>>>>>>>> write length is consistently 4K. So, I am still concerned that
>>>>>>>>>> ignoring the
>>>>>>>>>> 'huge' option could lead to compatibility issues.
>>>>>>>>>
>>>>>>>>> Yeah, I don't think we are there yet to ignore the mount option.
>>>>>>>>
>>>>>>>> OK.
>>>>>>>>
>>>>>>>>> Maybe we need to get a new generic interface to request the semantics
>>>>>>>>> tmpfs has with huge= on per-inode level on any fs. Like a set of
>>>>>>>>> FADV_*
>>>>>>>>> handles to make kernel allocate PMD-size folio on any allocation
>>>>>>>>> or on
>>>>>>>>> allocations within i_size. I think this behaviour is useful beyond
>>>>>>>>> tmpfs.
>>>>>>>>>
>>>>>>>>> Then huge= implementation for tmpfs can be re-defined to set these
>>>>>>>>> per-inode FADV_ flags by default. This way we can keep tmpfs
>>>>>>>>> compatible
>>>>>>>>> with current deployments and less special comparing to rest of
>>>>>>>>> filesystems on kernel side.
>>>>>>>>
>>>>>>>> I did a quick search, and I didn't find any other fs that require
>>>>>>>> PMD-sized
>>>>>>>> huge pages, so I am not sure if FADV_* is useful for filesystems
>>>>>>>> other than
>>>>>>>> tmpfs. Please correct me if I missed something.
>>>>>>>
>>>>>>> What do you mean by "require"? THPs are always opportunistic.
>>>>>>>
>>>>>>> IIUC, we don't have a way to hint kernel to use huge pages for a
>>>>>>> file on
>>>>>>> read from backing storage. Readahead is not always the right way.
>>>>>>>
>>>>>>>>> If huge= is not set, tmpfs would behave the same way as the rest of
>>>>>>>>> filesystems.
>>>>>>>>
>>>>>>>> So if 'huge=' is not set, tmpfs write()/fallocate() can still
>>>>>>>> allocate large
>>>>>>>> folios based on the write size? If yes, that means it will change the
>>>>>>>> default huge behavior for tmpfs. Because previously having 'huge='
>>>>>>>> is not
>>>>>>>> set means the huge option is 'SHMEM_HUGE_NEVER', which is similar
>>>>>>>> to what I
>>>>>>>> mentioned:
>>>>>>>> "Another possible choice is to make the huge pages allocation based
>>>>>>>> on write
>>>>>>>> size as the *default* behavior for tmpfs, ..."
>>>>>>>
>>>>>>> I am more worried about breaking existing users of huge pages. So
>>>>>>> changing
>>>>>>> behaviour of users who don't specify huge is okay to me.
>>>>>>
>>>>>> I think moving tmpfs to allocate large folios opportunistically by
>>>>>> default (as it was proposed initially) doesn't necessary conflict with
>>>>>> the default behaviour (huge=never). We just need to clarify that in
>>>>>> the documentation.
>>>>>>
>>>>>> However, and IIRC, one of the requests from Hugh was to have a way to
>>>>>> disable large folios which is something other FS do not have control
>>>>>> of as of today. Ryan sent a proposal to actually control that globally
>>>>>> but I think it didn't move forward. So, what are we missing to go back
>>>>>> to implement large folios in tmpfs in the default case, as any other fs
>>>>>> leveraging large folios?
>>>>>
>>>>> IMHO, as I discussed with Kirill, we still need maintain compatibility
>>>>> with the 'huge=' mount option. This means that if 'huge=never' is set
>>>>> for tmpfs, huge page allocation will still be prohibited (which can
>>>>> address Hugh's request?). However, if 'huge=' is not set, we can
>>>>> allocate large folios based on the write size.
> 
> So, in order to make tmpfs behave like other filesystems, we need to
> allocate large folios by default. Not setting 'huge=' is the same as
> setting it to 'huge=never' as per documentation. But 'huge=' is meant to
> control THP, not large folios, so it should not have a conflict here, or
> else, what case are you thinking?
> 
> So, to make tmpfs behave like other filesystems, we need to allocate
> large folios by default. According to the documentation, not setting

Right.

> 'huge=' is the same as setting 'huge=never.' However, 'huge=' is

I will update the documentation in next version. That means if 'huge=' 
option is not set, we can still allocate large folios based on the write 
size (will be not same as setting 'huge=never').

> intended to control THP, not large folios, so there shouldn't be
> a conflict in this case. Can you clarify what specific scenario or

Yes, we should still keep the same semantics of 
'huge=always/within_size/advise' setting, which only controls THP 
allocations.

> conflict you're considering here? Perhaps when large folios order is the
> same as PMD-size?
Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
Posted by Daniel Gomez 1 month ago
On Thu Oct 24, 2024 at 12:49 PM CEST, Daniel Gomez wrote:
> On Wed Oct 23, 2024 at 11:27 AM CEST, David Hildenbrand wrote:
> > On 23.10.24 10:04, Baolin Wang wrote:
> > > 
> > > 
> > > On 2024/10/22 23:31, David Hildenbrand wrote:
> > >> On 22.10.24 05:41, Baolin Wang wrote:
> > >>>
> > >>>
> > >>> On 2024/10/21 21:34, Daniel Gomez wrote:
> > >>>> On Mon Oct 21, 2024 at 10:54 AM CEST, Kirill A. Shutemov wrote:
> > >>>>> On Mon, Oct 21, 2024 at 02:24:18PM +0800, Baolin Wang wrote:
> > >>>>>>
> > >>>>>>
> > >>>>>> On 2024/10/17 19:26, Kirill A. Shutemov wrote:
> > >>>>>>> On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote:
> > >>>>>>>> + Kirill
> > >>>>>>>>
> > >>>>>>>> On 2024/10/16 22:06, Matthew Wilcox wrote:
> > >>>>>>>>> On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote:
> > >>>>>>>>>> Considering that tmpfs already has the 'huge=' option to
> > >>>>>>>>>> control the THP
> > >>>>>>>>>> allocation, it is necessary to maintain compatibility with the
> > >>>>>>>>>> 'huge='
> > >>>>>>>>>> option, as well as considering the 'deny' and 'force' option
> > >>>>>>>>>> controlled
> > >>>>>>>>>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
> > >>>>>>>>>
> > >>>>>>>>> No, it's not.  No other filesystem honours these settings.
> > >>>>>>>>> tmpfs would
> > >>>>>>>>> not have had these settings if it were written today.  It should
> > >>>>>>>>> simply
> > >>>>>>>>> ignore them, the way that NFS ignores the "intr" mount option
> > >>>>>>>>> now that
> > >>>>>>>>> we have a better solution to the original problem.
> > >>>>>>>>>
> > >>>>>>>>> To reiterate my position:
> > >>>>>>>>>
> > >>>>>>>>>       - When using tmpfs as a filesystem, it should behave like
> > >>>>>>>>> other
> > >>>>>>>>>         filesystems.
> > >>>>>>>>>       - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED,
> > >>>>>>>>> it should
> > >>>>>>>>>         behave like anonymous memory.
> > >>>>>>>>
> > >>>>>>>> I do agree with your point to some extent, but the ‘huge=’ option
> > >>>>>>>> has
> > >>>>>>>> existed for nearly 8 years, and the huge orders based on write
> > >>>>>>>> size may not
> > >>>>>>>> achieve the performance of PMD-sized THP in some scenarios, such
> > >>>>>>>> as when the
> > >>>>>>>> write length is consistently 4K. So, I am still concerned that
> > >>>>>>>> ignoring the
> > >>>>>>>> 'huge' option could lead to compatibility issues.
> > >>>>>>>
> > >>>>>>> Yeah, I don't think we are there yet to ignore the mount option.
> > >>>>>>
> > >>>>>> OK.
> > >>>>>>
> > >>>>>>> Maybe we need to get a new generic interface to request the semantics
> > >>>>>>> tmpfs has with huge= on per-inode level on any fs. Like a set of
> > >>>>>>> FADV_*
> > >>>>>>> handles to make kernel allocate PMD-size folio on any allocation
> > >>>>>>> or on
> > >>>>>>> allocations within i_size. I think this behaviour is useful beyond
> > >>>>>>> tmpfs.
> > >>>>>>>
> > >>>>>>> Then huge= implementation for tmpfs can be re-defined to set these
> > >>>>>>> per-inode FADV_ flags by default. This way we can keep tmpfs
> > >>>>>>> compatible
> > >>>>>>> with current deployments and less special comparing to rest of
> > >>>>>>> filesystems on kernel side.
> > >>>>>>
> > >>>>>> I did a quick search, and I didn't find any other fs that require
> > >>>>>> PMD-sized
> > >>>>>> huge pages, so I am not sure if FADV_* is useful for filesystems
> > >>>>>> other than
> > >>>>>> tmpfs. Please correct me if I missed something.
> > >>>>>
> > >>>>> What do you mean by "require"? THPs are always opportunistic.
> > >>>>>
> > >>>>> IIUC, we don't have a way to hint kernel to use huge pages for a
> > >>>>> file on
> > >>>>> read from backing storage. Readahead is not always the right way.
> > >>>>>
> > >>>>>>> If huge= is not set, tmpfs would behave the same way as the rest of
> > >>>>>>> filesystems.
> > >>>>>>
> > >>>>>> So if 'huge=' is not set, tmpfs write()/fallocate() can still
> > >>>>>> allocate large
> > >>>>>> folios based on the write size? If yes, that means it will change the
> > >>>>>> default huge behavior for tmpfs. Because previously having 'huge='
> > >>>>>> is not
> > >>>>>> set means the huge option is 'SHMEM_HUGE_NEVER', which is similar
> > >>>>>> to what I
> > >>>>>> mentioned:
> > >>>>>> "Another possible choice is to make the huge pages allocation based
> > >>>>>> on write
> > >>>>>> size as the *default* behavior for tmpfs, ..."
> > >>>>>
> > >>>>> I am more worried about breaking existing users of huge pages. So
> > >>>>> changing
> > >>>>> behaviour of users who don't specify huge is okay to me.
> > >>>>
> > >>>> I think moving tmpfs to allocate large folios opportunistically by
> > >>>> default (as it was proposed initially) doesn't necessary conflict with
> > >>>> the default behaviour (huge=never). We just need to clarify that in
> > >>>> the documentation.
> > >>>>
> > >>>> However, and IIRC, one of the requests from Hugh was to have a way to
> > >>>> disable large folios which is something other FS do not have control
> > >>>> of as of today. Ryan sent a proposal to actually control that globally
> > >>>> but I think it didn't move forward. So, what are we missing to go back
> > >>>> to implement large folios in tmpfs in the default case, as any other fs
> > >>>> leveraging large folios?
> > >>>
> > >>> IMHO, as I discussed with Kirill, we still need maintain compatibility
> > >>> with the 'huge=' mount option. This means that if 'huge=never' is set
> > >>> for tmpfs, huge page allocation will still be prohibited (which can
> > >>> address Hugh's request?). However, if 'huge=' is not set, we can
> > >>> allocate large folios based on the write size.
>
> So, in order to make tmpfs behave like other filesystems, we need to
> allocate large folios by default. Not setting 'huge=' is the same as
> setting it to 'huge=never' as per documentation. But 'huge=' is meant to
> control THP, not large folios, so it should not have a conflict here, or
> else, what case are you thinking?
>
> So, to make tmpfs behave like other filesystems, we need to allocate
> large folios by default. According to the documentation, not setting
> 'huge=' is the same as setting 'huge=never.' However, 'huge=' is
> intended to control THP, not large folios, so there shouldn't be
> a conflict in this case. Can you clarify what specific scenario or
> conflict you're considering here? Perhaps when large folios order is the
> same as PMD-size?

Sorry for duplicate paragraph.

>
> > >>
> > >> I consider allocating large folios in shmem/tmpfs on the write path less
> > >> controversial than allocating them on the page fault path -- especially
> > >> as long as we stay within the size to-be-written.
> > >>
> > >> I think in RHEL THP on shmem/tmpfs are disabled as default (e.g.,
> > >> shmem_enabled=never). Maybe because of some rather undesired
> > >> side-effects (maybe some are historical?): I recall issues with VMs with
> > >> THP+ memory ballooning, as we cannot reclaim pages of folios if
> > >> splitting fails). I assume most of these problematic use cases don't use
> > >> tmpfs as an ordinary file system (write()/read()), but mmap() the whole
> > >> thing.
> > >>
> > >> Sadly, I don't find any information about shmem/tmpfs + THP in the RHEL
> > >> documentation; most documentation is only concerned about anon THP.
> > >> Which makes me conclude that they are not suggested as of now.
> > >>
> > >> I see more issues with allocating them on the page fault path and not
> > >> having a way to disable it -- compared to allocating them on the write()
> > >> path.
> > > 
> > > I may not understand your issues. IIUC, you can disable allocating huge
> > > pages on the page fault path by using the 'huge=never' mount option or
> > > setting shmem_enabled=deny. No?
> >
> > That's what I am saying: if there is some way to disable it that will 
> > keep working, great.
>
> I agree. That aligns with what I recall Hugh requested. However, I
> believe if that is the way to go, we shouldn't limit it to tmpfs.
> Otherwise, why should tmpfs be prevented from allocating large folios if
> other filesystems in the system are allowed to allocate them? I think,
> if we want to disable large folios we should make it more generic,
> something similar to Ryan's proposal [1] for controlling folio sizes.
>
> [1] https://lore.kernel.org/all/20240717071257.4141363-1-ryan.roberts@arm.com/
>
> That said, there has already been disagreement on this point here [2].
>
> [2] https://lore.kernel.org/all/ZvVRiJYfaXD645Nh@casper.infradead.org/
Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
Posted by Kefeng Wang 1 month, 1 week ago

On 2024/10/10 17:58, Baolin Wang wrote:
> Hi,
> 
> This RFC patch series attempts to support large folios for tmpfs.
> 
> Considering that tmpfs already has the 'huge=' option to control the THP
> allocation, it is necessary to maintain compatibility with the 'huge='
> option, as well as considering the 'deny' and 'force' option controlled
> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
> 
> Add a new huge option 'write_size' to support large folio allocation based
> on the write size for tmpfs write and fallocate paths. So the huge pages
> allocation strategy for tmpfs is that, if the 'huge=' option
> (huge=always/within_size/advise) is enabled or the 'shmem_enabled' option
> is 'force', it need just allow PMD sized THP to keep backward compatibility
> for tmpfs. While 'huge=' option is disabled (huge=never) or the 'shmem_enabled'
> option is 'deny', it will still disable any large folio allocations. Only
> when the 'huge=' option is 'write_size', it will allow allocating large
> folios based on the write size.
> 
> And I think the 'huge=write_size' option should be the default behavior
> for tmpfs in future.

Could we avoid new huge= option for tmpfs, maybe support other orders
for both read/write/fallocate if mount with huge?

> 
> Any comments and suggestions are appreciated. Thanks.
> 
> Changes from RFC v2:
>   - Drop mTHP interfaces to control huge page allocation, per Matthew.
>   - Add a new helper to calculate the order, suggested by Matthew.
>   - Add a new huge=write_size option to allocate large folios based on
>     the write size.
>   - Add a new patch to update the documentation.
> 
> Changes from RFC v1:
>   - Drop patch 1.
>   - Use 'write_end' to calculate the length in shmem_allowable_huge_orders().
>   - Update shmem_mapping_size_order() per Daniel.
> 
> Baolin Wang (4):
>    mm: factor out the order calculation into a new helper
>    mm: shmem: change shmem_huge_global_enabled() to return huge order
>      bitmap
>    mm: shmem: add large folio support to the write and fallocate paths
>      for tmpfs
>    docs: tmpfs: add documention for 'write_size' huge option
> 
>   Documentation/filesystems/tmpfs.rst |   7 +-
>   include/linux/pagemap.h             |  16 ++++-
>   mm/shmem.c                          | 105 ++++++++++++++++++++--------
>   3 files changed, 94 insertions(+), 34 deletions(-)
>
Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
Posted by Baolin Wang 1 month, 1 week ago

On 2024/10/16 15:49, Kefeng Wang wrote:
> 
> 
> On 2024/10/10 17:58, Baolin Wang wrote:
>> Hi,
>>
>> This RFC patch series attempts to support large folios for tmpfs.
>>
>> Considering that tmpfs already has the 'huge=' option to control the THP
>> allocation, it is necessary to maintain compatibility with the 'huge='
>> option, as well as considering the 'deny' and 'force' option controlled
>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>
>> Add a new huge option 'write_size' to support large folio allocation 
>> based
>> on the write size for tmpfs write and fallocate paths. So the huge pages
>> allocation strategy for tmpfs is that, if the 'huge=' option
>> (huge=always/within_size/advise) is enabled or the 'shmem_enabled' option
>> is 'force', it need just allow PMD sized THP to keep backward 
>> compatibility
>> for tmpfs. While 'huge=' option is disabled (huge=never) or the 
>> 'shmem_enabled'
>> option is 'deny', it will still disable any large folio allocations. Only
>> when the 'huge=' option is 'write_size', it will allow allocating large
>> folios based on the write size.
>>
>> And I think the 'huge=write_size' option should be the default behavior
>> for tmpfs in future.
> 
> Could we avoid new huge= option for tmpfs, maybe support other orders
> for both read/write/fallocate if mount with huge?

Um, I am afraid not, as that would break the 'huge=' compatibility. That 
is to say, users still want PMD-sized huge pages if 'huge=always'.
Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
Posted by Kefeng Wang 1 month, 1 week ago

On 2024/10/16 17:29, Baolin Wang wrote:
> 
> 
> On 2024/10/16 15:49, Kefeng Wang wrote:
>>
>>
>> On 2024/10/10 17:58, Baolin Wang wrote:
>>> Hi,
>>>
>>> This RFC patch series attempts to support large folios for tmpfs.
>>>
>>> Considering that tmpfs already has the 'huge=' option to control the THP
>>> allocation, it is necessary to maintain compatibility with the 'huge='
>>> option, as well as considering the 'deny' and 'force' option controlled
>>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>>
>>> Add a new huge option 'write_size' to support large folio allocation 
>>> based
>>> on the write size for tmpfs write and fallocate paths. So the huge pages
>>> allocation strategy for tmpfs is that, if the 'huge=' option
>>> (huge=always/within_size/advise) is enabled or the 'shmem_enabled' 
>>> option
>>> is 'force', it need just allow PMD sized THP to keep backward 
>>> compatibility
>>> for tmpfs. While 'huge=' option is disabled (huge=never) or the 
>>> 'shmem_enabled'
>>> option is 'deny', it will still disable any large folio allocations. 
>>> Only
>>> when the 'huge=' option is 'write_size', it will allow allocating large
>>> folios based on the write size.
>>>
>>> And I think the 'huge=write_size' option should be the default behavior
>>> for tmpfs in future.
>>
>> Could we avoid new huge= option for tmpfs, maybe support other orders
>> for both read/write/fallocate if mount with huge?
> 
> Um, I am afraid not, as that would break the 'huge=' compatibility. That 
> is to say, users still want PMD-sized huge pages if 'huge=always'.

Yes, compatibility maybe an issue, but only write/fallocate side support
large folio is a little strange, maybe a new mode to support both read/
write/fallocate?
Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
Posted by Baolin Wang 1 month, 1 week ago

On 2024/10/16 21:45, Kefeng Wang wrote:
> 
> 
> On 2024/10/16 17:29, Baolin Wang wrote:
>>
>>
>> On 2024/10/16 15:49, Kefeng Wang wrote:
>>>
>>>
>>> On 2024/10/10 17:58, Baolin Wang wrote:
>>>> Hi,
>>>>
>>>> This RFC patch series attempts to support large folios for tmpfs.
>>>>
>>>> Considering that tmpfs already has the 'huge=' option to control the 
>>>> THP
>>>> allocation, it is necessary to maintain compatibility with the 'huge='
>>>> option, as well as considering the 'deny' and 'force' option controlled
>>>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>>>
>>>> Add a new huge option 'write_size' to support large folio allocation 
>>>> based
>>>> on the write size for tmpfs write and fallocate paths. So the huge 
>>>> pages
>>>> allocation strategy for tmpfs is that, if the 'huge=' option
>>>> (huge=always/within_size/advise) is enabled or the 'shmem_enabled' 
>>>> option
>>>> is 'force', it need just allow PMD sized THP to keep backward 
>>>> compatibility
>>>> for tmpfs. While 'huge=' option is disabled (huge=never) or the 
>>>> 'shmem_enabled'
>>>> option is 'deny', it will still disable any large folio allocations. 
>>>> Only
>>>> when the 'huge=' option is 'write_size', it will allow allocating large
>>>> folios based on the write size.
>>>>
>>>> And I think the 'huge=write_size' option should be the default behavior
>>>> for tmpfs in future.
>>>
>>> Could we avoid new huge= option for tmpfs, maybe support other orders
>>> for both read/write/fallocate if mount with huge?
>>
>> Um, I am afraid not, as that would break the 'huge=' compatibility. 
>> That is to say, users still want PMD-sized huge pages if 'huge=always'.
> 
> Yes, compatibility maybe an issue, but only write/fallocate side support
> large folio is a little strange, maybe a new mode to support both read/
> write/fallocate?

Because tmpfs read() will not allocate folios for tmpfs holes, and will 
use ZERO_PAGE instead. If the shmem folios are swapped out, and now we 
will always swapin base page, which is another story...

For tmpfs mmap() read, we do not have a length to indicate how large the 
folio should be allocated. Moreover, we have decided against adding any 
mTHP interfaces for tmpfs in the previous discussion[1].

[1] https://lore.kernel.org/all/ZvVRiJYfaXD645Nh@casper.infradead.org/