Documentation/filesystems/tmpfs.rst | 7 +- include/linux/pagemap.h | 16 ++++- mm/shmem.c | 105 ++++++++++++++++++++-------- 3 files changed, 94 insertions(+), 34 deletions(-)
Hi, This RFC patch series attempts to support large folios for tmpfs. Considering that tmpfs already has the 'huge=' option to control the THP allocation, it is necessary to maintain compatibility with the 'huge=' option, as well as considering the 'deny' and 'force' option controlled by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'. Add a new huge option 'write_size' to support large folio allocation based on the write size for tmpfs write and fallocate paths. So the huge pages allocation strategy for tmpfs is that, if the 'huge=' option (huge=always/within_size/advise) is enabled or the 'shmem_enabled' option is 'force', it need just allow PMD sized THP to keep backward compatibility for tmpfs. While 'huge=' option is disabled (huge=never) or the 'shmem_enabled' option is 'deny', it will still disable any large folio allocations. Only when the 'huge=' option is 'write_size', it will allow allocating large folios based on the write size. And I think the 'huge=write_size' option should be the default behavior for tmpfs in future. Any comments and suggestions are appreciated. Thanks. Changes from RFC v2: - Drop mTHP interfaces to control huge page allocation, per Matthew. - Add a new helper to calculate the order, suggested by Matthew. - Add a new huge=write_size option to allocate large folios based on the write size. - Add a new patch to update the documentation. Changes from RFC v1: - Drop patch 1. - Use 'write_end' to calculate the length in shmem_allowable_huge_orders(). - Update shmem_mapping_size_order() per Daniel. Baolin Wang (4): mm: factor out the order calculation into a new helper mm: shmem: change shmem_huge_global_enabled() to return huge order bitmap mm: shmem: add large folio support to the write and fallocate paths for tmpfs docs: tmpfs: add documention for 'write_size' huge option Documentation/filesystems/tmpfs.rst | 7 +- include/linux/pagemap.h | 16 ++++- mm/shmem.c | 105 ++++++++++++++++++++-------- 3 files changed, 94 insertions(+), 34 deletions(-) -- 2.39.3
On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote: > Considering that tmpfs already has the 'huge=' option to control the THP > allocation, it is necessary to maintain compatibility with the 'huge=' > option, as well as considering the 'deny' and 'force' option controlled > by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'. No, it's not. No other filesystem honours these settings. tmpfs would not have had these settings if it were written today. It should simply ignore them, the way that NFS ignores the "intr" mount option now that we have a better solution to the original problem. To reiterate my position: - When using tmpfs as a filesystem, it should behave like other filesystems. - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED, it should behave like anonymous memory. No more special mount options.
+ Kirill On 2024/10/16 22:06, Matthew Wilcox wrote: > On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote: >> Considering that tmpfs already has the 'huge=' option to control the THP >> allocation, it is necessary to maintain compatibility with the 'huge=' >> option, as well as considering the 'deny' and 'force' option controlled >> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'. > > No, it's not. No other filesystem honours these settings. tmpfs would > not have had these settings if it were written today. It should simply > ignore them, the way that NFS ignores the "intr" mount option now that > we have a better solution to the original problem. > > To reiterate my position: > > - When using tmpfs as a filesystem, it should behave like other > filesystems. > - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED, it should > behave like anonymous memory. I do agree with your point to some extent, but the ‘huge=’ option has existed for nearly 8 years, and the huge orders based on write size may not achieve the performance of PMD-sized THP in some scenarios, such as when the write length is consistently 4K. So, I am still concerned that ignoring the 'huge' option could lead to compatibility issues. Another possible choice is to make the huge pages allocation based on write size as the *default* behavior for tmpfs, while marking the 'huge=' option as deprecated and gradually removing it if there are no user complaints about performance issues. Let's also see what Hugh and Kirill think. Hugh, Kirill, do you have any inputs?
On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote: > + Kirill > > On 2024/10/16 22:06, Matthew Wilcox wrote: > > On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote: > > > Considering that tmpfs already has the 'huge=' option to control the THP > > > allocation, it is necessary to maintain compatibility with the 'huge=' > > > option, as well as considering the 'deny' and 'force' option controlled > > > by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'. > > > > No, it's not. No other filesystem honours these settings. tmpfs would > > not have had these settings if it were written today. It should simply > > ignore them, the way that NFS ignores the "intr" mount option now that > > we have a better solution to the original problem. > > > > To reiterate my position: > > > > - When using tmpfs as a filesystem, it should behave like other > > filesystems. > > - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED, it should > > behave like anonymous memory. > > I do agree with your point to some extent, but the ‘huge=’ option has > existed for nearly 8 years, and the huge orders based on write size may not > achieve the performance of PMD-sized THP in some scenarios, such as when the > write length is consistently 4K. So, I am still concerned that ignoring the > 'huge' option could lead to compatibility issues. Yeah, I don't think we are there yet to ignore the mount option. Maybe we need to get a new generic interface to request the semantics tmpfs has with huge= on per-inode level on any fs. Like a set of FADV_* handles to make kernel allocate PMD-size folio on any allocation or on allocations within i_size. I think this behaviour is useful beyond tmpfs. Then huge= implementation for tmpfs can be re-defined to set these per-inode FADV_ flags by default. This way we can keep tmpfs compatible with current deployments and less special comparing to rest of filesystems on kernel side. If huge= is not set, tmpfs would behave the same way as the rest of filesystems. -- Kiryl Shutsemau / Kirill A. Shutemov
On 2024/10/17 19:26, Kirill A. Shutemov wrote: > On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote: >> + Kirill >> >> On 2024/10/16 22:06, Matthew Wilcox wrote: >>> On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote: >>>> Considering that tmpfs already has the 'huge=' option to control the THP >>>> allocation, it is necessary to maintain compatibility with the 'huge=' >>>> option, as well as considering the 'deny' and 'force' option controlled >>>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'. >>> >>> No, it's not. No other filesystem honours these settings. tmpfs would >>> not have had these settings if it were written today. It should simply >>> ignore them, the way that NFS ignores the "intr" mount option now that >>> we have a better solution to the original problem. >>> >>> To reiterate my position: >>> >>> - When using tmpfs as a filesystem, it should behave like other >>> filesystems. >>> - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED, it should >>> behave like anonymous memory. >> >> I do agree with your point to some extent, but the ‘huge=’ option has >> existed for nearly 8 years, and the huge orders based on write size may not >> achieve the performance of PMD-sized THP in some scenarios, such as when the >> write length is consistently 4K. So, I am still concerned that ignoring the >> 'huge' option could lead to compatibility issues. > > Yeah, I don't think we are there yet to ignore the mount option. OK. > Maybe we need to get a new generic interface to request the semantics > tmpfs has with huge= on per-inode level on any fs. Like a set of FADV_* > handles to make kernel allocate PMD-size folio on any allocation or on > allocations within i_size. I think this behaviour is useful beyond tmpfs. > > Then huge= implementation for tmpfs can be re-defined to set these > per-inode FADV_ flags by default. This way we can keep tmpfs compatible > with current deployments and less special comparing to rest of > filesystems on kernel side. I did a quick search, and I didn't find any other fs that require PMD-sized huge pages, so I am not sure if FADV_* is useful for filesystems other than tmpfs. Please correct me if I missed something. > If huge= is not set, tmpfs would behave the same way as the rest of > filesystems. So if 'huge=' is not set, tmpfs write()/fallocate() can still allocate large folios based on the write size? If yes, that means it will change the default huge behavior for tmpfs. Because previously having 'huge=' is not set means the huge option is 'SHMEM_HUGE_NEVER', which is similar to what I mentioned: "Another possible choice is to make the huge pages allocation based on write size as the *default* behavior for tmpfs, ..."
On Mon, Oct 21, 2024 at 02:24:18PM +0800, Baolin Wang wrote: > > > On 2024/10/17 19:26, Kirill A. Shutemov wrote: > > On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote: > > > + Kirill > > > > > > On 2024/10/16 22:06, Matthew Wilcox wrote: > > > > On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote: > > > > > Considering that tmpfs already has the 'huge=' option to control the THP > > > > > allocation, it is necessary to maintain compatibility with the 'huge=' > > > > > option, as well as considering the 'deny' and 'force' option controlled > > > > > by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'. > > > > > > > > No, it's not. No other filesystem honours these settings. tmpfs would > > > > not have had these settings if it were written today. It should simply > > > > ignore them, the way that NFS ignores the "intr" mount option now that > > > > we have a better solution to the original problem. > > > > > > > > To reiterate my position: > > > > > > > > - When using tmpfs as a filesystem, it should behave like other > > > > filesystems. > > > > - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED, it should > > > > behave like anonymous memory. > > > > > > I do agree with your point to some extent, but the ‘huge=’ option has > > > existed for nearly 8 years, and the huge orders based on write size may not > > > achieve the performance of PMD-sized THP in some scenarios, such as when the > > > write length is consistently 4K. So, I am still concerned that ignoring the > > > 'huge' option could lead to compatibility issues. > > > > Yeah, I don't think we are there yet to ignore the mount option. > > OK. > > > Maybe we need to get a new generic interface to request the semantics > > tmpfs has with huge= on per-inode level on any fs. Like a set of FADV_* > > handles to make kernel allocate PMD-size folio on any allocation or on > > allocations within i_size. I think this behaviour is useful beyond tmpfs. > > > > Then huge= implementation for tmpfs can be re-defined to set these > > per-inode FADV_ flags by default. This way we can keep tmpfs compatible > > with current deployments and less special comparing to rest of > > filesystems on kernel side. > > I did a quick search, and I didn't find any other fs that require PMD-sized > huge pages, so I am not sure if FADV_* is useful for filesystems other than > tmpfs. Please correct me if I missed something. What do you mean by "require"? THPs are always opportunistic. IIUC, we don't have a way to hint kernel to use huge pages for a file on read from backing storage. Readahead is not always the right way. > > If huge= is not set, tmpfs would behave the same way as the rest of > > filesystems. > > So if 'huge=' is not set, tmpfs write()/fallocate() can still allocate large > folios based on the write size? If yes, that means it will change the > default huge behavior for tmpfs. Because previously having 'huge=' is not > set means the huge option is 'SHMEM_HUGE_NEVER', which is similar to what I > mentioned: > "Another possible choice is to make the huge pages allocation based on write > size as the *default* behavior for tmpfs, ..." I am more worried about breaking existing users of huge pages. So changing behaviour of users who don't specify huge is okay to me. -- Kiryl Shutsemau / Kirill A. Shutemov
On 2024/10/21 16:54, Kirill A. Shutemov wrote: > On Mon, Oct 21, 2024 at 02:24:18PM +0800, Baolin Wang wrote: >> >> >> On 2024/10/17 19:26, Kirill A. Shutemov wrote: >>> On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote: >>>> + Kirill >>>> >>>> On 2024/10/16 22:06, Matthew Wilcox wrote: >>>>> On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote: >>>>>> Considering that tmpfs already has the 'huge=' option to control the THP >>>>>> allocation, it is necessary to maintain compatibility with the 'huge=' >>>>>> option, as well as considering the 'deny' and 'force' option controlled >>>>>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'. >>>>> >>>>> No, it's not. No other filesystem honours these settings. tmpfs would >>>>> not have had these settings if it were written today. It should simply >>>>> ignore them, the way that NFS ignores the "intr" mount option now that >>>>> we have a better solution to the original problem. >>>>> >>>>> To reiterate my position: >>>>> >>>>> - When using tmpfs as a filesystem, it should behave like other >>>>> filesystems. >>>>> - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED, it should >>>>> behave like anonymous memory. >>>> >>>> I do agree with your point to some extent, but the ‘huge=’ option has >>>> existed for nearly 8 years, and the huge orders based on write size may not >>>> achieve the performance of PMD-sized THP in some scenarios, such as when the >>>> write length is consistently 4K. So, I am still concerned that ignoring the >>>> 'huge' option could lead to compatibility issues. >>> >>> Yeah, I don't think we are there yet to ignore the mount option. >> >> OK. >> >>> Maybe we need to get a new generic interface to request the semantics >>> tmpfs has with huge= on per-inode level on any fs. Like a set of FADV_* >>> handles to make kernel allocate PMD-size folio on any allocation or on >>> allocations within i_size. I think this behaviour is useful beyond tmpfs. >>> >>> Then huge= implementation for tmpfs can be re-defined to set these >>> per-inode FADV_ flags by default. This way we can keep tmpfs compatible >>> with current deployments and less special comparing to rest of >>> filesystems on kernel side. >> >> I did a quick search, and I didn't find any other fs that require PMD-sized >> huge pages, so I am not sure if FADV_* is useful for filesystems other than >> tmpfs. Please correct me if I missed something. > > What do you mean by "require"? THPs are always opportunistic. > > IIUC, we don't have a way to hint kernel to use huge pages for a file on > read from backing storage. Readahead is not always the right way. IIUC, most file systems use method similar to iomap buffered IO (see iomap_get_folio()) to allocate huge pages. What I mean is that, it would be better to have a real use case to add a hint for allocating THP (other than tmpfs). >>> If huge= is not set, tmpfs would behave the same way as the rest of >>> filesystems. >> >> So if 'huge=' is not set, tmpfs write()/fallocate() can still allocate large >> folios based on the write size? If yes, that means it will change the >> default huge behavior for tmpfs. Because previously having 'huge=' is not >> set means the huge option is 'SHMEM_HUGE_NEVER', which is similar to what I >> mentioned: >> "Another possible choice is to make the huge pages allocation based on write >> size as the *default* behavior for tmpfs, ..." > > I am more worried about breaking existing users of huge pages. So changing > behaviour of users who don't specify huge is okay to me. OK. Good.
On Tue, Oct 22, 2024 at 11:34:14AM +0800, Baolin Wang wrote: > IIUC, most file systems use method similar to iomap buffered IO (see > iomap_get_folio()) to allocate huge pages. What I mean is that, it would be > better to have a real use case to add a hint for allocating THP (other than > tmpfs). I would be nice to hear from folks who works with production what the actual needs are. But I find asymmetry between MADV_ hints and FADV_ hints wrt huge pages not justified. I think it would be easy to find use-cases for FADV_HUGEPAGE/FADV_NOHUGEPAGE. Furthermore I think it would be useful to have some kind of mechanism to make these hints persistent: any open of a file would have these hints set by default based on inode metadata on backing storage. Although, I am not sure what the right way to archive that. xattrs? -- Kiryl Shutsemau / Kirill A. Shutemov
On 2024/10/22 18:06, Kirill A. Shutemov wrote: > On Tue, Oct 22, 2024 at 11:34:14AM +0800, Baolin Wang wrote: >> IIUC, most file systems use method similar to iomap buffered IO (see >> iomap_get_folio()) to allocate huge pages. What I mean is that, it would be >> better to have a real use case to add a hint for allocating THP (other than >> tmpfs). > > I would be nice to hear from folks who works with production what the > actual needs are. > > But I find asymmetry between MADV_ hints and FADV_ hints wrt huge pages > not justified. I think it would be easy to find use-cases for > FADV_HUGEPAGE/FADV_NOHUGEPAGE. > > Furthermore I think it would be useful to have some kind of mechanism to > make these hints persistent: any open of a file would have these hints set > by default based on inode metadata on backing storage. Although, I am not > sure what the right way to archive that. xattrs? May be can re-use mapping_set_folio_order_range()?
On Mon Oct 21, 2024 at 10:54 AM CEST, Kirill A. Shutemov wrote: > On Mon, Oct 21, 2024 at 02:24:18PM +0800, Baolin Wang wrote: >> >> >> On 2024/10/17 19:26, Kirill A. Shutemov wrote: >> > On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote: >> > > + Kirill >> > > >> > > On 2024/10/16 22:06, Matthew Wilcox wrote: >> > > > On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote: >> > > > > Considering that tmpfs already has the 'huge=' option to control the THP >> > > > > allocation, it is necessary to maintain compatibility with the 'huge=' >> > > > > option, as well as considering the 'deny' and 'force' option controlled >> > > > > by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'. >> > > > >> > > > No, it's not. No other filesystem honours these settings. tmpfs would >> > > > not have had these settings if it were written today. It should simply >> > > > ignore them, the way that NFS ignores the "intr" mount option now that >> > > > we have a better solution to the original problem. >> > > > >> > > > To reiterate my position: >> > > > >> > > > - When using tmpfs as a filesystem, it should behave like other >> > > > filesystems. >> > > > - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED, it should >> > > > behave like anonymous memory. >> > > >> > > I do agree with your point to some extent, but the ‘huge=’ option has >> > > existed for nearly 8 years, and the huge orders based on write size may not >> > > achieve the performance of PMD-sized THP in some scenarios, such as when the >> > > write length is consistently 4K. So, I am still concerned that ignoring the >> > > 'huge' option could lead to compatibility issues. >> > >> > Yeah, I don't think we are there yet to ignore the mount option. >> >> OK. >> >> > Maybe we need to get a new generic interface to request the semantics >> > tmpfs has with huge= on per-inode level on any fs. Like a set of FADV_* >> > handles to make kernel allocate PMD-size folio on any allocation or on >> > allocations within i_size. I think this behaviour is useful beyond tmpfs. >> > >> > Then huge= implementation for tmpfs can be re-defined to set these >> > per-inode FADV_ flags by default. This way we can keep tmpfs compatible >> > with current deployments and less special comparing to rest of >> > filesystems on kernel side. >> >> I did a quick search, and I didn't find any other fs that require PMD-sized >> huge pages, so I am not sure if FADV_* is useful for filesystems other than >> tmpfs. Please correct me if I missed something. > > What do you mean by "require"? THPs are always opportunistic. > > IIUC, we don't have a way to hint kernel to use huge pages for a file on > read from backing storage. Readahead is not always the right way. > >> > If huge= is not set, tmpfs would behave the same way as the rest of >> > filesystems. >> >> So if 'huge=' is not set, tmpfs write()/fallocate() can still allocate large >> folios based on the write size? If yes, that means it will change the >> default huge behavior for tmpfs. Because previously having 'huge=' is not >> set means the huge option is 'SHMEM_HUGE_NEVER', which is similar to what I >> mentioned: >> "Another possible choice is to make the huge pages allocation based on write >> size as the *default* behavior for tmpfs, ..." > > I am more worried about breaking existing users of huge pages. So changing > behaviour of users who don't specify huge is okay to me. I think moving tmpfs to allocate large folios opportunistically by default (as it was proposed initially) doesn't necessary conflict with the default behaviour (huge=never). We just need to clarify that in the documentation. However, and IIRC, one of the requests from Hugh was to have a way to disable large folios which is something other FS do not have control of as of today. Ryan sent a proposal to actually control that globally but I think it didn't move forward. So, what are we missing to go back to implement large folios in tmpfs in the default case, as any other fs leveraging large folios?
On 2024/10/21 21:34, Daniel Gomez wrote: > On Mon Oct 21, 2024 at 10:54 AM CEST, Kirill A. Shutemov wrote: >> On Mon, Oct 21, 2024 at 02:24:18PM +0800, Baolin Wang wrote: >>> >>> >>> On 2024/10/17 19:26, Kirill A. Shutemov wrote: >>>> On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote: >>>>> + Kirill >>>>> >>>>> On 2024/10/16 22:06, Matthew Wilcox wrote: >>>>>> On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote: >>>>>>> Considering that tmpfs already has the 'huge=' option to control the THP >>>>>>> allocation, it is necessary to maintain compatibility with the 'huge=' >>>>>>> option, as well as considering the 'deny' and 'force' option controlled >>>>>>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'. >>>>>> >>>>>> No, it's not. No other filesystem honours these settings. tmpfs would >>>>>> not have had these settings if it were written today. It should simply >>>>>> ignore them, the way that NFS ignores the "intr" mount option now that >>>>>> we have a better solution to the original problem. >>>>>> >>>>>> To reiterate my position: >>>>>> >>>>>> - When using tmpfs as a filesystem, it should behave like other >>>>>> filesystems. >>>>>> - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED, it should >>>>>> behave like anonymous memory. >>>>> >>>>> I do agree with your point to some extent, but the ‘huge=’ option has >>>>> existed for nearly 8 years, and the huge orders based on write size may not >>>>> achieve the performance of PMD-sized THP in some scenarios, such as when the >>>>> write length is consistently 4K. So, I am still concerned that ignoring the >>>>> 'huge' option could lead to compatibility issues. >>>> >>>> Yeah, I don't think we are there yet to ignore the mount option. >>> >>> OK. >>> >>>> Maybe we need to get a new generic interface to request the semantics >>>> tmpfs has with huge= on per-inode level on any fs. Like a set of FADV_* >>>> handles to make kernel allocate PMD-size folio on any allocation or on >>>> allocations within i_size. I think this behaviour is useful beyond tmpfs. >>>> >>>> Then huge= implementation for tmpfs can be re-defined to set these >>>> per-inode FADV_ flags by default. This way we can keep tmpfs compatible >>>> with current deployments and less special comparing to rest of >>>> filesystems on kernel side. >>> >>> I did a quick search, and I didn't find any other fs that require PMD-sized >>> huge pages, so I am not sure if FADV_* is useful for filesystems other than >>> tmpfs. Please correct me if I missed something. >> >> What do you mean by "require"? THPs are always opportunistic. >> >> IIUC, we don't have a way to hint kernel to use huge pages for a file on >> read from backing storage. Readahead is not always the right way. >> >>>> If huge= is not set, tmpfs would behave the same way as the rest of >>>> filesystems. >>> >>> So if 'huge=' is not set, tmpfs write()/fallocate() can still allocate large >>> folios based on the write size? If yes, that means it will change the >>> default huge behavior for tmpfs. Because previously having 'huge=' is not >>> set means the huge option is 'SHMEM_HUGE_NEVER', which is similar to what I >>> mentioned: >>> "Another possible choice is to make the huge pages allocation based on write >>> size as the *default* behavior for tmpfs, ..." >> >> I am more worried about breaking existing users of huge pages. So changing >> behaviour of users who don't specify huge is okay to me. > > I think moving tmpfs to allocate large folios opportunistically by > default (as it was proposed initially) doesn't necessary conflict with > the default behaviour (huge=never). We just need to clarify that in > the documentation. > > However, and IIRC, one of the requests from Hugh was to have a way to > disable large folios which is something other FS do not have control > of as of today. Ryan sent a proposal to actually control that globally > but I think it didn't move forward. So, what are we missing to go back > to implement large folios in tmpfs in the default case, as any other fs > leveraging large folios? IMHO, as I discussed with Kirill, we still need maintain compatibility with the 'huge=' mount option. This means that if 'huge=never' is set for tmpfs, huge page allocation will still be prohibited (which can address Hugh's request?). However, if 'huge=' is not set, we can allocate large folios based on the write size.
On 22.10.24 05:41, Baolin Wang wrote: > > > On 2024/10/21 21:34, Daniel Gomez wrote: >> On Mon Oct 21, 2024 at 10:54 AM CEST, Kirill A. Shutemov wrote: >>> On Mon, Oct 21, 2024 at 02:24:18PM +0800, Baolin Wang wrote: >>>> >>>> >>>> On 2024/10/17 19:26, Kirill A. Shutemov wrote: >>>>> On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote: >>>>>> + Kirill >>>>>> >>>>>> On 2024/10/16 22:06, Matthew Wilcox wrote: >>>>>>> On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote: >>>>>>>> Considering that tmpfs already has the 'huge=' option to control the THP >>>>>>>> allocation, it is necessary to maintain compatibility with the 'huge=' >>>>>>>> option, as well as considering the 'deny' and 'force' option controlled >>>>>>>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'. >>>>>>> >>>>>>> No, it's not. No other filesystem honours these settings. tmpfs would >>>>>>> not have had these settings if it were written today. It should simply >>>>>>> ignore them, the way that NFS ignores the "intr" mount option now that >>>>>>> we have a better solution to the original problem. >>>>>>> >>>>>>> To reiterate my position: >>>>>>> >>>>>>> - When using tmpfs as a filesystem, it should behave like other >>>>>>> filesystems. >>>>>>> - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED, it should >>>>>>> behave like anonymous memory. >>>>>> >>>>>> I do agree with your point to some extent, but the ‘huge=’ option has >>>>>> existed for nearly 8 years, and the huge orders based on write size may not >>>>>> achieve the performance of PMD-sized THP in some scenarios, such as when the >>>>>> write length is consistently 4K. So, I am still concerned that ignoring the >>>>>> 'huge' option could lead to compatibility issues. >>>>> >>>>> Yeah, I don't think we are there yet to ignore the mount option. >>>> >>>> OK. >>>> >>>>> Maybe we need to get a new generic interface to request the semantics >>>>> tmpfs has with huge= on per-inode level on any fs. Like a set of FADV_* >>>>> handles to make kernel allocate PMD-size folio on any allocation or on >>>>> allocations within i_size. I think this behaviour is useful beyond tmpfs. >>>>> >>>>> Then huge= implementation for tmpfs can be re-defined to set these >>>>> per-inode FADV_ flags by default. This way we can keep tmpfs compatible >>>>> with current deployments and less special comparing to rest of >>>>> filesystems on kernel side. >>>> >>>> I did a quick search, and I didn't find any other fs that require PMD-sized >>>> huge pages, so I am not sure if FADV_* is useful for filesystems other than >>>> tmpfs. Please correct me if I missed something. >>> >>> What do you mean by "require"? THPs are always opportunistic. >>> >>> IIUC, we don't have a way to hint kernel to use huge pages for a file on >>> read from backing storage. Readahead is not always the right way. >>> >>>>> If huge= is not set, tmpfs would behave the same way as the rest of >>>>> filesystems. >>>> >>>> So if 'huge=' is not set, tmpfs write()/fallocate() can still allocate large >>>> folios based on the write size? If yes, that means it will change the >>>> default huge behavior for tmpfs. Because previously having 'huge=' is not >>>> set means the huge option is 'SHMEM_HUGE_NEVER', which is similar to what I >>>> mentioned: >>>> "Another possible choice is to make the huge pages allocation based on write >>>> size as the *default* behavior for tmpfs, ..." >>> >>> I am more worried about breaking existing users of huge pages. So changing >>> behaviour of users who don't specify huge is okay to me. >> >> I think moving tmpfs to allocate large folios opportunistically by >> default (as it was proposed initially) doesn't necessary conflict with >> the default behaviour (huge=never). We just need to clarify that in >> the documentation. >> >> However, and IIRC, one of the requests from Hugh was to have a way to >> disable large folios which is something other FS do not have control >> of as of today. Ryan sent a proposal to actually control that globally >> but I think it didn't move forward. So, what are we missing to go back >> to implement large folios in tmpfs in the default case, as any other fs >> leveraging large folios? > > IMHO, as I discussed with Kirill, we still need maintain compatibility > with the 'huge=' mount option. This means that if 'huge=never' is set > for tmpfs, huge page allocation will still be prohibited (which can > address Hugh's request?). However, if 'huge=' is not set, we can > allocate large folios based on the write size. I consider allocating large folios in shmem/tmpfs on the write path less controversial than allocating them on the page fault path -- especially as long as we stay within the size to-be-written. I think in RHEL THP on shmem/tmpfs are disabled as default (e.g., shmem_enabled=never). Maybe because of some rather undesired side-effects (maybe some are historical?): I recall issues with VMs with THP+ memory ballooning, as we cannot reclaim pages of folios if splitting fails). I assume most of these problematic use cases don't use tmpfs as an ordinary file system (write()/read()), but mmap() the whole thing. Sadly, I don't find any information about shmem/tmpfs + THP in the RHEL documentation; most documentation is only concerned about anon THP. Which makes me conclude that they are not suggested as of now. I see more issues with allocating them on the page fault path and not having a way to disable it -- compared to allocating them on the write() path. Getting Hugh's opinion in this would be very valuable. -- Cheers, David / dhildenb
On 2024/10/22 23:31, David Hildenbrand wrote: > On 22.10.24 05:41, Baolin Wang wrote: >> >> >> On 2024/10/21 21:34, Daniel Gomez wrote: >>> On Mon Oct 21, 2024 at 10:54 AM CEST, Kirill A. Shutemov wrote: >>>> On Mon, Oct 21, 2024 at 02:24:18PM +0800, Baolin Wang wrote: >>>>> >>>>> >>>>> On 2024/10/17 19:26, Kirill A. Shutemov wrote: >>>>>> On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote: >>>>>>> + Kirill >>>>>>> >>>>>>> On 2024/10/16 22:06, Matthew Wilcox wrote: >>>>>>>> On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote: >>>>>>>>> Considering that tmpfs already has the 'huge=' option to >>>>>>>>> control the THP >>>>>>>>> allocation, it is necessary to maintain compatibility with the >>>>>>>>> 'huge=' >>>>>>>>> option, as well as considering the 'deny' and 'force' option >>>>>>>>> controlled >>>>>>>>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'. >>>>>>>> >>>>>>>> No, it's not. No other filesystem honours these settings. >>>>>>>> tmpfs would >>>>>>>> not have had these settings if it were written today. It should >>>>>>>> simply >>>>>>>> ignore them, the way that NFS ignores the "intr" mount option >>>>>>>> now that >>>>>>>> we have a better solution to the original problem. >>>>>>>> >>>>>>>> To reiterate my position: >>>>>>>> >>>>>>>> - When using tmpfs as a filesystem, it should behave like >>>>>>>> other >>>>>>>> filesystems. >>>>>>>> - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED, >>>>>>>> it should >>>>>>>> behave like anonymous memory. >>>>>>> >>>>>>> I do agree with your point to some extent, but the ‘huge=’ option >>>>>>> has >>>>>>> existed for nearly 8 years, and the huge orders based on write >>>>>>> size may not >>>>>>> achieve the performance of PMD-sized THP in some scenarios, such >>>>>>> as when the >>>>>>> write length is consistently 4K. So, I am still concerned that >>>>>>> ignoring the >>>>>>> 'huge' option could lead to compatibility issues. >>>>>> >>>>>> Yeah, I don't think we are there yet to ignore the mount option. >>>>> >>>>> OK. >>>>> >>>>>> Maybe we need to get a new generic interface to request the semantics >>>>>> tmpfs has with huge= on per-inode level on any fs. Like a set of >>>>>> FADV_* >>>>>> handles to make kernel allocate PMD-size folio on any allocation >>>>>> or on >>>>>> allocations within i_size. I think this behaviour is useful beyond >>>>>> tmpfs. >>>>>> >>>>>> Then huge= implementation for tmpfs can be re-defined to set these >>>>>> per-inode FADV_ flags by default. This way we can keep tmpfs >>>>>> compatible >>>>>> with current deployments and less special comparing to rest of >>>>>> filesystems on kernel side. >>>>> >>>>> I did a quick search, and I didn't find any other fs that require >>>>> PMD-sized >>>>> huge pages, so I am not sure if FADV_* is useful for filesystems >>>>> other than >>>>> tmpfs. Please correct me if I missed something. >>>> >>>> What do you mean by "require"? THPs are always opportunistic. >>>> >>>> IIUC, we don't have a way to hint kernel to use huge pages for a >>>> file on >>>> read from backing storage. Readahead is not always the right way. >>>> >>>>>> If huge= is not set, tmpfs would behave the same way as the rest of >>>>>> filesystems. >>>>> >>>>> So if 'huge=' is not set, tmpfs write()/fallocate() can still >>>>> allocate large >>>>> folios based on the write size? If yes, that means it will change the >>>>> default huge behavior for tmpfs. Because previously having 'huge=' >>>>> is not >>>>> set means the huge option is 'SHMEM_HUGE_NEVER', which is similar >>>>> to what I >>>>> mentioned: >>>>> "Another possible choice is to make the huge pages allocation based >>>>> on write >>>>> size as the *default* behavior for tmpfs, ..." >>>> >>>> I am more worried about breaking existing users of huge pages. So >>>> changing >>>> behaviour of users who don't specify huge is okay to me. >>> >>> I think moving tmpfs to allocate large folios opportunistically by >>> default (as it was proposed initially) doesn't necessary conflict with >>> the default behaviour (huge=never). We just need to clarify that in >>> the documentation. >>> >>> However, and IIRC, one of the requests from Hugh was to have a way to >>> disable large folios which is something other FS do not have control >>> of as of today. Ryan sent a proposal to actually control that globally >>> but I think it didn't move forward. So, what are we missing to go back >>> to implement large folios in tmpfs in the default case, as any other fs >>> leveraging large folios? >> >> IMHO, as I discussed with Kirill, we still need maintain compatibility >> with the 'huge=' mount option. This means that if 'huge=never' is set >> for tmpfs, huge page allocation will still be prohibited (which can >> address Hugh's request?). However, if 'huge=' is not set, we can >> allocate large folios based on the write size. > > I consider allocating large folios in shmem/tmpfs on the write path less > controversial than allocating them on the page fault path -- especially > as long as we stay within the size to-be-written. > > I think in RHEL THP on shmem/tmpfs are disabled as default (e.g., > shmem_enabled=never). Maybe because of some rather undesired > side-effects (maybe some are historical?): I recall issues with VMs with > THP+ memory ballooning, as we cannot reclaim pages of folios if > splitting fails). I assume most of these problematic use cases don't use > tmpfs as an ordinary file system (write()/read()), but mmap() the whole > thing. > > Sadly, I don't find any information about shmem/tmpfs + THP in the RHEL > documentation; most documentation is only concerned about anon THP. > Which makes me conclude that they are not suggested as of now. > > I see more issues with allocating them on the page fault path and not > having a way to disable it -- compared to allocating them on the write() > path. I may not understand your issues. IIUC, you can disable allocating huge pages on the page fault path by using the 'huge=never' mount option or setting shmem_enabled=deny. No?
On 23.10.24 10:04, Baolin Wang wrote: > > > On 2024/10/22 23:31, David Hildenbrand wrote: >> On 22.10.24 05:41, Baolin Wang wrote: >>> >>> >>> On 2024/10/21 21:34, Daniel Gomez wrote: >>>> On Mon Oct 21, 2024 at 10:54 AM CEST, Kirill A. Shutemov wrote: >>>>> On Mon, Oct 21, 2024 at 02:24:18PM +0800, Baolin Wang wrote: >>>>>> >>>>>> >>>>>> On 2024/10/17 19:26, Kirill A. Shutemov wrote: >>>>>>> On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote: >>>>>>>> + Kirill >>>>>>>> >>>>>>>> On 2024/10/16 22:06, Matthew Wilcox wrote: >>>>>>>>> On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote: >>>>>>>>>> Considering that tmpfs already has the 'huge=' option to >>>>>>>>>> control the THP >>>>>>>>>> allocation, it is necessary to maintain compatibility with the >>>>>>>>>> 'huge=' >>>>>>>>>> option, as well as considering the 'deny' and 'force' option >>>>>>>>>> controlled >>>>>>>>>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'. >>>>>>>>> >>>>>>>>> No, it's not. No other filesystem honours these settings. >>>>>>>>> tmpfs would >>>>>>>>> not have had these settings if it were written today. It should >>>>>>>>> simply >>>>>>>>> ignore them, the way that NFS ignores the "intr" mount option >>>>>>>>> now that >>>>>>>>> we have a better solution to the original problem. >>>>>>>>> >>>>>>>>> To reiterate my position: >>>>>>>>> >>>>>>>>> - When using tmpfs as a filesystem, it should behave like >>>>>>>>> other >>>>>>>>> filesystems. >>>>>>>>> - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED, >>>>>>>>> it should >>>>>>>>> behave like anonymous memory. >>>>>>>> >>>>>>>> I do agree with your point to some extent, but the ‘huge=’ option >>>>>>>> has >>>>>>>> existed for nearly 8 years, and the huge orders based on write >>>>>>>> size may not >>>>>>>> achieve the performance of PMD-sized THP in some scenarios, such >>>>>>>> as when the >>>>>>>> write length is consistently 4K. So, I am still concerned that >>>>>>>> ignoring the >>>>>>>> 'huge' option could lead to compatibility issues. >>>>>>> >>>>>>> Yeah, I don't think we are there yet to ignore the mount option. >>>>>> >>>>>> OK. >>>>>> >>>>>>> Maybe we need to get a new generic interface to request the semantics >>>>>>> tmpfs has with huge= on per-inode level on any fs. Like a set of >>>>>>> FADV_* >>>>>>> handles to make kernel allocate PMD-size folio on any allocation >>>>>>> or on >>>>>>> allocations within i_size. I think this behaviour is useful beyond >>>>>>> tmpfs. >>>>>>> >>>>>>> Then huge= implementation for tmpfs can be re-defined to set these >>>>>>> per-inode FADV_ flags by default. This way we can keep tmpfs >>>>>>> compatible >>>>>>> with current deployments and less special comparing to rest of >>>>>>> filesystems on kernel side. >>>>>> >>>>>> I did a quick search, and I didn't find any other fs that require >>>>>> PMD-sized >>>>>> huge pages, so I am not sure if FADV_* is useful for filesystems >>>>>> other than >>>>>> tmpfs. Please correct me if I missed something. >>>>> >>>>> What do you mean by "require"? THPs are always opportunistic. >>>>> >>>>> IIUC, we don't have a way to hint kernel to use huge pages for a >>>>> file on >>>>> read from backing storage. Readahead is not always the right way. >>>>> >>>>>>> If huge= is not set, tmpfs would behave the same way as the rest of >>>>>>> filesystems. >>>>>> >>>>>> So if 'huge=' is not set, tmpfs write()/fallocate() can still >>>>>> allocate large >>>>>> folios based on the write size? If yes, that means it will change the >>>>>> default huge behavior for tmpfs. Because previously having 'huge=' >>>>>> is not >>>>>> set means the huge option is 'SHMEM_HUGE_NEVER', which is similar >>>>>> to what I >>>>>> mentioned: >>>>>> "Another possible choice is to make the huge pages allocation based >>>>>> on write >>>>>> size as the *default* behavior for tmpfs, ..." >>>>> >>>>> I am more worried about breaking existing users of huge pages. So >>>>> changing >>>>> behaviour of users who don't specify huge is okay to me. >>>> >>>> I think moving tmpfs to allocate large folios opportunistically by >>>> default (as it was proposed initially) doesn't necessary conflict with >>>> the default behaviour (huge=never). We just need to clarify that in >>>> the documentation. >>>> >>>> However, and IIRC, one of the requests from Hugh was to have a way to >>>> disable large folios which is something other FS do not have control >>>> of as of today. Ryan sent a proposal to actually control that globally >>>> but I think it didn't move forward. So, what are we missing to go back >>>> to implement large folios in tmpfs in the default case, as any other fs >>>> leveraging large folios? >>> >>> IMHO, as I discussed with Kirill, we still need maintain compatibility >>> with the 'huge=' mount option. This means that if 'huge=never' is set >>> for tmpfs, huge page allocation will still be prohibited (which can >>> address Hugh's request?). However, if 'huge=' is not set, we can >>> allocate large folios based on the write size. >> >> I consider allocating large folios in shmem/tmpfs on the write path less >> controversial than allocating them on the page fault path -- especially >> as long as we stay within the size to-be-written. >> >> I think in RHEL THP on shmem/tmpfs are disabled as default (e.g., >> shmem_enabled=never). Maybe because of some rather undesired >> side-effects (maybe some are historical?): I recall issues with VMs with >> THP+ memory ballooning, as we cannot reclaim pages of folios if >> splitting fails). I assume most of these problematic use cases don't use >> tmpfs as an ordinary file system (write()/read()), but mmap() the whole >> thing. >> >> Sadly, I don't find any information about shmem/tmpfs + THP in the RHEL >> documentation; most documentation is only concerned about anon THP. >> Which makes me conclude that they are not suggested as of now. >> >> I see more issues with allocating them on the page fault path and not >> having a way to disable it -- compared to allocating them on the write() >> path. > > I may not understand your issues. IIUC, you can disable allocating huge > pages on the page fault path by using the 'huge=never' mount option or > setting shmem_enabled=deny. No? That's what I am saying: if there is some way to disable it that will keep working, great. -- Cheers, David / dhildenb
On Wed Oct 23, 2024 at 11:27 AM CEST, David Hildenbrand wrote: > On 23.10.24 10:04, Baolin Wang wrote: > > > > > > On 2024/10/22 23:31, David Hildenbrand wrote: > >> On 22.10.24 05:41, Baolin Wang wrote: > >>> > >>> > >>> On 2024/10/21 21:34, Daniel Gomez wrote: > >>>> On Mon Oct 21, 2024 at 10:54 AM CEST, Kirill A. Shutemov wrote: > >>>>> On Mon, Oct 21, 2024 at 02:24:18PM +0800, Baolin Wang wrote: > >>>>>> > >>>>>> > >>>>>> On 2024/10/17 19:26, Kirill A. Shutemov wrote: > >>>>>>> On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote: > >>>>>>>> + Kirill > >>>>>>>> > >>>>>>>> On 2024/10/16 22:06, Matthew Wilcox wrote: > >>>>>>>>> On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote: > >>>>>>>>>> Considering that tmpfs already has the 'huge=' option to > >>>>>>>>>> control the THP > >>>>>>>>>> allocation, it is necessary to maintain compatibility with the > >>>>>>>>>> 'huge=' > >>>>>>>>>> option, as well as considering the 'deny' and 'force' option > >>>>>>>>>> controlled > >>>>>>>>>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'. > >>>>>>>>> > >>>>>>>>> No, it's not. No other filesystem honours these settings. > >>>>>>>>> tmpfs would > >>>>>>>>> not have had these settings if it were written today. It should > >>>>>>>>> simply > >>>>>>>>> ignore them, the way that NFS ignores the "intr" mount option > >>>>>>>>> now that > >>>>>>>>> we have a better solution to the original problem. > >>>>>>>>> > >>>>>>>>> To reiterate my position: > >>>>>>>>> > >>>>>>>>> - When using tmpfs as a filesystem, it should behave like > >>>>>>>>> other > >>>>>>>>> filesystems. > >>>>>>>>> - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED, > >>>>>>>>> it should > >>>>>>>>> behave like anonymous memory. > >>>>>>>> > >>>>>>>> I do agree with your point to some extent, but the ‘huge=’ option > >>>>>>>> has > >>>>>>>> existed for nearly 8 years, and the huge orders based on write > >>>>>>>> size may not > >>>>>>>> achieve the performance of PMD-sized THP in some scenarios, such > >>>>>>>> as when the > >>>>>>>> write length is consistently 4K. So, I am still concerned that > >>>>>>>> ignoring the > >>>>>>>> 'huge' option could lead to compatibility issues. > >>>>>>> > >>>>>>> Yeah, I don't think we are there yet to ignore the mount option. > >>>>>> > >>>>>> OK. > >>>>>> > >>>>>>> Maybe we need to get a new generic interface to request the semantics > >>>>>>> tmpfs has with huge= on per-inode level on any fs. Like a set of > >>>>>>> FADV_* > >>>>>>> handles to make kernel allocate PMD-size folio on any allocation > >>>>>>> or on > >>>>>>> allocations within i_size. I think this behaviour is useful beyond > >>>>>>> tmpfs. > >>>>>>> > >>>>>>> Then huge= implementation for tmpfs can be re-defined to set these > >>>>>>> per-inode FADV_ flags by default. This way we can keep tmpfs > >>>>>>> compatible > >>>>>>> with current deployments and less special comparing to rest of > >>>>>>> filesystems on kernel side. > >>>>>> > >>>>>> I did a quick search, and I didn't find any other fs that require > >>>>>> PMD-sized > >>>>>> huge pages, so I am not sure if FADV_* is useful for filesystems > >>>>>> other than > >>>>>> tmpfs. Please correct me if I missed something. > >>>>> > >>>>> What do you mean by "require"? THPs are always opportunistic. > >>>>> > >>>>> IIUC, we don't have a way to hint kernel to use huge pages for a > >>>>> file on > >>>>> read from backing storage. Readahead is not always the right way. > >>>>> > >>>>>>> If huge= is not set, tmpfs would behave the same way as the rest of > >>>>>>> filesystems. > >>>>>> > >>>>>> So if 'huge=' is not set, tmpfs write()/fallocate() can still > >>>>>> allocate large > >>>>>> folios based on the write size? If yes, that means it will change the > >>>>>> default huge behavior for tmpfs. Because previously having 'huge=' > >>>>>> is not > >>>>>> set means the huge option is 'SHMEM_HUGE_NEVER', which is similar > >>>>>> to what I > >>>>>> mentioned: > >>>>>> "Another possible choice is to make the huge pages allocation based > >>>>>> on write > >>>>>> size as the *default* behavior for tmpfs, ..." > >>>>> > >>>>> I am more worried about breaking existing users of huge pages. So > >>>>> changing > >>>>> behaviour of users who don't specify huge is okay to me. > >>>> > >>>> I think moving tmpfs to allocate large folios opportunistically by > >>>> default (as it was proposed initially) doesn't necessary conflict with > >>>> the default behaviour (huge=never). We just need to clarify that in > >>>> the documentation. > >>>> > >>>> However, and IIRC, one of the requests from Hugh was to have a way to > >>>> disable large folios which is something other FS do not have control > >>>> of as of today. Ryan sent a proposal to actually control that globally > >>>> but I think it didn't move forward. So, what are we missing to go back > >>>> to implement large folios in tmpfs in the default case, as any other fs > >>>> leveraging large folios? > >>> > >>> IMHO, as I discussed with Kirill, we still need maintain compatibility > >>> with the 'huge=' mount option. This means that if 'huge=never' is set > >>> for tmpfs, huge page allocation will still be prohibited (which can > >>> address Hugh's request?). However, if 'huge=' is not set, we can > >>> allocate large folios based on the write size. So, in order to make tmpfs behave like other filesystems, we need to allocate large folios by default. Not setting 'huge=' is the same as setting it to 'huge=never' as per documentation. But 'huge=' is meant to control THP, not large folios, so it should not have a conflict here, or else, what case are you thinking? So, to make tmpfs behave like other filesystems, we need to allocate large folios by default. According to the documentation, not setting 'huge=' is the same as setting 'huge=never.' However, 'huge=' is intended to control THP, not large folios, so there shouldn't be a conflict in this case. Can you clarify what specific scenario or conflict you're considering here? Perhaps when large folios order is the same as PMD-size? > >> > >> I consider allocating large folios in shmem/tmpfs on the write path less > >> controversial than allocating them on the page fault path -- especially > >> as long as we stay within the size to-be-written. > >> > >> I think in RHEL THP on shmem/tmpfs are disabled as default (e.g., > >> shmem_enabled=never). Maybe because of some rather undesired > >> side-effects (maybe some are historical?): I recall issues with VMs with > >> THP+ memory ballooning, as we cannot reclaim pages of folios if > >> splitting fails). I assume most of these problematic use cases don't use > >> tmpfs as an ordinary file system (write()/read()), but mmap() the whole > >> thing. > >> > >> Sadly, I don't find any information about shmem/tmpfs + THP in the RHEL > >> documentation; most documentation is only concerned about anon THP. > >> Which makes me conclude that they are not suggested as of now. > >> > >> I see more issues with allocating them on the page fault path and not > >> having a way to disable it -- compared to allocating them on the write() > >> path. > > > > I may not understand your issues. IIUC, you can disable allocating huge > > pages on the page fault path by using the 'huge=never' mount option or > > setting shmem_enabled=deny. No? > > That's what I am saying: if there is some way to disable it that will > keep working, great. I agree. That aligns with what I recall Hugh requested. However, I believe if that is the way to go, we shouldn't limit it to tmpfs. Otherwise, why should tmpfs be prevented from allocating large folios if other filesystems in the system are allowed to allocate them? I think, if we want to disable large folios we should make it more generic, something similar to Ryan's proposal [1] for controlling folio sizes. [1] https://lore.kernel.org/all/20240717071257.4141363-1-ryan.roberts@arm.com/ That said, there has already been disagreement on this point here [2]. [2] https://lore.kernel.org/all/ZvVRiJYfaXD645Nh@casper.infradead.org/
Sorry for the late reply! >>>>> IMHO, as I discussed with Kirill, we still need maintain compatibility >>>>> with the 'huge=' mount option. This means that if 'huge=never' is set >>>>> for tmpfs, huge page allocation will still be prohibited (which can >>>>> address Hugh's request?). However, if 'huge=' is not set, we can >>>>> allocate large folios based on the write size. > > So, in order to make tmpfs behave like other filesystems, we need to > allocate large folios by default. Not setting 'huge=' is the same as > setting it to 'huge=never' as per documentation. But 'huge=' is meant to > control THP, not large folios, so it should not have a conflict here, or > else, what case are you thinking? I think we really have to move away from "huge/thp == PMD", that's a historical artifact. Everything else will simply be inconsistent and confusing in the future -- and I don't see any real need for that. For anonymous memory and anon shmem we managed the transition. (there is a longer writeup from me about this topic, so I won't go into detail). I think I raised this in the past, but tmpfs/shmem is just like any other file system .. except it sometimes really isn't and behaves much more like (swappable) anonymous memory. (or mlocked files) There are many systems out there that run without swap enabled, or with extremely minimal swap (IIRC until recently kubernetes was completely incompatible with swapping). Swap can even be disabled today for shmem using a mount option. That's a big difference to all other file systems where you are guaranteed to have backend storage where you can simply evict under memory pressure (might temporarily fail, of course). I *think* that's the reason why we have the "huge=" parameter that also controls the THP allocations during page faults (IOW possible memory over-allocation). Maybe also because it was a new feature, and we only had a single THP size. There is, of course also the "fallocate() might not free up memory if there is an unexpected reference on the page because splitting it will fail" problem, that even exists when not over-allocating memory in the first place ... So ...I don't think tmpfs behaves like other file system in some cases. And I don't think ignoring these points is a good idea. Fortunately I don't maintain that code :) If we don't want to go with the shmem_enabled toggles, we should probably still extend the documentation to cover "all THP sizes", like we did elsewhere. huge=never: no THPs of any size huge=always: THPs of any size (fault/write/etc) huge=fadvise: like "always" but only with fadvise/madvise huge=within_size: like "fadvise" but respect i_size We could think about adding a "nowaste" extension and try make it the default. For example "huge=always:nowaste: THPs of any size as long as we don't over-allocate memory (write)" The sysfs toggles have their beauty as well and could be useful (I'm pretty sure they will be useful :) ): "huge=always;sysfs": THPs of any size (fault/write/etc) as configured in sysfs. Too many options here to explore, too little time I have to spend on this. Just to throw out some ideas. What I can really suggest is not making this one of the remaining interfaces where "huge" means "PMD-sized" once other sizes exist. > >>>> >>>> I consider allocating large folios in shmem/tmpfs on the write path less >>>> controversial than allocating them on the page fault path -- especially >>>> as long as we stay within the size to-be-written. >>>> >>>> I think in RHEL THP on shmem/tmpfs are disabled as default (e.g., >>>> shmem_enabled=never). Maybe because of some rather undesired >>>> side-effects (maybe some are historical?): I recall issues with VMs with >>>> THP+ memory ballooning, as we cannot reclaim pages of folios if >>>> splitting fails). I assume most of these problematic use cases don't use >>>> tmpfs as an ordinary file system (write()/read()), but mmap() the whole >>>> thing. >>>> >>>> Sadly, I don't find any information about shmem/tmpfs + THP in the RHEL >>>> documentation; most documentation is only concerned about anon THP. >>>> Which makes me conclude that they are not suggested as of now. >>>> >>>> I see more issues with allocating them on the page fault path and not >>>> having a way to disable it -- compared to allocating them on the write() >>>> path. >>> >>> I may not understand your issues. IIUC, you can disable allocating huge >>> pages on the page fault path by using the 'huge=never' mount option or >>> setting shmem_enabled=deny. No? >> >> That's what I am saying: if there is some way to disable it that will >> keep working, great. > > I agree. That aligns with what I recall Hugh requested. However, I > believe if that is the way to go, we shouldn't limit it to tmpfs. > Otherwise, why should tmpfs be prevented from allocating large folios if > other filesystems in the system are allowed to allocate them? See above. On systems without/little swap you might not want them for shmem/tmpfs, but would happily use them elsewhere. The "write() won't waste memory" case is really interesting, the "fallocate cannot free the memory" still exists. A shrinker might help. -- Cheers, David / dhildenb
On Fri Oct 25, 2024 at 10:21 PM CEST, David Hildenbrand wrote: > Sorry for the late reply! > > >>>>> IMHO, as I discussed with Kirill, we still need maintain compatibility > >>>>> with the 'huge=' mount option. This means that if 'huge=never' is set > >>>>> for tmpfs, huge page allocation will still be prohibited (which can > >>>>> address Hugh's request?). However, if 'huge=' is not set, we can > >>>>> allocate large folios based on the write size. > > > > So, in order to make tmpfs behave like other filesystems, we need to > > allocate large folios by default. Not setting 'huge=' is the same as > > setting it to 'huge=never' as per documentation. But 'huge=' is meant to > > control THP, not large folios, so it should not have a conflict here, or > > else, what case are you thinking? > > I think we really have to move away from "huge/thp == PMD", that's a > historical artifact. Everything else will simply be inconsistent and > confusing in the future -- and I don't see any real need for that. For > anonymous memory and anon shmem we managed the transition. (there is a > longer writeup from me about this topic, so I won't go into detail). > > > I think I raised this in the past, but tmpfs/shmem is just like any > other file system .. except it sometimes really isn't and behaves much > more like (swappable) anonymous memory. (or mlocked files) > > There are many systems out there that run without swap enabled, or with > extremely minimal swap (IIRC until recently kubernetes was completely > incompatible with swapping). Swap can even be disabled today for shmem > using a mount option. > > That's a big difference to all other file systems where you are > guaranteed to have backend storage where you can simply evict under > memory pressure (might temporarily fail, of course). > > I *think* that's the reason why we have the "huge=" parameter that also > controls the THP allocations during page faults (IOW possible memory > over-allocation). Maybe also because it was a new feature, and we only > had a single THP size. > > There is, of course also the "fallocate() might not free up memory if > there is an unexpected reference on the page because splitting it will > fail" problem, that even exists when not over-allocating memory in the > first place ... > > > So ...I don't think tmpfs behaves like other file system in some cases. > And I don't think ignoring these points is a good idea. Assuming a system without swap, what's the difference you are concern about between using the current tmpfs allocation method vs large folios implementation? > > Fortunately I don't maintain that code :) > > > If we don't want to go with the shmem_enabled toggles, we should > probably still extend the documentation to cover "all THP sizes", like > we did elsewhere. > > huge=never: no THPs of any size > huge=always: THPs of any size (fault/write/etc) > huge=fadvise: like "always" but only with fadvise/madvise > huge=within_size: like "fadvise" but respect i_size > > We could think about adding a "nowaste" extension and try make it the > default. > > For example > > "huge=always:nowaste: THPs of any size as long as we don't over-allocate > memory (write)" This is the default behaviour in other fs too. I don't think is necessary to make it explicit. > > The sysfs toggles have their beauty as well and could be useful (I'm > pretty sure they will be useful :) ): > > "huge=always;sysfs": THPs of any size (fault/write/etc) as configured in > sysfs. > > Too many options here to explore, too little time I have to spend on > this. Just to throw out some ideas. > > What I can really suggest is not making this one of the remaining > interfaces where "huge" means "PMD-sized" once other sizes exist. > > > > >>>> > >>>> I consider allocating large folios in shmem/tmpfs on the write path less > >>>> controversial than allocating them on the page fault path -- especially > >>>> as long as we stay within the size to-be-written. > >>>> > >>>> I think in RHEL THP on shmem/tmpfs are disabled as default (e.g., > >>>> shmem_enabled=never). Maybe because of some rather undesired > >>>> side-effects (maybe some are historical?): I recall issues with VMs with > >>>> THP+ memory ballooning, as we cannot reclaim pages of folios if > >>>> splitting fails). I assume most of these problematic use cases don't use > >>>> tmpfs as an ordinary file system (write()/read()), but mmap() the whole > >>>> thing. > >>>> > >>>> Sadly, I don't find any information about shmem/tmpfs + THP in the RHEL > >>>> documentation; most documentation is only concerned about anon THP. > >>>> Which makes me conclude that they are not suggested as of now. > >>>> > >>>> I see more issues with allocating them on the page fault path and not > >>>> having a way to disable it -- compared to allocating them on the write() > >>>> path. > >>> > >>> I may not understand your issues. IIUC, you can disable allocating huge > >>> pages on the page fault path by using the 'huge=never' mount option or > >>> setting shmem_enabled=deny. No? > >> > >> That's what I am saying: if there is some way to disable it that will > >> keep working, great. > > > > I agree. That aligns with what I recall Hugh requested. However, I > > believe if that is the way to go, we shouldn't limit it to tmpfs. > > Otherwise, why should tmpfs be prevented from allocating large folios if > > other filesystems in the system are allowed to allocate them? > > See above. On systems without/little swap you might not want them for > shmem/tmpfs, but would happily use them elsewhere. > > The "write() won't waste memory" case is really interesting, the > "fallocate cannot free the memory" still exists. A shrinker might help. The previous implementation with large folios allocation was wrong and was actually wasting memory by rounding up while trying to find the order. Matthew already pointed it out [1]. So, with that fixed, we should not end up wasting memory. https://lore.kernel.org/all/ZvVQoY8Tn_BNc79T@casper.infradead.org/
On 28.10.24 22:56, Daniel Gomez wrote: > On Fri Oct 25, 2024 at 10:21 PM CEST, David Hildenbrand wrote: >> Sorry for the late reply! >> >>>>>>> IMHO, as I discussed with Kirill, we still need maintain compatibility >>>>>>> with the 'huge=' mount option. This means that if 'huge=never' is set >>>>>>> for tmpfs, huge page allocation will still be prohibited (which can >>>>>>> address Hugh's request?). However, if 'huge=' is not set, we can >>>>>>> allocate large folios based on the write size. >>> >>> So, in order to make tmpfs behave like other filesystems, we need to >>> allocate large folios by default. Not setting 'huge=' is the same as >>> setting it to 'huge=never' as per documentation. But 'huge=' is meant to >>> control THP, not large folios, so it should not have a conflict here, or >>> else, what case are you thinking? >> >> I think we really have to move away from "huge/thp == PMD", that's a >> historical artifact. Everything else will simply be inconsistent and >> confusing in the future -- and I don't see any real need for that. For >> anonymous memory and anon shmem we managed the transition. (there is a >> longer writeup from me about this topic, so I won't go into detail). >> >> >> I think I raised this in the past, but tmpfs/shmem is just like any >> other file system .. except it sometimes really isn't and behaves much >> more like (swappable) anonymous memory. (or mlocked files) >> >> There are many systems out there that run without swap enabled, or with >> extremely minimal swap (IIRC until recently kubernetes was completely >> incompatible with swapping). Swap can even be disabled today for shmem >> using a mount option. >> >> That's a big difference to all other file systems where you are >> guaranteed to have backend storage where you can simply evict under >> memory pressure (might temporarily fail, of course). >> >> I *think* that's the reason why we have the "huge=" parameter that also >> controls the THP allocations during page faults (IOW possible memory >> over-allocation). Maybe also because it was a new feature, and we only >> had a single THP size. >> >> There is, of course also the "fallocate() might not free up memory if >> there is an unexpected reference on the page because splitting it will >> fail" problem, that even exists when not over-allocating memory in the >> first place ... >> >> >> So ...I don't think tmpfs behaves like other file system in some cases. >> And I don't think ignoring these points is a good idea. > > Assuming a system without swap, what's the difference you are concern > about between using the current tmpfs allocation method vs large folios > implementation? As raised above, there is the interesting interaction between fallocate(FALLOC_FL_PUNCH_HOLE) and raised refcounts, where we can fail to reclaim memory. shmem_fallocate()->shmem_truncate_range()->truncate_inode_pages_range()->truncate_inode_partial_folio(). It's better than it was in the past -- in the past we didn't even try splitting, but today splitting can still fail and we'll never try reclaiming that memory again later. This is very different to anonymous memory where we have the deferred split queue+remember which pages where zapped implicitly using the page tables (instead of zeroing them out and not freeing up the memory). It's one of the issues people ran into when using THP+shmem for backing guest VMs along with memory ballooning. For that reason, the recommendation still is to disable THP when using shmem for backing guest VMs and relying on memory overcommit optimizations such as memory balloon inflation. > >> >> Fortunately I don't maintain that code :) >> >> >> If we don't want to go with the shmem_enabled toggles, we should >> probably still extend the documentation to cover "all THP sizes", like >> we did elsewhere. >> >> huge=never: no THPs of any size >> huge=always: THPs of any size (fault/write/etc) >> huge=fadvise: like "always" but only with fadvise/madvise >> huge=within_size: like "fadvise" but respect i_size >> >> We could think about adding a "nowaste" extension and try make it the >> default. >> >> For example >> >> "huge=always:nowaste: THPs of any size as long as we don't over-allocate >> memory (write)" > > This is the default behaviour in other fs too. I don't think is > necessary to make it explicit. Please keep in mind that allocating THPs of different size during *page faults* will have to fit into the whole picture we are creating here. That's also what "huge=always" controls for shmem today IIRC. >>> >>>>>> >>>>>> I consider allocating large folios in shmem/tmpfs on the write path less >>>>>> controversial than allocating them on the page fault path -- especially >>>>>> as long as we stay within the size to-be-written. >>>>>> >>>>>> I think in RHEL THP on shmem/tmpfs are disabled as default (e.g., >>>>>> shmem_enabled=never). Maybe because of some rather undesired >>>>>> side-effects (maybe some are historical?): I recall issues with VMs with >>>>>> THP+ memory ballooning, as we cannot reclaim pages of folios if >>>>>> splitting fails). I assume most of these problematic use cases don't use >>>>>> tmpfs as an ordinary file system (write()/read()), but mmap() the whole >>>>>> thing. >>>>>> >>>>>> Sadly, I don't find any information about shmem/tmpfs + THP in the RHEL >>>>>> documentation; most documentation is only concerned about anon THP. >>>>>> Which makes me conclude that they are not suggested as of now. >>>>>> >>>>>> I see more issues with allocating them on the page fault path and not >>>>>> having a way to disable it -- compared to allocating them on the write() >>>>>> path. >>>>> >>>>> I may not understand your issues. IIUC, you can disable allocating huge >>>>> pages on the page fault path by using the 'huge=never' mount option or >>>>> setting shmem_enabled=deny. No? >>>> >>>> That's what I am saying: if there is some way to disable it that will >>>> keep working, great. >>> >>> I agree. That aligns with what I recall Hugh requested. However, I >>> believe if that is the way to go, we shouldn't limit it to tmpfs. >>> Otherwise, why should tmpfs be prevented from allocating large folios if >>> other filesystems in the system are allowed to allocate them? >> >> See above. On systems without/little swap you might not want them for >> shmem/tmpfs, but would happily use them elsewhere. >> >> The "write() won't waste memory" case is really interesting, the >> "fallocate cannot free the memory" still exists. A shrinker might help. > > The previous implementation with large folios allocation was wrong > and was actually wasting memory by rounding up while trying to find > the order. Matthew already pointed it out [1]. So, with that fixed, we > should not end up wasting memory. Again, we should have a clear path forward how we deal with page faults and how this fits into the picture. -- Cheers, David / dhildenb
On 25.10.24 22:21, David Hildenbrand wrote: > Sorry for the late reply! > >>>>>> IMHO, as I discussed with Kirill, we still need maintain compatibility >>>>>> with the 'huge=' mount option. This means that if 'huge=never' is set >>>>>> for tmpfs, huge page allocation will still be prohibited (which can >>>>>> address Hugh's request?). However, if 'huge=' is not set, we can >>>>>> allocate large folios based on the write size. >> >> So, in order to make tmpfs behave like other filesystems, we need to >> allocate large folios by default. Not setting 'huge=' is the same as >> setting it to 'huge=never' as per documentation. But 'huge=' is meant to >> control THP, not large folios, so it should not have a conflict here, or >> else, what case are you thinking? > > I think we really have to move away from "huge/thp == PMD", that's a > historical artifact. Everything else will simply be inconsistent and > confusing in the future -- and I don't see any real need for that. For > anonymous memory and anon shmem we managed the transition. (there is a > longer writeup from me about this topic, so I won't go into detail). > > > I think I raised this in the past, but tmpfs/shmem is just like any > other file system .. except it sometimes really isn't and behaves much > more like (swappable) anonymous memory. (or mlocked files) > > There are many systems out there that run without swap enabled, or with > extremely minimal swap (IIRC until recently kubernetes was completely > incompatible with swapping). Swap can even be disabled today for shmem > using a mount option. > > That's a big difference to all other file systems where you are > guaranteed to have backend storage where you can simply evict under > memory pressure (might temporarily fail, of course). > > I *think* that's the reason why we have the "huge=" parameter that also > controls the THP allocations during page faults (IOW possible memory > over-allocation). Maybe also because it was a new feature, and we only > had a single THP size. > > There is, of course also the "fallocate() might not free up memory if > there is an unexpected reference on the page because splitting it will > fail" problem, that even exists when not over-allocating memory in the > first place ... > > > So ...I don't think tmpfs behaves like other file system in some cases. > And I don't think ignoring these points is a good idea. > > Fortunately I don't maintain that code :) > > > If we don't want to go with the shmem_enabled toggles, we should > probably still extend the documentation to cover "all THP sizes", like > we did elsewhere. > > huge=never: no THPs of any size > huge=always: THPs of any size (fault/write/etc) > huge=fadvise: like "always" but only with fadvise/madvise > huge=within_size: like "fadvise" but respect i_size Thinking some more about that over the weekend, this is likely the way to go, paired with conditionally changing the default to always/within_size. I suggest a kconfig option for that. That should probably do as a first shot; I assume people will want more control over which size to use, especially during page faults, but that can likely be added later. -- Cheers, David / dhildenb
Sorry for late reply. On 2024/10/28 17:48, David Hildenbrand wrote: > On 25.10.24 22:21, David Hildenbrand wrote: >> Sorry for the late reply! >> >>>>>>> IMHO, as I discussed with Kirill, we still need maintain >>>>>>> compatibility >>>>>>> with the 'huge=' mount option. This means that if 'huge=never' is >>>>>>> set >>>>>>> for tmpfs, huge page allocation will still be prohibited (which can >>>>>>> address Hugh's request?). However, if 'huge=' is not set, we can >>>>>>> allocate large folios based on the write size. >>> >>> So, in order to make tmpfs behave like other filesystems, we need to >>> allocate large folios by default. Not setting 'huge=' is the same as >>> setting it to 'huge=never' as per documentation. But 'huge=' is meant to >>> control THP, not large folios, so it should not have a conflict here, or >>> else, what case are you thinking? >> >> I think we really have to move away from "huge/thp == PMD", that's a >> historical artifact. Everything else will simply be inconsistent and >> confusing in the future -- and I don't see any real need for that. For >> anonymous memory and anon shmem we managed the transition. (there is a >> longer writeup from me about this topic, so I won't go into detail). >> >> >> I think I raised this in the past, but tmpfs/shmem is just like any >> other file system .. except it sometimes really isn't and behaves much >> more like (swappable) anonymous memory. (or mlocked files) >> >> There are many systems out there that run without swap enabled, or with >> extremely minimal swap (IIRC until recently kubernetes was completely >> incompatible with swapping). Swap can even be disabled today for shmem >> using a mount option. >> >> That's a big difference to all other file systems where you are >> guaranteed to have backend storage where you can simply evict under >> memory pressure (might temporarily fail, of course). >> >> I *think* that's the reason why we have the "huge=" parameter that also >> controls the THP allocations during page faults (IOW possible memory >> over-allocation). Maybe also because it was a new feature, and we only >> had a single THP size. >> >> There is, of course also the "fallocate() might not free up memory if >> there is an unexpected reference on the page because splitting it will >> fail" problem, that even exists when not over-allocating memory in the >> first place ... >> >> >> So ...I don't think tmpfs behaves like other file system in some cases. >> And I don't think ignoring these points is a good idea. >> >> Fortunately I don't maintain that code :) >> >> >> If we don't want to go with the shmem_enabled toggles, we should >> probably still extend the documentation to cover "all THP sizes", like >> we did elsewhere. >> >> huge=never: no THPs of any size >> huge=always: THPs of any size (fault/write/etc) >> huge=fadvise: like "always" but only with fadvise/madvise >> huge=within_size: like "fadvise" but respect i_size > > Thinking some more about that over the weekend, this is likely the way > to go, paired with conditionally changing the default to > always/within_size. I suggest a kconfig option for that. I am still worried about adding a new kconfig option, which might complicate the tmpfs controls further. > That should probably do as a first shot; I assume people will want more > control over which size to use, especially during page faults, but that > can likely be added later. After some discussions, I think the first step is to achieve two goals: 1) Try to make tmpfs use large folios like other file systems, that means we should avoid adding more complex control options (per Matthew). 2) Still need maintain compatibility with the 'huge=' mount option (per Kirill), as I also remembered we have customers who use 'huge=within_size' to allocate THPs for better performance. Based on these considerations, my first step is to neither add a new 'huge=' option parameter nor introduce the mTHP interfaces control for tmpfs, but rather to change the default huge allocation behavior for tmpfs. That is to say, when 'huge=' option is not configured, we will allow the huge folios allocation based on the write size. As a result, the behavior of huge pages for tmpfs will change as follows: no 'huge=' set: can allocate any size huge folios based on write size huge=never: no any size huge folios huge=always: only PMD sized THP allocation as before huge=fadvise: like "always" but only with fadvise/madvise huge=within_size: like "fadvise" but respect i_size The next step is to continue discussing whether to add a new Kconfig option or FADV_* in the future. So what do you think?
>>> >>> If we don't want to go with the shmem_enabled toggles, we should >>> probably still extend the documentation to cover "all THP sizes", like >>> we did elsewhere. >>> >>> huge=never: no THPs of any size >>> huge=always: THPs of any size (fault/write/etc) >>> huge=fadvise: like "always" but only with fadvise/madvise >>> huge=within_size: like "fadvise" but respect i_size >> >> Thinking some more about that over the weekend, this is likely the way >> to go, paired with conditionally changing the default to >> always/within_size. I suggest a kconfig option for that. > > I am still worried about adding a new kconfig option, which might > complicate the tmpfs controls further. Why exactly? If we are changing a default similar to CONFIG_TRANSPARENT_HUGEPAGE_NEVER -> CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS, it would make perfectly sense to give people building a kernel control over that. If we want to support this feature in a distro kernel like RHEL we'll have to leave the default unmodified. Otherwise I see no way (excluding downstream-only hacks) to backport this into distro kernels. > >> That should probably do as a first shot; I assume people will want more >> control over which size to use, especially during page faults, but that >> can likely be added later. I know, it puts you in a bad position because there are different opinions floating around. But let's try to find something that is reasonable and still acceptable. And let's hope that Hugh will voice an opinion :D > > After some discussions, I think the first step is to achieve two goals: > 1) Try to make tmpfs use large folios like other file systems, that > means we should avoid adding more complex control options (per Matthew). > 2) Still need maintain compatibility with the 'huge=' mount option (per > Kirill), as I also remembered we have customers who use > 'huge=within_size' to allocate THPs for better performance. > > Based on these considerations, my first step is to neither add a new > 'huge=' option parameter nor introduce the mTHP interfaces control for > tmpfs, but rather to change the default huge allocation behavior for > tmpfs. That is to say, when 'huge=' option is not configured, we will > allow the huge folios allocation based on the write size. As a result, > the behavior of huge pages for tmpfs will change as follows: > > no 'huge=' set: can allocate any size huge folios based on write size > huge=never: no any size huge folios> huge=always: only PMD sized THP allocation as before > huge=fadvise: like "always" but only with fadvise/madvise> huge=within_size: like "fadvise" but respect i_size I don't like that: (a) there is no way to explicitly enable/name that new behavior. (b) "always" etc. are only concerned about PMDs. So again, I suggest: huge=never: No THPs of any size huge=always: THPs of any size huge=fadvise: like "always" but only with fadvise/madvise huge=within_size: like "fadvise" but respect i_size "huge=" default depends on a Kconfig option. With that we: (1) Maximize the cases where we will use large folios of any sizes (which Willy cares about). (2) Have a way to disable them completely (which I care about). (3) Allow distros to keep the default unchanged. Likely, for now we will only try allocating PMD-sized THPs during page faults, and allocate different sizes only during write(). So the effect for many use cases (VMs, DBs) that primarily mmap() tmpfs files will be completely unchanged even with "huge=always". It will get more tricky once we change that behavior as well, but that's something to likely figure out if it is a real problem at at different day :) I really preferred using the sysfs toggles (as discussed with Hugh in the meeting back then), but I can also understand why we at least want to try making tmpfs behave more like other file systems. But I'm a bit more careful to not ignore the cases where it really isn't like any other file system. If we start making PMD-sized THPs special in any non-configurable way, then we are effectively off *worse* than allowing to configure them properly. So if someone voices "but we want only PMD-sized" ones, the next one will say "but we only want cont-pte sized-ones" and then we should provide an option to control the actual sizes to use differently, in some way. But let's see if that is even required. -- Cheers, David / dhildenb
On 2024/10/31 16:53, David Hildenbrand wrote: >>>> >>>> If we don't want to go with the shmem_enabled toggles, we should >>>> probably still extend the documentation to cover "all THP sizes", like >>>> we did elsewhere. >>>> >>>> huge=never: no THPs of any size >>>> huge=always: THPs of any size (fault/write/etc) >>>> huge=fadvise: like "always" but only with fadvise/madvise >>>> huge=within_size: like "fadvise" but respect i_size >>> >>> Thinking some more about that over the weekend, this is likely the way >>> to go, paired with conditionally changing the default to >>> always/within_size. I suggest a kconfig option for that. >> >> I am still worried about adding a new kconfig option, which might >> complicate the tmpfs controls further. > > Why exactly? There will be more options to control huge pages allocation for tmpfs, which may confuse users and make life harder? Yes, we can add some documentation, but I'm still a bit cautious about this. > If we are changing a default similar to > CONFIG_TRANSPARENT_HUGEPAGE_NEVER -> CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS, > it would make perfectly sense to give people building a kernel control > over that. > > If we want to support this feature in a distro kernel like RHEL we'll > have to leave the default unmodified. Otherwise I see no way (excluding > downstream-only hacks) to backport this into distro kernels. > >> >>> That should probably do as a first shot; I assume people will want more >>> control over which size to use, especially during page faults, but that >>> can likely be added later. > > I know, it puts you in a bad position because there are different > opinions floating around. But let's try to find something that is > reasonable and still acceptable. And let's hope that Hugh will voice an > opinion :D Yes, I am also waiting to see if Hugh has any inputs :) >> After some discussions, I think the first step is to achieve two goals: >> 1) Try to make tmpfs use large folios like other file systems, that >> means we should avoid adding more complex control options (per Matthew). >> 2) Still need maintain compatibility with the 'huge=' mount option (per >> Kirill), as I also remembered we have customers who use >> 'huge=within_size' to allocate THPs for better performance. > >> >> Based on these considerations, my first step is to neither add a new >> 'huge=' option parameter nor introduce the mTHP interfaces control for >> tmpfs, but rather to change the default huge allocation behavior for >> tmpfs. That is to say, when 'huge=' option is not configured, we will >> allow the huge folios allocation based on the write size. As a result, >> the behavior of huge pages for tmpfs will change as follows: > > > no 'huge=' set: can allocate any size huge folios based on write size > > huge=never: no any size huge folios> huge=always: only PMD sized THP > allocation as before > > huge=fadvise: like "always" but only with fadvise/madvise> > huge=within_size: like "fadvise" but respect i_size > > I don't like that: > > (a) there is no way to explicitly enable/name that new behavior. But this is similar to other file systems that enable large folios (setting mapping_set_large_folios()), and I haven't seen any other file systems supporting large folios requiring a new Kconfig. Maybe tmpfs is a bit special? If we all agree that tmpfs is a bit special when using huge pages, then fine, a Kconfig option might be needed. > (b) "always" etc. are only concerned about PMDs. Yes, currently maintain the same semantics as before, in case users still expect THPs. > So again, I suggest: > > huge=never: No THPs of any size > huge=always: THPs of any size > huge=fadvise: like "always" but only with fadvise/madvise > huge=within_size: like "fadvise" but respect i_size > > "huge=" default depends on a Kconfig option. > > With that we: > > (1) Maximize the cases where we will use large folios of any sizes > (which Willy cares about). > (2) Have a way to disable them completely (which I care about). > (3) Allow distros to keep the default unchanged. > > Likely, for now we will only try allocating PMD-sized THPs during page > faults, and allocate different sizes only during write(). So the effect > for many use cases (VMs, DBs) that primarily mmap() tmpfs files will be > completely unchanged even with "huge=always". > > It will get more tricky once we change that behavior as well, but that's > something to likely figure out if it is a real problem at at different > day :) > > > I really preferred using the sysfs toggles (as discussed with Hugh in > the meeting back then), but I can also understand why we at least want > to try making tmpfs behave more like other file systems. But I'm a bit > more careful to not ignore the cases where it really isn't like any > other file system. That's also my previous thought, but Matthew is strongly against that. Let's step by step. > If we start making PMD-sized THPs special in any non-configurable way, > then we are effectively off *worse* than allowing to configure them > properly. So if someone voices "but we want only PMD-sized" ones, the > next one will say "but we only want cont-pte sized-ones" and then we > should provide an option to control the actual sizes to use differently, > in some way. But let's see if that is even required. Yes, I agree. So what I am thinking is, the 'huge=' option should be gradually deprecated in the future and eventually tmpfs can allocate any size large folios as default.
>>> I am still worried about adding a new kconfig option, which might >>> complicate the tmpfs controls further. >> >> Why exactly? > > There will be more options to control huge pages allocation for tmpfs, > which may confuse users and make life harder? Yes, we can add some > documentation, but I'm still a bit cautious about this. If it's just "changing the default from "huge=never" to "huge=X" I don't see a big problem here. Again, we already do that for anon THPs. If we make more behavior depend on than (which I don't think we should be doing), I agree that it would be more controversial. [..] >>> >>>> That should probably do as a first shot; I assume people will want more >>>> control over which size to use, especially during page faults, but that >>>> can likely be added later. >> >> I know, it puts you in a bad position because there are different >> opinions floating around. But let's try to find something that is >> reasonable and still acceptable. And let's hope that Hugh will voice an >> opinion :D > > Yes, I am also waiting to see if Hugh has any inputs :) We keep saying that ... I have to find a way to summon him :) > >>> After some discussions, I think the first step is to achieve two goals: >>> 1) Try to make tmpfs use large folios like other file systems, that >>> means we should avoid adding more complex control options (per Matthew). >>> 2) Still need maintain compatibility with the 'huge=' mount option (per >>> Kirill), as I also remembered we have customers who use >>> 'huge=within_size' to allocate THPs for better performance. >> >>> >>> Based on these considerations, my first step is to neither add a new >>> 'huge=' option parameter nor introduce the mTHP interfaces control for >>> tmpfs, but rather to change the default huge allocation behavior for >>> tmpfs. That is to say, when 'huge=' option is not configured, we will >>> allow the huge folios allocation based on the write size. As a result, >>> the behavior of huge pages for tmpfs will change as follows: >> > > no 'huge=' set: can allocate any size huge folios based on write size >> > huge=never: no any size huge folios> huge=always: only PMD sized THP >> allocation as before >> > huge=fadvise: like "always" but only with fadvise/madvise> >> huge=within_size: like "fadvise" but respect i_size >> >> I don't like that: >> >> (a) there is no way to explicitly enable/name that new behavior. > > But this is similar to other file systems that enable large folios > (setting mapping_set_large_folios()), and I haven't seen any other file > systems supporting large folios requiring a new Kconfig. Maybe tmpfs is > a bit special? I'm afraid I don't have the energy to explain once more why I think tmpfs is not just like any other file system in some cases. And distributions are rather careful when it comes to something like this ... > > If we all agree that tmpfs is a bit special when using huge pages, then > fine, a Kconfig option might be needed. > >> (b) "always" etc. are only concerned about PMDs. > > Yes, currently maintain the same semantics as before, in case users > still expect THPs. Again, I don't think that is a reasonable approach to make PMD-sized ones special here. It will all get seriously confusing and inconsistent. THPs are opportunistic after all, and page fault behavior will remain unchanged (PMD-sized) for now. And even if we support other sizes during page faults, we'd like start with the largest size (PMD-size) first, and it likely might just all work better than before. Happy to learn where this really makes a difference. Of course, if you change the default behavior (which you are planning), it's ... a changed default. If there are reasons to have more tunables regarding the sizes to use, then it should not be limited to PMD-size. > >> So again, I suggest: >> >> huge=never: No THPs of any size >> huge=always: THPs of any size >> huge=fadvise: like "always" but only with fadvise/madvise >> huge=within_size: like "fadvise" but respect i_size >> >> "huge=" default depends on a Kconfig option. >> >> With that we: >> >> (1) Maximize the cases where we will use large folios of any sizes >> (which Willy cares about). >> (2) Have a way to disable them completely (which I care about). >> (3) Allow distros to keep the default unchanged. >> >> Likely, for now we will only try allocating PMD-sized THPs during page >> faults, and allocate different sizes only during write(). So the effect >> for many use cases (VMs, DBs) that primarily mmap() tmpfs files will be >> completely unchanged even with "huge=always". >> >> It will get more tricky once we change that behavior as well, but that's >> something to likely figure out if it is a real problem at at different >> day :) >> >> >> I really preferred using the sysfs toggles (as discussed with Hugh in >> the meeting back then), but I can also understand why we at least want >> to try making tmpfs behave more like other file systems. But I'm a bit >> more careful to not ignore the cases where it really isn't like any >> other file system. > > That's also my previous thought, but Matthew is strongly against that. > Let's step by step. Yes, I understand his view as well. But I won't blindly agree to the "tmpfs is just like any other file system" opinion :) > >> If we start making PMD-sized THPs special in any non-configurable way, >> then we are effectively off *worse* than allowing to configure them >> properly. So if someone voices "but we want only PMD-sized" ones, the >> next one will say "but we only want cont-pte sized-ones" and then we >> should provide an option to control the actual sizes to use differently, >> in some way. But let's see if that is even required. > > Yes, I agree. So what I am thinking is, the 'huge=' option should be > gradually deprecated in the future and eventually tmpfs can allocate any > size large folios as default. Let's be realistic, it won't get removed any time soon. ;) So changing "huge=always" etc. semantics to reflect our new size options, and then try changing the default (with the option for people/distros to have the old default) is a reasonable approach, at least to me. I'm trying to stay open-minded here, but the proposal I heard so far is not particularly appealing. -- Cheers, David / dhildenb
>>> I am still worried about adding a new kconfig option, which might >>> complicate the tmpfs controls further. >> >> Why exactly? > > There will be more options to control huge pages allocation for tmpfs, > which may confuse users and make life harder? Yes, we can add some > documentation, but I'm still a bit cautious about this. If it's just "changing the default from "huge=never" to "huge=X" I don't see a big problem here. Again, we already do that for anon THPs. If we make more behavior depend on than (which I don't think we should be doing), I agree that it would be more controversial. [..] >>> >>>> That should probably do as a first shot; I assume people will want more >>>> control over which size to use, especially during page faults, but that >>>> can likely be added later. >> >> I know, it puts you in a bad position because there are different >> opinions floating around. But let's try to find something that is >> reasonable and still acceptable. And let's hope that Hugh will voice an >> opinion :D > > Yes, I am also waiting to see if Hugh has any inputs :) We keep saying that ... I have to find a way to summon him :) > >>> After some discussions, I think the first step is to achieve two goals: >>> 1) Try to make tmpfs use large folios like other file systems, that >>> means we should avoid adding more complex control options (per Matthew). >>> 2) Still need maintain compatibility with the 'huge=' mount option (per >>> Kirill), as I also remembered we have customers who use >>> 'huge=within_size' to allocate THPs for better performance. >> >>> >>> Based on these considerations, my first step is to neither add a new >>> 'huge=' option parameter nor introduce the mTHP interfaces control for >>> tmpfs, but rather to change the default huge allocation behavior for >>> tmpfs. That is to say, when 'huge=' option is not configured, we will >>> allow the huge folios allocation based on the write size. As a result, >>> the behavior of huge pages for tmpfs will change as follows: >> > > no 'huge=' set: can allocate any size huge folios based on write size >> > huge=never: no any size huge folios> huge=always: only PMD sized THP >> allocation as before >> > huge=fadvise: like "always" but only with fadvise/madvise> >> huge=within_size: like "fadvise" but respect i_size >> >> I don't like that: >> >> (a) there is no way to explicitly enable/name that new behavior. > > But this is similar to other file systems that enable large folios > (setting mapping_set_large_folios()), and I haven't seen any other file > systems supporting large folios requiring a new Kconfig. Maybe tmpfs is > a bit special? I'm afraid I don't have the energy to explain once more why I think tmpfs is not just like any other file system in some cases. And distributions are rather careful when it comes to something like this ... > > If we all agree that tmpfs is a bit special when using huge pages, then > fine, a Kconfig option might be needed. > >> (b) "always" etc. are only concerned about PMDs. > > Yes, currently maintain the same semantics as before, in case users > still expect THPs. Again, I don't think that is a reasonable approach to make PMD-sized ones special here. It will all get seriously confusing and inconsistent. THPs are opportunistic after all, and page fault behavior will remain unchanged (PMD-sized) for now. And even if we support other sizes during page faults, we'd like start with the largest size (PMD-size) first, and it likely might just all work better than before. Happy to learn where this really makes a difference. Of course, if you change the default behavior (which you are planning), it's ... a changed default. If there are reasons to have more tunables regarding the sizes to use, then it should not be limited to PMD-size. > >> So again, I suggest: >> >> huge=never: No THPs of any size >> huge=always: THPs of any size >> huge=fadvise: like "always" but only with fadvise/madvise >> huge=within_size: like "fadvise" but respect i_size >> >> "huge=" default depends on a Kconfig option. >> >> With that we: >> >> (1) Maximize the cases where we will use large folios of any sizes >> (which Willy cares about). >> (2) Have a way to disable them completely (which I care about). >> (3) Allow distros to keep the default unchanged. >> >> Likely, for now we will only try allocating PMD-sized THPs during page >> faults, and allocate different sizes only during write(). So the effect >> for many use cases (VMs, DBs) that primarily mmap() tmpfs files will be >> completely unchanged even with "huge=always". >> >> It will get more tricky once we change that behavior as well, but that's >> something to likely figure out if it is a real problem at at different >> day :) >> >> >> I really preferred using the sysfs toggles (as discussed with Hugh in >> the meeting back then), but I can also understand why we at least want >> to try making tmpfs behave more like other file systems. But I'm a bit >> more careful to not ignore the cases where it really isn't like any >> other file system. > > That's also my previous thought, but Matthew is strongly against that. > Let's step by step. Yes, I understand his view as well. But I won't blindly agree to the "tmpfs is just like any other file system" opinion :) > >> If we start making PMD-sized THPs special in any non-configurable way, >> then we are effectively off *worse* than allowing to configure them >> properly. So if someone voices "but we want only PMD-sized" ones, the >> next one will say "but we only want cont-pte sized-ones" and then we >> should provide an option to control the actual sizes to use differently, >> in some way. But let's see if that is even required. > > Yes, I agree. So what I am thinking is, the 'huge=' option should be > gradually deprecated in the future and eventually tmpfs can allocate any > size large folios as default. Let's be realistic, it won't get removed any time soon. ;) So changing "huge=always" etc. semantics to reflect our new size options, and then try changing the default (with the option for people/distros to have the old default) is a reasonable approach, at least to me. I'm trying to stay open-minded here, but the proposal I heard so far is not particularly appealing. -- Cheers, David / dhildenb
On 2024/10/31 18:46, David Hildenbrand wrote: [snip] >>> I don't like that: >>> >>> (a) there is no way to explicitly enable/name that new behavior. >> >> But this is similar to other file systems that enable large folios >> (setting mapping_set_large_folios()), and I haven't seen any other file >> systems supporting large folios requiring a new Kconfig. Maybe tmpfs is >> a bit special? > > I'm afraid I don't have the energy to explain once more why I think > tmpfs is not just like any other file system in some cases. > > And distributions are rather careful when it comes to something like > this ... > >> >> If we all agree that tmpfs is a bit special when using huge pages, then >> fine, a Kconfig option might be needed. >> >>> (b) "always" etc. are only concerned about PMDs. >> >> Yes, currently maintain the same semantics as before, in case users >> still expect THPs. > > Again, I don't think that is a reasonable approach to make PMD-sized > ones special here. It will all get seriously confusing and inconsistent. I agree PMD-sized should not be special. This is all for backward compatibility with the ‘huge=’ mount option, and adding a new kconfig is also for this purpose. > THPs are opportunistic after all, and page fault behavior will remain > unchanged (PMD-sized) for now. And even if we support other sizes during > page faults, we'd like start with the largest size (PMD-size) first, and > it likely might just all work better than before. > > Happy to learn where this really makes a difference. > > Of course, if you change the default behavior (which you are planning), > it's ... a changed default. > > If there are reasons to have more tunables regarding the sizes to use, > then it should not be limited to PMD-size. I have tried to modify the code according to your suggestion (not tested yet). These are what you had in mind? static inline unsigned int shmem_mapping_size_order(struct address_space *mapping, pgoff_t index, loff_t write_end) { unsigned int order; size_t size; if (!mapping_large_folio_support(mapping) || !write_end) return 0; /* Calculate the write size based on the write_end */ size = write_end - (index << PAGE_SHIFT); order = filemap_get_order(size); if (!order) return 0; /* If we're not aligned, allocate a smaller folio */ if (index & ((1UL << order) - 1)) order = __ffs(index); order = min_t(size_t, order, MAX_PAGECACHE_ORDER); return order > 0 ? BIT(order + 1) - 1 : 0; } static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index, loff_t write_end, bool shmem_huge_force, unsigned long vm_flags) { bool is_shmem = inode->i_sb == shm_mnt->mnt_sb; unsigned long within_size_orders; unsigned int order; loff_t i_size; if (HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER) return 0; if (!S_ISREG(inode->i_mode)) return 0; if (shmem_huge == SHMEM_HUGE_DENY) return 0; if (shmem_huge_force || shmem_huge == SHMEM_HUGE_FORCE) return BIT(HPAGE_PMD_ORDER); switch (SHMEM_SB(inode->i_sb)->huge) { case SHMEM_HUGE_NEVER: return 0; case SHMEM_HUGE_ALWAYS: if (is_shmem || IS_ENABLED(CONFIG_USE_ONLY_THP_FOR_TMPFS)) return BIT(HPAGE_PMD_ORDER); return shmem_mapping_size_order(inode->i_mapping, index, write_end); case SHMEM_HUGE_WITHIN_SIZE: if (is_shmem || IS_ENABLED(CONFIG_USE_ONLY_THP_FOR_TMPFS)) within_size_orders = BIT(HPAGE_PMD_ORDER); else within_size_orders = shmem_mapping_size_order(inode->i_mapping, index, write_end); order = highest_order(within_size_orders); while (within_size_orders) { index = round_up(index + 1, 1 << order); i_size = max(write_end, i_size_read(inode)); i_size = round_up(i_size, PAGE_SIZE); if (i_size >> PAGE_SHIFT >= index) return within_size_orders; order = next_order(&within_size_orders, order); } fallthrough; case SHMEM_HUGE_ADVISE: if (vm_flags & VM_HUGEPAGE) { if (is_shmem || IS_ENABLED(USE_ONLY_THP_FOR_TMPFS)) return BIT(HPAGE_PMD_ORDER); return shmem_mapping_size_order(inode->i_mapping, index, write_end); } fallthrough; default: return 0; } } 1) Add a new 'CONFIG_USE_ONLY_THP_FOR_TMPFS' kconfig to keep ‘huge=’ mount option compatibility. 2) For tmpfs write(), if CONFIG_USE_ONLY_THP_FOR_TMPFS is not enabled, then will get the possible huge orders based on the write size. 3) For tmpfs mmap() fault, always use PMD-sized huge order. 4) For shmem, ignore the write size logic and always use PMD-sized THP to check if the global huge is enabled. However, in case 2), if 'huge=always' and write size is less than 4K, so we will allocate small pages, that will break the 'huge' semantics? Maybe it's not something to worry too much about. >>> huge=never: No THPs of any size >>> huge=always: THPs of any size >>> huge=fadvise: like "always" but only with fadvise/madvise >>> huge=within_size: like "fadvise" but respect i_size >>> >>> "huge=" default depends on a Kconfig option. >>> >>> With that we: >>> >>> (1) Maximize the cases where we will use large folios of any sizes >>> (which Willy cares about). >>> (2) Have a way to disable them completely (which I care about). >>> (3) Allow distros to keep the default unchanged. >>> >>> Likely, for now we will only try allocating PMD-sized THPs during page >>> faults, and allocate different sizes only during write(). So the effect >>> for many use cases (VMs, DBs) that primarily mmap() tmpfs files will be >>> completely unchanged even with "huge=always". >>> >>> It will get more tricky once we change that behavior as well, but that's >>> something to likely figure out if it is a real problem at at different >>> day :) >>> >>> >>> I really preferred using the sysfs toggles (as discussed with Hugh in >>> the meeting back then), but I can also understand why we at least want >>> to try making tmpfs behave more like other file systems. But I'm a bit >>> more careful to not ignore the cases where it really isn't like any >>> other file system. >> >> That's also my previous thought, but Matthew is strongly against that. >> Let's step by step. > > Yes, I understand his view as well. > > But I won't blindly agree to the "tmpfs is just like any other file > system" opinion :) > > > >> If we start making PMD-sized THPs special in any non-configurable > way, >>> then we are effectively off *worse* than allowing to configure them >>> properly. So if someone voices "but we want only PMD-sized" ones, the >>> next one will say "but we only want cont-pte sized-ones" and then we >>> should provide an option to control the actual sizes to use differently, >>> in some way. But let's see if that is even required. >> >> Yes, I agree. So what I am thinking is, the 'huge=' option should be >> gradually deprecated in the future and eventually tmpfs can allocate any >> size large folios as default. > > Let's be realistic, it won't get removed any time soon. ;) > > So changing "huge=always" etc. semantics to reflect our new size > options, and then try changing the default (with the option for > people/distros to have the old default) is a reasonable approach, at > least to me. > > I'm trying to stay open-minded here, but the proposal I heard so far is > not particularly appealing. >
On 05.11.24 13:45, Baolin Wang wrote: > > > On 2024/10/31 18:46, David Hildenbrand wrote: > [snip] > >>>> I don't like that: >>>> >>>> (a) there is no way to explicitly enable/name that new behavior. >>> >>> But this is similar to other file systems that enable large folios >>> (setting mapping_set_large_folios()), and I haven't seen any other file >>> systems supporting large folios requiring a new Kconfig. Maybe tmpfs is >>> a bit special? >> >> I'm afraid I don't have the energy to explain once more why I think >> tmpfs is not just like any other file system in some cases. >> >> And distributions are rather careful when it comes to something like >> this ... >> >>> >>> If we all agree that tmpfs is a bit special when using huge pages, then >>> fine, a Kconfig option might be needed. >>> >>>> (b) "always" etc. are only concerned about PMDs. >>> >>> Yes, currently maintain the same semantics as before, in case users >>> still expect THPs. >> >> Again, I don't think that is a reasonable approach to make PMD-sized >> ones special here. It will all get seriously confusing and inconsistent. > > I agree PMD-sized should not be special. This is all for backward > compatibility with the ‘huge=’ mount option, and adding a new kconfig is > also for this purpose. > >> THPs are opportunistic after all, and page fault behavior will remain >> unchanged (PMD-sized) for now. And even if we support other sizes during >> page faults, we'd like start with the largest size (PMD-size) first, and >> it likely might just all work better than before. >> >> Happy to learn where this really makes a difference. >> >> Of course, if you change the default behavior (which you are planning), >> it's ... a changed default. >> >> If there are reasons to have more tunables regarding the sizes to use, >> then it should not be limited to PMD-size. > > I have tried to modify the code according to your suggestion (not tested > yet). These are what you had in mind? > > static inline unsigned int > shmem_mapping_size_order(struct address_space *mapping, pgoff_t index, > loff_t write_end) > { > unsigned int order; > size_t size; > > if (!mapping_large_folio_support(mapping) || !write_end) > return 0; > > /* Calculate the write size based on the write_end */ > size = write_end - (index << PAGE_SHIFT); > order = filemap_get_order(size); > if (!order) > return 0; > > /* If we're not aligned, allocate a smaller folio */ > if (index & ((1UL << order) - 1)) > order = __ffs(index); > > order = min_t(size_t, order, MAX_PAGECACHE_ORDER); > return order > 0 ? BIT(order + 1) - 1 : 0; > } > > static unsigned int shmem_huge_global_enabled(struct inode *inode, > pgoff_t index, > loff_t write_end, bool > shmem_huge_force, > unsigned long vm_flags) > { > bool is_shmem = inode->i_sb == shm_mnt->mnt_sb; > unsigned long within_size_orders; > unsigned int order; > loff_t i_size; > > if (HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER) > return 0; > if (!S_ISREG(inode->i_mode)) > return 0; > if (shmem_huge == SHMEM_HUGE_DENY) > return 0; > if (shmem_huge_force || shmem_huge == SHMEM_HUGE_FORCE) > return BIT(HPAGE_PMD_ORDER); > > switch (SHMEM_SB(inode->i_sb)->huge) { > case SHMEM_HUGE_NEVER: > return 0; > case SHMEM_HUGE_ALWAYS: > if (is_shmem || IS_ENABLED(CONFIG_USE_ONLY_THP_FOR_TMPFS)) > return BIT(HPAGE_PMD_ORDER); > > return shmem_mapping_size_order(inode->i_mapping, > index, write_end); > case SHMEM_HUGE_WITHIN_SIZE: > if (is_shmem || IS_ENABLED(CONFIG_USE_ONLY_THP_FOR_TMPFS)) > within_size_orders = BIT(HPAGE_PMD_ORDER); > else > within_size_orders = > shmem_mapping_size_order(inode->i_mapping, > > index, write_end); > > order = highest_order(within_size_orders); > while (within_size_orders) { > index = round_up(index + 1, 1 << order); > i_size = max(write_end, i_size_read(inode)); > i_size = round_up(i_size, PAGE_SIZE); > if (i_size >> PAGE_SHIFT >= index) > return within_size_orders; > > order = next_order(&within_size_orders, order); > } > fallthrough; > case SHMEM_HUGE_ADVISE: > if (vm_flags & VM_HUGEPAGE) { > if (is_shmem || IS_ENABLED(USE_ONLY_THP_FOR_TMPFS)) > return BIT(HPAGE_PMD_ORDER); > > return shmem_mapping_size_order(inode->i_mapping, > index, write_end); > } > fallthrough; > default: > return 0; > } > } > > 1) Add a new 'CONFIG_USE_ONLY_THP_FOR_TMPFS' kconfig to keep ‘huge=’ > mount option compatibility. > 2) For tmpfs write(), if CONFIG_USE_ONLY_THP_FOR_TMPFS is not enabled, > then will get the possible huge orders based on the write size. > 3) For tmpfs mmap() fault, always use PMD-sized huge order. > 4) For shmem, ignore the write size logic and always use PMD-sized THP > to check if the global huge is enabled. > > However, in case 2), if 'huge=always' and write size is less than 4K, so > we will allocate small pages, that will break the 'huge' semantics? > Maybe it's not something to worry too much about. Probably I didn't express clearly what I think we should, because this is not quite what I had in mind. I would use the CONFIG_USE_ONLY_THP_FOR_TMPFS way of doing it only if really required. As raised, if someone needs finer control, providing that only for a single size is rather limiting. This is what I hope we can do (doc update to show what I mean): diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 5034915f4e8e8..d7d1a9acdbfc5 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -349,11 +349,24 @@ user, the PMD_ORDER hugepage policy will be overridden. If the policy for PMD_ORDER is not defined within a valid ``thp_shmem``, its policy will default to ``never``. -Hugepages in tmpfs/shmem -======================== +tmpfs/shmem +=========== -You can control hugepage allocation policy in tmpfs with mount option -``huge=``. It can have following values: +Traditionally, tmpfs only supported a single huge page size ("PMD"). Today, +it also supports smaller sizes just like anonymous memory, often referred +to as "multi-size THP" (mTHP). Huge pages of any size are commonly +represented in the kernel as "large folios". + +While there is fine control over the huge page sizes to use for the internal +shmem mount (see below), ordinary tmpfs mounts will make use of all +available huge page sizes without any control over the exact sizes, +behaving more like other file systems. + +tmpfs mounts +------------ + +The THP allocation policy for tmpfs mounts can be adjusted using the mount +option: ``huge=``. It can have following values: always Attempt to allocate huge pages every time we need a new page; @@ -368,19 +381,20 @@ within_size advise Only allocate huge pages if requested with fadvise()/madvise(); -The default policy is ``never``. +Remember, that the kernel may use huge pages of all available sizes, and +that no fine control as for the internal tmpfs mount is available. + +The default policy in the past was ``never``, but it can now be adjusted +using the CONFIG_TMPFS_TRANSPARENT_HUGEPAGE_ALWAYS, +CONFIG_TMPFS_TRANSPARENT_HUGEPAGE_NEVER etc. ``mount -o remount,huge= /mountpoint`` works fine after mount: remounting ``huge=never`` will not attempt to break up huge pages at all, just stop more from being allocated. -There's also sysfs knob to control hugepage allocation policy for internal -shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount -is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or -MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem. - -In addition to policies listed above, shmem_enabled allows two further -values: +In addition to policies listed above, the sysfs knob +/sys/kernel/mm/transparent_hugepage/shmem_enabled will affect the +allocation policy of tmpfs mounts, when set to the following values: deny For use in emergencies, to force the huge option off from @@ -388,13 +402,26 @@ deny force Force the huge option on for all - very useful for testing; -Shmem can also use "multi-size THP" (mTHP) by adding a new sysfs knob to -control mTHP allocation: -'/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled', -and its value for each mTHP is essentially consistent with the global -setting. An 'inherit' option is added to ensure compatibility with these -global settings. Conversely, the options 'force' and 'deny' are dropped, -which are rather testing artifacts from the old ages. + +shmem / internal tmpfs +---------------------- + +The mount internal tmpfs mount is used for SysV SHM, memfds, shared anonymous +mmaps (of /dev/zero or MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem. + +To control the THP allocation policy for this internal tmpfs mount, the +sysfs knob /sys/kernel/mm/transparent_hugepage/shmem_enabled and the knobs +per THP size in +'/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled' +can be used. + +The global knob has the same semantics as the ``huge=`` mount options +for tmpfs mounts, except that the different huge page sizes can be controlled +individually, and will only use the setting of the global knob when the +per-size knob is set to 'inherit'. + +The options 'force' and 'deny' are dropped for the individual sizes, which +are rather testing artifacts from the old ages. always Attempt to allocate <size> huge pages every time we need a new page; diff --git a/Documentation/filesystems/tmpfs.rst b/Documentation/filesystems/tmpfs.rst index 56a26c843dbe9..10de8f706d07b 100644 There is this question of "do we need the old way of doing it and only allocate PMDs". For that, likely a config similar to the one you propose might make sense, but I would want to see if there is real demand for that. In particular: for whom the smaller sizes are a problem when bigger (PMD) sizes were enabled in the past. -- Cheers, David / dhildenb
On 2024/11/5 22:56, David Hildenbrand wrote: > On 05.11.24 13:45, Baolin Wang wrote: >> >> >> On 2024/10/31 18:46, David Hildenbrand wrote: >> [snip] >> >>>>> I don't like that: >>>>> >>>>> (a) there is no way to explicitly enable/name that new behavior. >>>> >>>> But this is similar to other file systems that enable large folios >>>> (setting mapping_set_large_folios()), and I haven't seen any other file >>>> systems supporting large folios requiring a new Kconfig. Maybe tmpfs is >>>> a bit special? >>> >>> I'm afraid I don't have the energy to explain once more why I think >>> tmpfs is not just like any other file system in some cases. >>> >>> And distributions are rather careful when it comes to something like >>> this ... >>> >>>> >>>> If we all agree that tmpfs is a bit special when using huge pages, then >>>> fine, a Kconfig option might be needed. >>>> >>>>> (b) "always" etc. are only concerned about PMDs. >>>> >>>> Yes, currently maintain the same semantics as before, in case users >>>> still expect THPs. >>> >>> Again, I don't think that is a reasonable approach to make PMD-sized >>> ones special here. It will all get seriously confusing and inconsistent. >> >> I agree PMD-sized should not be special. This is all for backward >> compatibility with the ‘huge=’ mount option, and adding a new kconfig is >> also for this purpose. >> >>> THPs are opportunistic after all, and page fault behavior will remain >>> unchanged (PMD-sized) for now. And even if we support other sizes during >>> page faults, we'd like start with the largest size (PMD-size) first, and >>> it likely might just all work better than before. >>> >>> Happy to learn where this really makes a difference. >>> >>> Of course, if you change the default behavior (which you are planning), >>> it's ... a changed default. >>> >>> If there are reasons to have more tunables regarding the sizes to use, >>> then it should not be limited to PMD-size. >> >> I have tried to modify the code according to your suggestion (not tested >> yet). These are what you had in mind? >> >> static inline unsigned int >> shmem_mapping_size_order(struct address_space *mapping, pgoff_t index, >> loff_t write_end) >> { >> unsigned int order; >> size_t size; >> >> if (!mapping_large_folio_support(mapping) || !write_end) >> return 0; >> >> /* Calculate the write size based on the write_end */ >> size = write_end - (index << PAGE_SHIFT); >> order = filemap_get_order(size); >> if (!order) >> return 0; >> >> /* If we're not aligned, allocate a smaller folio */ >> if (index & ((1UL << order) - 1)) >> order = __ffs(index); >> >> order = min_t(size_t, order, MAX_PAGECACHE_ORDER); >> return order > 0 ? BIT(order + 1) - 1 : 0; >> } >> >> static unsigned int shmem_huge_global_enabled(struct inode *inode, >> pgoff_t index, >> loff_t write_end, bool >> shmem_huge_force, >> unsigned long vm_flags) >> { >> bool is_shmem = inode->i_sb == shm_mnt->mnt_sb; >> unsigned long within_size_orders; >> unsigned int order; >> loff_t i_size; >> >> if (HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER) >> return 0; >> if (!S_ISREG(inode->i_mode)) >> return 0; >> if (shmem_huge == SHMEM_HUGE_DENY) >> return 0; >> if (shmem_huge_force || shmem_huge == SHMEM_HUGE_FORCE) >> return BIT(HPAGE_PMD_ORDER); >> >> switch (SHMEM_SB(inode->i_sb)->huge) { >> case SHMEM_HUGE_NEVER: >> return 0; >> case SHMEM_HUGE_ALWAYS: >> if (is_shmem || >> IS_ENABLED(CONFIG_USE_ONLY_THP_FOR_TMPFS)) >> return BIT(HPAGE_PMD_ORDER); >> >> return shmem_mapping_size_order(inode->i_mapping, >> index, write_end); >> case SHMEM_HUGE_WITHIN_SIZE: >> if (is_shmem || >> IS_ENABLED(CONFIG_USE_ONLY_THP_FOR_TMPFS)) >> within_size_orders = BIT(HPAGE_PMD_ORDER); >> else >> within_size_orders = >> shmem_mapping_size_order(inode->i_mapping, >> index, write_end); >> >> order = highest_order(within_size_orders); >> while (within_size_orders) { >> index = round_up(index + 1, 1 << order); >> i_size = max(write_end, i_size_read(inode)); >> i_size = round_up(i_size, PAGE_SIZE); >> if (i_size >> PAGE_SHIFT >= index) >> return within_size_orders; >> >> order = next_order(&within_size_orders, order); >> } >> fallthrough; >> case SHMEM_HUGE_ADVISE: >> if (vm_flags & VM_HUGEPAGE) { >> if (is_shmem || >> IS_ENABLED(USE_ONLY_THP_FOR_TMPFS)) >> return BIT(HPAGE_PMD_ORDER); >> >> return >> shmem_mapping_size_order(inode->i_mapping, >> index, >> write_end); >> } >> fallthrough; >> default: >> return 0; >> } >> } >> >> 1) Add a new 'CONFIG_USE_ONLY_THP_FOR_TMPFS' kconfig to keep ‘huge=’ >> mount option compatibility. >> 2) For tmpfs write(), if CONFIG_USE_ONLY_THP_FOR_TMPFS is not enabled, >> then will get the possible huge orders based on the write size. >> 3) For tmpfs mmap() fault, always use PMD-sized huge order. >> 4) For shmem, ignore the write size logic and always use PMD-sized THP >> to check if the global huge is enabled. >> >> However, in case 2), if 'huge=always' and write size is less than 4K, so >> we will allocate small pages, that will break the 'huge' semantics? >> Maybe it's not something to worry too much about. > > Probably I didn't express clearly what I think we should, because this is > not quite what I had in mind. > > I would use the CONFIG_USE_ONLY_THP_FOR_TMPFS way of doing it only if > really required. As raised, if someone needs finer control, providing that > only for a single size is rather limiting. OK. I misunderstood your points. > This is what I hope we can do (doc update to show what I mean): Thanks for updating the doc. I'd like to include them in the next version. > diff --git a/Documentation/admin-guide/mm/transhuge.rst > b/Documentation/admin-guide/mm/transhuge.rst > index 5034915f4e8e8..d7d1a9acdbfc5 100644 > --- a/Documentation/admin-guide/mm/transhuge.rst > +++ b/Documentation/admin-guide/mm/transhuge.rst > @@ -349,11 +349,24 @@ user, the PMD_ORDER hugepage policy will be > overridden. If the policy for > PMD_ORDER is not defined within a valid ``thp_shmem``, its policy will > default to ``never``. > > -Hugepages in tmpfs/shmem > -======================== > +tmpfs/shmem > +=========== > > -You can control hugepage allocation policy in tmpfs with mount option > -``huge=``. It can have following values: > +Traditionally, tmpfs only supported a single huge page size ("PMD"). > Today, > +it also supports smaller sizes just like anonymous memory, often referred > +to as "multi-size THP" (mTHP). Huge pages of any size are commonly > +represented in the kernel as "large folios". > + > +While there is fine control over the huge page sizes to use for the > internal > +shmem mount (see below), ordinary tmpfs mounts will make use of all > +available huge page sizes without any control over the exact sizes, > +behaving more like other file systems. > + > +tmpfs mounts > +------------ > + > +The THP allocation policy for tmpfs mounts can be adjusted using the mount > +option: ``huge=``. It can have following values: > > always > Attempt to allocate huge pages every time we need a new page; > @@ -368,19 +381,20 @@ within_size > advise > Only allocate huge pages if requested with fadvise()/madvise(); > > -The default policy is ``never``. > +Remember, that the kernel may use huge pages of all available sizes, and > +that no fine control as for the internal tmpfs mount is available. > + > +The default policy in the past was ``never``, but it can now be adjusted > +using the CONFIG_TMPFS_TRANSPARENT_HUGEPAGE_ALWAYS, > +CONFIG_TMPFS_TRANSPARENT_HUGEPAGE_NEVER etc. > > ``mount -o remount,huge= /mountpoint`` works fine after mount: remounting > ``huge=never`` will not attempt to break up huge pages at all, just > stop more > from being allocated. > > -There's also sysfs knob to control hugepage allocation policy for internal > -shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount > -is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or > -MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem. > - > -In addition to policies listed above, shmem_enabled allows two further > -values: > +In addition to policies listed above, the sysfs knob > +/sys/kernel/mm/transparent_hugepage/shmem_enabled will affect the > +allocation policy of tmpfs mounts, when set to the following values: > > deny > For use in emergencies, to force the huge option off from > @@ -388,13 +402,26 @@ deny > force > Force the huge option on for all - very useful for testing; > > -Shmem can also use "multi-size THP" (mTHP) by adding a new sysfs knob to > -control mTHP allocation: > -'/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled', > -and its value for each mTHP is essentially consistent with the global > -setting. An 'inherit' option is added to ensure compatibility with these > -global settings. Conversely, the options 'force' and 'deny' are dropped, > -which are rather testing artifacts from the old ages. > + > +shmem / internal tmpfs > +---------------------- > + > +The mount internal tmpfs mount is used for SysV SHM, memfds, shared > anonymous > +mmaps (of /dev/zero or MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem. > + > +To control the THP allocation policy for this internal tmpfs mount, the > +sysfs knob /sys/kernel/mm/transparent_hugepage/shmem_enabled and the knobs > +per THP size in > +'/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled' > +can be used. > + > +The global knob has the same semantics as the ``huge=`` mount options > +for tmpfs mounts, except that the different huge page sizes can be > controlled > +individually, and will only use the setting of the global knob when the > +per-size knob is set to 'inherit'. > + > +The options 'force' and 'deny' are dropped for the individual sizes, which > +are rather testing artifacts from the old ages. > > always > Attempt to allocate <size> huge pages every time we need a new page; > diff --git a/Documentation/filesystems/tmpfs.rst > b/Documentation/filesystems/tmpfs.rst > index 56a26c843dbe9..10de8f706d07b 100644 > > > > There is this question of "do we need the old way of doing it and only > allocate PMDs". For that, likely a config similar to the one you propose > might > make sense, but I would want to see if there is real demand for that. In > particular: > for whom the smaller sizes are a problem when bigger (PMD) sizes were > enabled in the past. I am also not sure if such a case exists. I can remove this kconfig for now, and we can consider it again if someone really complains this in the future.
On 2024/10/24 18:49, Daniel Gomez wrote: > On Wed Oct 23, 2024 at 11:27 AM CEST, David Hildenbrand wrote: >> On 23.10.24 10:04, Baolin Wang wrote: >>> >>> >>> On 2024/10/22 23:31, David Hildenbrand wrote: >>>> On 22.10.24 05:41, Baolin Wang wrote: >>>>> >>>>> >>>>> On 2024/10/21 21:34, Daniel Gomez wrote: >>>>>> On Mon Oct 21, 2024 at 10:54 AM CEST, Kirill A. Shutemov wrote: >>>>>>> On Mon, Oct 21, 2024 at 02:24:18PM +0800, Baolin Wang wrote: >>>>>>>> >>>>>>>> >>>>>>>> On 2024/10/17 19:26, Kirill A. Shutemov wrote: >>>>>>>>> On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote: >>>>>>>>>> + Kirill >>>>>>>>>> >>>>>>>>>> On 2024/10/16 22:06, Matthew Wilcox wrote: >>>>>>>>>>> On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote: >>>>>>>>>>>> Considering that tmpfs already has the 'huge=' option to >>>>>>>>>>>> control the THP >>>>>>>>>>>> allocation, it is necessary to maintain compatibility with the >>>>>>>>>>>> 'huge=' >>>>>>>>>>>> option, as well as considering the 'deny' and 'force' option >>>>>>>>>>>> controlled >>>>>>>>>>>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'. >>>>>>>>>>> >>>>>>>>>>> No, it's not. No other filesystem honours these settings. >>>>>>>>>>> tmpfs would >>>>>>>>>>> not have had these settings if it were written today. It should >>>>>>>>>>> simply >>>>>>>>>>> ignore them, the way that NFS ignores the "intr" mount option >>>>>>>>>>> now that >>>>>>>>>>> we have a better solution to the original problem. >>>>>>>>>>> >>>>>>>>>>> To reiterate my position: >>>>>>>>>>> >>>>>>>>>>> - When using tmpfs as a filesystem, it should behave like >>>>>>>>>>> other >>>>>>>>>>> filesystems. >>>>>>>>>>> - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED, >>>>>>>>>>> it should >>>>>>>>>>> behave like anonymous memory. >>>>>>>>>> >>>>>>>>>> I do agree with your point to some extent, but the ‘huge=’ option >>>>>>>>>> has >>>>>>>>>> existed for nearly 8 years, and the huge orders based on write >>>>>>>>>> size may not >>>>>>>>>> achieve the performance of PMD-sized THP in some scenarios, such >>>>>>>>>> as when the >>>>>>>>>> write length is consistently 4K. So, I am still concerned that >>>>>>>>>> ignoring the >>>>>>>>>> 'huge' option could lead to compatibility issues. >>>>>>>>> >>>>>>>>> Yeah, I don't think we are there yet to ignore the mount option. >>>>>>>> >>>>>>>> OK. >>>>>>>> >>>>>>>>> Maybe we need to get a new generic interface to request the semantics >>>>>>>>> tmpfs has with huge= on per-inode level on any fs. Like a set of >>>>>>>>> FADV_* >>>>>>>>> handles to make kernel allocate PMD-size folio on any allocation >>>>>>>>> or on >>>>>>>>> allocations within i_size. I think this behaviour is useful beyond >>>>>>>>> tmpfs. >>>>>>>>> >>>>>>>>> Then huge= implementation for tmpfs can be re-defined to set these >>>>>>>>> per-inode FADV_ flags by default. This way we can keep tmpfs >>>>>>>>> compatible >>>>>>>>> with current deployments and less special comparing to rest of >>>>>>>>> filesystems on kernel side. >>>>>>>> >>>>>>>> I did a quick search, and I didn't find any other fs that require >>>>>>>> PMD-sized >>>>>>>> huge pages, so I am not sure if FADV_* is useful for filesystems >>>>>>>> other than >>>>>>>> tmpfs. Please correct me if I missed something. >>>>>>> >>>>>>> What do you mean by "require"? THPs are always opportunistic. >>>>>>> >>>>>>> IIUC, we don't have a way to hint kernel to use huge pages for a >>>>>>> file on >>>>>>> read from backing storage. Readahead is not always the right way. >>>>>>> >>>>>>>>> If huge= is not set, tmpfs would behave the same way as the rest of >>>>>>>>> filesystems. >>>>>>>> >>>>>>>> So if 'huge=' is not set, tmpfs write()/fallocate() can still >>>>>>>> allocate large >>>>>>>> folios based on the write size? If yes, that means it will change the >>>>>>>> default huge behavior for tmpfs. Because previously having 'huge=' >>>>>>>> is not >>>>>>>> set means the huge option is 'SHMEM_HUGE_NEVER', which is similar >>>>>>>> to what I >>>>>>>> mentioned: >>>>>>>> "Another possible choice is to make the huge pages allocation based >>>>>>>> on write >>>>>>>> size as the *default* behavior for tmpfs, ..." >>>>>>> >>>>>>> I am more worried about breaking existing users of huge pages. So >>>>>>> changing >>>>>>> behaviour of users who don't specify huge is okay to me. >>>>>> >>>>>> I think moving tmpfs to allocate large folios opportunistically by >>>>>> default (as it was proposed initially) doesn't necessary conflict with >>>>>> the default behaviour (huge=never). We just need to clarify that in >>>>>> the documentation. >>>>>> >>>>>> However, and IIRC, one of the requests from Hugh was to have a way to >>>>>> disable large folios which is something other FS do not have control >>>>>> of as of today. Ryan sent a proposal to actually control that globally >>>>>> but I think it didn't move forward. So, what are we missing to go back >>>>>> to implement large folios in tmpfs in the default case, as any other fs >>>>>> leveraging large folios? >>>>> >>>>> IMHO, as I discussed with Kirill, we still need maintain compatibility >>>>> with the 'huge=' mount option. This means that if 'huge=never' is set >>>>> for tmpfs, huge page allocation will still be prohibited (which can >>>>> address Hugh's request?). However, if 'huge=' is not set, we can >>>>> allocate large folios based on the write size. > > So, in order to make tmpfs behave like other filesystems, we need to > allocate large folios by default. Not setting 'huge=' is the same as > setting it to 'huge=never' as per documentation. But 'huge=' is meant to > control THP, not large folios, so it should not have a conflict here, or > else, what case are you thinking? > > So, to make tmpfs behave like other filesystems, we need to allocate > large folios by default. According to the documentation, not setting Right. > 'huge=' is the same as setting 'huge=never.' However, 'huge=' is I will update the documentation in next version. That means if 'huge=' option is not set, we can still allocate large folios based on the write size (will be not same as setting 'huge=never'). > intended to control THP, not large folios, so there shouldn't be > a conflict in this case. Can you clarify what specific scenario or Yes, we should still keep the same semantics of 'huge=always/within_size/advise' setting, which only controls THP allocations. > conflict you're considering here? Perhaps when large folios order is the > same as PMD-size?
On Thu Oct 24, 2024 at 12:49 PM CEST, Daniel Gomez wrote: > On Wed Oct 23, 2024 at 11:27 AM CEST, David Hildenbrand wrote: > > On 23.10.24 10:04, Baolin Wang wrote: > > > > > > > > > On 2024/10/22 23:31, David Hildenbrand wrote: > > >> On 22.10.24 05:41, Baolin Wang wrote: > > >>> > > >>> > > >>> On 2024/10/21 21:34, Daniel Gomez wrote: > > >>>> On Mon Oct 21, 2024 at 10:54 AM CEST, Kirill A. Shutemov wrote: > > >>>>> On Mon, Oct 21, 2024 at 02:24:18PM +0800, Baolin Wang wrote: > > >>>>>> > > >>>>>> > > >>>>>> On 2024/10/17 19:26, Kirill A. Shutemov wrote: > > >>>>>>> On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote: > > >>>>>>>> + Kirill > > >>>>>>>> > > >>>>>>>> On 2024/10/16 22:06, Matthew Wilcox wrote: > > >>>>>>>>> On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote: > > >>>>>>>>>> Considering that tmpfs already has the 'huge=' option to > > >>>>>>>>>> control the THP > > >>>>>>>>>> allocation, it is necessary to maintain compatibility with the > > >>>>>>>>>> 'huge=' > > >>>>>>>>>> option, as well as considering the 'deny' and 'force' option > > >>>>>>>>>> controlled > > >>>>>>>>>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'. > > >>>>>>>>> > > >>>>>>>>> No, it's not. No other filesystem honours these settings. > > >>>>>>>>> tmpfs would > > >>>>>>>>> not have had these settings if it were written today. It should > > >>>>>>>>> simply > > >>>>>>>>> ignore them, the way that NFS ignores the "intr" mount option > > >>>>>>>>> now that > > >>>>>>>>> we have a better solution to the original problem. > > >>>>>>>>> > > >>>>>>>>> To reiterate my position: > > >>>>>>>>> > > >>>>>>>>> - When using tmpfs as a filesystem, it should behave like > > >>>>>>>>> other > > >>>>>>>>> filesystems. > > >>>>>>>>> - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED, > > >>>>>>>>> it should > > >>>>>>>>> behave like anonymous memory. > > >>>>>>>> > > >>>>>>>> I do agree with your point to some extent, but the ‘huge=’ option > > >>>>>>>> has > > >>>>>>>> existed for nearly 8 years, and the huge orders based on write > > >>>>>>>> size may not > > >>>>>>>> achieve the performance of PMD-sized THP in some scenarios, such > > >>>>>>>> as when the > > >>>>>>>> write length is consistently 4K. So, I am still concerned that > > >>>>>>>> ignoring the > > >>>>>>>> 'huge' option could lead to compatibility issues. > > >>>>>>> > > >>>>>>> Yeah, I don't think we are there yet to ignore the mount option. > > >>>>>> > > >>>>>> OK. > > >>>>>> > > >>>>>>> Maybe we need to get a new generic interface to request the semantics > > >>>>>>> tmpfs has with huge= on per-inode level on any fs. Like a set of > > >>>>>>> FADV_* > > >>>>>>> handles to make kernel allocate PMD-size folio on any allocation > > >>>>>>> or on > > >>>>>>> allocations within i_size. I think this behaviour is useful beyond > > >>>>>>> tmpfs. > > >>>>>>> > > >>>>>>> Then huge= implementation for tmpfs can be re-defined to set these > > >>>>>>> per-inode FADV_ flags by default. This way we can keep tmpfs > > >>>>>>> compatible > > >>>>>>> with current deployments and less special comparing to rest of > > >>>>>>> filesystems on kernel side. > > >>>>>> > > >>>>>> I did a quick search, and I didn't find any other fs that require > > >>>>>> PMD-sized > > >>>>>> huge pages, so I am not sure if FADV_* is useful for filesystems > > >>>>>> other than > > >>>>>> tmpfs. Please correct me if I missed something. > > >>>>> > > >>>>> What do you mean by "require"? THPs are always opportunistic. > > >>>>> > > >>>>> IIUC, we don't have a way to hint kernel to use huge pages for a > > >>>>> file on > > >>>>> read from backing storage. Readahead is not always the right way. > > >>>>> > > >>>>>>> If huge= is not set, tmpfs would behave the same way as the rest of > > >>>>>>> filesystems. > > >>>>>> > > >>>>>> So if 'huge=' is not set, tmpfs write()/fallocate() can still > > >>>>>> allocate large > > >>>>>> folios based on the write size? If yes, that means it will change the > > >>>>>> default huge behavior for tmpfs. Because previously having 'huge=' > > >>>>>> is not > > >>>>>> set means the huge option is 'SHMEM_HUGE_NEVER', which is similar > > >>>>>> to what I > > >>>>>> mentioned: > > >>>>>> "Another possible choice is to make the huge pages allocation based > > >>>>>> on write > > >>>>>> size as the *default* behavior for tmpfs, ..." > > >>>>> > > >>>>> I am more worried about breaking existing users of huge pages. So > > >>>>> changing > > >>>>> behaviour of users who don't specify huge is okay to me. > > >>>> > > >>>> I think moving tmpfs to allocate large folios opportunistically by > > >>>> default (as it was proposed initially) doesn't necessary conflict with > > >>>> the default behaviour (huge=never). We just need to clarify that in > > >>>> the documentation. > > >>>> > > >>>> However, and IIRC, one of the requests from Hugh was to have a way to > > >>>> disable large folios which is something other FS do not have control > > >>>> of as of today. Ryan sent a proposal to actually control that globally > > >>>> but I think it didn't move forward. So, what are we missing to go back > > >>>> to implement large folios in tmpfs in the default case, as any other fs > > >>>> leveraging large folios? > > >>> > > >>> IMHO, as I discussed with Kirill, we still need maintain compatibility > > >>> with the 'huge=' mount option. This means that if 'huge=never' is set > > >>> for tmpfs, huge page allocation will still be prohibited (which can > > >>> address Hugh's request?). However, if 'huge=' is not set, we can > > >>> allocate large folios based on the write size. > > So, in order to make tmpfs behave like other filesystems, we need to > allocate large folios by default. Not setting 'huge=' is the same as > setting it to 'huge=never' as per documentation. But 'huge=' is meant to > control THP, not large folios, so it should not have a conflict here, or > else, what case are you thinking? > > So, to make tmpfs behave like other filesystems, we need to allocate > large folios by default. According to the documentation, not setting > 'huge=' is the same as setting 'huge=never.' However, 'huge=' is > intended to control THP, not large folios, so there shouldn't be > a conflict in this case. Can you clarify what specific scenario or > conflict you're considering here? Perhaps when large folios order is the > same as PMD-size? Sorry for duplicate paragraph. > > > >> > > >> I consider allocating large folios in shmem/tmpfs on the write path less > > >> controversial than allocating them on the page fault path -- especially > > >> as long as we stay within the size to-be-written. > > >> > > >> I think in RHEL THP on shmem/tmpfs are disabled as default (e.g., > > >> shmem_enabled=never). Maybe because of some rather undesired > > >> side-effects (maybe some are historical?): I recall issues with VMs with > > >> THP+ memory ballooning, as we cannot reclaim pages of folios if > > >> splitting fails). I assume most of these problematic use cases don't use > > >> tmpfs as an ordinary file system (write()/read()), but mmap() the whole > > >> thing. > > >> > > >> Sadly, I don't find any information about shmem/tmpfs + THP in the RHEL > > >> documentation; most documentation is only concerned about anon THP. > > >> Which makes me conclude that they are not suggested as of now. > > >> > > >> I see more issues with allocating them on the page fault path and not > > >> having a way to disable it -- compared to allocating them on the write() > > >> path. > > > > > > I may not understand your issues. IIUC, you can disable allocating huge > > > pages on the page fault path by using the 'huge=never' mount option or > > > setting shmem_enabled=deny. No? > > > > That's what I am saying: if there is some way to disable it that will > > keep working, great. > > I agree. That aligns with what I recall Hugh requested. However, I > believe if that is the way to go, we shouldn't limit it to tmpfs. > Otherwise, why should tmpfs be prevented from allocating large folios if > other filesystems in the system are allowed to allocate them? I think, > if we want to disable large folios we should make it more generic, > something similar to Ryan's proposal [1] for controlling folio sizes. > > [1] https://lore.kernel.org/all/20240717071257.4141363-1-ryan.roberts@arm.com/ > > That said, there has already been disagreement on this point here [2]. > > [2] https://lore.kernel.org/all/ZvVRiJYfaXD645Nh@casper.infradead.org/
On 2024/10/10 17:58, Baolin Wang wrote: > Hi, > > This RFC patch series attempts to support large folios for tmpfs. > > Considering that tmpfs already has the 'huge=' option to control the THP > allocation, it is necessary to maintain compatibility with the 'huge=' > option, as well as considering the 'deny' and 'force' option controlled > by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'. > > Add a new huge option 'write_size' to support large folio allocation based > on the write size for tmpfs write and fallocate paths. So the huge pages > allocation strategy for tmpfs is that, if the 'huge=' option > (huge=always/within_size/advise) is enabled or the 'shmem_enabled' option > is 'force', it need just allow PMD sized THP to keep backward compatibility > for tmpfs. While 'huge=' option is disabled (huge=never) or the 'shmem_enabled' > option is 'deny', it will still disable any large folio allocations. Only > when the 'huge=' option is 'write_size', it will allow allocating large > folios based on the write size. > > And I think the 'huge=write_size' option should be the default behavior > for tmpfs in future. Could we avoid new huge= option for tmpfs, maybe support other orders for both read/write/fallocate if mount with huge? > > Any comments and suggestions are appreciated. Thanks. > > Changes from RFC v2: > - Drop mTHP interfaces to control huge page allocation, per Matthew. > - Add a new helper to calculate the order, suggested by Matthew. > - Add a new huge=write_size option to allocate large folios based on > the write size. > - Add a new patch to update the documentation. > > Changes from RFC v1: > - Drop patch 1. > - Use 'write_end' to calculate the length in shmem_allowable_huge_orders(). > - Update shmem_mapping_size_order() per Daniel. > > Baolin Wang (4): > mm: factor out the order calculation into a new helper > mm: shmem: change shmem_huge_global_enabled() to return huge order > bitmap > mm: shmem: add large folio support to the write and fallocate paths > for tmpfs > docs: tmpfs: add documention for 'write_size' huge option > > Documentation/filesystems/tmpfs.rst | 7 +- > include/linux/pagemap.h | 16 ++++- > mm/shmem.c | 105 ++++++++++++++++++++-------- > 3 files changed, 94 insertions(+), 34 deletions(-) >
On 2024/10/16 15:49, Kefeng Wang wrote: > > > On 2024/10/10 17:58, Baolin Wang wrote: >> Hi, >> >> This RFC patch series attempts to support large folios for tmpfs. >> >> Considering that tmpfs already has the 'huge=' option to control the THP >> allocation, it is necessary to maintain compatibility with the 'huge=' >> option, as well as considering the 'deny' and 'force' option controlled >> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'. >> >> Add a new huge option 'write_size' to support large folio allocation >> based >> on the write size for tmpfs write and fallocate paths. So the huge pages >> allocation strategy for tmpfs is that, if the 'huge=' option >> (huge=always/within_size/advise) is enabled or the 'shmem_enabled' option >> is 'force', it need just allow PMD sized THP to keep backward >> compatibility >> for tmpfs. While 'huge=' option is disabled (huge=never) or the >> 'shmem_enabled' >> option is 'deny', it will still disable any large folio allocations. Only >> when the 'huge=' option is 'write_size', it will allow allocating large >> folios based on the write size. >> >> And I think the 'huge=write_size' option should be the default behavior >> for tmpfs in future. > > Could we avoid new huge= option for tmpfs, maybe support other orders > for both read/write/fallocate if mount with huge? Um, I am afraid not, as that would break the 'huge=' compatibility. That is to say, users still want PMD-sized huge pages if 'huge=always'.
On 2024/10/16 17:29, Baolin Wang wrote: > > > On 2024/10/16 15:49, Kefeng Wang wrote: >> >> >> On 2024/10/10 17:58, Baolin Wang wrote: >>> Hi, >>> >>> This RFC patch series attempts to support large folios for tmpfs. >>> >>> Considering that tmpfs already has the 'huge=' option to control the THP >>> allocation, it is necessary to maintain compatibility with the 'huge=' >>> option, as well as considering the 'deny' and 'force' option controlled >>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'. >>> >>> Add a new huge option 'write_size' to support large folio allocation >>> based >>> on the write size for tmpfs write and fallocate paths. So the huge pages >>> allocation strategy for tmpfs is that, if the 'huge=' option >>> (huge=always/within_size/advise) is enabled or the 'shmem_enabled' >>> option >>> is 'force', it need just allow PMD sized THP to keep backward >>> compatibility >>> for tmpfs. While 'huge=' option is disabled (huge=never) or the >>> 'shmem_enabled' >>> option is 'deny', it will still disable any large folio allocations. >>> Only >>> when the 'huge=' option is 'write_size', it will allow allocating large >>> folios based on the write size. >>> >>> And I think the 'huge=write_size' option should be the default behavior >>> for tmpfs in future. >> >> Could we avoid new huge= option for tmpfs, maybe support other orders >> for both read/write/fallocate if mount with huge? > > Um, I am afraid not, as that would break the 'huge=' compatibility. That > is to say, users still want PMD-sized huge pages if 'huge=always'. Yes, compatibility maybe an issue, but only write/fallocate side support large folio is a little strange, maybe a new mode to support both read/ write/fallocate?
On 2024/10/16 21:45, Kefeng Wang wrote: > > > On 2024/10/16 17:29, Baolin Wang wrote: >> >> >> On 2024/10/16 15:49, Kefeng Wang wrote: >>> >>> >>> On 2024/10/10 17:58, Baolin Wang wrote: >>>> Hi, >>>> >>>> This RFC patch series attempts to support large folios for tmpfs. >>>> >>>> Considering that tmpfs already has the 'huge=' option to control the >>>> THP >>>> allocation, it is necessary to maintain compatibility with the 'huge=' >>>> option, as well as considering the 'deny' and 'force' option controlled >>>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'. >>>> >>>> Add a new huge option 'write_size' to support large folio allocation >>>> based >>>> on the write size for tmpfs write and fallocate paths. So the huge >>>> pages >>>> allocation strategy for tmpfs is that, if the 'huge=' option >>>> (huge=always/within_size/advise) is enabled or the 'shmem_enabled' >>>> option >>>> is 'force', it need just allow PMD sized THP to keep backward >>>> compatibility >>>> for tmpfs. While 'huge=' option is disabled (huge=never) or the >>>> 'shmem_enabled' >>>> option is 'deny', it will still disable any large folio allocations. >>>> Only >>>> when the 'huge=' option is 'write_size', it will allow allocating large >>>> folios based on the write size. >>>> >>>> And I think the 'huge=write_size' option should be the default behavior >>>> for tmpfs in future. >>> >>> Could we avoid new huge= option for tmpfs, maybe support other orders >>> for both read/write/fallocate if mount with huge? >> >> Um, I am afraid not, as that would break the 'huge=' compatibility. >> That is to say, users still want PMD-sized huge pages if 'huge=always'. > > Yes, compatibility maybe an issue, but only write/fallocate side support > large folio is a little strange, maybe a new mode to support both read/ > write/fallocate? Because tmpfs read() will not allocate folios for tmpfs holes, and will use ZERO_PAGE instead. If the shmem folios are swapped out, and now we will always swapin base page, which is another story... For tmpfs mmap() read, we do not have a length to indicate how large the folio should be allocated. Moreover, we have decided against adding any mTHP interfaces for tmpfs in the previous discussion[1]. [1] https://lore.kernel.org/all/ZvVRiJYfaXD645Nh@casper.infradead.org/
© 2016 - 2024 Red Hat, Inc.