Control folio sizes used for page cache memory | Patchew

[RFC PATCH v1 0/4] Control folio sizes used for page cache memory

Ryan Roberts posted 4 patches 1 year, 6 months ago

Download series mbox

.../admin-guide/kernel-parameters.txt         |  16 ++
Documentation/admin-guide/mm/transhuge.rst    |  66 +++++++-
include/linux/huge_mm.h                       |  61 ++++---
mm/filemap.c                                  |  26 ++-
mm/huge_memory.c                              | 158 +++++++++++++++++-
mm/readahead.c                                |  43 ++++-
6 files changed, 329 insertions(+), 41 deletions(-)

Expand all Fold all

[RFC PATCH v1 0/4] Control folio sizes used for page cache memory

Posted by Ryan Roberts 1 year, 6 months ago

Hi All,

This series is an RFC that adds sysfs and kernel cmdline controls to configure
the set of allowed large folio sizes that can be used when allocating
file-memory for the page cache. As part of the control mechanism, it provides
for a special-case "preferred folio size for executable mappings" marker.

I'm trying to solve 2 separate problems with this series:

1. Reduce pressure in iTLB and improve performance on arm64: This is a modified
approach for the change at [1]. Instead of hardcoding the preferred executable
folio size into the arch, user space can now select it. This decouples the arch
code and also makes the mechanism more generic; it can be bypassed (the default)
or any folio size can be set. For my use case, 64K is preferred, but I've also
heard from Willy of a use case where putting all text into 2M PMD-sized folios
is preferred. This approach avoids the need for synchonous MADV_COLLAPSE (and
therefore faulting in all text ahead of time) to achieve that.

2. Reduce memory fragmentation in systems under high memory pressure (e.g.
Android): The theory goes that if all folios are 64K, then failure to allocate a
64K folio should become unlikely. But if the page cache is allocating lots of
different orders, with most allocations having an order below 64K (as is the
case today) then ability to allocate 64K folios diminishes. By providing control
over the allowed set of folio sizes, we can tune to avoid crucial 64K folio
allocation failure. Additionally I've heard (second hand) of the need to disable
large folios in the page cache entirely due to latency concerns in some
settings. These controls allow all of this without kernel changes.

The value of (1) is clear and the performance improvements are documented in
patch 2. I don't yet have any data demonstrating the theory for (2) since I
can't reproduce the setup that Barry had at [2]. But my view is that by adding
these controls we will enable the community to explore further, in the same way
that the anon mTHP controls helped harden the understanding for anonymous
memory.

---
This series depends on the "mTHP allocation stats for file-backed memory" series
at [3], which itself applies on top of yesterday's mm-unstable (650b6752c8a3). All
mm selftests have been run; no regressions were observed.

[1] https://lore.kernel.org/linux-mm/20240215154059.2863126-1-ryan.roberts@arm.com/
[2] https://www.youtube.com/watch?v=ht7eGWqwmNs&list=PLbzoR-pLrL6oj1rVTXLnV7cOuetvjKn9q&index=4
[3] https://lore.kernel.org/linux-mm/20240716135907.4047689-1-ryan.roberts@arm.com/

Thanks,
Ryan

Ryan Roberts (4):
mm: mTHP user controls to configure pagecache large folio sizes
mm: Introduce "always+exec" for mTHP file_enabled control
mm: Override mTHP "enabled" defaults at kernel cmdline
mm: Override mTHP "file_enabled" defaults at kernel cmdline

.../admin-guide/kernel-parameters.txt | 16 ++
Documentation/admin-guide/mm/transhuge.rst | 66 +++++++-
include/linux/huge_mm.h | 61 ++++---
mm/filemap.c | 26 ++-
mm/huge_memory.c | 158 +++++++++++++++++-
mm/readahead.c | 43 ++++-
6 files changed, 329 insertions(+), 41 deletions(-)

--
2.43.0

Re: [RFC PATCH v1 0/4] Control folio sizes used for page cache memory

Posted by Andi Kleen 1 month, 3 weeks ago

Ryan Roberts <ryan.roberts@arm.com> writes:

> Hi All,
>
> This series is an RFC that adds sysfs and kernel cmdline controls to configure
> the set of allowed large folio sizes that can be used when allocating
> file-memory for the page cache. As part of the control mechanism, it provides
> for a special-case "preferred folio size for executable mappings" marker.
>
> I'm trying to solve 2 separate problems with this series:
>

What happened to this patchkit? I was looking into how to efficiently
get larger pages for text transparently, and there doesn't seem
to be anything better than such a heuristic?

Thanks,
-Andi

Re: [RFC PATCH v1 0/4] Control folio sizes used for page cache memory

Posted by Barry Song 1 month, 3 weeks ago

On Thu, Dec 18, 2025 at 3:21 AM Andi Kleen <ak@kernel.org> wrote:
>
> Ryan Roberts <ryan.roberts@arm.com> writes:
>
> > Hi All,
> >
> > This series is an RFC that adds sysfs and kernel cmdline controls to configure
> > the set of allowed large folio sizes that can be used when allocating
> > file-memory for the page cache. As part of the control mechanism, it provides
> > for a special-case "preferred folio size for executable mappings" marker.
> >
> > I'm trying to solve 2 separate problems with this series:
> >
>
> What happened to this patchkit? I was looking into how to efficiently
> get larger pages for text transparently, and there doesn't seem
> to be anything better than such a heuristic?

I do feel that we need some controls over file folio sizes.

It could also be implemented using an eBPF-based approach, such as [1]?

[1] https://lore.kernel.org/linux-mm/20251026100159.6103-1-laoar.shao@gmail.com/

Thanks
Barry

Re: [RFC PATCH v1 0/4] Control folio sizes used for page cache memory

Posted by Ryan Roberts 1 year, 6 months ago

On 17/07/2024 08:12, Ryan Roberts wrote:
> Hi All,
> 
> This series is an RFC that adds sysfs and kernel cmdline controls to configure
> the set of allowed large folio sizes that can be used when allocating
> file-memory for the page cache. As part of the control mechanism, it provides
> for a special-case "preferred folio size for executable mappings" marker.
> 
> I'm trying to solve 2 separate problems with this series:
> 
> 1. Reduce pressure in iTLB and improve performance on arm64: This is a modified
> approach for the change at [1]. Instead of hardcoding the preferred executable
> folio size into the arch, user space can now select it. This decouples the arch
> code and also makes the mechanism more generic; it can be bypassed (the default)
> or any folio size can be set. For my use case, 64K is preferred, but I've also
> heard from Willy of a use case where putting all text into 2M PMD-sized folios
> is preferred. This approach avoids the need for synchonous MADV_COLLAPSE (and
> therefore faulting in all text ahead of time) to achieve that.

Just a polite bump on this; I'd really like to get something like this merged to
help reduce iTLB pressure. We had a discussion at the THP Cabal meeting a few
weeks back without solid conclusion. I haven't heard any concrete objections
yet, but also only a luke-warm reception. How can I move this forwards?

Thanks,
Ryan


> 
> 2. Reduce memory fragmentation in systems under high memory pressure (e.g.
> Android): The theory goes that if all folios are 64K, then failure to allocate a
> 64K folio should become unlikely. But if the page cache is allocating lots of
> different orders, with most allocations having an order below 64K (as is the
> case today) then ability to allocate 64K folios diminishes. By providing control
> over the allowed set of folio sizes, we can tune to avoid crucial 64K folio
> allocation failure. Additionally I've heard (second hand) of the need to disable
> large folios in the page cache entirely due to latency concerns in some
> settings. These controls allow all of this without kernel changes.
> 
> The value of (1) is clear and the performance improvements are documented in
> patch 2. I don't yet have any data demonstrating the theory for (2) since I
> can't reproduce the setup that Barry had at [2]. But my view is that by adding
> these controls we will enable the community to explore further, in the same way
> that the anon mTHP controls helped harden the understanding for anonymous
> memory.
> 
> ---
> This series depends on the "mTHP allocation stats for file-backed memory" series
> at [3], which itself applies on top of yesterday's mm-unstable (650b6752c8a3). All
> mm selftests have been run; no regressions were observed.
> 
> [1] https://lore.kernel.org/linux-mm/20240215154059.2863126-1-ryan.roberts@arm.com/
> [2] https://www.youtube.com/watch?v=ht7eGWqwmNs&list=PLbzoR-pLrL6oj1rVTXLnV7cOuetvjKn9q&index=4
> [3] https://lore.kernel.org/linux-mm/20240716135907.4047689-1-ryan.roberts@arm.com/
> 
> Thanks,
> Ryan
> 
> Ryan Roberts (4):
>   mm: mTHP user controls to configure pagecache large folio sizes
>   mm: Introduce "always+exec" for mTHP file_enabled control
>   mm: Override mTHP "enabled" defaults at kernel cmdline
>   mm: Override mTHP "file_enabled" defaults at kernel cmdline
> 
>  .../admin-guide/kernel-parameters.txt         |  16 ++
>  Documentation/admin-guide/mm/transhuge.rst    |  66 +++++++-
>  include/linux/huge_mm.h                       |  61 ++++---
>  mm/filemap.c                                  |  26 ++-
>  mm/huge_memory.c                              | 158 +++++++++++++++++-
>  mm/readahead.c                                |  43 ++++-
>  6 files changed, 329 insertions(+), 41 deletions(-)
> 
> --
> 2.43.0
>

Re: [RFC PATCH v1 0/4] Control folio sizes used for page cache memory

Posted by Barry Song 1 year, 4 months ago

On Thu, Aug 8, 2024 at 10:27 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 17/07/2024 08:12, Ryan Roberts wrote:
> > Hi All,
> >
> > This series is an RFC that adds sysfs and kernel cmdline controls to configure
> > the set of allowed large folio sizes that can be used when allocating
> > file-memory for the page cache. As part of the control mechanism, it provides
> > for a special-case "preferred folio size for executable mappings" marker.
> >
> > I'm trying to solve 2 separate problems with this series:
> >
> > 1. Reduce pressure in iTLB and improve performance on arm64: This is a modified
> > approach for the change at [1]. Instead of hardcoding the preferred executable
> > folio size into the arch, user space can now select it. This decouples the arch
> > code and also makes the mechanism more generic; it can be bypassed (the default)
> > or any folio size can be set. For my use case, 64K is preferred, but I've also
> > heard from Willy of a use case where putting all text into 2M PMD-sized folios
> > is preferred. This approach avoids the need for synchonous MADV_COLLAPSE (and
> > therefore faulting in all text ahead of time) to achieve that.
>
> Just a polite bump on this; I'd really like to get something like this merged to
> help reduce iTLB pressure. We had a discussion at the THP Cabal meeting a few
> weeks back without solid conclusion. I haven't heard any concrete objections
> yet, but also only a luke-warm reception. How can I move this forwards?

Hi Ryan,

These requirements seem to apply to anon, swap, pagecache, and shmem to
some extent. While the swapin_enabled knob was rejected, the shmem_enabled
option is already in place.

I wonder if it's possible to use the existing 'enabled' setting across
all cases, as
from an architectural perspective with cont-pte, pagecache may not differ from
anon. The demand for reducing page faults, LRU overhead, etc., also seems
quite similar.

I imagine that once Android's file systems support mTHP, we’ll uniformly enable
64KB for anon, swap, shmem, and page cache. It should then be sufficient to
enable all of them using a single knob:
'/sys/kernel/mm/transparent_hugepage/hugepages-xxkB/enabled'.

Is there anything that makes pagecache and shmem significantly different
from anon? In my Android case, they all seem the same. However, I assume
there might be other use cases where differentiating them is necessary?

>
> Thanks,
> Ryan
>
>
> >
> > 2. Reduce memory fragmentation in systems under high memory pressure (e.g.
> > Android): The theory goes that if all folios are 64K, then failure to allocate a
> > 64K folio should become unlikely. But if the page cache is allocating lots of
> > different orders, with most allocations having an order below 64K (as is the
> > case today) then ability to allocate 64K folios diminishes. By providing control
> > over the allowed set of folio sizes, we can tune to avoid crucial 64K folio
> > allocation failure. Additionally I've heard (second hand) of the need to disable
> > large folios in the page cache entirely due to latency concerns in some
> > settings. These controls allow all of this without kernel changes.
> >
> > The value of (1) is clear and the performance improvements are documented in
> > patch 2. I don't yet have any data demonstrating the theory for (2) since I
> > can't reproduce the setup that Barry had at [2]. But my view is that by adding
> > these controls we will enable the community to explore further, in the same way
> > that the anon mTHP controls helped harden the understanding for anonymous
> > memory.
> >
> > ---
> > This series depends on the "mTHP allocation stats for file-backed memory" series
> > at [3], which itself applies on top of yesterday's mm-unstable (650b6752c8a3). All
> > mm selftests have been run; no regressions were observed.
> >
> > [1] https://lore.kernel.org/linux-mm/20240215154059.2863126-1-ryan.roberts@arm.com/
> > [2] https://www.youtube.com/watch?v=ht7eGWqwmNs&list=PLbzoR-pLrL6oj1rVTXLnV7cOuetvjKn9q&index=4
> > [3] https://lore.kernel.org/linux-mm/20240716135907.4047689-1-ryan.roberts@arm.com/
> >
> > Thanks,
> > Ryan
> >
> > Ryan Roberts (4):
> >   mm: mTHP user controls to configure pagecache large folio sizes
> >   mm: Introduce "always+exec" for mTHP file_enabled control
> >   mm: Override mTHP "enabled" defaults at kernel cmdline
> >   mm: Override mTHP "file_enabled" defaults at kernel cmdline
> >
> >  .../admin-guide/kernel-parameters.txt         |  16 ++
> >  Documentation/admin-guide/mm/transhuge.rst    |  66 +++++++-
> >  include/linux/huge_mm.h                       |  61 ++++---
> >  mm/filemap.c                                  |  26 ++-
> >  mm/huge_memory.c                              | 158 +++++++++++++++++-
> >  mm/readahead.c                                |  43 ++++-
> >  6 files changed, 329 insertions(+), 41 deletions(-)
> >
> > --
> > 2.43.0
> >
>

Thanks
Barry

Re: [RFC PATCH v1 0/4] Control folio sizes used for page cache memory

Posted by Barry Song 1 year, 2 months ago

It's unusual that many emails sent days ago are resurfacing on LKML.
Please ignore them.
By the way, does anyone know what happened?

On Fri, Dec 6, 2024 at 5:12 AM Barry Song <baohua@kernel.org> wrote:
>
> On Thu, Aug 8, 2024 at 10:27 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >
> > On 17/07/2024 08:12, Ryan Roberts wrote:
> > > Hi All,
> > >
> > > This series is an RFC that adds sysfs and kernel cmdline controls to configure
> > > the set of allowed large folio sizes that can be used when allocating
> > > file-memory for the page cache. As part of the control mechanism, it provides
> > > for a special-case "preferred folio size for executable mappings" marker.
> > >
> > > I'm trying to solve 2 separate problems with this series:
> > >
> > > 1. Reduce pressure in iTLB and improve performance on arm64: This is a modified
> > > approach for the change at [1]. Instead of hardcoding the preferred executable
> > > folio size into the arch, user space can now select it. This decouples the arch
> > > code and also makes the mechanism more generic; it can be bypassed (the default)
> > > or any folio size can be set. For my use case, 64K is preferred, but I've also
> > > heard from Willy of a use case where putting all text into 2M PMD-sized folios
> > > is preferred. This approach avoids the need for synchonous MADV_COLLAPSE (and
> > > therefore faulting in all text ahead of time) to achieve that.
> >
> > Just a polite bump on this; I'd really like to get something like this merged to
> > help reduce iTLB pressure. We had a discussion at the THP Cabal meeting a few
> > weeks back without solid conclusion. I haven't heard any concrete objections
> > yet, but also only a luke-warm reception. How can I move this forwards?
>
> Hi Ryan,
>
> These requirements seem to apply to anon, swap, pagecache, and shmem to
> some extent. While the swapin_enabled knob was rejected, the shmem_enabled
> option is already in place.
>
> I wonder if it's possible to use the existing 'enabled' setting across
> all cases, as
> from an architectural perspective with cont-pte, pagecache may not differ from
> anon. The demand for reducing page faults, LRU overhead, etc., also seems
> quite similar.
>
> I imagine that once Android's file systems support mTHP, we’ll uniformly enable
> 64KB for anon, swap, shmem, and page cache. It should then be sufficient to
> enable all of them using a single knob:
> '/sys/kernel/mm/transparent_hugepage/hugepages-xxkB/enabled'.
>
> Is there anything that makes pagecache and shmem significantly different
> from anon? In my Android case, they all seem the same. However, I assume
> there might be other use cases where differentiating them is necessary?
>
> >
> > Thanks,
> > Ryan
> >
> >
> > >
> > > 2. Reduce memory fragmentation in systems under high memory pressure (e.g.
> > > Android): The theory goes that if all folios are 64K, then failure to allocate a
> > > 64K folio should become unlikely. But if the page cache is allocating lots of
> > > different orders, with most allocations having an order below 64K (as is the
> > > case today) then ability to allocate 64K folios diminishes. By providing control
> > > over the allowed set of folio sizes, we can tune to avoid crucial 64K folio
> > > allocation failure. Additionally I've heard (second hand) of the need to disable
> > > large folios in the page cache entirely due to latency concerns in some
> > > settings. These controls allow all of this without kernel changes.
> > >
> > > The value of (1) is clear and the performance improvements are documented in
> > > patch 2. I don't yet have any data demonstrating the theory for (2) since I
> > > can't reproduce the setup that Barry had at [2]. But my view is that by adding
> > > these controls we will enable the community to explore further, in the same way
> > > that the anon mTHP controls helped harden the understanding for anonymous
> > > memory.
> > >
> > > ---
> > > This series depends on the "mTHP allocation stats for file-backed memory" series
> > > at [3], which itself applies on top of yesterday's mm-unstable (650b6752c8a3). All
> > > mm selftests have been run; no regressions were observed.
> > >
> > > [1] https://lore.kernel.org/linux-mm/20240215154059.2863126-1-ryan.roberts@arm.com/
> > > [2] https://www.youtube.com/watch?v=ht7eGWqwmNs&list=PLbzoR-pLrL6oj1rVTXLnV7cOuetvjKn9q&index=4
> > > [3] https://lore.kernel.org/linux-mm/20240716135907.4047689-1-ryan.roberts@arm.com/
> > >
> > > Thanks,
> > > Ryan
> > >
> > > Ryan Roberts (4):
> > >   mm: mTHP user controls to configure pagecache large folio sizes
> > >   mm: Introduce "always+exec" for mTHP file_enabled control
> > >   mm: Override mTHP "enabled" defaults at kernel cmdline
> > >   mm: Override mTHP "file_enabled" defaults at kernel cmdline
> > >
> > >  .../admin-guide/kernel-parameters.txt         |  16 ++
> > >  Documentation/admin-guide/mm/transhuge.rst    |  66 +++++++-
> > >  include/linux/huge_mm.h                       |  61 ++++---
> > >  mm/filemap.c                                  |  26 ++-
> > >  mm/huge_memory.c                              | 158 +++++++++++++++++-
> > >  mm/readahead.c                                |  43 ++++-
> > >  6 files changed, 329 insertions(+), 41 deletions(-)
> > >
> > > --
> > > 2.43.0
> > >
> >
>
> Thanks
> Barry
>

Re: [RFC PATCH v1 0/4] Control folio sizes used for page cache memory

Posted by Baolin Wang 1 year, 2 months ago

On 2024/12/6 13:09, Barry Song wrote:
> It's unusual that many emails sent days ago are resurfacing on LKML.
> Please ignore them.
> By the way, does anyone know what happened?

I also received many previous emails, and seems that changing the 
maillist spam filtering rules by the owner of linux-mm@kvack.org caused 
this.

See: https://lore.kernel.org/all/20241205154213.GA5247@kvack.org/

Re: [RFC PATCH v1 0/4] Control folio sizes used for page cache memory

Posted by Ryan Roberts 1 year, 4 months ago

On 19/09/2024 09:20, Barry Song wrote:
> On Thu, Aug 8, 2024 at 10:27 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 17/07/2024 08:12, Ryan Roberts wrote:
>>> Hi All,
>>>
>>> This series is an RFC that adds sysfs and kernel cmdline controls to configure
>>> the set of allowed large folio sizes that can be used when allocating
>>> file-memory for the page cache. As part of the control mechanism, it provides
>>> for a special-case "preferred folio size for executable mappings" marker.
>>>
>>> I'm trying to solve 2 separate problems with this series:
>>>
>>> 1. Reduce pressure in iTLB and improve performance on arm64: This is a modified
>>> approach for the change at [1]. Instead of hardcoding the preferred executable
>>> folio size into the arch, user space can now select it. This decouples the arch
>>> code and also makes the mechanism more generic; it can be bypassed (the default)
>>> or any folio size can be set. For my use case, 64K is preferred, but I've also
>>> heard from Willy of a use case where putting all text into 2M PMD-sized folios
>>> is preferred. This approach avoids the need for synchonous MADV_COLLAPSE (and
>>> therefore faulting in all text ahead of time) to achieve that.
>>
>> Just a polite bump on this; I'd really like to get something like this merged to
>> help reduce iTLB pressure. We had a discussion at the THP Cabal meeting a few
>> weeks back without solid conclusion. I haven't heard any concrete objections
>> yet, but also only a luke-warm reception. How can I move this forwards?
> 
> Hi Ryan,
> 
> These requirements seem to apply to anon, swap, pagecache, and shmem to
> some extent. While the swapin_enabled knob was rejected, the shmem_enabled
> option is already in place.
> 
> I wonder if it's possible to use the existing 'enabled' setting across
> all cases, as
> from an architectural perspective with cont-pte, pagecache may not differ from
> anon. The demand for reducing page faults, LRU overhead, etc., also seems
> quite similar.
> 
> I imagine that once Android's file systems support mTHP, we’ll uniformly enable
> 64KB for anon, swap, shmem, and page cache. It should then be sufficient to
> enable all of them using a single knob:
> '/sys/kernel/mm/transparent_hugepage/hugepages-xxkB/enabled'.
> 
> Is there anything that makes pagecache and shmem significantly different
> from anon? In my Android case, they all seem the same. However, I assume
> there might be other use cases where differentiating them is necessary?

For anon vs shmem, we were just following the precedent set by the legacy PMD
controls, which separated them. I vaguely recall David explaining why there are
separate controls but don't recall the exact reason; I beleive there was some
use case where anon THP made sense, but shmem THP was problematic for some
reason. Note too, that the controls expose different options; anon has {always
never, madvise}, shmem has {always, never, advise (no m; it applies to fadvise
too), within_size, force, deny}. So I guess if the extra shmem options are
important then it makes sense to have a separate control.

For pagecache vs anon, I'm not sure it makes sense to tie these to the same
control. We have readahead information to help us make an educated guess at the
folio size we should use (currently we start at order-2 and increase by 2 orders
every time we hit the readahead marker) and it's much easier to drop pagecache
folios under memory pressure. So by default, I think most/all orders would be
enabled for pagecahce. But for anon, things are harder. In the common case,
likely we only want 2M when madvised, and 64K always (and possibly 16K always).

Talking with Willy today, his preference is to not expose any controls for
pagecache at all, and let the architecture hint the preferred folio size for
code - basically how I did it at [1] - linked in the original post. This is very
simple and exposes no user controls so could be easily modified over time as we
get more data.

Trouble is nobody seemed willing to R-b the first approach. So perhaps we're
stuck waiting for Android's FSs to support large folios so we can start
benchmarking the real-world gains?

Thanks,
Ryan

> 
>>
>> Thanks,
>> Ryan
>>
>>
>>>
>>> 2. Reduce memory fragmentation in systems under high memory pressure (e.g.
>>> Android): The theory goes that if all folios are 64K, then failure to allocate a
>>> 64K folio should become unlikely. But if the page cache is allocating lots of
>>> different orders, with most allocations having an order below 64K (as is the
>>> case today) then ability to allocate 64K folios diminishes. By providing control
>>> over the allowed set of folio sizes, we can tune to avoid crucial 64K folio
>>> allocation failure. Additionally I've heard (second hand) of the need to disable
>>> large folios in the page cache entirely due to latency concerns in some
>>> settings. These controls allow all of this without kernel changes.
>>>
>>> The value of (1) is clear and the performance improvements are documented in
>>> patch 2. I don't yet have any data demonstrating the theory for (2) since I
>>> can't reproduce the setup that Barry had at [2]. But my view is that by adding
>>> these controls we will enable the community to explore further, in the same way
>>> that the anon mTHP controls helped harden the understanding for anonymous
>>> memory.
>>>
>>> ---
>>> This series depends on the "mTHP allocation stats for file-backed memory" series
>>> at [3], which itself applies on top of yesterday's mm-unstable (650b6752c8a3). All
>>> mm selftests have been run; no regressions were observed.
>>>
>>> [1] https://lore.kernel.org/linux-mm/20240215154059.2863126-1-ryan.roberts@arm.com/
>>> [2] https://www.youtube.com/watch?v=ht7eGWqwmNs&list=PLbzoR-pLrL6oj1rVTXLnV7cOuetvjKn9q&index=4
>>> [3] https://lore.kernel.org/linux-mm/20240716135907.4047689-1-ryan.roberts@arm.com/
>>>
>>> Thanks,
>>> Ryan
>>>
>>> Ryan Roberts (4):
>>>   mm: mTHP user controls to configure pagecache large folio sizes
>>>   mm: Introduce "always+exec" for mTHP file_enabled control
>>>   mm: Override mTHP "enabled" defaults at kernel cmdline
>>>   mm: Override mTHP "file_enabled" defaults at kernel cmdline
>>>
>>>  .../admin-guide/kernel-parameters.txt         |  16 ++
>>>  Documentation/admin-guide/mm/transhuge.rst    |  66 +++++++-
>>>  include/linux/huge_mm.h                       |  61 ++++---
>>>  mm/filemap.c                                  |  26 ++-
>>>  mm/huge_memory.c                              | 158 +++++++++++++++++-
>>>  mm/readahead.c                                |  43 ++++-
>>>  6 files changed, 329 insertions(+), 41 deletions(-)
>>>
>>> --
>>> 2.43.0
>>>
>>
> 
> Thanks
> Barry

Re: [RFC PATCH v1 0/4] Control folio sizes used for page cache memory

Posted by David Hildenbrand 1 year, 6 months ago

On 17.07.24 09:12, Ryan Roberts wrote:
> Hi All,
> 
> This series is an RFC that adds sysfs and kernel cmdline controls to configure
> the set of allowed large folio sizes that can be used when allocating
> file-memory for the page cache. As part of the control mechanism, it provides
> for a special-case "preferred folio size for executable mappings" marker.
> 
> I'm trying to solve 2 separate problems with this series:
> 
> 1. Reduce pressure in iTLB and improve performance on arm64: This is a modified
> approach for the change at [1]. Instead of hardcoding the preferred executable
> folio size into the arch, user space can now select it. This decouples the arch
> code and also makes the mechanism more generic; it can be bypassed (the default)
> or any folio size can be set. For my use case, 64K is preferred, but I've also
> heard from Willy of a use case where putting all text into 2M PMD-sized folios
> is preferred. This approach avoids the need for synchonous MADV_COLLAPSE (and
> therefore faulting in all text ahead of time) to achieve that.
> 
> 2. Reduce memory fragmentation in systems under high memory pressure (e.g.
> Android): The theory goes that if all folios are 64K, then failure to allocate a
> 64K folio should become unlikely. But if the page cache is allocating lots of
> different orders, with most allocations having an order below 64K (as is the
> case today) then ability to allocate 64K folios diminishes. By providing control
> over the allowed set of folio sizes, we can tune to avoid crucial 64K folio
> allocation failure. Additionally I've heard (second hand) of the need to disable
> large folios in the page cache entirely due to latency concerns in some
> settings. These controls allow all of this without kernel changes.
> 
> The value of (1) is clear and the performance improvements are documented in
> patch 2. I don't yet have any data demonstrating the theory for (2) since I
> can't reproduce the setup that Barry had at [2]. But my view is that by adding
> these controls we will enable the community to explore further, in the same way
> that the anon mTHP controls helped harden the understanding for anonymous
> memory.
> 
> ---

How would this interact with other requirements we get from the 
filesystem (for example, because of the device) [1].

Assuming a device has a filesystem has a min order of X, but we disable 
anything >= X, how would we combine that configuration/information?


[1] 
https://lore.kernel.org/all/20240715094457.452836-2-kernel@pankajraghav.com/T/#u

-- 
Cheers,

David / dhildenb

Re: [RFC PATCH v1 0/4] Control folio sizes used for page cache memory

Posted by Ryan Roberts 1 year, 6 months ago

On 17/07/2024 11:31, David Hildenbrand wrote:
> On 17.07.24 09:12, Ryan Roberts wrote:
>> Hi All,
>>
>> This series is an RFC that adds sysfs and kernel cmdline controls to configure
>> the set of allowed large folio sizes that can be used when allocating
>> file-memory for the page cache. As part of the control mechanism, it provides
>> for a special-case "preferred folio size for executable mappings" marker.
>>
>> I'm trying to solve 2 separate problems with this series:
>>
>> 1. Reduce pressure in iTLB and improve performance on arm64: This is a modified
>> approach for the change at [1]. Instead of hardcoding the preferred executable
>> folio size into the arch, user space can now select it. This decouples the arch
>> code and also makes the mechanism more generic; it can be bypassed (the default)
>> or any folio size can be set. For my use case, 64K is preferred, but I've also
>> heard from Willy of a use case where putting all text into 2M PMD-sized folios
>> is preferred. This approach avoids the need for synchonous MADV_COLLAPSE (and
>> therefore faulting in all text ahead of time) to achieve that.
>>
>> 2. Reduce memory fragmentation in systems under high memory pressure (e.g.
>> Android): The theory goes that if all folios are 64K, then failure to allocate a
>> 64K folio should become unlikely. But if the page cache is allocating lots of
>> different orders, with most allocations having an order below 64K (as is the
>> case today) then ability to allocate 64K folios diminishes. By providing control
>> over the allowed set of folio sizes, we can tune to avoid crucial 64K folio
>> allocation failure. Additionally I've heard (second hand) of the need to disable
>> large folios in the page cache entirely due to latency concerns in some
>> settings. These controls allow all of this without kernel changes.
>>
>> The value of (1) is clear and the performance improvements are documented in
>> patch 2. I don't yet have any data demonstrating the theory for (2) since I
>> can't reproduce the setup that Barry had at [2]. But my view is that by adding
>> these controls we will enable the community to explore further, in the same way
>> that the anon mTHP controls helped harden the understanding for anonymous
>> memory.
>>
>> ---
> 
> How would this interact with other requirements we get from the filesystem (for
> example, because of the device) [1].
> 
> Assuming a device has a filesystem has a min order of X, but we disable anything
>>= X, how would we combine that configuration/information?

Currently order-0 is implicitly the "always-on" fallback order. My thinking was
that with [1], the specified min order just becomes that "always-on" fallback order.

Today:

  orders = file_orders_always() | BIT(0);

Tomorrow:

  orders = (file_orders_always() & ~(BIT(min_order) - 1)) | BIT(min_order);

That does mean that in this case, a user-disabled order could still be used. So
the controls are really hints rather than definitive commands.


> 
> 
> [1]
> https://lore.kernel.org/all/20240715094457.452836-2-kernel@pankajraghav.com/T/#u
>

Re: [RFC PATCH v1 0/4] Control folio sizes used for page cache memory

Posted by Daniel Gomez 1 year, 6 months ago

On Wed, Jul 17, 2024 at 11:45:48AM GMT, Ryan Roberts wrote:
> On 17/07/2024 11:31, David Hildenbrand wrote:
> > On 17.07.24 09:12, Ryan Roberts wrote:
> >> Hi All,
> >>
> >> This series is an RFC that adds sysfs and kernel cmdline controls to configure
> >> the set of allowed large folio sizes that can be used when allocating
> >> file-memory for the page cache. As part of the control mechanism, it provides
> >> for a special-case "preferred folio size for executable mappings" marker.
> >>
> >> I'm trying to solve 2 separate problems with this series:
> >>
> >> 1. Reduce pressure in iTLB and improve performance on arm64: This is a modified
> >> approach for the change at [1]. Instead of hardcoding the preferred executable
> >> folio size into the arch, user space can now select it. This decouples the arch
> >> code and also makes the mechanism more generic; it can be bypassed (the default)
> >> or any folio size can be set. For my use case, 64K is preferred, but I've also
> >> heard from Willy of a use case where putting all text into 2M PMD-sized folios
> >> is preferred. This approach avoids the need for synchonous MADV_COLLAPSE (and
> >> therefore faulting in all text ahead of time) to achieve that.
> >>
> >> 2. Reduce memory fragmentation in systems under high memory pressure (e.g.
> >> Android): The theory goes that if all folios are 64K, then failure to allocate a
> >> 64K folio should become unlikely. But if the page cache is allocating lots of
> >> different orders, with most allocations having an order below 64K (as is the
> >> case today) then ability to allocate 64K folios diminishes. By providing control
> >> over the allowed set of folio sizes, we can tune to avoid crucial 64K folio
> >> allocation failure. Additionally I've heard (second hand) of the need to disable
> >> large folios in the page cache entirely due to latency concerns in some
> >> settings. These controls allow all of this without kernel changes.
> >>
> >> The value of (1) is clear and the performance improvements are documented in
> >> patch 2. I don't yet have any data demonstrating the theory for (2) since I
> >> can't reproduce the setup that Barry had at [2]. But my view is that by adding
> >> these controls we will enable the community to explore further, in the same way
> >> that the anon mTHP controls helped harden the understanding for anonymous
> >> memory.
> >>
> >> ---
> > 
> > How would this interact with other requirements we get from the filesystem (for
> > example, because of the device) [1].
> > 
> > Assuming a device has a filesystem has a min order of X, but we disable anything
> >>= X, how would we combine that configuration/information?
> 
> Currently order-0 is implicitly the "always-on" fallback order. My thinking was
> that with [1], the specified min order just becomes that "always-on" fallback order.
> 
> Today:
> 
>   orders = file_orders_always() | BIT(0);
> 
> Tomorrow:
> 
>   orders = (file_orders_always() & ~(BIT(min_order) - 1)) | BIT(min_order);
> 
> That does mean that in this case, a user-disabled order could still be used. So
> the controls are really hints rather than definitive commands.

In the scenario where a min order is not enabled in hugepages-<size>kB/
file_enabled, will the user still be allowed to automatically mkfs/mount with
blocksize=min_order, and will sysfs reflect this? Or, since it's a hint, will it
remain hidden but still allow mkfs/mount to proceed?

> 
> 
> > 
> > 
> > [1]
> > https://lore.kernel.org/all/20240715094457.452836-2-kernel@pankajraghav.com/T/#u
> > 
>

Re: [RFC PATCH v1 0/4] Control folio sizes used for page cache memory

Posted by Ryan Roberts 1 year, 6 months ago

On 22/07/2024 10:35, Daniel Gomez wrote:
> On Wed, Jul 17, 2024 at 11:45:48AM GMT, Ryan Roberts wrote:
>> On 17/07/2024 11:31, David Hildenbrand wrote:
>>> On 17.07.24 09:12, Ryan Roberts wrote:
>>>> Hi All,
>>>>
>>>> This series is an RFC that adds sysfs and kernel cmdline controls to configure
>>>> the set of allowed large folio sizes that can be used when allocating
>>>> file-memory for the page cache. As part of the control mechanism, it provides
>>>> for a special-case "preferred folio size for executable mappings" marker.
>>>>
>>>> I'm trying to solve 2 separate problems with this series:
>>>>
>>>> 1. Reduce pressure in iTLB and improve performance on arm64: This is a modified
>>>> approach for the change at [1]. Instead of hardcoding the preferred executable
>>>> folio size into the arch, user space can now select it. This decouples the arch
>>>> code and also makes the mechanism more generic; it can be bypassed (the default)
>>>> or any folio size can be set. For my use case, 64K is preferred, but I've also
>>>> heard from Willy of a use case where putting all text into 2M PMD-sized folios
>>>> is preferred. This approach avoids the need for synchonous MADV_COLLAPSE (and
>>>> therefore faulting in all text ahead of time) to achieve that.
>>>>
>>>> 2. Reduce memory fragmentation in systems under high memory pressure (e.g.
>>>> Android): The theory goes that if all folios are 64K, then failure to allocate a
>>>> 64K folio should become unlikely. But if the page cache is allocating lots of
>>>> different orders, with most allocations having an order below 64K (as is the
>>>> case today) then ability to allocate 64K folios diminishes. By providing control
>>>> over the allowed set of folio sizes, we can tune to avoid crucial 64K folio
>>>> allocation failure. Additionally I've heard (second hand) of the need to disable
>>>> large folios in the page cache entirely due to latency concerns in some
>>>> settings. These controls allow all of this without kernel changes.
>>>>
>>>> The value of (1) is clear and the performance improvements are documented in
>>>> patch 2. I don't yet have any data demonstrating the theory for (2) since I
>>>> can't reproduce the setup that Barry had at [2]. But my view is that by adding
>>>> these controls we will enable the community to explore further, in the same way
>>>> that the anon mTHP controls helped harden the understanding for anonymous
>>>> memory.
>>>>
>>>> ---
>>>
>>> How would this interact with other requirements we get from the filesystem (for
>>> example, because of the device) [1].
>>>
>>> Assuming a device has a filesystem has a min order of X, but we disable anything
>>>> = X, how would we combine that configuration/information?
>>
>> Currently order-0 is implicitly the "always-on" fallback order. My thinking was
>> that with [1], the specified min order just becomes that "always-on" fallback order.
>>
>> Today:
>>
>>   orders = file_orders_always() | BIT(0);
>>
>> Tomorrow:
>>
>>   orders = (file_orders_always() & ~(BIT(min_order) - 1)) | BIT(min_order);
>>
>> That does mean that in this case, a user-disabled order could still be used. So
>> the controls are really hints rather than definitive commands.
> 
> In the scenario where a min order is not enabled in hugepages-<size>kB/
> file_enabled, will the user still be allowed to automatically mkfs/mount with
> blocksize=min_order, and will sysfs reflect this? Or, since it's a hint, will it
> remain hidden but still allow mkfs/mount to proceed?

My proposal is that the controls are hints, and they would not block mounting a
file system.

As an example, the user may set
`/sys/kernel/mm/transparent_hugepage/hugepages-16kB/file_enable` to `never`. In
this case the kernel would never pick a 16K folio to back a file who's minimum
folio size is not 16K. If the file's minimum folio size is 16K then it would
still allocate that folio size in the fallback case, after trying any
appropriate bigger folio sizes that are set to `always`.

Thanks,
Ryan

> 
>>
>>
>>>
>>>
>>> [1]
>>> https://lore.kernel.org/all/20240715094457.452836-2-kernel@pankajraghav.com/T/#u
>>>

Re: [RFC PATCH v1 0/4] Control folio sizes used for page cache memory

Posted by David Hildenbrand 1 year, 6 months ago

On 17.07.24 12:45, Ryan Roberts wrote:
> On 17/07/2024 11:31, David Hildenbrand wrote:
>> On 17.07.24 09:12, Ryan Roberts wrote:
>>> Hi All,
>>>
>>> This series is an RFC that adds sysfs and kernel cmdline controls to configure
>>> the set of allowed large folio sizes that can be used when allocating
>>> file-memory for the page cache. As part of the control mechanism, it provides
>>> for a special-case "preferred folio size for executable mappings" marker.
>>>
>>> I'm trying to solve 2 separate problems with this series:
>>>
>>> 1. Reduce pressure in iTLB and improve performance on arm64: This is a modified
>>> approach for the change at [1]. Instead of hardcoding the preferred executable
>>> folio size into the arch, user space can now select it. This decouples the arch
>>> code and also makes the mechanism more generic; it can be bypassed (the default)
>>> or any folio size can be set. For my use case, 64K is preferred, but I've also
>>> heard from Willy of a use case where putting all text into 2M PMD-sized folios
>>> is preferred. This approach avoids the need for synchonous MADV_COLLAPSE (and
>>> therefore faulting in all text ahead of time) to achieve that.
>>>
>>> 2. Reduce memory fragmentation in systems under high memory pressure (e.g.
>>> Android): The theory goes that if all folios are 64K, then failure to allocate a
>>> 64K folio should become unlikely. But if the page cache is allocating lots of
>>> different orders, with most allocations having an order below 64K (as is the
>>> case today) then ability to allocate 64K folios diminishes. By providing control
>>> over the allowed set of folio sizes, we can tune to avoid crucial 64K folio
>>> allocation failure. Additionally I've heard (second hand) of the need to disable
>>> large folios in the page cache entirely due to latency concerns in some
>>> settings. These controls allow all of this without kernel changes.
>>>
>>> The value of (1) is clear and the performance improvements are documented in
>>> patch 2. I don't yet have any data demonstrating the theory for (2) since I
>>> can't reproduce the setup that Barry had at [2]. But my view is that by adding
>>> these controls we will enable the community to explore further, in the same way
>>> that the anon mTHP controls helped harden the understanding for anonymous
>>> memory.
>>>
>>> ---
>>
>> How would this interact with other requirements we get from the filesystem (for
>> example, because of the device) [1].
>>
>> Assuming a device has a filesystem has a min order of X, but we disable anything
>>> = X, how would we combine that configuration/information?
> 
> Currently order-0 is implicitly the "always-on" fallback order. My thinking was
> that with [1], the specified min order just becomes that "always-on" fallback order.
> 
> Today:
> 
>    orders = file_orders_always() | BIT(0);
> 
> Tomorrow:
> 
>    orders = (file_orders_always() & ~(BIT(min_order) - 1)) | BIT(min_order);
> 
> That does mean that in this case, a user-disabled order could still be used. So
> the controls are really hints rather than definitive commands.

Okay, because that's a difference to order-0, which is -- as you note -- 
always-on (not even a toggle).

Staring at patch #1, you use the name "file_enable". That might indeed 
cause some confusion. Thinking out loud, I wonder if a different 
terminology could better express the semantics. Hm ... but maybe it only 
would have to be documented.

Thanks for the details.

-- 
Cheers,

David / dhildenb