[v10] khugepaged: mTHP support

[PATCH v10 00/13] khugepaged: mTHP support

Posted by Nico Pache 1 month, 2 weeks ago

The following series provides khugepaged with the capability to collapse
anonymous memory regions to mTHPs.

To achieve this we generalize the khugepaged functions to no longer depend
on PMD_ORDER. Then during the PMD scan, we use a bitmap to track chunks of
pages (defined by KHUGEPAGED_MTHP_MIN_ORDER) that are utilized. After the
PMD scan is done, we do binary recursion on the bitmap to find the optimal
mTHP sizes for the PMD range. The restriction on max_ptes_none is removed
during the scan, to make sure we account for the whole PMD range. When no
mTHP size is enabled, the legacy behavior of khugepaged is maintained.
max_ptes_none will be scaled by the attempted collapse order to determine
how full a mTHP must be to be eligible for the collapse to occur. If a
mTHP collapse is attempted, but contains swapped out, or shared pages, we
don't perform the collapse. It is now also possible to collapse to mTHPs
without requiring the PMD THP size to be enabled.

With the default max_ptes_none=511, the code should keep its most of its
original behavior. When enabling multiple adjacent (m)THP sizes we need to
set max_ptes_none<=255. With max_ptes_none > HPAGE_PMD_NR/2 you will
experience collapse "creep" and constantly promote mTHPs to the next
available size. This is due the fact that a collapse will introduce at
least 2x the number of pages, and on a future scan will satisfy the
promotion condition once again.

Patch 1:     Refactor/rename hpage_collapse
Patch 2:     Some refactoring to combine madvise_collapse and khugepaged
Patch 3-5:   Generalize khugepaged functions for arbitrary orders
Patch 6-8:   The mTHP patches
Patch 9-10:  Allow khugepaged to operate without PMD enabled
Patch 11-12: Tracing/stats
Patch 13:    Documentation

---------
 Testing
---------
- Built for x86_64, aarch64, ppc64le, and s390x
- selftests mm
- I created a test script that I used to push khugepaged to its limits
   while monitoring a number of stats and tracepoints. The code is
   available here[1] (Run in legacy mode for these changes and set mthp
   sizes to inherit)
   The summary from my testings was that there was no significant
   regression noticed through this test. In some cases my changes had
   better collapse latencies, and was able to scan more pages in the same
   amount of time/work, but for the most part the results were consistent.
- redis testing. I tested these changes along with my defer changes
  (see followup [4] post for more details). We've decided to get the mTHP
  changes merged first before attempting the defer series.
- some basic testing on 64k page size.
- lots of general use.

V10 Changes:
- Fixed bug where bitmap tracking was off by one leading to weird behavior
  in some test cases.
- Track mTHP stats for PMD order too (Baolin)
- indentation cleanup (David)
- add review/ack tags
- Improve the control flow, readability, and result handling in
  collapse_scan_bitmap (Baolin)
- Indentation nits/cleanup (David)
- Converted u8 orders to unsigned int to be consistent with other folio
  callers (David)
- Handled conflicts with Devs work on pte batching
- Changed SWAP/SHARED restriction comments to a TODO comment (David)
- Squashed main mTHP patch and the introduce bitmap patch (David)
- Other small nits

V9 Changes: [3]
- Drop madvise_collapse support [2]. Further discussion needed.
- Add documentation entries for new stats (Baolin)
- Fix missing stat update (MTHP_STAT_COLLAPSE_EXCEED_SWAP) that was
  accidentally dropped in v7 (Baolin)
- Fix mishandled conflict noted in v8 (merged into wrong commit)
- change rename from khugepaged to collapse (Dev)

V8 Changes:
- Fix mishandled conflict with shmem config changes (Baolin)
- Add Baolin's patches for allowing collapse without PMD enabled
- Add additional patch for allowing madvise_collapse without PMD enabled
- Documentations nits (Randy)
- Simplify SCAN_ANY_PROCESS lock jumbling (Liam)
- Add a BUG_ON to the mTHP collapse similar to PMD (Dev)
- Remove doc comment about khugepaged PMD only limitation (Dev)
- Change revalidation function to accept multiple orders
- Handled conflicts introduced by Lorenzo's madvise changes

V7 (RESEND)

V6 Changes:
- Dont release the anon_vma_lock early (like in the PMD case), as not all
  pages are isolated.
- Define the PTE as null to avoid a uninitilized condition
- minor nits and newline cleanup
- make sure to unmap and unlock the pte for the swapin case
- change the revalidation to always check the PMD order (as this will make
  sure that no other VMA spans it)

V5 Changes:
- switched the order of patches 1 and 2
- fixed some edge cases on the unified madvise_collapse and khugepaged
- Explained the "creep" some more in the docs
- fix EXCEED_SHARED vs EXCEED_SWAP accounting issue
- fix potential highmem issue caused by a early unmap of the PTE

V4 Changes:
- Rebased onto mm-unstable
- small changes to Documentation

V3 Changes:
- corrected legacy behavior for khugepaged and madvise_collapse
- added proper mTHP stat tracking
- Minor changes to prevent a nested lock on non-split-lock arches
- Took Devs version of alloc_charge_folio as it has the proper stats
- Skip cases were trying to collapse to a lower order would still fail
- Fixed cases were the bitmap was not being updated properly
- Moved Documentation update to this series instead of the defer set
- Minor bugs discovered during testing and review
- Minor "nit" cleanup

V2 Changes:
- Minor bug fixes discovered during review and testing
- removed dynamic allocations for bitmaps, and made them stack based
- Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
- Updated trace events to include collapsing order info.
- Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
- No longer require a chunk to be fully utilized before setting the bit.
   Use the same max_ptes_none scaling principle to achieve this.
- Skip mTHP collapse that requires swapin or shared handling. This helps
   prevent some of the "creep" that was discovered in v1.

A big thanks to everyone that has reviewed, tested, and participated in
the development process. Its been a great experience working with all of
you on this long endeavour.

[1] - https://gitlab.com/npache/khugepaged_mthp_test
[2] - https://lore.kernel.org/all/23b8ad10-cd1f-45df-a25c-78d01c8af44f@redhat.com/
[3] - https://lore.kernel.org/lkml/20250714003207.113275-1-npache@redhat.com/
[4] - https://lore.kernel.org/lkml/20250515033857.132535-1-npache@redhat.com/

Baolin Wang (2):
  khugepaged: enable collapsing mTHPs even when PMD THPs are disabled
  khugepaged: kick khugepaged for enabling none-PMD-sized mTHPs

Dev Jain (1):
  khugepaged: generalize alloc_charge_folio()

Nico Pache (10):
  khugepaged: rename hpage_collapse_* to collapse_*
  introduce collapse_single_pmd to unify khugepaged and madvise_collapse
  khugepaged: generalize hugepage_vma_revalidate for mTHP support
  khugepaged: generalize __collapse_huge_page_* for mTHP support
  khugepaged: add mTHP support
  khugepaged: skip collapsing mTHP to smaller orders
  khugepaged: avoid unnecessary mTHP collapse attempts
  khugepaged: improve tracepoints for mTHP orders
  khugepaged: add per-order mTHP khugepaged stats
  Documentation: mm: update the admin guide for mTHP collapse

 Documentation/admin-guide/mm/transhuge.rst |  44 +-
 include/linux/huge_mm.h                    |   5 +
 include/linux/khugepaged.h                 |   4 +
 include/trace/events/huge_memory.h         |  34 +-
 mm/huge_memory.c                           |  11 +
 mm/khugepaged.c                            | 552 +++++++++++++++------
 6 files changed, 468 insertions(+), 182 deletions(-)

-- 
2.50.1

Re: [PATCH v10 00/13] khugepaged: mTHP support

Posted by Andrew Morton 1 month, 2 weeks ago

On Tue, 19 Aug 2025 07:41:52 -0600 Nico Pache <npache@redhat.com> wrote:

> The following series provides khugepaged with the capability to collapse
> anonymous memory regions to mTHPs.
> 
> ...
>
> - I created a test script that I used to push khugepaged to its limits
>    while monitoring a number of stats and tracepoints. The code is
>    available here[1] (Run in legacy mode for these changes and set mthp
>    sizes to inherit)

Could this be turned into something in tools/testing/selftests/mm/?

> V10 Changes:

I'll add this to mm-new, thanks.  I'll suppress the usual emails.

Re: [PATCH v10 00/13] khugepaged: mTHP support

Posted by Nico Pache 1 month, 2 weeks ago

On Tue, Aug 19, 2025 at 3:55 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Tue, 19 Aug 2025 07:41:52 -0600 Nico Pache <npache@redhat.com> wrote:
>
> > The following series provides khugepaged with the capability to collapse
> > anonymous memory regions to mTHPs.
> >
> > ...
> >
> > - I created a test script that I used to push khugepaged to its limits
> >    while monitoring a number of stats and tracepoints. The code is
> >    available here[1] (Run in legacy mode for these changes and set mthp
> >    sizes to inherit)
>
> Could this be turned into something in tools/testing/selftests/mm/?
Yep! I was actually working on some selftests for this before I hit a
weird bug during some of my testing, which took precedence over that.

I was planning on sending a separate series for the testing. One of
the pain points was that selftests helpers were set up for PMDs, but I
think Baolin just cleaned that up in his khugepaged mTHP shmem series
(still need to review). So it should be a lot easier to implement now.
>
> > V10 Changes:
>
> I'll add this to mm-new, thanks.  I'll suppress the usual emails.
Thank you :)
-- Nico
>

Re: [PATCH v10 00/13] khugepaged: mTHP support

Posted by Dev Jain 4 weeks ago

On 19/08/25 7:11 pm, Nico Pache wrote:
> The following series provides khugepaged with the capability to collapse
> anonymous memory regions to mTHPs.
>
> To achieve this we generalize the khugepaged functions to no longer depend
> on PMD_ORDER. Then during the PMD scan, we use a bitmap to track chunks of
> pages (defined by KHUGEPAGED_MTHP_MIN_ORDER) that are utilized. After the
> PMD scan is done, we do binary recursion on the bitmap to find the optimal
> mTHP sizes for the PMD range. The restriction on max_ptes_none is removed
> during the scan, to make sure we account for the whole PMD range. When no
> mTHP size is enabled, the legacy behavior of khugepaged is maintained.
> max_ptes_none will be scaled by the attempted collapse order to determine
> how full a mTHP must be to be eligible for the collapse to occur. If a
> mTHP collapse is attempted, but contains swapped out, or shared pages, we
> don't perform the collapse. It is now also possible to collapse to mTHPs
> without requiring the PMD THP size to be enabled.
>
> With the default max_ptes_none=511, the code should keep its most of its
> original behavior. When enabling multiple adjacent (m)THP sizes we need to
> set max_ptes_none<=255. With max_ptes_none > HPAGE_PMD_NR/2 you will
> experience collapse "creep" and constantly promote mTHPs to the next
> available size. This is due the fact that a collapse will introduce at
> least 2x the number of pages, and on a future scan will satisfy the
> promotion condition once again.
>
> Patch 1:     Refactor/rename hpage_collapse
> Patch 2:     Some refactoring to combine madvise_collapse and khugepaged
> Patch 3-5:   Generalize khugepaged functions for arbitrary orders
> Patch 6-8:   The mTHP patches
> Patch 9-10:  Allow khugepaged to operate without PMD enabled
> Patch 11-12: Tracing/stats
> Patch 13:    Documentation
>
>

For the next version, it will be really great if you can mention
the lore links referencing important ideas guiding the evolution
of the algorithm - say, a policy decision we make. (I frequently do
that albeit I think I over-do it :)) I am asking because I am
completely lost on the current discussion going on around
the max_ptes_* scaling (been busy with other stuff).

Re: [PATCH v10 00/13] khugepaged: mTHP support

Posted by David Hildenbrand 1 month ago

On 19.08.25 15:41, Nico Pache wrote:
> The following series provides khugepaged with the capability to collapse
> anonymous memory regions to mTHPs.
> 
> To achieve this we generalize the khugepaged functions to no longer depend
> on PMD_ORDER. Then during the PMD scan, we use a bitmap to track chunks of
> pages (defined by KHUGEPAGED_MTHP_MIN_ORDER) that are utilized. After the
> PMD scan is done, we do binary recursion on the bitmap to find the optimal
> mTHP sizes for the PMD range. The restriction on max_ptes_none is removed
> during the scan, to make sure we account for the whole PMD range. When no
> mTHP size is enabled, the legacy behavior of khugepaged is maintained.
> max_ptes_none will be scaled by the attempted collapse order to determine
> how full a mTHP must be to be eligible for the collapse to occur. If a
> mTHP collapse is attempted, but contains swapped out, or shared pages, we
> don't perform the collapse. It is now also possible to collapse to mTHPs
> without requiring the PMD THP size to be enabled.
> 
> With the default max_ptes_none=511, the code should keep its most of its
> original behavior. When enabling multiple adjacent (m)THP sizes we need to
> set max_ptes_none<=255. With max_ptes_none > HPAGE_PMD_NR/2 you will
> experience collapse "creep" and constantly promote mTHPs to the next
> available size. This is due the fact that a collapse will introduce at
> least 2x the number of pages, and on a future scan will satisfy the
> promotion condition once again.
> 
> Patch 1:     Refactor/rename hpage_collapse
> Patch 2:     Some refactoring to combine madvise_collapse and khugepaged
> Patch 3-5:   Generalize khugepaged functions for arbitrary orders
> Patch 6-8:   The mTHP patches
> Patch 9-10:  Allow khugepaged to operate without PMD enabled
> Patch 11-12: Tracing/stats
> Patch 13:    Documentation

Would it be feasible to start with simply not supporting the 
max_pte_none parameter in the first version, just like we won't support 
max_pte_swapped/max_pte_shared in the first version?

That gives us more time to think about how to use/modify the old interface.

For example, I could envision a ratio-based interface, or as discussed 
with Lorenzo a simple boolean. We could make the existing max_ptes* 
interface backwards compatible then.

That also gives us the opportunity to think about the creep problem 
separately.

I'm sure initial mTHP collapse will be valuable even without support for 
that weird set of parameters.

Would there be implementation-wise a problem?

But let me think further about the creep problem ... :/

-- 
Cheers

David / dhildenb

Re: [PATCH v10 00/13] khugepaged: mTHP support

Posted by David Hildenbrand 1 month ago

On 01.09.25 18:21, David Hildenbrand wrote:
> On 19.08.25 15:41, Nico Pache wrote:
>> The following series provides khugepaged with the capability to collapse
>> anonymous memory regions to mTHPs.
>>
>> To achieve this we generalize the khugepaged functions to no longer depend
>> on PMD_ORDER. Then during the PMD scan, we use a bitmap to track chunks of
>> pages (defined by KHUGEPAGED_MTHP_MIN_ORDER) that are utilized. After the
>> PMD scan is done, we do binary recursion on the bitmap to find the optimal
>> mTHP sizes for the PMD range. The restriction on max_ptes_none is removed
>> during the scan, to make sure we account for the whole PMD range. When no
>> mTHP size is enabled, the legacy behavior of khugepaged is maintained.
>> max_ptes_none will be scaled by the attempted collapse order to determine
>> how full a mTHP must be to be eligible for the collapse to occur. If a
>> mTHP collapse is attempted, but contains swapped out, or shared pages, we
>> don't perform the collapse. It is now also possible to collapse to mTHPs
>> without requiring the PMD THP size to be enabled.
>>
>> With the default max_ptes_none=511, the code should keep its most of its
>> original behavior. When enabling multiple adjacent (m)THP sizes we need to
>> set max_ptes_none<=255. With max_ptes_none > HPAGE_PMD_NR/2 you will
>> experience collapse "creep" and constantly promote mTHPs to the next
>> available size. This is due the fact that a collapse will introduce at
>> least 2x the number of pages, and on a future scan will satisfy the
>> promotion condition once again.
>>
>> Patch 1:     Refactor/rename hpage_collapse
>> Patch 2:     Some refactoring to combine madvise_collapse and khugepaged
>> Patch 3-5:   Generalize khugepaged functions for arbitrary orders
>> Patch 6-8:   The mTHP patches
>> Patch 9-10:  Allow khugepaged to operate without PMD enabled
>> Patch 11-12: Tracing/stats
>> Patch 13:    Documentation
> 
> Would it be feasible to start with simply not supporting the
> max_pte_none parameter in the first version, just like we won't support
> max_pte_swapped/max_pte_shared in the first version?
> 
> That gives us more time to think about how to use/modify the old interface.
> 
> For example, I could envision a ratio-based interface, or as discussed
> with Lorenzo a simple boolean. We could make the existing max_ptes*
> interface backwards compatible then.
> 
> That also gives us the opportunity to think about the creep problem
> separately.
> 
> I'm sure initial mTHP collapse will be valuable even without support for
> that weird set of parameters.
> 
> Would there be implementation-wise a problem?
> 
> But let me think further about the creep problem ... :/

FWIW, I just looked around and there is documented usage of setting 
max_ptes_none to 0 [1, 2, 3].

In essence, I think it can make sense to set it to 0 when an application 
wants to manage THP on its own (MADV_COLLAPSE), and avoid khugepaged 
interfering. Now, using a system-wide toggle for such a use case is 
rather questionable, but it's all we have.

I did not find anything only recommending to set values different to 0 
or 511 -- so far.

So *likely* focusing on 0 vs. 511 initially would cover most use cases 
out there. Ignoring the parameter initially (require all to be !none) 
could of course also work.

[1] https://www.mongodb.com/docs/manual/administration/tcmalloc-performance/
[2] https://google.github.io/tcmalloc/tuning.html
[3] 
https://support.yugabyte.com/hc/en-us/articles/36558155921165-Mitigating-Excessive-RSS-Memory-Usage-Due-to-THP-Transparent-Huge-Pages

-- 
Cheers

David / dhildenb

Re: [PATCH v10 00/13] khugepaged: mTHP support

Posted by Lorenzo Stoakes 1 month, 1 week ago

OK so I noticed in patch 13/13 (!) where you change the documentation that you
essentially state that the whole method used to determine the ratio of PTEs to
collapse to mTHP is broken:

	khugepaged uses max_ptes_none scaled to the order of the enabled
	mTHP size to determine collapses. When using mTHPs it's recommended
	to set max_ptes_none low-- ideally less than HPAGE_PMD_NR / 2 (255
	on 4k page size). This will prevent undesired "creep" behavior that
	leads to continuously collapsing to the largest mTHP size; when we
	collapse, we are bringing in new non-zero pages that will, on a
	subsequent scan, cause the max_ptes_none check of the +1 order to
	always be satisfied. By limiting this to less than half the current
	order, we make sure we don't cause this feedback
	loop. max_ptes_shared and max_ptes_swap have no effect when
	collapsing to a mTHP, and mTHP collapse will fail on shared or
	swapped out pages.

This seems to me to suggest that using
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none as some means
of establishing a 'ratio' to do this calculation is fundamentally flawed.

So surely we ought to introduce a new sysfs tunable for this? Perhaps

/sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio

Or something like this?

It's already questionable that we are taking a value that is expressed
essentially in terms of PTE entries per PMD and then use it implicitly to
determine the ratio for mTHP, but to then say 'oh but the default value is
known-broken' is just a blocker for the series in my opinion.

This really has to be done a different way I think.

Cheers, Lorenzo

Re: [PATCH v10 00/13] khugepaged: mTHP support

Posted by Dev Jain 1 month, 1 week ago

On 21/08/25 8:31 pm, Lorenzo Stoakes wrote:
> OK so I noticed in patch 13/13 (!) where you change the documentation that you
> essentially state that the whole method used to determine the ratio of PTEs to
> collapse to mTHP is broken:
>
> 	khugepaged uses max_ptes_none scaled to the order of the enabled
> 	mTHP size to determine collapses. When using mTHPs it's recommended
> 	to set max_ptes_none low-- ideally less than HPAGE_PMD_NR / 2 (255
> 	on 4k page size). This will prevent undesired "creep" behavior that
> 	leads to continuously collapsing to the largest mTHP size; when we
> 	collapse, we are bringing in new non-zero pages that will, on a
> 	subsequent scan, cause the max_ptes_none check of the +1 order to
> 	always be satisfied. By limiting this to less than half the current
> 	order, we make sure we don't cause this feedback
> 	loop. max_ptes_shared and max_ptes_swap have no effect when
> 	collapsing to a mTHP, and mTHP collapse will fail on shared or
> 	swapped out pages.
>
> This seems to me to suggest that using
> /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none as some means
> of establishing a 'ratio' to do this calculation is fundamentally flawed.
>
> So surely we ought to introduce a new sysfs tunable for this? Perhaps
>
> /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio
>
> Or something like this?
>
> It's already questionable that we are taking a value that is expressed
> essentially in terms of PTE entries per PMD and then use it implicitly to
> determine the ratio for mTHP, but to then say 'oh but the default value is
> known-broken' is just a blocker for the series in my opinion.
>
> This really has to be done a different way I think.
>
> Cheers, Lorenzo

FWIW this was my version of the documentation patch:
https://lore.kernel.org/all/20250211111326.14295-18-dev.jain@arm.com/

The discussion about the creep problem started here:
https://lore.kernel.org/all/7098654a-776d-413b-8aca-28f811620df7@arm.com/

and the discussion continuing here:
https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com/

ending with a summary I gave here:
https://lore.kernel.org/all/8114d47b-b383-4d6e-ab65-a0e88b99c873@arm.com/

This should help you with the context.

Re: [PATCH v10 00/13] khugepaged: mTHP support

Posted by Liam R. Howlett 1 month, 1 week ago

* Dev Jain <dev.jain@arm.com> [250821 11:14]:
> 
> On 21/08/25 8:31 pm, Lorenzo Stoakes wrote:
> > OK so I noticed in patch 13/13 (!) where you change the documentation that you
> > essentially state that the whole method used to determine the ratio of PTEs to
> > collapse to mTHP is broken:
> > 
> > 	khugepaged uses max_ptes_none scaled to the order of the enabled
> > 	mTHP size to determine collapses. When using mTHPs it's recommended
> > 	to set max_ptes_none low-- ideally less than HPAGE_PMD_NR / 2 (255
> > 	on 4k page size). This will prevent undesired "creep" behavior that
> > 	leads to continuously collapsing to the largest mTHP size; when we
> > 	collapse, we are bringing in new non-zero pages that will, on a
> > 	subsequent scan, cause the max_ptes_none check of the +1 order to
> > 	always be satisfied. By limiting this to less than half the current
> > 	order, we make sure we don't cause this feedback
> > 	loop. max_ptes_shared and max_ptes_swap have no effect when
> > 	collapsing to a mTHP, and mTHP collapse will fail on shared or
> > 	swapped out pages.
> > 
> > This seems to me to suggest that using
> > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none as some means
> > of establishing a 'ratio' to do this calculation is fundamentally flawed.
> > 
> > So surely we ought to introduce a new sysfs tunable for this? Perhaps
> > 
> > /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio
> > 
> > Or something like this?
> > 
> > It's already questionable that we are taking a value that is expressed
> > essentially in terms of PTE entries per PMD and then use it implicitly to
> > determine the ratio for mTHP, but to then say 'oh but the default value is
> > known-broken' is just a blocker for the series in my opinion.
> > 
> > This really has to be done a different way I think.
> > 
> > Cheers, Lorenzo
> 
> FWIW this was my version of the documentation patch:
> https://lore.kernel.org/all/20250211111326.14295-18-dev.jain@arm.com/
> 
> The discussion about the creep problem started here:
> https://lore.kernel.org/all/7098654a-776d-413b-8aca-28f811620df7@arm.com/
> 
> and the discussion continuing here:
> https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com/
> 
> ending with a summary I gave here:
> https://lore.kernel.org/all/8114d47b-b383-4d6e-ab65-a0e88b99c873@arm.com/
> 
> This should help you with the context.

Thanks for hunting this down, the context should be referenced in the
change log so we can find it easier in the future (and now).  Or at
least in the cover letter.

The way the change log in the cover letter is written makes it
exceedingly long.  Could you switch to listing the changes from v9 and
links to v1-8 (+RFCs if there are any)?  Well, I guess include v10
changes and v1-9 urls..

At the length it is now, it's most likely a tl;dr for most.  If you're
starting to review this at v10, then you'd probably appreciate not
rehashing discussions and if you're going from v9 then you already have
an idea of what v10 should have changed.

Said another way, the changelog is more useful with context and context
is difficult to find without a lore link.

I am having issues tracking down the contexts of many items of what has
been generated here and it'll only get worse as time moves on.  We do
our best to keep change logs with the necessary details, but having
bread crumbs to follow is extremely helpful for review and in the long
run.

Thanks,
Liam

Re: [PATCH v10 00/13] khugepaged: mTHP support

Posted by Lorenzo Stoakes 1 month, 1 week ago

On Thu, Aug 21, 2025 at 08:43:18PM +0530, Dev Jain wrote:
>
> On 21/08/25 8:31 pm, Lorenzo Stoakes wrote:
> > OK so I noticed in patch 13/13 (!) where you change the documentation that you
> > essentially state that the whole method used to determine the ratio of PTEs to
> > collapse to mTHP is broken:
> >
> > 	khugepaged uses max_ptes_none scaled to the order of the enabled
> > 	mTHP size to determine collapses. When using mTHPs it's recommended
> > 	to set max_ptes_none low-- ideally less than HPAGE_PMD_NR / 2 (255
> > 	on 4k page size). This will prevent undesired "creep" behavior that
> > 	leads to continuously collapsing to the largest mTHP size; when we
> > 	collapse, we are bringing in new non-zero pages that will, on a
> > 	subsequent scan, cause the max_ptes_none check of the +1 order to
> > 	always be satisfied. By limiting this to less than half the current
> > 	order, we make sure we don't cause this feedback
> > 	loop. max_ptes_shared and max_ptes_swap have no effect when
> > 	collapsing to a mTHP, and mTHP collapse will fail on shared or
> > 	swapped out pages.
> >
> > This seems to me to suggest that using
> > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none as some means
> > of establishing a 'ratio' to do this calculation is fundamentally flawed.
> >
> > So surely we ought to introduce a new sysfs tunable for this? Perhaps
> >
> > /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio
> >
> > Or something like this?
> >
> > It's already questionable that we are taking a value that is expressed
> > essentially in terms of PTE entries per PMD and then use it implicitly to
> > determine the ratio for mTHP, but to then say 'oh but the default value is
> > known-broken' is just a blocker for the series in my opinion.
> >
> > This really has to be done a different way I think.
> >
> > Cheers, Lorenzo
>
> FWIW this was my version of the documentation patch:
> https://lore.kernel.org/all/20250211111326.14295-18-dev.jain@arm.com/
>
> The discussion about the creep problem started here:
> https://lore.kernel.org/all/7098654a-776d-413b-8aca-28f811620df7@arm.com/
>
> and the discussion continuing here:
> https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com/
>
> ending with a summary I gave here:
> https://lore.kernel.org/all/8114d47b-b383-4d6e-ab65-a0e88b99c873@arm.com/
>
> This should help you with the context.
>
>

Thanks and I"ll have a look, but this series is unmergeable with a broken
default in
/sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio
sorry.

We need to have a new tunable as far as I can tell. I also find the use of
this PMD-specific value as an arbitrary way of expressing a ratio pretty
gross.

Thanks, Lorenzo

Re: [PATCH v10 00/13] khugepaged: mTHP support

Posted by Nico Pache 1 month, 1 week ago

On Thu, Aug 21, 2025 at 9:20 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Thu, Aug 21, 2025 at 08:43:18PM +0530, Dev Jain wrote:
> >
> > On 21/08/25 8:31 pm, Lorenzo Stoakes wrote:
> > > OK so I noticed in patch 13/13 (!) where you change the documentation that you
> > > essentially state that the whole method used to determine the ratio of PTEs to
> > > collapse to mTHP is broken:
> > >
> > >     khugepaged uses max_ptes_none scaled to the order of the enabled
> > >     mTHP size to determine collapses. When using mTHPs it's recommended
> > >     to set max_ptes_none low-- ideally less than HPAGE_PMD_NR / 2 (255
> > >     on 4k page size). This will prevent undesired "creep" behavior that
> > >     leads to continuously collapsing to the largest mTHP size; when we
> > >     collapse, we are bringing in new non-zero pages that will, on a
> > >     subsequent scan, cause the max_ptes_none check of the +1 order to
> > >     always be satisfied. By limiting this to less than half the current
> > >     order, we make sure we don't cause this feedback
> > >     loop. max_ptes_shared and max_ptes_swap have no effect when
> > >     collapsing to a mTHP, and mTHP collapse will fail on shared or
> > >     swapped out pages.
> > >
> > > This seems to me to suggest that using
> > > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none as some means
> > > of establishing a 'ratio' to do this calculation is fundamentally flawed.
> > >
> > > So surely we ought to introduce a new sysfs tunable for this? Perhaps
> > >
> > > /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio
> > >
> > > Or something like this?
> > >
> > > It's already questionable that we are taking a value that is expressed
> > > essentially in terms of PTE entries per PMD and then use it implicitly to
> > > determine the ratio for mTHP, but to then say 'oh but the default value is
> > > known-broken' is just a blocker for the series in my opinion.
> > >
> > > This really has to be done a different way I think.
> > >
> > > Cheers, Lorenzo
> >
> > FWIW this was my version of the documentation patch:
> > https://lore.kernel.org/all/20250211111326.14295-18-dev.jain@arm.com/
> >
> > The discussion about the creep problem started here:
> > https://lore.kernel.org/all/7098654a-776d-413b-8aca-28f811620df7@arm.com/
> >
> > and the discussion continuing here:
> > https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com/
> >
> > ending with a summary I gave here:
> > https://lore.kernel.org/all/8114d47b-b383-4d6e-ab65-a0e88b99c873@arm.com/
> >
> > This should help you with the context.
> >
> >
>
> Thanks and I"ll have a look, but this series is unmergeable with a broken
> default in
> /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio
> sorry.
>
> We need to have a new tunable as far as I can tell. I also find the use of
> this PMD-specific value as an arbitrary way of expressing a ratio pretty
> gross.
The first thing that comes to mind is that we can pin max_ptes_none to
255 if it exceeds 255. It's worth noting that the issue occurs only
for adjacently enabled mTHP sizes.

ie)
if order!=HPAGE_PMD_ORDER && khugepaged_max_ptes_none > 255
      temp_max_ptes_none = 255;
>
> Thanks, Lorenzo
>

Re: [PATCH v10 00/13] khugepaged: mTHP support

Posted by Nico Pache 1 month, 1 week ago

On Thu, Aug 21, 2025 at 9:25 AM Nico Pache <npache@redhat.com> wrote:
>
> On Thu, Aug 21, 2025 at 9:20 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Thu, Aug 21, 2025 at 08:43:18PM +0530, Dev Jain wrote:
> > >
> > > On 21/08/25 8:31 pm, Lorenzo Stoakes wrote:
> > > > OK so I noticed in patch 13/13 (!) where you change the documentation that you
> > > > essentially state that the whole method used to determine the ratio of PTEs to
> > > > collapse to mTHP is broken:
> > > >
> > > >     khugepaged uses max_ptes_none scaled to the order of the enabled
> > > >     mTHP size to determine collapses. When using mTHPs it's recommended
> > > >     to set max_ptes_none low-- ideally less than HPAGE_PMD_NR / 2 (255
> > > >     on 4k page size). This will prevent undesired "creep" behavior that
> > > >     leads to continuously collapsing to the largest mTHP size; when we
> > > >     collapse, we are bringing in new non-zero pages that will, on a
> > > >     subsequent scan, cause the max_ptes_none check of the +1 order to
> > > >     always be satisfied. By limiting this to less than half the current
> > > >     order, we make sure we don't cause this feedback
> > > >     loop. max_ptes_shared and max_ptes_swap have no effect when
> > > >     collapsing to a mTHP, and mTHP collapse will fail on shared or
> > > >     swapped out pages.
> > > >
> > > > This seems to me to suggest that using
> > > > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none as some means
> > > > of establishing a 'ratio' to do this calculation is fundamentally flawed.
> > > >
> > > > So surely we ought to introduce a new sysfs tunable for this? Perhaps
> > > >
> > > > /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio
> > > >
> > > > Or something like this?
> > > >
> > > > It's already questionable that we are taking a value that is expressed
> > > > essentially in terms of PTE entries per PMD and then use it implicitly to
> > > > determine the ratio for mTHP, but to then say 'oh but the default value is
> > > > known-broken' is just a blocker for the series in my opinion.
> > > >
> > > > This really has to be done a different way I think.
> > > >
> > > > Cheers, Lorenzo
> > >
> > > FWIW this was my version of the documentation patch:
> > > https://lore.kernel.org/all/20250211111326.14295-18-dev.jain@arm.com/
> > >
> > > The discussion about the creep problem started here:
> > > https://lore.kernel.org/all/7098654a-776d-413b-8aca-28f811620df7@arm.com/
> > >
> > > and the discussion continuing here:
> > > https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com/
> > >
> > > ending with a summary I gave here:
> > > https://lore.kernel.org/all/8114d47b-b383-4d6e-ab65-a0e88b99c873@arm.com/
> > >
> > > This should help you with the context.
> > >
> > >
> >
> > Thanks and I"ll have a look, but this series is unmergeable with a broken
> > default in
> > /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio
> > sorry.
> >
> > We need to have a new tunable as far as I can tell. I also find the use of
> > this PMD-specific value as an arbitrary way of expressing a ratio pretty
> > gross.
> The first thing that comes to mind is that we can pin max_ptes_none to
> 255 if it exceeds 255. It's worth noting that the issue occurs only
> for adjacently enabled mTHP sizes.
>
> ie)
> if order!=HPAGE_PMD_ORDER && khugepaged_max_ptes_none > 255
>       temp_max_ptes_none = 255;
Oh and my second point, introducing a new tunable to control mTHP
collapse may become exceedingly complex from a tuning and code
management standpoint.
> >
> > Thanks, Lorenzo
> >

Re: [PATCH v10 00/13] khugepaged: mTHP support

Posted by Lorenzo Stoakes 1 month, 1 week ago

On Thu, Aug 21, 2025 at 09:27:19AM -0600, Nico Pache wrote:
> On Thu, Aug 21, 2025 at 9:25 AM Nico Pache <npache@redhat.com> wrote:
> >
> > On Thu, Aug 21, 2025 at 9:20 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > On Thu, Aug 21, 2025 at 08:43:18PM +0530, Dev Jain wrote:
> > > >
> > > > On 21/08/25 8:31 pm, Lorenzo Stoakes wrote:
> > > > > OK so I noticed in patch 13/13 (!) where you change the documentation that you
> > > > > essentially state that the whole method used to determine the ratio of PTEs to
> > > > > collapse to mTHP is broken:
> > > > >
> > > > >     khugepaged uses max_ptes_none scaled to the order of the enabled
> > > > >     mTHP size to determine collapses. When using mTHPs it's recommended
> > > > >     to set max_ptes_none low-- ideally less than HPAGE_PMD_NR / 2 (255
> > > > >     on 4k page size). This will prevent undesired "creep" behavior that
> > > > >     leads to continuously collapsing to the largest mTHP size; when we
> > > > >     collapse, we are bringing in new non-zero pages that will, on a
> > > > >     subsequent scan, cause the max_ptes_none check of the +1 order to
> > > > >     always be satisfied. By limiting this to less than half the current
> > > > >     order, we make sure we don't cause this feedback
> > > > >     loop. max_ptes_shared and max_ptes_swap have no effect when
> > > > >     collapsing to a mTHP, and mTHP collapse will fail on shared or
> > > > >     swapped out pages.
> > > > >
> > > > > This seems to me to suggest that using
> > > > > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none as some means
> > > > > of establishing a 'ratio' to do this calculation is fundamentally flawed.
> > > > >
> > > > > So surely we ought to introduce a new sysfs tunable for this? Perhaps
> > > > >
> > > > > /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio
> > > > >
> > > > > Or something like this?
> > > > >
> > > > > It's already questionable that we are taking a value that is expressed
> > > > > essentially in terms of PTE entries per PMD and then use it implicitly to
> > > > > determine the ratio for mTHP, but to then say 'oh but the default value is
> > > > > known-broken' is just a blocker for the series in my opinion.
> > > > >
> > > > > This really has to be done a different way I think.
> > > > >
> > > > > Cheers, Lorenzo
> > > >
> > > > FWIW this was my version of the documentation patch:
> > > > https://lore.kernel.org/all/20250211111326.14295-18-dev.jain@arm.com/
> > > >
> > > > The discussion about the creep problem started here:
> > > > https://lore.kernel.org/all/7098654a-776d-413b-8aca-28f811620df7@arm.com/
> > > >
> > > > and the discussion continuing here:
> > > > https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com/
> > > >
> > > > ending with a summary I gave here:
> > > > https://lore.kernel.org/all/8114d47b-b383-4d6e-ab65-a0e88b99c873@arm.com/
> > > >
> > > > This should help you with the context.
> > > >
> > > >
> > >
> > > Thanks and I"ll have a look, but this series is unmergeable with a broken
> > > default in
> > > /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio
> > > sorry.
> > >
> > > We need to have a new tunable as far as I can tell. I also find the use of
> > > this PMD-specific value as an arbitrary way of expressing a ratio pretty
> > > gross.
> > The first thing that comes to mind is that we can pin max_ptes_none to
> > 255 if it exceeds 255. It's worth noting that the issue occurs only
> > for adjacently enabled mTHP sizes.

No! Presumably the default of 511 (for PMDs with 512 entries) is set for a
reason, arbitrarily changing this to suit a specific case seems crazy no?

> >
> > ie)
> > if order!=HPAGE_PMD_ORDER && khugepaged_max_ptes_none > 255
> >       temp_max_ptes_none = 255;
> Oh and my second point, introducing a new tunable to control mTHP
> collapse may become exceedingly complex from a tuning and code
> management standpoint.

Umm right now you hve a ratio expressed in PTES per mTHP * ((PTEs per PMD) /
PMD) 'except please don't set to the usual default when using mTHP' and it's
currently default-broken.

I'm really not sure how that is simpler than a seprate tunable that can be
expressed as a ratio (e.g. percentage) that actually makes some kind of sense?

And we can make anything workable from a code management point of view by
refactoring/developing appropriately.

And given you're now proposing changing the default for even THP pages with a
cap or perhaps having mTHP being used silently change the cap - that is clearly
_far_ worse from a tuning standpoint.

With a new tunable you can just set a sensible default and people don't even
necessarily have to think about it.

Re: [PATCH v10 00/13] khugepaged: mTHP support

Posted by Nico Pache 1 month, 1 week ago

On Thu, Aug 21, 2025 at 9:40 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Thu, Aug 21, 2025 at 09:27:19AM -0600, Nico Pache wrote:
> > On Thu, Aug 21, 2025 at 9:25 AM Nico Pache <npache@redhat.com> wrote:
> > >
> > > On Thu, Aug 21, 2025 at 9:20 AM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > > >
> > > > On Thu, Aug 21, 2025 at 08:43:18PM +0530, Dev Jain wrote:
> > > > >
> > > > > On 21/08/25 8:31 pm, Lorenzo Stoakes wrote:
> > > > > > OK so I noticed in patch 13/13 (!) where you change the documentation that you
> > > > > > essentially state that the whole method used to determine the ratio of PTEs to
> > > > > > collapse to mTHP is broken:
> > > > > >
> > > > > >     khugepaged uses max_ptes_none scaled to the order of the enabled
> > > > > >     mTHP size to determine collapses. When using mTHPs it's recommended
> > > > > >     to set max_ptes_none low-- ideally less than HPAGE_PMD_NR / 2 (255
> > > > > >     on 4k page size). This will prevent undesired "creep" behavior that
> > > > > >     leads to continuously collapsing to the largest mTHP size; when we
> > > > > >     collapse, we are bringing in new non-zero pages that will, on a
> > > > > >     subsequent scan, cause the max_ptes_none check of the +1 order to
> > > > > >     always be satisfied. By limiting this to less than half the current
> > > > > >     order, we make sure we don't cause this feedback
> > > > > >     loop. max_ptes_shared and max_ptes_swap have no effect when
> > > > > >     collapsing to a mTHP, and mTHP collapse will fail on shared or
> > > > > >     swapped out pages.
> > > > > >
> > > > > > This seems to me to suggest that using
> > > > > > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none as some means
> > > > > > of establishing a 'ratio' to do this calculation is fundamentally flawed.
> > > > > >
> > > > > > So surely we ought to introduce a new sysfs tunable for this? Perhaps
> > > > > >
> > > > > > /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio
> > > > > >
> > > > > > Or something like this?
> > > > > >
> > > > > > It's already questionable that we are taking a value that is expressed
> > > > > > essentially in terms of PTE entries per PMD and then use it implicitly to
> > > > > > determine the ratio for mTHP, but to then say 'oh but the default value is
> > > > > > known-broken' is just a blocker for the series in my opinion.
> > > > > >
> > > > > > This really has to be done a different way I think.
> > > > > >
> > > > > > Cheers, Lorenzo
> > > > >
> > > > > FWIW this was my version of the documentation patch:
> > > > > https://lore.kernel.org/all/20250211111326.14295-18-dev.jain@arm.com/
> > > > >
> > > > > The discussion about the creep problem started here:
> > > > > https://lore.kernel.org/all/7098654a-776d-413b-8aca-28f811620df7@arm.com/
> > > > >
> > > > > and the discussion continuing here:
> > > > > https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com/
> > > > >
> > > > > ending with a summary I gave here:
> > > > > https://lore.kernel.org/all/8114d47b-b383-4d6e-ab65-a0e88b99c873@arm.com/
> > > > >
> > > > > This should help you with the context.
> > > > >
> > > > >
> > > >
> > > > Thanks and I"ll have a look, but this series is unmergeable with a broken
> > > > default in
> > > > /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio
> > > > sorry.
> > > >
> > > > We need to have a new tunable as far as I can tell. I also find the use of
> > > > this PMD-specific value as an arbitrary way of expressing a ratio pretty
> > > > gross.
> > > The first thing that comes to mind is that we can pin max_ptes_none to
> > > 255 if it exceeds 255. It's worth noting that the issue occurs only
> > > for adjacently enabled mTHP sizes.
>
> No! Presumably the default of 511 (for PMDs with 512 entries) is set for a
> reason, arbitrarily changing this to suit a specific case seems crazy no?
We wouldn't be changing it for PMD collapse, just for the new
behavior. At 511, no mTHP collapses would ever occur anyways, unless
you have 2MB disabled and other mTHP sizes enabled. Technically at 511
only the highest enabled order always gets collapsed.

Ive also argued in the past that 511 is a terrible default for
anything other than thp.enabled=always, but that's a whole other can
of worms we dont need to discuss now.

with this cap of 255, the PMD scan/collapse would work as intended,
then in mTHP collapses we would never introduce this undesired
behavior. We've discussed before that this would be a hard problem to
solve without introducing some expensive way of tracking what has
already been through a collapse, and that doesnt even consider what
happens if things change or are unmapped, and rescanning that section
would be helpful. So having a strictly enforced limit of 255 actually
seems like a good idea to me, as it completely avoids the undesired
behavior and does not require the admins to be aware of such an issue.

Another thought similar to what (IIRC) Dev has mentioned before, if we
have max_ptes_none > 255 then we only consider collapses to the
largest enabled order, that way no creep to the largest enabled order
would occur in the first place, and we would get there straight away.

To me one of these two solutions seem sane in the context of what we
are dealing with.
>
> > >
> > > ie)
> > > if order!=HPAGE_PMD_ORDER && khugepaged_max_ptes_none > 255
> > >       temp_max_ptes_none = 255;
> > Oh and my second point, introducing a new tunable to control mTHP
> > collapse may become exceedingly complex from a tuning and code
> > management standpoint.
>
> Umm right now you hve a ratio expressed in PTES per mTHP * ((PTEs per PMD) /
> PMD) 'except please don't set to the usual default when using mTHP' and it's
> currently default-broken.
>
> I'm really not sure how that is simpler than a seprate tunable that can be
> expressed as a ratio (e.g. percentage) that actually makes some kind of sense?
I agree that the current tunable wasn't designed for this, but we
tried to come up with something that leverages the tunable we have to
avoid new tunables and added complexity.
>
> And we can make anything workable from a code management point of view by
> refactoring/developing appropriately.
What happens if max_ptes_none = 0 and the ratio is 50% - 1 pte
(ideally the max number)? seems like we would be saying we want no new
none pages, but also to allow new none pages. To me that seems equally
broken and more confusing than just taking a scale of the current
number (now with a cap).


-- Nico

>
> And given you're now proposing changing the default for even THP pages with a
> cap or perhaps having mTHP being used silently change the cap - that is clearly
> _far_ worse from a tuning standpoint.
>
> With a new tunable you can just set a sensible default and people don't even
> necessarily have to think about it.
>

Re: [PATCH v10 00/13] khugepaged: mTHP support

Posted by Lorenzo Stoakes 1 month, 1 week ago

On Thu, Aug 21, 2025 at 10:46:18AM -0600, Nico Pache wrote:
> > > > > Thanks and I"ll have a look, but this series is unmergeable with a broken
> > > > > default in
> > > > > /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio
> > > > > sorry.
> > > > >
> > > > > We need to have a new tunable as far as I can tell. I also find the use of
> > > > > this PMD-specific value as an arbitrary way of expressing a ratio pretty
> > > > > gross.
> > > > The first thing that comes to mind is that we can pin max_ptes_none to
> > > > 255 if it exceeds 255. It's worth noting that the issue occurs only
> > > > for adjacently enabled mTHP sizes.
> >
> > No! Presumably the default of 511 (for PMDs with 512 entries) is set for a
> > reason, arbitrarily changing this to suit a specific case seems crazy no?
> We wouldn't be changing it for PMD collapse, just for the new
> behavior. At 511, no mTHP collapses would ever occur anyways, unless
> you have 2MB disabled and other mTHP sizes enabled. Technically at 511
> only the highest enabled order always gets collapsed.
>
> Ive also argued in the past that 511 is a terrible default for
> anything other than thp.enabled=always, but that's a whole other can
> of worms we dont need to discuss now.
>
> with this cap of 255, the PMD scan/collapse would work as intended,
> then in mTHP collapses we would never introduce this undesired
> behavior. We've discussed before that this would be a hard problem to
> solve without introducing some expensive way of tracking what has
> already been through a collapse, and that doesnt even consider what
> happens if things change or are unmapped, and rescanning that section
> would be helpful. So having a strictly enforced limit of 255 actually
> seems like a good idea to me, as it completely avoids the undesired
> behavior and does not require the admins to be aware of such an issue.
>
> Another thought similar to what (IIRC) Dev has mentioned before, if we
> have max_ptes_none > 255 then we only consider collapses to the
> largest enabled order, that way no creep to the largest enabled order
> would occur in the first place, and we would get there straight away.
>
> To me one of these two solutions seem sane in the context of what we
> are dealing with.
> >
> > > >
> > > > ie)
> > > > if order!=HPAGE_PMD_ORDER && khugepaged_max_ptes_none > 255
> > > >       temp_max_ptes_none = 255;
> > > Oh and my second point, introducing a new tunable to control mTHP
> > > collapse may become exceedingly complex from a tuning and code
> > > management standpoint.
> >
> > Umm right now you hve a ratio expressed in PTES per mTHP * ((PTEs per PMD) /
> > PMD) 'except please don't set to the usual default when using mTHP' and it's
> > currently default-broken.
> >
> > I'm really not sure how that is simpler than a seprate tunable that can be
> > expressed as a ratio (e.g. percentage) that actually makes some kind of sense?
> I agree that the current tunable wasn't designed for this, but we
> tried to come up with something that leverages the tunable we have to
> avoid new tunables and added complexity.
> >
> > And we can make anything workable from a code management point of view by
> > refactoring/developing appropriately.
> What happens if max_ptes_none = 0 and the ratio is 50% - 1 pte
> (ideally the max number)? seems like we would be saying we want no new
> none pages, but also to allow new none pages. To me that seems equally
> broken and more confusing than just taking a scale of the current
> number (now with a cap).
>
>

The one thing we absolutely cannot have is a default that causes this
'creeping' behaviour. This feels like shipping something that is broken and
alluding to it in the documentation.

I spoke to David off-list and he gave some insight into this and perhaps
some reasonable means of avoiding an additional tunable.

I don't want to rehash what he said as I think it's more productive for him
to reply when he has time but broadly I think how we handle this needs
careful consideration.

To me it's clear that some sense of ratio is just immediately very very
confusing, but then again this interface is already confusing, as with much
of THP.

Anyway I'll let David respond here so we don't loop around before he has a
chance to add his input.

Cheers, Lorenzo

Re: [PATCH v10 00/13] khugepaged: mTHP support

Posted by Nico Pache 1 month ago

On Thu, Aug 21, 2025 at 10:55 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Thu, Aug 21, 2025 at 10:46:18AM -0600, Nico Pache wrote:
> > > > > > Thanks and I"ll have a look, but this series is unmergeable with a broken
> > > > > > default in
> > > > > > /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio
> > > > > > sorry.
> > > > > >
> > > > > > We need to have a new tunable as far as I can tell. I also find the use of
> > > > > > this PMD-specific value as an arbitrary way of expressing a ratio pretty
> > > > > > gross.
> > > > > The first thing that comes to mind is that we can pin max_ptes_none to
> > > > > 255 if it exceeds 255. It's worth noting that the issue occurs only
> > > > > for adjacently enabled mTHP sizes.
> > >
> > > No! Presumably the default of 511 (for PMDs with 512 entries) is set for a
> > > reason, arbitrarily changing this to suit a specific case seems crazy no?
> > We wouldn't be changing it for PMD collapse, just for the new
> > behavior. At 511, no mTHP collapses would ever occur anyways, unless
> > you have 2MB disabled and other mTHP sizes enabled. Technically at 511
> > only the highest enabled order always gets collapsed.
> >
> > Ive also argued in the past that 511 is a terrible default for
> > anything other than thp.enabled=always, but that's a whole other can
> > of worms we dont need to discuss now.
> >
> > with this cap of 255, the PMD scan/collapse would work as intended,
> > then in mTHP collapses we would never introduce this undesired
> > behavior. We've discussed before that this would be a hard problem to
> > solve without introducing some expensive way of tracking what has
> > already been through a collapse, and that doesnt even consider what
> > happens if things change or are unmapped, and rescanning that section
> > would be helpful. So having a strictly enforced limit of 255 actually
> > seems like a good idea to me, as it completely avoids the undesired
> > behavior and does not require the admins to be aware of such an issue.
> >
> > Another thought similar to what (IIRC) Dev has mentioned before, if we
> > have max_ptes_none > 255 then we only consider collapses to the
> > largest enabled order, that way no creep to the largest enabled order
> > would occur in the first place, and we would get there straight away.
> >
> > To me one of these two solutions seem sane in the context of what we
> > are dealing with.
> > >
> > > > >
> > > > > ie)
> > > > > if order!=HPAGE_PMD_ORDER && khugepaged_max_ptes_none > 255
> > > > >       temp_max_ptes_none = 255;
> > > > Oh and my second point, introducing a new tunable to control mTHP
> > > > collapse may become exceedingly complex from a tuning and code
> > > > management standpoint.
> > >
> > > Umm right now you hve a ratio expressed in PTES per mTHP * ((PTEs per PMD) /
> > > PMD) 'except please don't set to the usual default when using mTHP' and it's
> > > currently default-broken.
> > >
> > > I'm really not sure how that is simpler than a seprate tunable that can be
> > > expressed as a ratio (e.g. percentage) that actually makes some kind of sense?
> > I agree that the current tunable wasn't designed for this, but we
> > tried to come up with something that leverages the tunable we have to
> > avoid new tunables and added complexity.
> > >
> > > And we can make anything workable from a code management point of view by
> > > refactoring/developing appropriately.
> > What happens if max_ptes_none = 0 and the ratio is 50% - 1 pte
> > (ideally the max number)? seems like we would be saying we want no new
> > none pages, but also to allow new none pages. To me that seems equally
> > broken and more confusing than just taking a scale of the current
> > number (now with a cap).
> >
> >
>
> The one thing we absolutely cannot have is a default that causes this
> 'creeping' behaviour. This feels like shipping something that is broken and
> alluding to it in the documentation.
Ok I've put a lot of thought and time into this and came up with a solution.

Here is what I currently have tested and would like to proposing:

- Expand bitmap to HPAGE_PMD_NR (512)*, this increases the accuracy of
the max_pte_none handling, and removes a lot of inaccuracies caused by
the compression into 128 bits that was being done. This also makes the
code a lot easier to understand.

- When attempting mTHP level collapses cap max_ptes_none to 255 to
prevent the creep issue

Ive tested this and found this performs better than my previous
version, allows for more granular control via max_ptes_none, and
prevents the creep issue without any admin knowledge needed.

I think this is a good middle ground between completely disabling the
fine tune control, and doing a better job at mitigating
misconfiguration.

**Baolin actually also expands the bitmap to 512 in his khugepaged
collapse file mTHP support patchset

Does this sound reasonable to you?

-- Nico
>
> I spoke to David off-list and he gave some insight into this and perhaps
> some reasonable means of avoiding an additional tunable.
>
> I don't want to rehash what he said as I think it's more productive for him
> to reply when he has time but broadly I think how we handle this needs
> careful consideration.
>
> To me it's clear that some sense of ratio is just immediately very very
> confusing, but then again this interface is already confusing, as with much
> of THP.
>
> Anyway I'll let David respond here so we don't loop around before he has a
> chance to add his input.
>
> Cheers, Lorenzo
>

Re: [PATCH v10 00/13] khugepaged: mTHP support

Posted by David Hildenbrand 4 weeks, 1 day ago

On 04.09.25 04:44, Nico Pache wrote:
> On Thu, Aug 21, 2025 at 10:55 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
>>
>> On Thu, Aug 21, 2025 at 10:46:18AM -0600, Nico Pache wrote:
>>>>>>> Thanks and I"ll have a look, but this series is unmergeable with a broken
>>>>>>> default in
>>>>>>> /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio
>>>>>>> sorry.
>>>>>>>
>>>>>>> We need to have a new tunable as far as I can tell. I also find the use of
>>>>>>> this PMD-specific value as an arbitrary way of expressing a ratio pretty
>>>>>>> gross.
>>>>>> The first thing that comes to mind is that we can pin max_ptes_none to
>>>>>> 255 if it exceeds 255. It's worth noting that the issue occurs only
>>>>>> for adjacently enabled mTHP sizes.
>>>>
>>>> No! Presumably the default of 511 (for PMDs with 512 entries) is set for a
>>>> reason, arbitrarily changing this to suit a specific case seems crazy no?
>>> We wouldn't be changing it for PMD collapse, just for the new
>>> behavior. At 511, no mTHP collapses would ever occur anyways, unless
>>> you have 2MB disabled and other mTHP sizes enabled. Technically at 511
>>> only the highest enabled order always gets collapsed.
>>>
>>> Ive also argued in the past that 511 is a terrible default for
>>> anything other than thp.enabled=always, but that's a whole other can
>>> of worms we dont need to discuss now.
>>>
>>> with this cap of 255, the PMD scan/collapse would work as intended,
>>> then in mTHP collapses we would never introduce this undesired
>>> behavior. We've discussed before that this would be a hard problem to
>>> solve without introducing some expensive way of tracking what has
>>> already been through a collapse, and that doesnt even consider what
>>> happens if things change or are unmapped, and rescanning that section
>>> would be helpful. So having a strictly enforced limit of 255 actually
>>> seems like a good idea to me, as it completely avoids the undesired
>>> behavior and does not require the admins to be aware of such an issue.
>>>
>>> Another thought similar to what (IIRC) Dev has mentioned before, if we
>>> have max_ptes_none > 255 then we only consider collapses to the
>>> largest enabled order, that way no creep to the largest enabled order
>>> would occur in the first place, and we would get there straight away.
>>>
>>> To me one of these two solutions seem sane in the context of what we
>>> are dealing with.
>>>>
>>>>>>
>>>>>> ie)
>>>>>> if order!=HPAGE_PMD_ORDER && khugepaged_max_ptes_none > 255
>>>>>>        temp_max_ptes_none = 255;
>>>>> Oh and my second point, introducing a new tunable to control mTHP
>>>>> collapse may become exceedingly complex from a tuning and code
>>>>> management standpoint.
>>>>
>>>> Umm right now you hve a ratio expressed in PTES per mTHP * ((PTEs per PMD) /
>>>> PMD) 'except please don't set to the usual default when using mTHP' and it's
>>>> currently default-broken.
>>>>
>>>> I'm really not sure how that is simpler than a seprate tunable that can be
>>>> expressed as a ratio (e.g. percentage) that actually makes some kind of sense?
>>> I agree that the current tunable wasn't designed for this, but we
>>> tried to come up with something that leverages the tunable we have to
>>> avoid new tunables and added complexity.
>>>>
>>>> And we can make anything workable from a code management point of view by
>>>> refactoring/developing appropriately.
>>> What happens if max_ptes_none = 0 and the ratio is 50% - 1 pte
>>> (ideally the max number)? seems like we would be saying we want no new
>>> none pages, but also to allow new none pages. To me that seems equally
>>> broken and more confusing than just taking a scale of the current
>>> number (now with a cap).
>>>
>>>
>>
>> The one thing we absolutely cannot have is a default that causes this
>> 'creeping' behaviour. This feels like shipping something that is broken and
>> alluding to it in the documentation.
> Ok I've put a lot of thought and time into this and came up with a solution.
> 
> Here is what I currently have tested and would like to proposing:
> 
> - Expand bitmap to HPAGE_PMD_NR (512)*, this increases the accuracy of
> the max_pte_none handling, and removes a lot of inaccuracies caused by
> the compression into 128 bits that was being done. This also makes the
> code a lot easier to understand.

That sounds good to me. Should make the code easier as well.

> 
> - When attempting mTHP level collapses cap max_ptes_none to 255 to
> prevent the creep issue

I guess the documentation would then state something like

* When collapsing smaller THPs, "max_ptes_none" is scaled proportional
   to the THP size.
* When collapsing smaller THPs, "max_ptes_none" may be internally
   capped at 255 if it exceeds 255 but is not set to the default (511).

Not 100% a fan of all of that, but maybe the only option when wanting to 
avoid other toggles.

The only alternative would really be respecting only 0/511 for mTHP, and 
not doing any scaling. That would obviously make the documentation 
easier and would allow us to revisit that later. The documentation would be:

* When collapsing smaller THPs, "max_ptes_none" may be interpreted as
   "0"  when set to a value different to the default (511). This behavior
   might change in the future.

> 
> Ive tested this and found this performs better than my previous
> version, allows for more granular control via max_ptes_none, and
> prevents the creep issue without any admin knowledge needed.

How would this interact with the shrinker once extended to mTHP? Would 
your RFC patch be sufficient for that or would we actually also want to 
cap? I haven't; fully thought this through yet. I'd assume we would not 
want to cap here. Which makes the doc weird as well, lol.

-- 
Cheers

David / dhildenb

Re: [PATCH v10 00/13] khugepaged: mTHP support

Posted by David Hildenbrand 1 month, 1 week ago

> 
> The one thing we absolutely cannot have is a default that causes this
> 'creeping' behaviour. This feels like shipping something that is broken and
> alluding to it in the documentation.
> 
> I spoke to David off-list and he gave some insight into this and perhaps
> some reasonable means of avoiding an additional tunable.
> 
> I don't want to rehash what he said as I think it's more productive for him
> to reply when he has time but broadly I think how we handle this needs
> careful consideration.
> 
> To me it's clear that some sense of ratio is just immediately very very
> confusing, but then again this interface is already confusing, as with much
> of THP.
> 
> Anyway I'll let David respond here so we don't loop around before he has a
> chance to add his input.
> 
> Cheers, Lorenzo
> 

[Resending because Thunderbird decided to use the wrong smtp server]

I've been summoned.

As raised in the past, I would initially only support specific values here like

0 				  : Never collapse with any pte_none/zeropage
511 (HPAGE_PMD_NR - 1) / default  : Always collapse, ignoring pte_none/zeropage

Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1), but not sure
if we have to add that for now.

Because, as raised in the past, I'm afraid nobody on this earth has a clue how
to set this parameter to values different to 0 (don't waste memory with khugepaged)
and 511 (page fault behavior).


If any other value is set, essentially
	pr_warn("Unsupported 'max_ptes_none' value for mTHP collapse");

for now and just disable it.

-- 
Cheers

David / dhildenb

Re: [PATCH v10 00/13] khugepaged: mTHP support

Posted by Lorenzo Stoakes 1 month, 1 week ago

On Thu, Aug 21, 2025 at 10:43:35PM +0200, David Hildenbrand wrote:
> >
> > The one thing we absolutely cannot have is a default that causes this
> > 'creeping' behaviour. This feels like shipping something that is broken and
> > alluding to it in the documentation.
> >
> > I spoke to David off-list and he gave some insight into this and perhaps
> > some reasonable means of avoiding an additional tunable.
> >
> > I don't want to rehash what he said as I think it's more productive for him
> > to reply when he has time but broadly I think how we handle this needs
> > careful consideration.
> >
> > To me it's clear that some sense of ratio is just immediately very very
> > confusing, but then again this interface is already confusing, as with much
> > of THP.
> >
> > Anyway I'll let David respond here so we don't loop around before he has a
> > chance to add his input.
> >
> > Cheers, Lorenzo
> >
>
> [Resending because Thunderbird decided to use the wrong smtp server]
>
> I've been summoned.

Welcome :)

>
> As raised in the past, I would initially only support specific values here like
>
> 0 				  : Never collapse with any pte_none/zeropage
> 511 (HPAGE_PMD_NR - 1) / default  : Always collapse, ignoring pte_none/zeropage
>

OK so if had effectively an off/on (I guess we have to keep this as it is for
legay purposes) and is forced to one or other of these values then fine (as long
as we don't have uAPI worries).

> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1), but not sure
> if we have to add that for now.

Yeah not so sure about this, this is a 'just have to know' too, and yes you
might add it to the docs, but people are going to be mightily confused, esp if
it's a calculated value.

I don't see any other way around having a separate tunable if we don't just have
something VERY simple like on/off.

Also the mentioned issue sounds like something that needs to be fixed elsewhere
honestly in the algorithm used to figure out mTHP ranges (I may be wrong - and
happy to stand corrected if this is somehow inherent, but reallly feels that
way).

>
> Because, as raised in the past, I'm afraid nobody on this earth has a clue how
> to set this parameter to values different to 0 (don't waste memory with khugepaged)
> and 511 (page fault behavior).

Yup

>
>
> If any other value is set, essentially
> 	pr_warn("Unsupported 'max_ptes_none' value for mTHP collapse");
>
> for now and just disable it.

Hmm but under what circumstances? I would just say unsupported value not mention
mTHP or people who don't use mTHP might find that confusing.

>
> --
> Cheers
>
> David / dhildenb
>

Cheers, Lorenzo

Re: [PATCH v10 00/13] khugepaged: mTHP support

Posted by David Hildenbrand 1 month, 1 week ago

>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1), but not sure
>> if we have to add that for now.
> 
> Yeah not so sure about this, this is a 'just have to know' too, and yes you
> might add it to the docs, but people are going to be mightily confused, esp if
> it's a calculated value.
> 
> I don't see any other way around having a separate tunable if we don't just have
> something VERY simple like on/off.

Yeah, not advocating that we add support for other values than 0/511, 
really.

> 
> Also the mentioned issue sounds like something that needs to be fixed elsewhere
> honestly in the algorithm used to figure out mTHP ranges (I may be wrong - and
> happy to stand corrected if this is somehow inherent, but reallly feels that
> way).

I think the creep is unavoidable for certain values.

If you have the first two pages of a PMD area populated, and you allow 
for at least half of the #PTEs to be non/zero, you'd collapse first a
order-2 folio, then and order-3 ... until you reached PMD order.

So for now we really should just support 0 / 511 to say "don't collapse 
if there are holes" vs. "always collapse if there is at least one pte used".

> 
>>
>> Because, as raised in the past, I'm afraid nobody on this earth has a clue how
>> to set this parameter to values different to 0 (don't waste memory with khugepaged)
>> and 511 (page fault behavior).
> 
> Yup
> 
>>
>>
>> If any other value is set, essentially
>> 	pr_warn("Unsupported 'max_ptes_none' value for mTHP collapse");
>>
>> for now and just disable it.
> 
> Hmm but under what circumstances? I would just say unsupported value not mention
> mTHP or people who don't use mTHP might find that confusing.

Well, we can check whether any mTHP size is enabled while the value is 
set to something unexpected. We can then even print the problematic 
sizes if we have to.

We could also just just say that if the value is set to something else 
than 511 (which is the default), it will be treated as being "0" when 
collapsing mthp, instead of doing any scaling.

-- 
Cheers

David / dhildenb

Re: [PATCH v10 00/13] khugepaged: mTHP support

Posted by Baolin Wang 1 month ago

(Sorry for chiming in late)

On 2025/8/22 22:10, David Hildenbrand wrote:
>>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1), 
>>> but not sure
>>> if we have to add that for now.
>>
>> Yeah not so sure about this, this is a 'just have to know' too, and 
>> yes you
>> might add it to the docs, but people are going to be mightily 
>> confused, esp if
>> it's a calculated value.
>>
>> I don't see any other way around having a separate tunable if we don't 
>> just have
>> something VERY simple like on/off.
> 
> Yeah, not advocating that we add support for other values than 0/511, 
> really.
> 
>>
>> Also the mentioned issue sounds like something that needs to be fixed 
>> elsewhere
>> honestly in the algorithm used to figure out mTHP ranges (I may be 
>> wrong - and
>> happy to stand corrected if this is somehow inherent, but reallly 
>> feels that
>> way).
> 
> I think the creep is unavoidable for certain values.
> 
> If you have the first two pages of a PMD area populated, and you allow 
> for at least half of the #PTEs to be non/zero, you'd collapse first a
> order-2 folio, then and order-3 ... until you reached PMD order.
> 
> So for now we really should just support 0 / 511 to say "don't collapse 
> if there are holes" vs. "always collapse if there is at least one pte 
> used".

If we only allow setting 0 or 511, as Nico mentioned before, "At 511, no 
mTHP collapses would ever occur anyway, unless you have 2MB disabled and 
other mTHP sizes enabled. Technically, at 511, only the highest enabled 
order would ever be collapsed."

In other words, for the scenario you described, although there are only 
2 PTEs present in a PMD, it would still get collapsed into a PMD-sized 
THP. In reality, what we probably need is just an order-2 mTHP collapse.

If 'khugepaged_max_ptes_none' is set to 255, I think this would achieve 
the desired result: when there are only 2 PTEs present in a PMD, an 
order-2 mTHP collapse would be successed, but it wouldn’t creep up to an 
order-3 mTHP collapse. That’s because:
When attempting an order-3 mTHP collapse, 'threshold_bits' = 1, while 
'bits_set' = 1 (means only 1 chunk is present), so 'bits_set > 
threshold_bits' is false, then an order-3 mTHP collapse wouldn’t be 
attempted. No?

So I have some concerns that if we only allow setting 0 or 511, it may 
not meet the goal we have for mTHP collapsing.

>>> Because, as raised in the past, I'm afraid nobody on this earth has a 
>>> clue how
>>> to set this parameter to values different to 0 (don't waste memory 
>>> with khugepaged)
>>> and 511 (page fault behavior).
>>
>> Yup
>>
>>>
>>>
>>> If any other value is set, essentially
>>>     pr_warn("Unsupported 'max_ptes_none' value for mTHP collapse");
>>>
>>> for now and just disable it.
>>
>> Hmm but under what circumstances? I would just say unsupported value 
>> not mention
>> mTHP or people who don't use mTHP might find that confusing.
> 
> Well, we can check whether any mTHP size is enabled while the value is 
> set to something unexpected. We can then even print the problematic 
> sizes if we have to.
> 
> We could also just just say that if the value is set to something else 
> than 511 (which is the default), it will be treated as being "0" when 
> collapsing mthp, instead of doing any scaling.
>

Re: [PATCH v10 00/13] khugepaged: mTHP support

Posted by Dev Jain 1 month ago

On 28/08/25 3:16 pm, Baolin Wang wrote:
> (Sorry for chiming in late)
>
> On 2025/8/22 22:10, David Hildenbrand wrote:
>>>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1), 
>>>> but not sure
>>>> if we have to add that for now.
>>>
>>> Yeah not so sure about this, this is a 'just have to know' too, and 
>>> yes you
>>> might add it to the docs, but people are going to be mightily 
>>> confused, esp if
>>> it's a calculated value.
>>>
>>> I don't see any other way around having a separate tunable if we 
>>> don't just have
>>> something VERY simple like on/off.
>>
>> Yeah, not advocating that we add support for other values than 0/511, 
>> really.
>>
>>>
>>> Also the mentioned issue sounds like something that needs to be 
>>> fixed elsewhere
>>> honestly in the algorithm used to figure out mTHP ranges (I may be 
>>> wrong - and
>>> happy to stand corrected if this is somehow inherent, but reallly 
>>> feels that
>>> way).
>>
>> I think the creep is unavoidable for certain values.
>>
>> If you have the first two pages of a PMD area populated, and you 
>> allow for at least half of the #PTEs to be non/zero, you'd collapse 
>> first a
>> order-2 folio, then and order-3 ... until you reached PMD order.
>>
>> So for now we really should just support 0 / 511 to say "don't 
>> collapse if there are holes" vs. "always collapse if there is at 
>> least one pte used".
>
> If we only allow setting 0 or 511, as Nico mentioned before, "At 511, 
> no mTHP collapses would ever occur anyway, unless you have 2MB 
> disabled and other mTHP sizes enabled. Technically, at 511, only the 
> highest enabled order would ever be collapsed."
I didn't understand this statement. At 511, mTHP collapses will occur if 
khugepaged cannot get a PMD folio. Our goal is to collapse to the 
highest order folio.
>
> In other words, for the scenario you described, although there are 
> only 2 PTEs present in a PMD, it would still get collapsed into a 
> PMD-sized THP. In reality, what we probably need is just an order-2 
> mTHP collapse.
>
> If 'khugepaged_max_ptes_none' is set to 255, I think this would 
> achieve the desired result: when there are only 2 PTEs present in a 
> PMD, an order-2 mTHP collapse would be successed, but it wouldn’t 
> creep up to an order-3 mTHP collapse. That’s because:
> When attempting an order-3 mTHP collapse, 'threshold_bits' = 1, while 
> 'bits_set' = 1 (means only 1 chunk is present), so 'bits_set > 
> threshold_bits' is false, then an order-3 mTHP collapse wouldn’t be 
> attempted. No?
>
> So I have some concerns that if we only allow setting 0 or 511, it may 
> not meet the goal we have for mTHP collapsing.
>
>>>> Because, as raised in the past, I'm afraid nobody on this earth has 
>>>> a clue how
>>>> to set this parameter to values different to 0 (don't waste memory 
>>>> with khugepaged)
>>>> and 511 (page fault behavior).
>>>
>>> Yup
>>>
>>>>
>>>>
>>>> If any other value is set, essentially
>>>>     pr_warn("Unsupported 'max_ptes_none' value for mTHP collapse");
>>>>
>>>> for now and just disable it.
>>>
>>> Hmm but under what circumstances? I would just say unsupported value 
>>> not mention
>>> mTHP or people who don't use mTHP might find that confusing.
>>
>> Well, we can check whether any mTHP size is enabled while the value 
>> is set to something unexpected. We can then even print the 
>> problematic sizes if we have to.
>>
>> We could also just just say that if the value is set to something 
>> else than 511 (which is the default), it will be treated as being "0" 
>> when collapsing mthp, instead of doing any scaling.
>>
>

Re: [PATCH v10 00/13] khugepaged: mTHP support

Posted by Baolin Wang 1 month ago


On 2025/8/28 18:48, Dev Jain wrote:
> 
> On 28/08/25 3:16 pm, Baolin Wang wrote:
>> (Sorry for chiming in late)
>>
>> On 2025/8/22 22:10, David Hildenbrand wrote:
>>>>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1), 
>>>>> but not sure
>>>>> if we have to add that for now.
>>>>
>>>> Yeah not so sure about this, this is a 'just have to know' too, and 
>>>> yes you
>>>> might add it to the docs, but people are going to be mightily 
>>>> confused, esp if
>>>> it's a calculated value.
>>>>
>>>> I don't see any other way around having a separate tunable if we 
>>>> don't just have
>>>> something VERY simple like on/off.
>>>
>>> Yeah, not advocating that we add support for other values than 0/511, 
>>> really.
>>>
>>>>
>>>> Also the mentioned issue sounds like something that needs to be 
>>>> fixed elsewhere
>>>> honestly in the algorithm used to figure out mTHP ranges (I may be 
>>>> wrong - and
>>>> happy to stand corrected if this is somehow inherent, but reallly 
>>>> feels that
>>>> way).
>>>
>>> I think the creep is unavoidable for certain values.
>>>
>>> If you have the first two pages of a PMD area populated, and you 
>>> allow for at least half of the #PTEs to be non/zero, you'd collapse 
>>> first a
>>> order-2 folio, then and order-3 ... until you reached PMD order.
>>>
>>> So for now we really should just support 0 / 511 to say "don't 
>>> collapse if there are holes" vs. "always collapse if there is at 
>>> least one pte used".
>>
>> If we only allow setting 0 or 511, as Nico mentioned before, "At 511, 
>> no mTHP collapses would ever occur anyway, unless you have 2MB 
>> disabled and other mTHP sizes enabled. Technically, at 511, only the 
>> highest enabled order would ever be collapsed."
> I didn't understand this statement. At 511, mTHP collapses will occur if 
> khugepaged cannot get a PMD folio. Our goal is to collapse to the 
> highest order folio.

Yes, I’m not saying that it’s incorrect behavior when set to 511. What I 
mean is, as in the example I gave below, users may only want to allow a 
large order collapse when the number of present PTEs reaches half of the 
large folio, in order to avoid RSS bloat.

So we might also need to consider whether 255 is a reasonable 
configuration for mTHP collapse.

>> In other words, for the scenario you described, although there are 
>> only 2 PTEs present in a PMD, it would still get collapsed into a PMD- 
>> sized THP. In reality, what we probably need is just an order-2 mTHP 
>> collapse.
>>
>> If 'khugepaged_max_ptes_none' is set to 255, I think this would 
>> achieve the desired result: when there are only 2 PTEs present in a 
>> PMD, an order-2 mTHP collapse would be successed, but it wouldn’t 
>> creep up to an order-3 mTHP collapse. That’s because:
>> When attempting an order-3 mTHP collapse, 'threshold_bits' = 1, while 
>> 'bits_set' = 1 (means only 1 chunk is present), so 'bits_set > 
>> threshold_bits' is false, then an order-3 mTHP collapse wouldn’t be 
>> attempted. No?
>>
>> So I have some concerns that if we only allow setting 0 or 511, it may 
>> not meet the goal we have for mTHP collapsing.
>>
>>>>> Because, as raised in the past, I'm afraid nobody on this earth has 
>>>>> a clue how
>>>>> to set this parameter to values different to 0 (don't waste memory 
>>>>> with khugepaged)
>>>>> and 511 (page fault behavior).
>>>>
>>>> Yup
>>>>
>>>>>
>>>>>
>>>>> If any other value is set, essentially
>>>>>     pr_warn("Unsupported 'max_ptes_none' value for mTHP collapse");
>>>>>
>>>>> for now and just disable it.
>>>>
>>>> Hmm but under what circumstances? I would just say unsupported value 
>>>> not mention
>>>> mTHP or people who don't use mTHP might find that confusing.
>>>
>>> Well, we can check whether any mTHP size is enabled while the value 
>>> is set to something unexpected. We can then even print the 
>>> problematic sizes if we have to.
>>>
>>> We could also just just say that if the value is set to something 
>>> else than 511 (which is the default), it will be treated as being "0" 
>>> when collapsing mthp, instead of doing any scaling.
>>>
>>

Re: [PATCH v10 00/13] khugepaged: mTHP support

Posted by David Hildenbrand 1 month ago

On 29.08.25 03:55, Baolin Wang wrote:
> 
> 
> On 2025/8/28 18:48, Dev Jain wrote:
>>
>> On 28/08/25 3:16 pm, Baolin Wang wrote:
>>> (Sorry for chiming in late)
>>>
>>> On 2025/8/22 22:10, David Hildenbrand wrote:
>>>>>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1),
>>>>>> but not sure
>>>>>> if we have to add that for now.
>>>>>
>>>>> Yeah not so sure about this, this is a 'just have to know' too, and
>>>>> yes you
>>>>> might add it to the docs, but people are going to be mightily
>>>>> confused, esp if
>>>>> it's a calculated value.
>>>>>
>>>>> I don't see any other way around having a separate tunable if we
>>>>> don't just have
>>>>> something VERY simple like on/off.
>>>>
>>>> Yeah, not advocating that we add support for other values than 0/511,
>>>> really.
>>>>
>>>>>
>>>>> Also the mentioned issue sounds like something that needs to be
>>>>> fixed elsewhere
>>>>> honestly in the algorithm used to figure out mTHP ranges (I may be
>>>>> wrong - and
>>>>> happy to stand corrected if this is somehow inherent, but reallly
>>>>> feels that
>>>>> way).
>>>>
>>>> I think the creep is unavoidable for certain values.
>>>>
>>>> If you have the first two pages of a PMD area populated, and you
>>>> allow for at least half of the #PTEs to be non/zero, you'd collapse
>>>> first a
>>>> order-2 folio, then and order-3 ... until you reached PMD order.
>>>>
>>>> So for now we really should just support 0 / 511 to say "don't
>>>> collapse if there are holes" vs. "always collapse if there is at
>>>> least one pte used".
>>>
>>> If we only allow setting 0 or 511, as Nico mentioned before, "At 511,
>>> no mTHP collapses would ever occur anyway, unless you have 2MB
>>> disabled and other mTHP sizes enabled. Technically, at 511, only the
>>> highest enabled order would ever be collapsed."
>> I didn't understand this statement. At 511, mTHP collapses will occur if
>> khugepaged cannot get a PMD folio. Our goal is to collapse to the
>> highest order folio.
> 
> Yes, I’m not saying that it’s incorrect behavior when set to 511. What I
> mean is, as in the example I gave below, users may only want to allow a
> large order collapse when the number of present PTEs reaches half of the
> large folio, in order to avoid RSS bloat.

How do these users control allocation at fault time where this parameter 
is completely ignored?

-- 
Cheers

David / dhildenb

Re: [PATCH v10 00/13] khugepaged: mTHP support

Posted by Baolin Wang 1 month ago


On 2025/9/2 00:46, David Hildenbrand wrote:
> On 29.08.25 03:55, Baolin Wang wrote:
>>
>>
>> On 2025/8/28 18:48, Dev Jain wrote:
>>>
>>> On 28/08/25 3:16 pm, Baolin Wang wrote:
>>>> (Sorry for chiming in late)
>>>>
>>>> On 2025/8/22 22:10, David Hildenbrand wrote:
>>>>>>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1),
>>>>>>> but not sure
>>>>>>> if we have to add that for now.
>>>>>>
>>>>>> Yeah not so sure about this, this is a 'just have to know' too, and
>>>>>> yes you
>>>>>> might add it to the docs, but people are going to be mightily
>>>>>> confused, esp if
>>>>>> it's a calculated value.
>>>>>>
>>>>>> I don't see any other way around having a separate tunable if we
>>>>>> don't just have
>>>>>> something VERY simple like on/off.
>>>>>
>>>>> Yeah, not advocating that we add support for other values than 0/511,
>>>>> really.
>>>>>
>>>>>>
>>>>>> Also the mentioned issue sounds like something that needs to be
>>>>>> fixed elsewhere
>>>>>> honestly in the algorithm used to figure out mTHP ranges (I may be
>>>>>> wrong - and
>>>>>> happy to stand corrected if this is somehow inherent, but reallly
>>>>>> feels that
>>>>>> way).
>>>>>
>>>>> I think the creep is unavoidable for certain values.
>>>>>
>>>>> If you have the first two pages of a PMD area populated, and you
>>>>> allow for at least half of the #PTEs to be non/zero, you'd collapse
>>>>> first a
>>>>> order-2 folio, then and order-3 ... until you reached PMD order.
>>>>>
>>>>> So for now we really should just support 0 / 511 to say "don't
>>>>> collapse if there are holes" vs. "always collapse if there is at
>>>>> least one pte used".
>>>>
>>>> If we only allow setting 0 or 511, as Nico mentioned before, "At 511,
>>>> no mTHP collapses would ever occur anyway, unless you have 2MB
>>>> disabled and other mTHP sizes enabled. Technically, at 511, only the
>>>> highest enabled order would ever be collapsed."
>>> I didn't understand this statement. At 511, mTHP collapses will occur if
>>> khugepaged cannot get a PMD folio. Our goal is to collapse to the
>>> highest order folio.
>>
>> Yes, I’m not saying that it’s incorrect behavior when set to 511. What I
>> mean is, as in the example I gave below, users may only want to allow a
>> large order collapse when the number of present PTEs reaches half of the
>> large folio, in order to avoid RSS bloat.
> 
> How do these users control allocation at fault time where this parameter 
> is completely ignored?

Sorry, I did not get your point. Why does the 'max_pte_none' need to 
control allocation at fault time? Could you be more specific? Thanks.

Re: [PATCH v10 00/13] khugepaged: mTHP support

Posted by David Hildenbrand 1 month ago

On 02.09.25 04:28, Baolin Wang wrote:
> 
> 
> On 2025/9/2 00:46, David Hildenbrand wrote:
>> On 29.08.25 03:55, Baolin Wang wrote:
>>>
>>>
>>> On 2025/8/28 18:48, Dev Jain wrote:
>>>>
>>>> On 28/08/25 3:16 pm, Baolin Wang wrote:
>>>>> (Sorry for chiming in late)
>>>>>
>>>>> On 2025/8/22 22:10, David Hildenbrand wrote:
>>>>>>>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1),
>>>>>>>> but not sure
>>>>>>>> if we have to add that for now.
>>>>>>>
>>>>>>> Yeah not so sure about this, this is a 'just have to know' too, and
>>>>>>> yes you
>>>>>>> might add it to the docs, but people are going to be mightily
>>>>>>> confused, esp if
>>>>>>> it's a calculated value.
>>>>>>>
>>>>>>> I don't see any other way around having a separate tunable if we
>>>>>>> don't just have
>>>>>>> something VERY simple like on/off.
>>>>>>
>>>>>> Yeah, not advocating that we add support for other values than 0/511,
>>>>>> really.
>>>>>>
>>>>>>>
>>>>>>> Also the mentioned issue sounds like something that needs to be
>>>>>>> fixed elsewhere
>>>>>>> honestly in the algorithm used to figure out mTHP ranges (I may be
>>>>>>> wrong - and
>>>>>>> happy to stand corrected if this is somehow inherent, but reallly
>>>>>>> feels that
>>>>>>> way).
>>>>>>
>>>>>> I think the creep is unavoidable for certain values.
>>>>>>
>>>>>> If you have the first two pages of a PMD area populated, and you
>>>>>> allow for at least half of the #PTEs to be non/zero, you'd collapse
>>>>>> first a
>>>>>> order-2 folio, then and order-3 ... until you reached PMD order.
>>>>>>
>>>>>> So for now we really should just support 0 / 511 to say "don't
>>>>>> collapse if there are holes" vs. "always collapse if there is at
>>>>>> least one pte used".
>>>>>
>>>>> If we only allow setting 0 or 511, as Nico mentioned before, "At 511,
>>>>> no mTHP collapses would ever occur anyway, unless you have 2MB
>>>>> disabled and other mTHP sizes enabled. Technically, at 511, only the
>>>>> highest enabled order would ever be collapsed."
>>>> I didn't understand this statement. At 511, mTHP collapses will occur if
>>>> khugepaged cannot get a PMD folio. Our goal is to collapse to the
>>>> highest order folio.
>>>
>>> Yes, I’m not saying that it’s incorrect behavior when set to 511. What I
>>> mean is, as in the example I gave below, users may only want to allow a
>>> large order collapse when the number of present PTEs reaches half of the
>>> large folio, in order to avoid RSS bloat.
>>
>> How do these users control allocation at fault time where this parameter
>> is completely ignored?
> 
> Sorry, I did not get your point. Why does the 'max_pte_none' need to
> control allocation at fault time? Could you be more specific? Thanks.

The comment over khugepaged_max_ptes_none gives a hint:

/*
  * default collapse hugepages if there is at least one pte mapped like
  * it would have happened if the vma was large enough during page
  * fault.
  *
  * Note that these are only respected if collapse was initiated by khugepaged.
  */

In the common case (for anything that really cares about RSS bloat) you will just a
get a THP during page fault and consequently RSS bloat.

As raised in my other reply, the only documented reason to set max_ptes_none=0 seems
to be when an application later (after once possibly getting a THP already during
page faults) did some MADV_DONTNEED and wants to control the usage of THPs itself using
MADV_COLLAPSE.

It's a questionable use case, that already got more problematic with mTHP and page
table reclaim.

Let me explain:

Before mTHP, if someone would MADV_DONTNEED (resulting in
a page table with at least one pte_none entry), there would have been no way we would
get memory over-allocated afterwards with max_ptes_none=0.

(1) Page faults would spot "there is a page table" and just fallback to order-0 pages.
(2) khugepaged was told to not collapse through max_ptes_none=0.

But now:

(A) With mTHP during page-faults, we can just end up over-allocating memory in such
     an area again: page faults will simply spot a bunch of pte_nones around the fault area
     and install an mTHP.

(B) With page table reclaim (when zapping all PTEs in a table at once), we will reclaim the
     page table. The next page fault will just try installing a PMD THP again, because there is
     no PTE table anymore.

So I question the utility of max_ptes_none. If you can't tame page faults, then there is only
limited sense in taming khugepaged. I think there is vale in setting max_ptes_none=0 for some
corner cases, but I am yet to learn why max_ptes_none=123 would make any sense.

-- 
Cheers

David / dhildenb

Re: [PATCH v10 00/13] khugepaged: mTHP support

Posted by Usama Arif 1 month ago


On 02/09/2025 10:03, David Hildenbrand wrote:
> On 02.09.25 04:28, Baolin Wang wrote:
>>
>>
>> On 2025/9/2 00:46, David Hildenbrand wrote:
>>> On 29.08.25 03:55, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2025/8/28 18:48, Dev Jain wrote:
>>>>>
>>>>> On 28/08/25 3:16 pm, Baolin Wang wrote:
>>>>>> (Sorry for chiming in late)
>>>>>>
>>>>>> On 2025/8/22 22:10, David Hildenbrand wrote:
>>>>>>>>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1),
>>>>>>>>> but not sure
>>>>>>>>> if we have to add that for now.
>>>>>>>>
>>>>>>>> Yeah not so sure about this, this is a 'just have to know' too, and
>>>>>>>> yes you
>>>>>>>> might add it to the docs, but people are going to be mightily
>>>>>>>> confused, esp if
>>>>>>>> it's a calculated value.
>>>>>>>>
>>>>>>>> I don't see any other way around having a separate tunable if we
>>>>>>>> don't just have
>>>>>>>> something VERY simple like on/off.
>>>>>>>
>>>>>>> Yeah, not advocating that we add support for other values than 0/511,
>>>>>>> really.
>>>>>>>
>>>>>>>>
>>>>>>>> Also the mentioned issue sounds like something that needs to be
>>>>>>>> fixed elsewhere
>>>>>>>> honestly in the algorithm used to figure out mTHP ranges (I may be
>>>>>>>> wrong - and
>>>>>>>> happy to stand corrected if this is somehow inherent, but reallly
>>>>>>>> feels that
>>>>>>>> way).
>>>>>>>
>>>>>>> I think the creep is unavoidable for certain values.
>>>>>>>
>>>>>>> If you have the first two pages of a PMD area populated, and you
>>>>>>> allow for at least half of the #PTEs to be non/zero, you'd collapse
>>>>>>> first a
>>>>>>> order-2 folio, then and order-3 ... until you reached PMD order.
>>>>>>>
>>>>>>> So for now we really should just support 0 / 511 to say "don't
>>>>>>> collapse if there are holes" vs. "always collapse if there is at
>>>>>>> least one pte used".
>>>>>>
>>>>>> If we only allow setting 0 or 511, as Nico mentioned before, "At 511,
>>>>>> no mTHP collapses would ever occur anyway, unless you have 2MB
>>>>>> disabled and other mTHP sizes enabled. Technically, at 511, only the
>>>>>> highest enabled order would ever be collapsed."
>>>>> I didn't understand this statement. At 511, mTHP collapses will occur if
>>>>> khugepaged cannot get a PMD folio. Our goal is to collapse to the
>>>>> highest order folio.
>>>>
>>>> Yes, I’m not saying that it’s incorrect behavior when set to 511. What I
>>>> mean is, as in the example I gave below, users may only want to allow a
>>>> large order collapse when the number of present PTEs reaches half of the
>>>> large folio, in order to avoid RSS bloat.
>>>
>>> How do these users control allocation at fault time where this parameter
>>> is completely ignored?
>>
>> Sorry, I did not get your point. Why does the 'max_pte_none' need to
>> control allocation at fault time? Could you be more specific? Thanks.
> 
> The comment over khugepaged_max_ptes_none gives a hint:
> 
> /*
>  * default collapse hugepages if there is at least one pte mapped like
>  * it would have happened if the vma was large enough during page
>  * fault.
>  *
>  * Note that these are only respected if collapse was initiated by khugepaged.
>  */
> 
> In the common case (for anything that really cares about RSS bloat) you will just a
> get a THP during page fault and consequently RSS bloat.
> 
> As raised in my other reply, the only documented reason to set max_ptes_none=0 seems
> to be when an application later (after once possibly getting a THP already during
> page faults) did some MADV_DONTNEED and wants to control the usage of THPs itself using
> MADV_COLLAPSE.
> 
> It's a questionable use case, that already got more problematic with mTHP and page
> table reclaim.
> 
> Let me explain:
> 
> Before mTHP, if someone would MADV_DONTNEED (resulting in
> a page table with at least one pte_none entry), there would have been no way we would
> get memory over-allocated afterwards with max_ptes_none=0.
> 
> (1) Page faults would spot "there is a page table" and just fallback to order-0 pages.
> (2) khugepaged was told to not collapse through max_ptes_none=0.
> 
> But now:
> 
> (A) With mTHP during page-faults, we can just end up over-allocating memory in such
>     an area again: page faults will simply spot a bunch of pte_nones around the fault area
>     and install an mTHP.
> 
> (B) With page table reclaim (when zapping all PTEs in a table at once), we will reclaim the
>     page table. The next page fault will just try installing a PMD THP again, because there is
>     no PTE table anymore.
> 
> So I question the utility of max_ptes_none. If you can't tame page faults, then there is only
> limited sense in taming khugepaged. I think there is vale in setting max_ptes_none=0 for some
> corner cases, but I am yet to learn why max_ptes_none=123 would make any sense.
> 
> 

For PMD mapped THPs with THP shrinker, this has changed. You can basically tame pagefaults, as when you encounter
memory pressure, the shrinker kicks in if the value is less than HPAGE_PMD_NR -1 (i.e. 511 for x86), and
will break down those hugepages and free up zero-filled memory. I have seen in our prod workloads where
the memory usage and THP usage can spike (usually when the workload starts), but with memory pressure,
the memory usage is lower compared to with max_ptes_none = 511, while still still keeping the benefits
of THPs like lower TLB misses.

I do agree that the value of max_ptes_none is magical and different workloads can react very differently
to it. The relationship is definitely not linear. i.e. if I use max_ptes_none = 256, it does not mean
that the memory regression of using THP=always vs THP=madvise is halved.

Re: [PATCH v10 00/13] khugepaged: mTHP support

Posted by David Hildenbrand 1 month ago

On 02.09.25 12:34, Usama Arif wrote:
> 
> 
> On 02/09/2025 10:03, David Hildenbrand wrote:
>> On 02.09.25 04:28, Baolin Wang wrote:
>>>
>>>
>>> On 2025/9/2 00:46, David Hildenbrand wrote:
>>>> On 29.08.25 03:55, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 2025/8/28 18:48, Dev Jain wrote:
>>>>>>
>>>>>> On 28/08/25 3:16 pm, Baolin Wang wrote:
>>>>>>> (Sorry for chiming in late)
>>>>>>>
>>>>>>> On 2025/8/22 22:10, David Hildenbrand wrote:
>>>>>>>>>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1),
>>>>>>>>>> but not sure
>>>>>>>>>> if we have to add that for now.
>>>>>>>>>
>>>>>>>>> Yeah not so sure about this, this is a 'just have to know' too, and
>>>>>>>>> yes you
>>>>>>>>> might add it to the docs, but people are going to be mightily
>>>>>>>>> confused, esp if
>>>>>>>>> it's a calculated value.
>>>>>>>>>
>>>>>>>>> I don't see any other way around having a separate tunable if we
>>>>>>>>> don't just have
>>>>>>>>> something VERY simple like on/off.
>>>>>>>>
>>>>>>>> Yeah, not advocating that we add support for other values than 0/511,
>>>>>>>> really.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Also the mentioned issue sounds like something that needs to be
>>>>>>>>> fixed elsewhere
>>>>>>>>> honestly in the algorithm used to figure out mTHP ranges (I may be
>>>>>>>>> wrong - and
>>>>>>>>> happy to stand corrected if this is somehow inherent, but reallly
>>>>>>>>> feels that
>>>>>>>>> way).
>>>>>>>>
>>>>>>>> I think the creep is unavoidable for certain values.
>>>>>>>>
>>>>>>>> If you have the first two pages of a PMD area populated, and you
>>>>>>>> allow for at least half of the #PTEs to be non/zero, you'd collapse
>>>>>>>> first a
>>>>>>>> order-2 folio, then and order-3 ... until you reached PMD order.
>>>>>>>>
>>>>>>>> So for now we really should just support 0 / 511 to say "don't
>>>>>>>> collapse if there are holes" vs. "always collapse if there is at
>>>>>>>> least one pte used".
>>>>>>>
>>>>>>> If we only allow setting 0 or 511, as Nico mentioned before, "At 511,
>>>>>>> no mTHP collapses would ever occur anyway, unless you have 2MB
>>>>>>> disabled and other mTHP sizes enabled. Technically, at 511, only the
>>>>>>> highest enabled order would ever be collapsed."
>>>>>> I didn't understand this statement. At 511, mTHP collapses will occur if
>>>>>> khugepaged cannot get a PMD folio. Our goal is to collapse to the
>>>>>> highest order folio.
>>>>>
>>>>> Yes, I’m not saying that it’s incorrect behavior when set to 511. What I
>>>>> mean is, as in the example I gave below, users may only want to allow a
>>>>> large order collapse when the number of present PTEs reaches half of the
>>>>> large folio, in order to avoid RSS bloat.
>>>>
>>>> How do these users control allocation at fault time where this parameter
>>>> is completely ignored?
>>>
>>> Sorry, I did not get your point. Why does the 'max_pte_none' need to
>>> control allocation at fault time? Could you be more specific? Thanks.
>>
>> The comment over khugepaged_max_ptes_none gives a hint:
>>
>> /*
>>   * default collapse hugepages if there is at least one pte mapped like
>>   * it would have happened if the vma was large enough during page
>>   * fault.
>>   *
>>   * Note that these are only respected if collapse was initiated by khugepaged.
>>   */
>>
>> In the common case (for anything that really cares about RSS bloat) you will just a
>> get a THP during page fault and consequently RSS bloat.
>>
>> As raised in my other reply, the only documented reason to set max_ptes_none=0 seems
>> to be when an application later (after once possibly getting a THP already during
>> page faults) did some MADV_DONTNEED and wants to control the usage of THPs itself using
>> MADV_COLLAPSE.
>>
>> It's a questionable use case, that already got more problematic with mTHP and page
>> table reclaim.
>>
>> Let me explain:
>>
>> Before mTHP, if someone would MADV_DONTNEED (resulting in
>> a page table with at least one pte_none entry), there would have been no way we would
>> get memory over-allocated afterwards with max_ptes_none=0.
>>
>> (1) Page faults would spot "there is a page table" and just fallback to order-0 pages.
>> (2) khugepaged was told to not collapse through max_ptes_none=0.
>>
>> But now:
>>
>> (A) With mTHP during page-faults, we can just end up over-allocating memory in such
>>      an area again: page faults will simply spot a bunch of pte_nones around the fault area
>>      and install an mTHP.
>>
>> (B) With page table reclaim (when zapping all PTEs in a table at once), we will reclaim the
>>      page table. The next page fault will just try installing a PMD THP again, because there is
>>      no PTE table anymore.
>>
>> So I question the utility of max_ptes_none. If you can't tame page faults, then there is only
>> limited sense in taming khugepaged. I think there is vale in setting max_ptes_none=0 for some
>> corner cases, but I am yet to learn why max_ptes_none=123 would make any sense.
>>
>>
> 
> For PMD mapped THPs with THP shrinker, this has changed. You can basically tame pagefaults, as when you encounter
> memory pressure, the shrinker kicks in if the value is less than HPAGE_PMD_NR -1 (i.e. 511 for x86), and
> will break down those hugepages and free up zero-filled memory.

You are not really taming page faults, though, you are undoing what page 
faults might have messed up :)

I have seen in our prod workloads where
> the memory usage and THP usage can spike (usually when the workload starts), but with memory pressure,
> the memory usage is lower compared to with max_ptes_none = 511, while still still keeping the benefits
> of THPs like lower TLB misses.

Thanks for raising that: I think the current behavior is in place such 
that you don't bounce back-and-forth between khugepaged collapse and 
shrinker-split.

There are likely other ways to achieve that, when we have in mind that 
the thp shrinker will install zero pages and max_ptes_none includes
zero pages.

> 
> I do agree that the value of max_ptes_none is magical and different workloads can react very differently
> to it. The relationship is definitely not linear. i.e. if I use max_ptes_none = 256, it does not mean
> that the memory regression of using THP=always vs THP=madvise is halved.

To which value would you set it? Just 510? 0?

-- 
Cheers

David / dhildenb

Re: [PATCH v10 00/13] khugepaged: mTHP support

Posted by Usama Arif 1 month ago


On 02/09/2025 12:03, David Hildenbrand wrote:
> On 02.09.25 12:34, Usama Arif wrote:
>>
>>
>> On 02/09/2025 10:03, David Hildenbrand wrote:
>>> On 02.09.25 04:28, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2025/9/2 00:46, David Hildenbrand wrote:
>>>>> On 29.08.25 03:55, Baolin Wang wrote:
>>>>>>
>>>>>>
>>>>>> On 2025/8/28 18:48, Dev Jain wrote:
>>>>>>>
>>>>>>> On 28/08/25 3:16 pm, Baolin Wang wrote:
>>>>>>>> (Sorry for chiming in late)
>>>>>>>>
>>>>>>>> On 2025/8/22 22:10, David Hildenbrand wrote:
>>>>>>>>>>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1),
>>>>>>>>>>> but not sure
>>>>>>>>>>> if we have to add that for now.
>>>>>>>>>>
>>>>>>>>>> Yeah not so sure about this, this is a 'just have to know' too, and
>>>>>>>>>> yes you
>>>>>>>>>> might add it to the docs, but people are going to be mightily
>>>>>>>>>> confused, esp if
>>>>>>>>>> it's a calculated value.
>>>>>>>>>>
>>>>>>>>>> I don't see any other way around having a separate tunable if we
>>>>>>>>>> don't just have
>>>>>>>>>> something VERY simple like on/off.
>>>>>>>>>
>>>>>>>>> Yeah, not advocating that we add support for other values than 0/511,
>>>>>>>>> really.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Also the mentioned issue sounds like something that needs to be
>>>>>>>>>> fixed elsewhere
>>>>>>>>>> honestly in the algorithm used to figure out mTHP ranges (I may be
>>>>>>>>>> wrong - and
>>>>>>>>>> happy to stand corrected if this is somehow inherent, but reallly
>>>>>>>>>> feels that
>>>>>>>>>> way).
>>>>>>>>>
>>>>>>>>> I think the creep is unavoidable for certain values.
>>>>>>>>>
>>>>>>>>> If you have the first two pages of a PMD area populated, and you
>>>>>>>>> allow for at least half of the #PTEs to be non/zero, you'd collapse
>>>>>>>>> first a
>>>>>>>>> order-2 folio, then and order-3 ... until you reached PMD order.
>>>>>>>>>
>>>>>>>>> So for now we really should just support 0 / 511 to say "don't
>>>>>>>>> collapse if there are holes" vs. "always collapse if there is at
>>>>>>>>> least one pte used".
>>>>>>>>
>>>>>>>> If we only allow setting 0 or 511, as Nico mentioned before, "At 511,
>>>>>>>> no mTHP collapses would ever occur anyway, unless you have 2MB
>>>>>>>> disabled and other mTHP sizes enabled. Technically, at 511, only the
>>>>>>>> highest enabled order would ever be collapsed."
>>>>>>> I didn't understand this statement. At 511, mTHP collapses will occur if
>>>>>>> khugepaged cannot get a PMD folio. Our goal is to collapse to the
>>>>>>> highest order folio.
>>>>>>
>>>>>> Yes, I’m not saying that it’s incorrect behavior when set to 511. What I
>>>>>> mean is, as in the example I gave below, users may only want to allow a
>>>>>> large order collapse when the number of present PTEs reaches half of the
>>>>>> large folio, in order to avoid RSS bloat.
>>>>>
>>>>> How do these users control allocation at fault time where this parameter
>>>>> is completely ignored?
>>>>
>>>> Sorry, I did not get your point. Why does the 'max_pte_none' need to
>>>> control allocation at fault time? Could you be more specific? Thanks.
>>>
>>> The comment over khugepaged_max_ptes_none gives a hint:
>>>
>>> /*
>>>   * default collapse hugepages if there is at least one pte mapped like
>>>   * it would have happened if the vma was large enough during page
>>>   * fault.
>>>   *
>>>   * Note that these are only respected if collapse was initiated by khugepaged.
>>>   */
>>>
>>> In the common case (for anything that really cares about RSS bloat) you will just a
>>> get a THP during page fault and consequently RSS bloat.
>>>
>>> As raised in my other reply, the only documented reason to set max_ptes_none=0 seems
>>> to be when an application later (after once possibly getting a THP already during
>>> page faults) did some MADV_DONTNEED and wants to control the usage of THPs itself using
>>> MADV_COLLAPSE.
>>>
>>> It's a questionable use case, that already got more problematic with mTHP and page
>>> table reclaim.
>>>
>>> Let me explain:
>>>
>>> Before mTHP, if someone would MADV_DONTNEED (resulting in
>>> a page table with at least one pte_none entry), there would have been no way we would
>>> get memory over-allocated afterwards with max_ptes_none=0.
>>>
>>> (1) Page faults would spot "there is a page table" and just fallback to order-0 pages.
>>> (2) khugepaged was told to not collapse through max_ptes_none=0.
>>>
>>> But now:
>>>
>>> (A) With mTHP during page-faults, we can just end up over-allocating memory in such
>>>      an area again: page faults will simply spot a bunch of pte_nones around the fault area
>>>      and install an mTHP.
>>>
>>> (B) With page table reclaim (when zapping all PTEs in a table at once), we will reclaim the
>>>      page table. The next page fault will just try installing a PMD THP again, because there is
>>>      no PTE table anymore.
>>>
>>> So I question the utility of max_ptes_none. If you can't tame page faults, then there is only
>>> limited sense in taming khugepaged. I think there is vale in setting max_ptes_none=0 for some
>>> corner cases, but I am yet to learn why max_ptes_none=123 would make any sense.
>>>
>>>
>>
>> For PMD mapped THPs with THP shrinker, this has changed. You can basically tame pagefaults, as when you encounter
>> memory pressure, the shrinker kicks in if the value is less than HPAGE_PMD_NR -1 (i.e. 511 for x86), and
>> will break down those hugepages and free up zero-filled memory.
> 
> You are not really taming page faults, though, you are undoing what page faults might have messed up :)
> 
> I have seen in our prod workloads where
>> the memory usage and THP usage can spike (usually when the workload starts), but with memory pressure,
>> the memory usage is lower compared to with max_ptes_none = 511, while still still keeping the benefits
>> of THPs like lower TLB misses.
> 
> Thanks for raising that: I think the current behavior is in place such that you don't bounce back-and-forth between khugepaged collapse and shrinker-split.
> 

Yes, both collapse and shrinker split hinge on max_ptes_none to prevent one of these things thrashing the effect of the other.

> There are likely other ways to achieve that, when we have in mind that the thp shrinker will install zero pages and max_ptes_none includes
> zero pages.
> 
>>
>> I do agree that the value of max_ptes_none is magical and different workloads can react very differently
>> to it. The relationship is definitely not linear. i.e. if I use max_ptes_none = 256, it does not mean
>> that the memory regression of using THP=always vs THP=madvise is halved.
> 
> To which value would you set it? Just 510? 0?
> 

There are some very large workloads in the meta fleet that I experimented with and found that having
a small value works out. I experimented with 0, 51 (10%) and 256 (50%). 51 was found to be an optimal
comprimise in terms of application metrics improving, having an acceptable amount of memory regression and
improved system level metrics (lower TLB misses, lower page faults). I am sure there was a better value out
there for these workloads, but not possible to experiment with every value.

In terms of wider rollout across the fleet, we are going to target 0 (or a very very small value)
when moving from THP=madvise to always. Mainly because it is the least likely to cause a memory regression as
THP shrinker will deal with page faults faulting in mostly zero-filled pages and khugepaged wont collapse
pages that are dominated by 4K zero-filled chunks.

Re: [PATCH v10 00/13] khugepaged: mTHP support

Posted by Nico Pache 1 month ago

On Tue, Sep 2, 2025 at 2:23 PM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 02/09/2025 12:03, David Hildenbrand wrote:
> > On 02.09.25 12:34, Usama Arif wrote:
> >>
> >>
> >> On 02/09/2025 10:03, David Hildenbrand wrote:
> >>> On 02.09.25 04:28, Baolin Wang wrote:
> >>>>
> >>>>
> >>>> On 2025/9/2 00:46, David Hildenbrand wrote:
> >>>>> On 29.08.25 03:55, Baolin Wang wrote:
> >>>>>>
> >>>>>>
> >>>>>> On 2025/8/28 18:48, Dev Jain wrote:
> >>>>>>>
> >>>>>>> On 28/08/25 3:16 pm, Baolin Wang wrote:
> >>>>>>>> (Sorry for chiming in late)
> >>>>>>>>
> >>>>>>>> On 2025/8/22 22:10, David Hildenbrand wrote:
> >>>>>>>>>>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1),
> >>>>>>>>>>> but not sure
> >>>>>>>>>>> if we have to add that for now.
> >>>>>>>>>>
> >>>>>>>>>> Yeah not so sure about this, this is a 'just have to know' too, and
> >>>>>>>>>> yes you
> >>>>>>>>>> might add it to the docs, but people are going to be mightily
> >>>>>>>>>> confused, esp if
> >>>>>>>>>> it's a calculated value.
> >>>>>>>>>>
> >>>>>>>>>> I don't see any other way around having a separate tunable if we
> >>>>>>>>>> don't just have
> >>>>>>>>>> something VERY simple like on/off.
> >>>>>>>>>
> >>>>>>>>> Yeah, not advocating that we add support for other values than 0/511,
> >>>>>>>>> really.
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Also the mentioned issue sounds like something that needs to be
> >>>>>>>>>> fixed elsewhere
> >>>>>>>>>> honestly in the algorithm used to figure out mTHP ranges (I may be
> >>>>>>>>>> wrong - and
> >>>>>>>>>> happy to stand corrected if this is somehow inherent, but reallly
> >>>>>>>>>> feels that
> >>>>>>>>>> way).
> >>>>>>>>>
> >>>>>>>>> I think the creep is unavoidable for certain values.
> >>>>>>>>>
> >>>>>>>>> If you have the first two pages of a PMD area populated, and you
> >>>>>>>>> allow for at least half of the #PTEs to be non/zero, you'd collapse
> >>>>>>>>> first a
> >>>>>>>>> order-2 folio, then and order-3 ... until you reached PMD order.
> >>>>>>>>>
> >>>>>>>>> So for now we really should just support 0 / 511 to say "don't
> >>>>>>>>> collapse if there are holes" vs. "always collapse if there is at
> >>>>>>>>> least one pte used".
> >>>>>>>>
> >>>>>>>> If we only allow setting 0 or 511, as Nico mentioned before, "At 511,
> >>>>>>>> no mTHP collapses would ever occur anyway, unless you have 2MB
> >>>>>>>> disabled and other mTHP sizes enabled. Technically, at 511, only the
> >>>>>>>> highest enabled order would ever be collapsed."
> >>>>>>> I didn't understand this statement. At 511, mTHP collapses will occur if
> >>>>>>> khugepaged cannot get a PMD folio. Our goal is to collapse to the
> >>>>>>> highest order folio.
> >>>>>>
> >>>>>> Yes, I’m not saying that it’s incorrect behavior when set to 511. What I
> >>>>>> mean is, as in the example I gave below, users may only want to allow a
> >>>>>> large order collapse when the number of present PTEs reaches half of the
> >>>>>> large folio, in order to avoid RSS bloat.
> >>>>>
> >>>>> How do these users control allocation at fault time where this parameter
> >>>>> is completely ignored?
> >>>>
> >>>> Sorry, I did not get your point. Why does the 'max_pte_none' need to
> >>>> control allocation at fault time? Could you be more specific? Thanks.
> >>>
> >>> The comment over khugepaged_max_ptes_none gives a hint:
> >>>
> >>> /*
> >>>   * default collapse hugepages if there is at least one pte mapped like
> >>>   * it would have happened if the vma was large enough during page
> >>>   * fault.
> >>>   *
> >>>   * Note that these are only respected if collapse was initiated by khugepaged.
> >>>   */
> >>>
> >>> In the common case (for anything that really cares about RSS bloat) you will just a
> >>> get a THP during page fault and consequently RSS bloat.
> >>>
> >>> As raised in my other reply, the only documented reason to set max_ptes_none=0 seems
> >>> to be when an application later (after once possibly getting a THP already during
> >>> page faults) did some MADV_DONTNEED and wants to control the usage of THPs itself using
> >>> MADV_COLLAPSE.
> >>>
> >>> It's a questionable use case, that already got more problematic with mTHP and page
> >>> table reclaim.
> >>>
> >>> Let me explain:
> >>>
> >>> Before mTHP, if someone would MADV_DONTNEED (resulting in
> >>> a page table with at least one pte_none entry), there would have been no way we would
> >>> get memory over-allocated afterwards with max_ptes_none=0.
> >>>
> >>> (1) Page faults would spot "there is a page table" and just fallback to order-0 pages.
> >>> (2) khugepaged was told to not collapse through max_ptes_none=0.
> >>>
> >>> But now:
> >>>
> >>> (A) With mTHP during page-faults, we can just end up over-allocating memory in such
> >>>      an area again: page faults will simply spot a bunch of pte_nones around the fault area
> >>>      and install an mTHP.
> >>>
> >>> (B) With page table reclaim (when zapping all PTEs in a table at once), we will reclaim the
> >>>      page table. The next page fault will just try installing a PMD THP again, because there is
> >>>      no PTE table anymore.
> >>>
> >>> So I question the utility of max_ptes_none. If you can't tame page faults, then there is only
> >>> limited sense in taming khugepaged. I think there is vale in setting max_ptes_none=0 for some
> >>> corner cases, but I am yet to learn why max_ptes_none=123 would make any sense.
> >>>
> >>>
> >>
> >> For PMD mapped THPs with THP shrinker, this has changed. You can basically tame pagefaults, as when you encounter
> >> memory pressure, the shrinker kicks in if the value is less than HPAGE_PMD_NR -1 (i.e. 511 for x86), and
> >> will break down those hugepages and free up zero-filled memory.
> >
> > You are not really taming page faults, though, you are undoing what page faults might have messed up :)
> >
> > I have seen in our prod workloads where
> >> the memory usage and THP usage can spike (usually when the workload starts), but with memory pressure,
> >> the memory usage is lower compared to with max_ptes_none = 511, while still still keeping the benefits
> >> of THPs like lower TLB misses.
> >
> > Thanks for raising that: I think the current behavior is in place such that you don't bounce back-and-forth between khugepaged collapse and shrinker-split.
> >
>
> Yes, both collapse and shrinker split hinge on max_ptes_none to prevent one of these things thrashing the effect of the other.
I believe with mTHP support in khugepaged, the max_ptes_none value in
the shrinker must also leverage the 'order' scaling to properly
prevent thrashing.
I've been testing a patch for this that I might include in the V11.
>
> > There are likely other ways to achieve that, when we have in mind that the thp shrinker will install zero pages and max_ptes_none includes
> > zero pages.
> >
> >>
> >> I do agree that the value of max_ptes_none is magical and different workloads can react very differently
> >> to it. The relationship is definitely not linear. i.e. if I use max_ptes_none = 256, it does not mean
> >> that the memory regression of using THP=always vs THP=madvise is halved.
> >
> > To which value would you set it? Just 510? 0?
> >
>
> There are some very large workloads in the meta fleet that I experimented with and found that having
> a small value works out. I experimented with 0, 51 (10%) and 256 (50%). 51 was found to be an optimal
> comprimise in terms of application metrics improving, having an acceptable amount of memory regression and
> improved system level metrics (lower TLB misses, lower page faults). I am sure there was a better value out
> there for these workloads, but not possible to experiment with every value.
>
> In terms of wider rollout across the fleet, we are going to target 0 (or a very very small value)
> when moving from THP=madvise to always. Mainly because it is the least likely to cause a memory regression as
> THP shrinker will deal with page faults faulting in mostly zero-filled pages and khugepaged wont collapse
> pages that are dominated by 4K zero-filled chunks.
>

Re: [PATCH v10 00/13] khugepaged: mTHP support

Posted by Lorenzo Stoakes 4 weeks ago

On Wed, Sep 03, 2025 at 08:54:39PM -0600, Nico Pache wrote:
> On Tue, Sep 2, 2025 at 2:23 PM Usama Arif <usamaarif642@gmail.com> wrote:
> > >>> So I question the utility of max_ptes_none. If you can't tame page faults, then there is only
> > >>> limited sense in taming khugepaged. I think there is vale in setting max_ptes_none=0 for some
> > >>> corner cases, but I am yet to learn why max_ptes_none=123 would make any sense.
> > >>>
> > >>>
> > >>
> > >> For PMD mapped THPs with THP shrinker, this has changed. You can basically tame pagefaults, as when you encounter
> > >> memory pressure, the shrinker kicks in if the value is less than HPAGE_PMD_NR -1 (i.e. 511 for x86), and
> > >> will break down those hugepages and free up zero-filled memory.
> > >
> > > You are not really taming page faults, though, you are undoing what page faults might have messed up :)
> > >
> > > I have seen in our prod workloads where
> > >> the memory usage and THP usage can spike (usually when the workload starts), but with memory pressure,
> > >> the memory usage is lower compared to with max_ptes_none = 511, while still still keeping the benefits
> > >> of THPs like lower TLB misses.
> > >
> > > Thanks for raising that: I think the current behavior is in place such that you don't bounce back-and-forth between khugepaged collapse and shrinker-split.
> > >
> >
> > Yes, both collapse and shrinker split hinge on max_ptes_none to prevent one of these things thrashing the effect of the other.
> I believe with mTHP support in khugepaged, the max_ptes_none value in
> the shrinker must also leverage the 'order' scaling to properly
> prevent thrashing.

No please do not extend this 'scalling' stuff somewhere else, it's really horrid.

We have to find an alternative to that, it's extremely confusing in what is
already extremely confusing THP code.

As I said before, if we can't have a boolean we need another interface, which
makes most sense to be a ratio or in practice, a percentage sysctl.

Speaking with David off-list, maybe the answer - if we must have this - is to
add a new percentage interface and have this in lock-step with the existing
max_ptes_none interface. One updates the other, but internally we're just using
the percentage value.

> I've been testing a patch for this that I might include in the V11.
> >
> > > There are likely other ways to achieve that, when we have in mind that the thp shrinker will install zero pages and max_ptes_none includes
> > > zero pages.
> > >
> > >>
> > >> I do agree that the value of max_ptes_none is magical and different workloads can react very differently
> > >> to it. The relationship is definitely not linear. i.e. if I use max_ptes_none = 256, it does not mean
> > >> that the memory regression of using THP=always vs THP=madvise is halved.
> > >
> > > To which value would you set it? Just 510? 0?
> > >
> >
> > There are some very large workloads in the meta fleet that I experimented with and found that having
> > a small value works out. I experimented with 0, 51 (10%) and 256 (50%). 51 was found to be an optimal
> > comprimise in terms of application metrics improving, having an acceptable amount of memory regression and
> > improved system level metrics (lower TLB misses, lower page faults). I am sure there was a better value out
> > there for these workloads, but not possible to experiment with every value.

(->Usama) It's a pity that such workloads exist. But then the percentage solution should work.

> >
> > In terms of wider rollout across the fleet, we are going to target 0 (or a very very small value)
> > when moving from THP=madvise to always. Mainly because it is the least likely to cause a memory regression as
> > THP shrinker will deal with page faults faulting in mostly zero-filled pages and khugepaged wont collapse
> > pages that are dominated by 4K zero-filled chunks.
> >

(->Usama) Interesting though that you've decided against doing this
fleetwide... I wonder then again whether we truly need non-boolean values.

But the fact workloads might theoretically exist where it's useful does make me
think we have to have this, sadly.

Cheers, Lorenzo

Re: [PATCH v10 00/13] khugepaged: mTHP support

Posted by David Hildenbrand 4 weeks ago

On 05.09.25 13:48, Lorenzo Stoakes wrote:
> On Wed, Sep 03, 2025 at 08:54:39PM -0600, Nico Pache wrote:
>> On Tue, Sep 2, 2025 at 2:23 PM Usama Arif <usamaarif642@gmail.com> wrote:
>>>>>> So I question the utility of max_ptes_none. If you can't tame page faults, then there is only
>>>>>> limited sense in taming khugepaged. I think there is vale in setting max_ptes_none=0 for some
>>>>>> corner cases, but I am yet to learn why max_ptes_none=123 would make any sense.
>>>>>>
>>>>>>
>>>>>
>>>>> For PMD mapped THPs with THP shrinker, this has changed. You can basically tame pagefaults, as when you encounter
>>>>> memory pressure, the shrinker kicks in if the value is less than HPAGE_PMD_NR -1 (i.e. 511 for x86), and
>>>>> will break down those hugepages and free up zero-filled memory.
>>>>
>>>> You are not really taming page faults, though, you are undoing what page faults might have messed up :)
>>>>
>>>> I have seen in our prod workloads where
>>>>> the memory usage and THP usage can spike (usually when the workload starts), but with memory pressure,
>>>>> the memory usage is lower compared to with max_ptes_none = 511, while still still keeping the benefits
>>>>> of THPs like lower TLB misses.
>>>>
>>>> Thanks for raising that: I think the current behavior is in place such that you don't bounce back-and-forth between khugepaged collapse and shrinker-split.
>>>>
>>>
>>> Yes, both collapse and shrinker split hinge on max_ptes_none to prevent one of these things thrashing the effect of the other.
>> I believe with mTHP support in khugepaged, the max_ptes_none value in
>> the shrinker must also leverage the 'order' scaling to properly
>> prevent thrashing.
> 
> No please do not extend this 'scalling' stuff somewhere else, it's really horrid.
> 
> We have to find an alternative to that, it's extremely confusing in what is
> already extremely confusing THP code.
> 
> As I said before, if we can't have a boolean we need another interface, which
> makes most sense to be a ratio or in practice, a percentage sysctl.
> 
> Speaking with David off-list, maybe the answer - if we must have this - is to
> add a new percentage interface and have this in lock-step with the existing
> max_ptes_none interface. One updates the other, but internally we're just using
> the percentage value.

Yes, I'll try hacking something up and sending it as an RFC.

> 
>> I've been testing a patch for this that I might include in the V11.
>>>
>>>> There are likely other ways to achieve that, when we have in mind that the thp shrinker will install zero pages and max_ptes_none includes
>>>> zero pages.
>>>>
>>>>>
>>>>> I do agree that the value of max_ptes_none is magical and different workloads can react very differently
>>>>> to it. The relationship is definitely not linear. i.e. if I use max_ptes_none = 256, it does not mean
>>>>> that the memory regression of using THP=always vs THP=madvise is halved.
>>>>
>>>> To which value would you set it? Just 510? 0?

Sorry, I missed Usama's reply. Thanks Usama!

>>>>
>>>
>>> There are some very large workloads in the meta fleet that I experimented with and found that having
>>> a small value works out. I experimented with 0, 51 (10%) and 256 (50%). 51 was found to be an optimal
>>> comprimise in terms of application metrics improving, having an acceptable amount of memory regression and
>>> improved system level metrics (lower TLB misses, lower page faults). I am sure there was a better value out
>>> there for these workloads, but not possible to experiment with every value.
> 
> (->Usama) It's a pity that such workloads exist. But then the percentage solution should work.

Good. So if there is no strong case for > 255, that's already valuable 
for mTHP.

-- 
Cheers

David / dhildenb

Re: [PATCH v10 00/13] khugepaged: mTHP support

Posted by Usama Arif 4 weeks ago


On 05/09/2025 12:55, David Hildenbrand wrote:
> On 05.09.25 13:48, Lorenzo Stoakes wrote:
>> On Wed, Sep 03, 2025 at 08:54:39PM -0600, Nico Pache wrote:
>>> On Tue, Sep 2, 2025 at 2:23 PM Usama Arif <usamaarif642@gmail.com> wrote:
>>>>>>> So I question the utility of max_ptes_none. If you can't tame page faults, then there is only
>>>>>>> limited sense in taming khugepaged. I think there is vale in setting max_ptes_none=0 for some
>>>>>>> corner cases, but I am yet to learn why max_ptes_none=123 would make any sense.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> For PMD mapped THPs with THP shrinker, this has changed. You can basically tame pagefaults, as when you encounter
>>>>>> memory pressure, the shrinker kicks in if the value is less than HPAGE_PMD_NR -1 (i.e. 511 for x86), and
>>>>>> will break down those hugepages and free up zero-filled memory.
>>>>>
>>>>> You are not really taming page faults, though, you are undoing what page faults might have messed up :)
>>>>>
>>>>> I have seen in our prod workloads where
>>>>>> the memory usage and THP usage can spike (usually when the workload starts), but with memory pressure,
>>>>>> the memory usage is lower compared to with max_ptes_none = 511, while still still keeping the benefits
>>>>>> of THPs like lower TLB misses.
>>>>>
>>>>> Thanks for raising that: I think the current behavior is in place such that you don't bounce back-and-forth between khugepaged collapse and shrinker-split.
>>>>>
>>>>
>>>> Yes, both collapse and shrinker split hinge on max_ptes_none to prevent one of these things thrashing the effect of the other.
>>> I believe with mTHP support in khugepaged, the max_ptes_none value in
>>> the shrinker must also leverage the 'order' scaling to properly
>>> prevent thrashing.
>>
>> No please do not extend this 'scalling' stuff somewhere else, it's really horrid.
>>
>> We have to find an alternative to that, it's extremely confusing in what is
>> already extremely confusing THP code.
>>
>> As I said before, if we can't have a boolean we need another interface, which
>> makes most sense to be a ratio or in practice, a percentage sysctl.
>>
>> Speaking with David off-list, maybe the answer - if we must have this - is to
>> add a new percentage interface and have this in lock-step with the existing
>> max_ptes_none interface. One updates the other, but internally we're just using
>> the percentage value.
> 
> Yes, I'll try hacking something up and sending it as an RFC.
> 
>>
>>> I've been testing a patch for this that I might include in the V11.
>>>>
>>>>> There are likely other ways to achieve that, when we have in mind that the thp shrinker will install zero pages and max_ptes_none includes
>>>>> zero pages.
>>>>>
>>>>>>
>>>>>> I do agree that the value of max_ptes_none is magical and different workloads can react very differently
>>>>>> to it. The relationship is definitely not linear. i.e. if I use max_ptes_none = 256, it does not mean
>>>>>> that the memory regression of using THP=always vs THP=madvise is halved.
>>>>>
>>>>> To which value would you set it? Just 510? 0?
> 
> Sorry, I missed Usama's reply. Thanks Usama!
> 
>>>>>
>>>>
>>>> There are some very large workloads in the meta fleet that I experimented with and found that having
>>>> a small value works out. I experimented with 0, 51 (10%) and 256 (50%). 51 was found to be an optimal
>>>> comprimise in terms of application metrics improving, having an acceptable amount of memory regression and
>>>> improved system level metrics (lower TLB misses, lower page faults). I am sure there was a better value out
>>>> there for these workloads, but not possible to experiment with every value.
>>
>> (->Usama) It's a pity that such workloads exist. But then the percentage solution should work.
> 
> Good. So if there is no strong case for > 255, that's already valuable for mTHP.
> 

tbh the default value of 511 is horrible. I have thought about sending a patch to change it to 0 as default
in upstream for sometime, but it might mean that people who upgrade their kernel might suddenly see
their memory not getting hugified and it could be confusing for them?

Re: [PATCH v10 00/13] khugepaged: mTHP support

Posted by David Hildenbrand 4 weeks ago

On 05.09.25 14:31, Usama Arif wrote:
> 
> 
> On 05/09/2025 12:55, David Hildenbrand wrote:
>> On 05.09.25 13:48, Lorenzo Stoakes wrote:
>>> On Wed, Sep 03, 2025 at 08:54:39PM -0600, Nico Pache wrote:
>>>> On Tue, Sep 2, 2025 at 2:23 PM Usama Arif <usamaarif642@gmail.com> wrote:
>>>>>>>> So I question the utility of max_ptes_none. If you can't tame page faults, then there is only
>>>>>>>> limited sense in taming khugepaged. I think there is vale in setting max_ptes_none=0 for some
>>>>>>>> corner cases, but I am yet to learn why max_ptes_none=123 would make any sense.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> For PMD mapped THPs with THP shrinker, this has changed. You can basically tame pagefaults, as when you encounter
>>>>>>> memory pressure, the shrinker kicks in if the value is less than HPAGE_PMD_NR -1 (i.e. 511 for x86), and
>>>>>>> will break down those hugepages and free up zero-filled memory.
>>>>>>
>>>>>> You are not really taming page faults, though, you are undoing what page faults might have messed up :)
>>>>>>
>>>>>> I have seen in our prod workloads where
>>>>>>> the memory usage and THP usage can spike (usually when the workload starts), but with memory pressure,
>>>>>>> the memory usage is lower compared to with max_ptes_none = 511, while still still keeping the benefits
>>>>>>> of THPs like lower TLB misses.
>>>>>>
>>>>>> Thanks for raising that: I think the current behavior is in place such that you don't bounce back-and-forth between khugepaged collapse and shrinker-split.
>>>>>>
>>>>>
>>>>> Yes, both collapse and shrinker split hinge on max_ptes_none to prevent one of these things thrashing the effect of the other.
>>>> I believe with mTHP support in khugepaged, the max_ptes_none value in
>>>> the shrinker must also leverage the 'order' scaling to properly
>>>> prevent thrashing.
>>>
>>> No please do not extend this 'scalling' stuff somewhere else, it's really horrid.
>>>
>>> We have to find an alternative to that, it's extremely confusing in what is
>>> already extremely confusing THP code.
>>>
>>> As I said before, if we can't have a boolean we need another interface, which
>>> makes most sense to be a ratio or in practice, a percentage sysctl.
>>>
>>> Speaking with David off-list, maybe the answer - if we must have this - is to
>>> add a new percentage interface and have this in lock-step with the existing
>>> max_ptes_none interface. One updates the other, but internally we're just using
>>> the percentage value.
>>
>> Yes, I'll try hacking something up and sending it as an RFC.
>>
>>>
>>>> I've been testing a patch for this that I might include in the V11.
>>>>>
>>>>>> There are likely other ways to achieve that, when we have in mind that the thp shrinker will install zero pages and max_ptes_none includes
>>>>>> zero pages.
>>>>>>
>>>>>>>
>>>>>>> I do agree that the value of max_ptes_none is magical and different workloads can react very differently
>>>>>>> to it. The relationship is definitely not linear. i.e. if I use max_ptes_none = 256, it does not mean
>>>>>>> that the memory regression of using THP=always vs THP=madvise is halved.
>>>>>>
>>>>>> To which value would you set it? Just 510? 0?
>>
>> Sorry, I missed Usama's reply. Thanks Usama!
>>
>>>>>>
>>>>>
>>>>> There are some very large workloads in the meta fleet that I experimented with and found that having
>>>>> a small value works out. I experimented with 0, 51 (10%) and 256 (50%). 51 was found to be an optimal
>>>>> comprimise in terms of application metrics improving, having an acceptable amount of memory regression and
>>>>> improved system level metrics (lower TLB misses, lower page faults). I am sure there was a better value out
>>>>> there for these workloads, but not possible to experiment with every value.
>>>
>>> (->Usama) It's a pity that such workloads exist. But then the percentage solution should work.
>>
>> Good. So if there is no strong case for > 255, that's already valuable for mTHP.
>>
> 
> tbh the default value of 511 is horrible. I have thought about sending a patch to change it to 0 as default
> in upstream for sometime, but it might mean that people who upgrade their kernel might suddenly see
> their memory not getting hugified and it could be confusing for them?

511 is just what a page fault would have done, so I think it makes 
perfect sense. More than anything else, actually.

It's just not optimal in many cases.

-- 
Cheers

David / dhildenb

Re: [PATCH v10 00/13] khugepaged: mTHP support

Posted by Baolin Wang 1 month ago


On 2025/9/3 04:23, Usama Arif wrote:
> 
> 
> On 02/09/2025 12:03, David Hildenbrand wrote:
>> On 02.09.25 12:34, Usama Arif wrote:
>>>
>>>
>>> On 02/09/2025 10:03, David Hildenbrand wrote:
>>>> On 02.09.25 04:28, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 2025/9/2 00:46, David Hildenbrand wrote:
>>>>>> On 29.08.25 03:55, Baolin Wang wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2025/8/28 18:48, Dev Jain wrote:
>>>>>>>>
>>>>>>>> On 28/08/25 3:16 pm, Baolin Wang wrote:
>>>>>>>>> (Sorry for chiming in late)
>>>>>>>>>
>>>>>>>>> On 2025/8/22 22:10, David Hildenbrand wrote:
>>>>>>>>>>>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1),
>>>>>>>>>>>> but not sure
>>>>>>>>>>>> if we have to add that for now.
>>>>>>>>>>>
>>>>>>>>>>> Yeah not so sure about this, this is a 'just have to know' too, and
>>>>>>>>>>> yes you
>>>>>>>>>>> might add it to the docs, but people are going to be mightily
>>>>>>>>>>> confused, esp if
>>>>>>>>>>> it's a calculated value.
>>>>>>>>>>>
>>>>>>>>>>> I don't see any other way around having a separate tunable if we
>>>>>>>>>>> don't just have
>>>>>>>>>>> something VERY simple like on/off.
>>>>>>>>>>
>>>>>>>>>> Yeah, not advocating that we add support for other values than 0/511,
>>>>>>>>>> really.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Also the mentioned issue sounds like something that needs to be
>>>>>>>>>>> fixed elsewhere
>>>>>>>>>>> honestly in the algorithm used to figure out mTHP ranges (I may be
>>>>>>>>>>> wrong - and
>>>>>>>>>>> happy to stand corrected if this is somehow inherent, but reallly
>>>>>>>>>>> feels that
>>>>>>>>>>> way).
>>>>>>>>>>
>>>>>>>>>> I think the creep is unavoidable for certain values.
>>>>>>>>>>
>>>>>>>>>> If you have the first two pages of a PMD area populated, and you
>>>>>>>>>> allow for at least half of the #PTEs to be non/zero, you'd collapse
>>>>>>>>>> first a
>>>>>>>>>> order-2 folio, then and order-3 ... until you reached PMD order.
>>>>>>>>>>
>>>>>>>>>> So for now we really should just support 0 / 511 to say "don't
>>>>>>>>>> collapse if there are holes" vs. "always collapse if there is at
>>>>>>>>>> least one pte used".
>>>>>>>>>
>>>>>>>>> If we only allow setting 0 or 511, as Nico mentioned before, "At 511,
>>>>>>>>> no mTHP collapses would ever occur anyway, unless you have 2MB
>>>>>>>>> disabled and other mTHP sizes enabled. Technically, at 511, only the
>>>>>>>>> highest enabled order would ever be collapsed."
>>>>>>>> I didn't understand this statement. At 511, mTHP collapses will occur if
>>>>>>>> khugepaged cannot get a PMD folio. Our goal is to collapse to the
>>>>>>>> highest order folio.
>>>>>>>
>>>>>>> Yes, I’m not saying that it’s incorrect behavior when set to 511. What I
>>>>>>> mean is, as in the example I gave below, users may only want to allow a
>>>>>>> large order collapse when the number of present PTEs reaches half of the
>>>>>>> large folio, in order to avoid RSS bloat.
>>>>>>
>>>>>> How do these users control allocation at fault time where this parameter
>>>>>> is completely ignored?
>>>>>
>>>>> Sorry, I did not get your point. Why does the 'max_pte_none' need to
>>>>> control allocation at fault time? Could you be more specific? Thanks.
>>>>
>>>> The comment over khugepaged_max_ptes_none gives a hint:
>>>>
>>>> /*
>>>>    * default collapse hugepages if there is at least one pte mapped like
>>>>    * it would have happened if the vma was large enough during page
>>>>    * fault.
>>>>    *
>>>>    * Note that these are only respected if collapse was initiated by khugepaged.
>>>>    */
>>>>
>>>> In the common case (for anything that really cares about RSS bloat) you will just a
>>>> get a THP during page fault and consequently RSS bloat.
>>>>
>>>> As raised in my other reply, the only documented reason to set max_ptes_none=0 seems
>>>> to be when an application later (after once possibly getting a THP already during
>>>> page faults) did some MADV_DONTNEED and wants to control the usage of THPs itself using
>>>> MADV_COLLAPSE.
>>>>
>>>> It's a questionable use case, that already got more problematic with mTHP and page
>>>> table reclaim.
>>>>
>>>> Let me explain:
>>>>
>>>> Before mTHP, if someone would MADV_DONTNEED (resulting in
>>>> a page table with at least one pte_none entry), there would have been no way we would
>>>> get memory over-allocated afterwards with max_ptes_none=0.
>>>>
>>>> (1) Page faults would spot "there is a page table" and just fallback to order-0 pages.
>>>> (2) khugepaged was told to not collapse through max_ptes_none=0.
>>>>
>>>> But now:
>>>>
>>>> (A) With mTHP during page-faults, we can just end up over-allocating memory in such
>>>>       an area again: page faults will simply spot a bunch of pte_nones around the fault area
>>>>       and install an mTHP.
>>>>
>>>> (B) With page table reclaim (when zapping all PTEs in a table at once), we will reclaim the
>>>>       page table. The next page fault will just try installing a PMD THP again, because there is
>>>>       no PTE table anymore.
>>>>
>>>> So I question the utility of max_ptes_none. If you can't tame page faults, then there is only
>>>> limited sense in taming khugepaged. I think there is vale in setting max_ptes_none=0 for some
>>>> corner cases, but I am yet to learn why max_ptes_none=123 would make any sense.

Thanks David for your explanation. I see your point now.

>>> For PMD mapped THPs with THP shrinker, this has changed. You can basically tame pagefaults, as when you encounter
>>> memory pressure, the shrinker kicks in if the value is less than HPAGE_PMD_NR -1 (i.e. 511 for x86), and
>>> will break down those hugepages and free up zero-filled memory.
>>
>> You are not really taming page faults, though, you are undoing what page faults might have messed up :)
>>
>> I have seen in our prod workloads where
>>> the memory usage and THP usage can spike (usually when the workload starts), but with memory pressure,
>>> the memory usage is lower compared to with max_ptes_none = 511, while still still keeping the benefits
>>> of THPs like lower TLB misses.
>>
>> Thanks for raising that: I think the current behavior is in place such that you don't bounce back-and-forth between khugepaged collapse and shrinker-split.
>>
> 
> Yes, both collapse and shrinker split hinge on max_ptes_none to prevent one of these things thrashing the effect of the other.
> 
>> There are likely other ways to achieve that, when we have in mind that the thp shrinker will install zero pages and max_ptes_none includes
>> zero pages.
>>
>>>
>>> I do agree that the value of max_ptes_none is magical and different workloads can react very differently
>>> to it. The relationship is definitely not linear. i.e. if I use max_ptes_none = 256, it does not mean
>>> that the memory regression of using THP=always vs THP=madvise is halved.
>>
>> To which value would you set it? Just 510? 0?
>>
> 
> There are some very large workloads in the meta fleet that I experimented with and found that having
> a small value works out. I experimented with 0, 51 (10%) and 256 (50%). 51 was found to be an optimal
> comprimise in terms of application metrics improving, having an acceptable amount of memory regression and
> improved system level metrics (lower TLB misses, lower page faults). I am sure there was a better value out
> there for these workloads, but not possible to experiment with every value.
> 
> In terms of wider rollout across the fleet, we are going to target 0 (or a very very small value)
> when moving from THP=madvise to always. Mainly because it is the least likely to cause a memory regression as
> THP shrinker will deal with page faults faulting in mostly zero-filled pages and khugepaged wont collapse
> pages that are dominated by 4K zero-filled chunks.

Thanks for sharing this. We're also investigating what max_ptes_none 
should be set to in order to use the THP shrinker properly, and 
currently, our customers always set max_ptes_none to its default value: 
511, which is not good.

If 0 is better, it seems like there isn't much conflict with the values 
expected by mTHP collapse (0 and 511). Sounds good to me.

Re: [PATCH v10 00/13] khugepaged: mTHP support

Posted by Lorenzo Stoakes 1 month, 1 week ago

On Fri, Aug 22, 2025 at 04:10:35PM +0200, David Hildenbrand wrote:
> > > Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1), but not sure
> > > if we have to add that for now.
> >
> > Yeah not so sure about this, this is a 'just have to know' too, and yes you
> > might add it to the docs, but people are going to be mightily confused, esp if
> > it's a calculated value.
> >
> > I don't see any other way around having a separate tunable if we don't just have
> > something VERY simple like on/off.
>
> Yeah, not advocating that we add support for other values than 0/511,
> really.

Yeah I'm fine with 0/511.

>
> >
> > Also the mentioned issue sounds like something that needs to be fixed elsewhere
> > honestly in the algorithm used to figure out mTHP ranges (I may be wrong - and
> > happy to stand corrected if this is somehow inherent, but reallly feels that
> > way).
>
> I think the creep is unavoidable for certain values.
>
> If you have the first two pages of a PMD area populated, and you allow for
> at least half of the #PTEs to be non/zero, you'd collapse first a
> order-2 folio, then and order-3 ... until you reached PMD order.

Feels like we should be looking at this in reverse? What's the largest, then
next largest, then etc.?

Surely this is the sensible way of doing it?

>
> So for now we really should just support 0 / 511 to say "don't collapse if
> there are holes" vs. "always collapse if there is at least one pte used".

Yes.

>
> >
> > >
> > > Because, as raised in the past, I'm afraid nobody on this earth has a clue how
> > > to set this parameter to values different to 0 (don't waste memory with khugepaged)
> > > and 511 (page fault behavior).
> >
> > Yup
> >
> > >
> > >
> > > If any other value is set, essentially
> > > 	pr_warn("Unsupported 'max_ptes_none' value for mTHP collapse");
> > >
> > > for now and just disable it.
> >
> > Hmm but under what circumstances? I would just say unsupported value not mention
> > mTHP or people who don't use mTHP might find that confusing.
>
> Well, we can check whether any mTHP size is enabled while the value is set
> to something unexpected. We can then even print the problematic sizes if we
> have to.

Ack

>
> We could also just just say that if the value is set to something else than
> 511 (which is the default), it will be treated as being "0" when collapsing
> mthp, instead of doing any scaling.

Or we could make it an error to set anything but 0, 511, but on the other hand
that's likely to break userspace so yeah probably not.

Maybe have a warning saying 'this is no longer supported and will be ignored'
then set the value to 0 for anything but 511 or 0.

Then can remove the warning later.

By having 0/511 we can really simplify the 'scaling' logic too which would be
fantastic! :)

Cheers, Lorenzo

Re: [PATCH v10 00/13] khugepaged: mTHP support

Posted by Dev Jain 1 month, 1 week ago

On 22/08/25 8:19 pm, Lorenzo Stoakes wrote:
> On Fri, Aug 22, 2025 at 04:10:35PM +0200, David Hildenbrand wrote:
>>>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1), but not sure
>>>> if we have to add that for now.
>>> Yeah not so sure about this, this is a 'just have to know' too, and yes you
>>> might add it to the docs, but people are going to be mightily confused, esp if
>>> it's a calculated value.
>>>
>>> I don't see any other way around having a separate tunable if we don't just have
>>> something VERY simple like on/off.
>> Yeah, not advocating that we add support for other values than 0/511,
>> really.
> Yeah I'm fine with 0/511.
>
>>> Also the mentioned issue sounds like something that needs to be fixed elsewhere
>>> honestly in the algorithm used to figure out mTHP ranges (I may be wrong - and
>>> happy to stand corrected if this is somehow inherent, but reallly feels that
>>> way).
>> I think the creep is unavoidable for certain values.
>>
>> If you have the first two pages of a PMD area populated, and you allow for
>> at least half of the #PTEs to be non/zero, you'd collapse first a
>> order-2 folio, then and order-3 ... until you reached PMD order.
> Feels like we should be looking at this in reverse? What's the largest, then
> next largest, then etc.?
>
> Surely this is the sensible way of doing it?

What David means to say is, for example, suppose all orders are enabled,
and we fail to collapse for order-9, then order-8, then order-7, and so on,
*only* because the distribution of ptes did not obey the scaled max_ptes_none.
Let order-4 collapse succeed.

Next time, khugepaged comes and tries for order-9, fails, then order-8, fails and
so on. Then it checks for order-5, and it comes under the scaled max_ptes_none constraint
only because the previous cycle's order-4 collapse changed the ptes' distribution.
  

>
>> So for now we really should just support 0 / 511 to say "don't collapse if
>> there are holes" vs. "always collapse if there is at least one pte used".
> Yes.
>
>>>> Because, as raised in the past, I'm afraid nobody on this earth has a clue how
>>>> to set this parameter to values different to 0 (don't waste memory with khugepaged)
>>>> and 511 (page fault behavior).
>>> Yup
>>>
>>>>
>>>> If any other value is set, essentially
>>>> 	pr_warn("Unsupported 'max_ptes_none' value for mTHP collapse");
>>>>
>>>> for now and just disable it.
>>> Hmm but under what circumstances? I would just say unsupported value not mention
>>> mTHP or people who don't use mTHP might find that confusing.
>> Well, we can check whether any mTHP size is enabled while the value is set
>> to something unexpected. We can then even print the problematic sizes if we
>> have to.
> Ack
>
>> We could also just just say that if the value is set to something else than
>> 511 (which is the default), it will be treated as being "0" when collapsing
>> mthp, instead of doing any scaling.
> Or we could make it an error to set anything but 0, 511, but on the other hand
> that's likely to break userspace so yeah probably not.
>
> Maybe have a warning saying 'this is no longer supported and will be ignored'
> then set the value to 0 for anything but 511 or 0.
>
> Then can remove the warning later.
>
> By having 0/511 we can really simplify the 'scaling' logic too which would be
> fantastic! :)

FWIW here was my implementation of this thing, for ease of everyone:
https://lore.kernel.org/all/20250211111326.14295-17-dev.jain@arm.com/

>
> Cheers, Lorenzo

Re: [PATCH v10 00/13] khugepaged: mTHP support

Posted by Lorenzo Stoakes 1 month, 1 week ago

On Fri, Aug 22, 2025 at 09:03:41PM +0530, Dev Jain wrote:
>
> On 22/08/25 8:19 pm, Lorenzo Stoakes wrote:
> > On Fri, Aug 22, 2025 at 04:10:35PM +0200, David Hildenbrand wrote:
> > > > > Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1), but not sure
> > > > > if we have to add that for now.
> > > > Yeah not so sure about this, this is a 'just have to know' too, and yes you
> > > > might add it to the docs, but people are going to be mightily confused, esp if
> > > > it's a calculated value.
> > > >
> > > > I don't see any other way around having a separate tunable if we don't just have
> > > > something VERY simple like on/off.
> > > Yeah, not advocating that we add support for other values than 0/511,
> > > really.
> > Yeah I'm fine with 0/511.
> >
> > > > Also the mentioned issue sounds like something that needs to be fixed elsewhere
> > > > honestly in the algorithm used to figure out mTHP ranges (I may be wrong - and
> > > > happy to stand corrected if this is somehow inherent, but reallly feels that
> > > > way).
> > > I think the creep is unavoidable for certain values.
> > >
> > > If you have the first two pages of a PMD area populated, and you allow for
> > > at least half of the #PTEs to be non/zero, you'd collapse first a
> > > order-2 folio, then and order-3 ... until you reached PMD order.
> > Feels like we should be looking at this in reverse? What's the largest, then
> > next largest, then etc.?
> >
> > Surely this is the sensible way of doing it?
>
> What David means to say is, for example, suppose all orders are enabled,
> and we fail to collapse for order-9, then order-8, then order-7, and so on,
> *only* because the distribution of ptes did not obey the scaled max_ptes_none.
> Let order-4 collapse succeed.

Ah so it is the overhead of this that's the problem?

All roads lead to David's suggestion imo.


> > By having 0/511 we can really simplify the 'scaling' logic too which would be
> > fantastic! :)
>
> FWIW here was my implementation of this thing, for ease of everyone:
> https://lore.kernel.org/all/20250211111326.14295-17-dev.jain@arm.com/

That's fine, but I really think we should just replace all this stuff with a
boolean, and change the interface to max_ptes to set boolean if 511, or clear if
0.

Cheers, Lorenzo

Re: [PATCH v10 00/13] khugepaged: mTHP support

Posted by David Hildenbrand 1 month, 1 week ago

On 21.08.25 18:54, Lorenzo Stoakes wrote:
> On Thu, Aug 21, 2025 at 10:46:18AM -0600, Nico Pache wrote:
>>>>>> Thanks and I"ll have a look, but this series is unmergeable with a broken
>>>>>> default in
>>>>>> /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio
>>>>>> sorry.
>>>>>>
>>>>>> We need to have a new tunable as far as I can tell. I also find the use of
>>>>>> this PMD-specific value as an arbitrary way of expressing a ratio pretty
>>>>>> gross.
>>>>> The first thing that comes to mind is that we can pin max_ptes_none to
>>>>> 255 if it exceeds 255. It's worth noting that the issue occurs only
>>>>> for adjacently enabled mTHP sizes.
>>>
>>> No! Presumably the default of 511 (for PMDs with 512 entries) is set for a
>>> reason, arbitrarily changing this to suit a specific case seems crazy no?
>> We wouldn't be changing it for PMD collapse, just for the new
>> behavior. At 511, no mTHP collapses would ever occur anyways, unless
>> you have 2MB disabled and other mTHP sizes enabled. Technically at 511
>> only the highest enabled order always gets collapsed.
>>
>> Ive also argued in the past that 511 is a terrible default for
>> anything other than thp.enabled=always, but that's a whole other can
>> of worms we dont need to discuss now.
>>
>> with this cap of 255, the PMD scan/collapse would work as intended,
>> then in mTHP collapses we would never introduce this undesired
>> behavior. We've discussed before that this would be a hard problem to
>> solve without introducing some expensive way of tracking what has
>> already been through a collapse, and that doesnt even consider what
>> happens if things change or are unmapped, and rescanning that section
>> would be helpful. So having a strictly enforced limit of 255 actually
>> seems like a good idea to me, as it completely avoids the undesired
>> behavior and does not require the admins to be aware of such an issue.
>>
>> Another thought similar to what (IIRC) Dev has mentioned before, if we
>> have max_ptes_none > 255 then we only consider collapses to the
>> largest enabled order, that way no creep to the largest enabled order
>> would occur in the first place, and we would get there straight away.
>>
>> To me one of these two solutions seem sane in the context of what we
>> are dealing with.
>>>
>>>>>
>>>>> ie)
>>>>> if order!=HPAGE_PMD_ORDER && khugepaged_max_ptes_none > 255
>>>>>        temp_max_ptes_none = 255;
>>>> Oh and my second point, introducing a new tunable to control mTHP
>>>> collapse may become exceedingly complex from a tuning and code
>>>> management standpoint.
>>>
>>> Umm right now you hve a ratio expressed in PTES per mTHP * ((PTEs per PMD) /
>>> PMD) 'except please don't set to the usual default when using mTHP' and it's
>>> currently default-broken.
>>>
>>> I'm really not sure how that is simpler than a seprate tunable that can be
>>> expressed as a ratio (e.g. percentage) that actually makes some kind of sense?
>> I agree that the current tunable wasn't designed for this, but we
>> tried to come up with something that leverages the tunable we have to
>> avoid new tunables and added complexity.
>>>
>>> And we can make anything workable from a code management point of view by
>>> refactoring/developing appropriately.
>> What happens if max_ptes_none = 0 and the ratio is 50% - 1 pte
>> (ideally the max number)? seems like we would be saying we want no new
>> none pages, but also to allow new none pages. To me that seems equally
>> broken and more confusing than just taking a scale of the current
>> number (now with a cap).
>>
>>
> 
> The one thing we absolutely cannot have is a default that causes this
> 'creeping' behaviour. This feels like shipping something that is broken and
> alluding to it in the documentation.
> 
> I spoke to David off-list and he gave some insight into this and perhaps
> some reasonable means of avoiding an additional tunable.
> 
> I don't want to rehash what he said as I think it's more productive for him
> to reply when he has time but broadly I think how we handle this needs
> careful consideration.
> 
> To me it's clear that some sense of ratio is just immediately very very
> confusing, but then again this interface is already confusing, as with much
> of THP.
> 
> Anyway I'll let David respond here so we don't loop around before he has a
> chance to add his input.

I've been summoned.

As raised in the past, I would initially only support specific values here like

0 				  : Never collapse with any pte_none/zeropage
511 (HPAGE_PMD_NR - 1) / default  : Always collapse, ignoring pte_none/zeropage

Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1), but not sure
if we have to add that for now.

Because, as raised in the past, I'm afraid nobody on this earth has a clue how
to set this parameter to values different to 0 (don't waste memory with khugepaged)
and 511 (page fault behavior).


If any other value is set, essentially
	pr_warn("Unsupported 'max_ptes_none' value for mTHP collapse");

for now and just disable it.

-- 
Cheers

David / dhildenb