Documentation/admin-guide/mm/transhuge.rst | 44 +- include/linux/huge_mm.h | 5 + include/linux/khugepaged.h | 4 + include/trace/events/huge_memory.h | 34 +- mm/huge_memory.c | 11 + mm/khugepaged.c | 552 +++++++++++++++------ 6 files changed, 468 insertions(+), 182 deletions(-)
The following series provides khugepaged with the capability to collapse anonymous memory regions to mTHPs. To achieve this we generalize the khugepaged functions to no longer depend on PMD_ORDER. Then during the PMD scan, we use a bitmap to track chunks of pages (defined by KHUGEPAGED_MTHP_MIN_ORDER) that are utilized. After the PMD scan is done, we do binary recursion on the bitmap to find the optimal mTHP sizes for the PMD range. The restriction on max_ptes_none is removed during the scan, to make sure we account for the whole PMD range. When no mTHP size is enabled, the legacy behavior of khugepaged is maintained. max_ptes_none will be scaled by the attempted collapse order to determine how full a mTHP must be to be eligible for the collapse to occur. If a mTHP collapse is attempted, but contains swapped out, or shared pages, we don't perform the collapse. It is now also possible to collapse to mTHPs without requiring the PMD THP size to be enabled. With the default max_ptes_none=511, the code should keep its most of its original behavior. When enabling multiple adjacent (m)THP sizes we need to set max_ptes_none<=255. With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and constantly promote mTHPs to the next available size. This is due the fact that a collapse will introduce at least 2x the number of pages, and on a future scan will satisfy the promotion condition once again. Patch 1: Refactor/rename hpage_collapse Patch 2: Some refactoring to combine madvise_collapse and khugepaged Patch 3-5: Generalize khugepaged functions for arbitrary orders Patch 6-8: The mTHP patches Patch 9-10: Allow khugepaged to operate without PMD enabled Patch 11-12: Tracing/stats Patch 13: Documentation --------- Testing --------- - Built for x86_64, aarch64, ppc64le, and s390x - selftests mm - I created a test script that I used to push khugepaged to its limits while monitoring a number of stats and tracepoints. The code is available here[1] (Run in legacy mode for these changes and set mthp sizes to inherit) The summary from my testings was that there was no significant regression noticed through this test. In some cases my changes had better collapse latencies, and was able to scan more pages in the same amount of time/work, but for the most part the results were consistent. - redis testing. I tested these changes along with my defer changes (see followup [4] post for more details). We've decided to get the mTHP changes merged first before attempting the defer series. - some basic testing on 64k page size. - lots of general use. V10 Changes: - Fixed bug where bitmap tracking was off by one leading to weird behavior in some test cases. - Track mTHP stats for PMD order too (Baolin) - indentation cleanup (David) - add review/ack tags - Improve the control flow, readability, and result handling in collapse_scan_bitmap (Baolin) - Indentation nits/cleanup (David) - Converted u8 orders to unsigned int to be consistent with other folio callers (David) - Handled conflicts with Devs work on pte batching - Changed SWAP/SHARED restriction comments to a TODO comment (David) - Squashed main mTHP patch and the introduce bitmap patch (David) - Other small nits V9 Changes: [3] - Drop madvise_collapse support [2]. Further discussion needed. - Add documentation entries for new stats (Baolin) - Fix missing stat update (MTHP_STAT_COLLAPSE_EXCEED_SWAP) that was accidentally dropped in v7 (Baolin) - Fix mishandled conflict noted in v8 (merged into wrong commit) - change rename from khugepaged to collapse (Dev) V8 Changes: - Fix mishandled conflict with shmem config changes (Baolin) - Add Baolin's patches for allowing collapse without PMD enabled - Add additional patch for allowing madvise_collapse without PMD enabled - Documentations nits (Randy) - Simplify SCAN_ANY_PROCESS lock jumbling (Liam) - Add a BUG_ON to the mTHP collapse similar to PMD (Dev) - Remove doc comment about khugepaged PMD only limitation (Dev) - Change revalidation function to accept multiple orders - Handled conflicts introduced by Lorenzo's madvise changes V7 (RESEND) V6 Changes: - Dont release the anon_vma_lock early (like in the PMD case), as not all pages are isolated. - Define the PTE as null to avoid a uninitilized condition - minor nits and newline cleanup - make sure to unmap and unlock the pte for the swapin case - change the revalidation to always check the PMD order (as this will make sure that no other VMA spans it) V5 Changes: - switched the order of patches 1 and 2 - fixed some edge cases on the unified madvise_collapse and khugepaged - Explained the "creep" some more in the docs - fix EXCEED_SHARED vs EXCEED_SWAP accounting issue - fix potential highmem issue caused by a early unmap of the PTE V4 Changes: - Rebased onto mm-unstable - small changes to Documentation V3 Changes: - corrected legacy behavior for khugepaged and madvise_collapse - added proper mTHP stat tracking - Minor changes to prevent a nested lock on non-split-lock arches - Took Devs version of alloc_charge_folio as it has the proper stats - Skip cases were trying to collapse to a lower order would still fail - Fixed cases were the bitmap was not being updated properly - Moved Documentation update to this series instead of the defer set - Minor bugs discovered during testing and review - Minor "nit" cleanup V2 Changes: - Minor bug fixes discovered during review and testing - removed dynamic allocations for bitmaps, and made them stack based - Adjusted bitmap offset from u8 to u16 to support 64k pagesize. - Updated trace events to include collapsing order info. - Scaled max_ptes_none by order rather than scaling to a 0-100 scale. - No longer require a chunk to be fully utilized before setting the bit. Use the same max_ptes_none scaling principle to achieve this. - Skip mTHP collapse that requires swapin or shared handling. This helps prevent some of the "creep" that was discovered in v1. A big thanks to everyone that has reviewed, tested, and participated in the development process. Its been a great experience working with all of you on this long endeavour. [1] - https://gitlab.com/npache/khugepaged_mthp_test [2] - https://lore.kernel.org/all/23b8ad10-cd1f-45df-a25c-78d01c8af44f@redhat.com/ [3] - https://lore.kernel.org/lkml/20250714003207.113275-1-npache@redhat.com/ [4] - https://lore.kernel.org/lkml/20250515033857.132535-1-npache@redhat.com/ Baolin Wang (2): khugepaged: enable collapsing mTHPs even when PMD THPs are disabled khugepaged: kick khugepaged for enabling none-PMD-sized mTHPs Dev Jain (1): khugepaged: generalize alloc_charge_folio() Nico Pache (10): khugepaged: rename hpage_collapse_* to collapse_* introduce collapse_single_pmd to unify khugepaged and madvise_collapse khugepaged: generalize hugepage_vma_revalidate for mTHP support khugepaged: generalize __collapse_huge_page_* for mTHP support khugepaged: add mTHP support khugepaged: skip collapsing mTHP to smaller orders khugepaged: avoid unnecessary mTHP collapse attempts khugepaged: improve tracepoints for mTHP orders khugepaged: add per-order mTHP khugepaged stats Documentation: mm: update the admin guide for mTHP collapse Documentation/admin-guide/mm/transhuge.rst | 44 +- include/linux/huge_mm.h | 5 + include/linux/khugepaged.h | 4 + include/trace/events/huge_memory.h | 34 +- mm/huge_memory.c | 11 + mm/khugepaged.c | 552 +++++++++++++++------ 6 files changed, 468 insertions(+), 182 deletions(-) -- 2.50.1
On Tue, 19 Aug 2025 07:41:52 -0600 Nico Pache <npache@redhat.com> wrote: > The following series provides khugepaged with the capability to collapse > anonymous memory regions to mTHPs. > > ... > > - I created a test script that I used to push khugepaged to its limits > while monitoring a number of stats and tracepoints. The code is > available here[1] (Run in legacy mode for these changes and set mthp > sizes to inherit) Could this be turned into something in tools/testing/selftests/mm/? > V10 Changes: I'll add this to mm-new, thanks. I'll suppress the usual emails.
On Tue, Aug 19, 2025 at 3:55 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Tue, 19 Aug 2025 07:41:52 -0600 Nico Pache <npache@redhat.com> wrote: > > > The following series provides khugepaged with the capability to collapse > > anonymous memory regions to mTHPs. > > > > ... > > > > - I created a test script that I used to push khugepaged to its limits > > while monitoring a number of stats and tracepoints. The code is > > available here[1] (Run in legacy mode for these changes and set mthp > > sizes to inherit) > > Could this be turned into something in tools/testing/selftests/mm/? Yep! I was actually working on some selftests for this before I hit a weird bug during some of my testing, which took precedence over that. I was planning on sending a separate series for the testing. One of the pain points was that selftests helpers were set up for PMDs, but I think Baolin just cleaned that up in his khugepaged mTHP shmem series (still need to review). So it should be a lot easier to implement now. > > > V10 Changes: > > I'll add this to mm-new, thanks. I'll suppress the usual emails. Thank you :) -- Nico >
On 19/08/25 7:11 pm, Nico Pache wrote: > The following series provides khugepaged with the capability to collapse > anonymous memory regions to mTHPs. > > To achieve this we generalize the khugepaged functions to no longer depend > on PMD_ORDER. Then during the PMD scan, we use a bitmap to track chunks of > pages (defined by KHUGEPAGED_MTHP_MIN_ORDER) that are utilized. After the > PMD scan is done, we do binary recursion on the bitmap to find the optimal > mTHP sizes for the PMD range. The restriction on max_ptes_none is removed > during the scan, to make sure we account for the whole PMD range. When no > mTHP size is enabled, the legacy behavior of khugepaged is maintained. > max_ptes_none will be scaled by the attempted collapse order to determine > how full a mTHP must be to be eligible for the collapse to occur. If a > mTHP collapse is attempted, but contains swapped out, or shared pages, we > don't perform the collapse. It is now also possible to collapse to mTHPs > without requiring the PMD THP size to be enabled. > > With the default max_ptes_none=511, the code should keep its most of its > original behavior. When enabling multiple adjacent (m)THP sizes we need to > set max_ptes_none<=255. With max_ptes_none > HPAGE_PMD_NR/2 you will > experience collapse "creep" and constantly promote mTHPs to the next > available size. This is due the fact that a collapse will introduce at > least 2x the number of pages, and on a future scan will satisfy the > promotion condition once again. > > Patch 1: Refactor/rename hpage_collapse > Patch 2: Some refactoring to combine madvise_collapse and khugepaged > Patch 3-5: Generalize khugepaged functions for arbitrary orders > Patch 6-8: The mTHP patches > Patch 9-10: Allow khugepaged to operate without PMD enabled > Patch 11-12: Tracing/stats > Patch 13: Documentation > > For the next version, it will be really great if you can mention the lore links referencing important ideas guiding the evolution of the algorithm - say, a policy decision we make. (I frequently do that albeit I think I over-do it :)) I am asking because I am completely lost on the current discussion going on around the max_ptes_* scaling (been busy with other stuff).
On 19.08.25 15:41, Nico Pache wrote: > The following series provides khugepaged with the capability to collapse > anonymous memory regions to mTHPs. > > To achieve this we generalize the khugepaged functions to no longer depend > on PMD_ORDER. Then during the PMD scan, we use a bitmap to track chunks of > pages (defined by KHUGEPAGED_MTHP_MIN_ORDER) that are utilized. After the > PMD scan is done, we do binary recursion on the bitmap to find the optimal > mTHP sizes for the PMD range. The restriction on max_ptes_none is removed > during the scan, to make sure we account for the whole PMD range. When no > mTHP size is enabled, the legacy behavior of khugepaged is maintained. > max_ptes_none will be scaled by the attempted collapse order to determine > how full a mTHP must be to be eligible for the collapse to occur. If a > mTHP collapse is attempted, but contains swapped out, or shared pages, we > don't perform the collapse. It is now also possible to collapse to mTHPs > without requiring the PMD THP size to be enabled. > > With the default max_ptes_none=511, the code should keep its most of its > original behavior. When enabling multiple adjacent (m)THP sizes we need to > set max_ptes_none<=255. With max_ptes_none > HPAGE_PMD_NR/2 you will > experience collapse "creep" and constantly promote mTHPs to the next > available size. This is due the fact that a collapse will introduce at > least 2x the number of pages, and on a future scan will satisfy the > promotion condition once again. > > Patch 1: Refactor/rename hpage_collapse > Patch 2: Some refactoring to combine madvise_collapse and khugepaged > Patch 3-5: Generalize khugepaged functions for arbitrary orders > Patch 6-8: The mTHP patches > Patch 9-10: Allow khugepaged to operate without PMD enabled > Patch 11-12: Tracing/stats > Patch 13: Documentation Would it be feasible to start with simply not supporting the max_pte_none parameter in the first version, just like we won't support max_pte_swapped/max_pte_shared in the first version? That gives us more time to think about how to use/modify the old interface. For example, I could envision a ratio-based interface, or as discussed with Lorenzo a simple boolean. We could make the existing max_ptes* interface backwards compatible then. That also gives us the opportunity to think about the creep problem separately. I'm sure initial mTHP collapse will be valuable even without support for that weird set of parameters. Would there be implementation-wise a problem? But let me think further about the creep problem ... :/ -- Cheers David / dhildenb
On 01.09.25 18:21, David Hildenbrand wrote: > On 19.08.25 15:41, Nico Pache wrote: >> The following series provides khugepaged with the capability to collapse >> anonymous memory regions to mTHPs. >> >> To achieve this we generalize the khugepaged functions to no longer depend >> on PMD_ORDER. Then during the PMD scan, we use a bitmap to track chunks of >> pages (defined by KHUGEPAGED_MTHP_MIN_ORDER) that are utilized. After the >> PMD scan is done, we do binary recursion on the bitmap to find the optimal >> mTHP sizes for the PMD range. The restriction on max_ptes_none is removed >> during the scan, to make sure we account for the whole PMD range. When no >> mTHP size is enabled, the legacy behavior of khugepaged is maintained. >> max_ptes_none will be scaled by the attempted collapse order to determine >> how full a mTHP must be to be eligible for the collapse to occur. If a >> mTHP collapse is attempted, but contains swapped out, or shared pages, we >> don't perform the collapse. It is now also possible to collapse to mTHPs >> without requiring the PMD THP size to be enabled. >> >> With the default max_ptes_none=511, the code should keep its most of its >> original behavior. When enabling multiple adjacent (m)THP sizes we need to >> set max_ptes_none<=255. With max_ptes_none > HPAGE_PMD_NR/2 you will >> experience collapse "creep" and constantly promote mTHPs to the next >> available size. This is due the fact that a collapse will introduce at >> least 2x the number of pages, and on a future scan will satisfy the >> promotion condition once again. >> >> Patch 1: Refactor/rename hpage_collapse >> Patch 2: Some refactoring to combine madvise_collapse and khugepaged >> Patch 3-5: Generalize khugepaged functions for arbitrary orders >> Patch 6-8: The mTHP patches >> Patch 9-10: Allow khugepaged to operate without PMD enabled >> Patch 11-12: Tracing/stats >> Patch 13: Documentation > > Would it be feasible to start with simply not supporting the > max_pte_none parameter in the first version, just like we won't support > max_pte_swapped/max_pte_shared in the first version? > > That gives us more time to think about how to use/modify the old interface. > > For example, I could envision a ratio-based interface, or as discussed > with Lorenzo a simple boolean. We could make the existing max_ptes* > interface backwards compatible then. > > That also gives us the opportunity to think about the creep problem > separately. > > I'm sure initial mTHP collapse will be valuable even without support for > that weird set of parameters. > > Would there be implementation-wise a problem? > > But let me think further about the creep problem ... :/ FWIW, I just looked around and there is documented usage of setting max_ptes_none to 0 [1, 2, 3]. In essence, I think it can make sense to set it to 0 when an application wants to manage THP on its own (MADV_COLLAPSE), and avoid khugepaged interfering. Now, using a system-wide toggle for such a use case is rather questionable, but it's all we have. I did not find anything only recommending to set values different to 0 or 511 -- so far. So *likely* focusing on 0 vs. 511 initially would cover most use cases out there. Ignoring the parameter initially (require all to be !none) could of course also work. [1] https://www.mongodb.com/docs/manual/administration/tcmalloc-performance/ [2] https://google.github.io/tcmalloc/tuning.html [3] https://support.yugabyte.com/hc/en-us/articles/36558155921165-Mitigating-Excessive-RSS-Memory-Usage-Due-to-THP-Transparent-Huge-Pages -- Cheers David / dhildenb
OK so I noticed in patch 13/13 (!) where you change the documentation that you essentially state that the whole method used to determine the ratio of PTEs to collapse to mTHP is broken: khugepaged uses max_ptes_none scaled to the order of the enabled mTHP size to determine collapses. When using mTHPs it's recommended to set max_ptes_none low-- ideally less than HPAGE_PMD_NR / 2 (255 on 4k page size). This will prevent undesired "creep" behavior that leads to continuously collapsing to the largest mTHP size; when we collapse, we are bringing in new non-zero pages that will, on a subsequent scan, cause the max_ptes_none check of the +1 order to always be satisfied. By limiting this to less than half the current order, we make sure we don't cause this feedback loop. max_ptes_shared and max_ptes_swap have no effect when collapsing to a mTHP, and mTHP collapse will fail on shared or swapped out pages. This seems to me to suggest that using /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none as some means of establishing a 'ratio' to do this calculation is fundamentally flawed. So surely we ought to introduce a new sysfs tunable for this? Perhaps /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio Or something like this? It's already questionable that we are taking a value that is expressed essentially in terms of PTE entries per PMD and then use it implicitly to determine the ratio for mTHP, but to then say 'oh but the default value is known-broken' is just a blocker for the series in my opinion. This really has to be done a different way I think. Cheers, Lorenzo
On 21/08/25 8:31 pm, Lorenzo Stoakes wrote: > OK so I noticed in patch 13/13 (!) where you change the documentation that you > essentially state that the whole method used to determine the ratio of PTEs to > collapse to mTHP is broken: > > khugepaged uses max_ptes_none scaled to the order of the enabled > mTHP size to determine collapses. When using mTHPs it's recommended > to set max_ptes_none low-- ideally less than HPAGE_PMD_NR / 2 (255 > on 4k page size). This will prevent undesired "creep" behavior that > leads to continuously collapsing to the largest mTHP size; when we > collapse, we are bringing in new non-zero pages that will, on a > subsequent scan, cause the max_ptes_none check of the +1 order to > always be satisfied. By limiting this to less than half the current > order, we make sure we don't cause this feedback > loop. max_ptes_shared and max_ptes_swap have no effect when > collapsing to a mTHP, and mTHP collapse will fail on shared or > swapped out pages. > > This seems to me to suggest that using > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none as some means > of establishing a 'ratio' to do this calculation is fundamentally flawed. > > So surely we ought to introduce a new sysfs tunable for this? Perhaps > > /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio > > Or something like this? > > It's already questionable that we are taking a value that is expressed > essentially in terms of PTE entries per PMD and then use it implicitly to > determine the ratio for mTHP, but to then say 'oh but the default value is > known-broken' is just a blocker for the series in my opinion. > > This really has to be done a different way I think. > > Cheers, Lorenzo FWIW this was my version of the documentation patch: https://lore.kernel.org/all/20250211111326.14295-18-dev.jain@arm.com/ The discussion about the creep problem started here: https://lore.kernel.org/all/7098654a-776d-413b-8aca-28f811620df7@arm.com/ and the discussion continuing here: https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com/ ending with a summary I gave here: https://lore.kernel.org/all/8114d47b-b383-4d6e-ab65-a0e88b99c873@arm.com/ This should help you with the context.
* Dev Jain <dev.jain@arm.com> [250821 11:14]: > > On 21/08/25 8:31 pm, Lorenzo Stoakes wrote: > > OK so I noticed in patch 13/13 (!) where you change the documentation that you > > essentially state that the whole method used to determine the ratio of PTEs to > > collapse to mTHP is broken: > > > > khugepaged uses max_ptes_none scaled to the order of the enabled > > mTHP size to determine collapses. When using mTHPs it's recommended > > to set max_ptes_none low-- ideally less than HPAGE_PMD_NR / 2 (255 > > on 4k page size). This will prevent undesired "creep" behavior that > > leads to continuously collapsing to the largest mTHP size; when we > > collapse, we are bringing in new non-zero pages that will, on a > > subsequent scan, cause the max_ptes_none check of the +1 order to > > always be satisfied. By limiting this to less than half the current > > order, we make sure we don't cause this feedback > > loop. max_ptes_shared and max_ptes_swap have no effect when > > collapsing to a mTHP, and mTHP collapse will fail on shared or > > swapped out pages. > > > > This seems to me to suggest that using > > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none as some means > > of establishing a 'ratio' to do this calculation is fundamentally flawed. > > > > So surely we ought to introduce a new sysfs tunable for this? Perhaps > > > > /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio > > > > Or something like this? > > > > It's already questionable that we are taking a value that is expressed > > essentially in terms of PTE entries per PMD and then use it implicitly to > > determine the ratio for mTHP, but to then say 'oh but the default value is > > known-broken' is just a blocker for the series in my opinion. > > > > This really has to be done a different way I think. > > > > Cheers, Lorenzo > > FWIW this was my version of the documentation patch: > https://lore.kernel.org/all/20250211111326.14295-18-dev.jain@arm.com/ > > The discussion about the creep problem started here: > https://lore.kernel.org/all/7098654a-776d-413b-8aca-28f811620df7@arm.com/ > > and the discussion continuing here: > https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com/ > > ending with a summary I gave here: > https://lore.kernel.org/all/8114d47b-b383-4d6e-ab65-a0e88b99c873@arm.com/ > > This should help you with the context. Thanks for hunting this down, the context should be referenced in the change log so we can find it easier in the future (and now). Or at least in the cover letter. The way the change log in the cover letter is written makes it exceedingly long. Could you switch to listing the changes from v9 and links to v1-8 (+RFCs if there are any)? Well, I guess include v10 changes and v1-9 urls.. At the length it is now, it's most likely a tl;dr for most. If you're starting to review this at v10, then you'd probably appreciate not rehashing discussions and if you're going from v9 then you already have an idea of what v10 should have changed. Said another way, the changelog is more useful with context and context is difficult to find without a lore link. I am having issues tracking down the contexts of many items of what has been generated here and it'll only get worse as time moves on. We do our best to keep change logs with the necessary details, but having bread crumbs to follow is extremely helpful for review and in the long run. Thanks, Liam
On Thu, Aug 21, 2025 at 08:43:18PM +0530, Dev Jain wrote: > > On 21/08/25 8:31 pm, Lorenzo Stoakes wrote: > > OK so I noticed in patch 13/13 (!) where you change the documentation that you > > essentially state that the whole method used to determine the ratio of PTEs to > > collapse to mTHP is broken: > > > > khugepaged uses max_ptes_none scaled to the order of the enabled > > mTHP size to determine collapses. When using mTHPs it's recommended > > to set max_ptes_none low-- ideally less than HPAGE_PMD_NR / 2 (255 > > on 4k page size). This will prevent undesired "creep" behavior that > > leads to continuously collapsing to the largest mTHP size; when we > > collapse, we are bringing in new non-zero pages that will, on a > > subsequent scan, cause the max_ptes_none check of the +1 order to > > always be satisfied. By limiting this to less than half the current > > order, we make sure we don't cause this feedback > > loop. max_ptes_shared and max_ptes_swap have no effect when > > collapsing to a mTHP, and mTHP collapse will fail on shared or > > swapped out pages. > > > > This seems to me to suggest that using > > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none as some means > > of establishing a 'ratio' to do this calculation is fundamentally flawed. > > > > So surely we ought to introduce a new sysfs tunable for this? Perhaps > > > > /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio > > > > Or something like this? > > > > It's already questionable that we are taking a value that is expressed > > essentially in terms of PTE entries per PMD and then use it implicitly to > > determine the ratio for mTHP, but to then say 'oh but the default value is > > known-broken' is just a blocker for the series in my opinion. > > > > This really has to be done a different way I think. > > > > Cheers, Lorenzo > > FWIW this was my version of the documentation patch: > https://lore.kernel.org/all/20250211111326.14295-18-dev.jain@arm.com/ > > The discussion about the creep problem started here: > https://lore.kernel.org/all/7098654a-776d-413b-8aca-28f811620df7@arm.com/ > > and the discussion continuing here: > https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com/ > > ending with a summary I gave here: > https://lore.kernel.org/all/8114d47b-b383-4d6e-ab65-a0e88b99c873@arm.com/ > > This should help you with the context. > > Thanks and I"ll have a look, but this series is unmergeable with a broken default in /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio sorry. We need to have a new tunable as far as I can tell. I also find the use of this PMD-specific value as an arbitrary way of expressing a ratio pretty gross. Thanks, Lorenzo
On Thu, Aug 21, 2025 at 9:20 AM Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > > On Thu, Aug 21, 2025 at 08:43:18PM +0530, Dev Jain wrote: > > > > On 21/08/25 8:31 pm, Lorenzo Stoakes wrote: > > > OK so I noticed in patch 13/13 (!) where you change the documentation that you > > > essentially state that the whole method used to determine the ratio of PTEs to > > > collapse to mTHP is broken: > > > > > > khugepaged uses max_ptes_none scaled to the order of the enabled > > > mTHP size to determine collapses. When using mTHPs it's recommended > > > to set max_ptes_none low-- ideally less than HPAGE_PMD_NR / 2 (255 > > > on 4k page size). This will prevent undesired "creep" behavior that > > > leads to continuously collapsing to the largest mTHP size; when we > > > collapse, we are bringing in new non-zero pages that will, on a > > > subsequent scan, cause the max_ptes_none check of the +1 order to > > > always be satisfied. By limiting this to less than half the current > > > order, we make sure we don't cause this feedback > > > loop. max_ptes_shared and max_ptes_swap have no effect when > > > collapsing to a mTHP, and mTHP collapse will fail on shared or > > > swapped out pages. > > > > > > This seems to me to suggest that using > > > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none as some means > > > of establishing a 'ratio' to do this calculation is fundamentally flawed. > > > > > > So surely we ought to introduce a new sysfs tunable for this? Perhaps > > > > > > /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio > > > > > > Or something like this? > > > > > > It's already questionable that we are taking a value that is expressed > > > essentially in terms of PTE entries per PMD and then use it implicitly to > > > determine the ratio for mTHP, but to then say 'oh but the default value is > > > known-broken' is just a blocker for the series in my opinion. > > > > > > This really has to be done a different way I think. > > > > > > Cheers, Lorenzo > > > > FWIW this was my version of the documentation patch: > > https://lore.kernel.org/all/20250211111326.14295-18-dev.jain@arm.com/ > > > > The discussion about the creep problem started here: > > https://lore.kernel.org/all/7098654a-776d-413b-8aca-28f811620df7@arm.com/ > > > > and the discussion continuing here: > > https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com/ > > > > ending with a summary I gave here: > > https://lore.kernel.org/all/8114d47b-b383-4d6e-ab65-a0e88b99c873@arm.com/ > > > > This should help you with the context. > > > > > > Thanks and I"ll have a look, but this series is unmergeable with a broken > default in > /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio > sorry. > > We need to have a new tunable as far as I can tell. I also find the use of > this PMD-specific value as an arbitrary way of expressing a ratio pretty > gross. The first thing that comes to mind is that we can pin max_ptes_none to 255 if it exceeds 255. It's worth noting that the issue occurs only for adjacently enabled mTHP sizes. ie) if order!=HPAGE_PMD_ORDER && khugepaged_max_ptes_none > 255 temp_max_ptes_none = 255; > > Thanks, Lorenzo >
On Thu, Aug 21, 2025 at 9:25 AM Nico Pache <npache@redhat.com> wrote: > > On Thu, Aug 21, 2025 at 9:20 AM Lorenzo Stoakes > <lorenzo.stoakes@oracle.com> wrote: > > > > On Thu, Aug 21, 2025 at 08:43:18PM +0530, Dev Jain wrote: > > > > > > On 21/08/25 8:31 pm, Lorenzo Stoakes wrote: > > > > OK so I noticed in patch 13/13 (!) where you change the documentation that you > > > > essentially state that the whole method used to determine the ratio of PTEs to > > > > collapse to mTHP is broken: > > > > > > > > khugepaged uses max_ptes_none scaled to the order of the enabled > > > > mTHP size to determine collapses. When using mTHPs it's recommended > > > > to set max_ptes_none low-- ideally less than HPAGE_PMD_NR / 2 (255 > > > > on 4k page size). This will prevent undesired "creep" behavior that > > > > leads to continuously collapsing to the largest mTHP size; when we > > > > collapse, we are bringing in new non-zero pages that will, on a > > > > subsequent scan, cause the max_ptes_none check of the +1 order to > > > > always be satisfied. By limiting this to less than half the current > > > > order, we make sure we don't cause this feedback > > > > loop. max_ptes_shared and max_ptes_swap have no effect when > > > > collapsing to a mTHP, and mTHP collapse will fail on shared or > > > > swapped out pages. > > > > > > > > This seems to me to suggest that using > > > > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none as some means > > > > of establishing a 'ratio' to do this calculation is fundamentally flawed. > > > > > > > > So surely we ought to introduce a new sysfs tunable for this? Perhaps > > > > > > > > /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio > > > > > > > > Or something like this? > > > > > > > > It's already questionable that we are taking a value that is expressed > > > > essentially in terms of PTE entries per PMD and then use it implicitly to > > > > determine the ratio for mTHP, but to then say 'oh but the default value is > > > > known-broken' is just a blocker for the series in my opinion. > > > > > > > > This really has to be done a different way I think. > > > > > > > > Cheers, Lorenzo > > > > > > FWIW this was my version of the documentation patch: > > > https://lore.kernel.org/all/20250211111326.14295-18-dev.jain@arm.com/ > > > > > > The discussion about the creep problem started here: > > > https://lore.kernel.org/all/7098654a-776d-413b-8aca-28f811620df7@arm.com/ > > > > > > and the discussion continuing here: > > > https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com/ > > > > > > ending with a summary I gave here: > > > https://lore.kernel.org/all/8114d47b-b383-4d6e-ab65-a0e88b99c873@arm.com/ > > > > > > This should help you with the context. > > > > > > > > > > Thanks and I"ll have a look, but this series is unmergeable with a broken > > default in > > /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio > > sorry. > > > > We need to have a new tunable as far as I can tell. I also find the use of > > this PMD-specific value as an arbitrary way of expressing a ratio pretty > > gross. > The first thing that comes to mind is that we can pin max_ptes_none to > 255 if it exceeds 255. It's worth noting that the issue occurs only > for adjacently enabled mTHP sizes. > > ie) > if order!=HPAGE_PMD_ORDER && khugepaged_max_ptes_none > 255 > temp_max_ptes_none = 255; Oh and my second point, introducing a new tunable to control mTHP collapse may become exceedingly complex from a tuning and code management standpoint. > > > > Thanks, Lorenzo > >
On Thu, Aug 21, 2025 at 09:27:19AM -0600, Nico Pache wrote: > On Thu, Aug 21, 2025 at 9:25 AM Nico Pache <npache@redhat.com> wrote: > > > > On Thu, Aug 21, 2025 at 9:20 AM Lorenzo Stoakes > > <lorenzo.stoakes@oracle.com> wrote: > > > > > > On Thu, Aug 21, 2025 at 08:43:18PM +0530, Dev Jain wrote: > > > > > > > > On 21/08/25 8:31 pm, Lorenzo Stoakes wrote: > > > > > OK so I noticed in patch 13/13 (!) where you change the documentation that you > > > > > essentially state that the whole method used to determine the ratio of PTEs to > > > > > collapse to mTHP is broken: > > > > > > > > > > khugepaged uses max_ptes_none scaled to the order of the enabled > > > > > mTHP size to determine collapses. When using mTHPs it's recommended > > > > > to set max_ptes_none low-- ideally less than HPAGE_PMD_NR / 2 (255 > > > > > on 4k page size). This will prevent undesired "creep" behavior that > > > > > leads to continuously collapsing to the largest mTHP size; when we > > > > > collapse, we are bringing in new non-zero pages that will, on a > > > > > subsequent scan, cause the max_ptes_none check of the +1 order to > > > > > always be satisfied. By limiting this to less than half the current > > > > > order, we make sure we don't cause this feedback > > > > > loop. max_ptes_shared and max_ptes_swap have no effect when > > > > > collapsing to a mTHP, and mTHP collapse will fail on shared or > > > > > swapped out pages. > > > > > > > > > > This seems to me to suggest that using > > > > > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none as some means > > > > > of establishing a 'ratio' to do this calculation is fundamentally flawed. > > > > > > > > > > So surely we ought to introduce a new sysfs tunable for this? Perhaps > > > > > > > > > > /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio > > > > > > > > > > Or something like this? > > > > > > > > > > It's already questionable that we are taking a value that is expressed > > > > > essentially in terms of PTE entries per PMD and then use it implicitly to > > > > > determine the ratio for mTHP, but to then say 'oh but the default value is > > > > > known-broken' is just a blocker for the series in my opinion. > > > > > > > > > > This really has to be done a different way I think. > > > > > > > > > > Cheers, Lorenzo > > > > > > > > FWIW this was my version of the documentation patch: > > > > https://lore.kernel.org/all/20250211111326.14295-18-dev.jain@arm.com/ > > > > > > > > The discussion about the creep problem started here: > > > > https://lore.kernel.org/all/7098654a-776d-413b-8aca-28f811620df7@arm.com/ > > > > > > > > and the discussion continuing here: > > > > https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com/ > > > > > > > > ending with a summary I gave here: > > > > https://lore.kernel.org/all/8114d47b-b383-4d6e-ab65-a0e88b99c873@arm.com/ > > > > > > > > This should help you with the context. > > > > > > > > > > > > > > Thanks and I"ll have a look, but this series is unmergeable with a broken > > > default in > > > /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio > > > sorry. > > > > > > We need to have a new tunable as far as I can tell. I also find the use of > > > this PMD-specific value as an arbitrary way of expressing a ratio pretty > > > gross. > > The first thing that comes to mind is that we can pin max_ptes_none to > > 255 if it exceeds 255. It's worth noting that the issue occurs only > > for adjacently enabled mTHP sizes. No! Presumably the default of 511 (for PMDs with 512 entries) is set for a reason, arbitrarily changing this to suit a specific case seems crazy no? > > > > ie) > > if order!=HPAGE_PMD_ORDER && khugepaged_max_ptes_none > 255 > > temp_max_ptes_none = 255; > Oh and my second point, introducing a new tunable to control mTHP > collapse may become exceedingly complex from a tuning and code > management standpoint. Umm right now you hve a ratio expressed in PTES per mTHP * ((PTEs per PMD) / PMD) 'except please don't set to the usual default when using mTHP' and it's currently default-broken. I'm really not sure how that is simpler than a seprate tunable that can be expressed as a ratio (e.g. percentage) that actually makes some kind of sense? And we can make anything workable from a code management point of view by refactoring/developing appropriately. And given you're now proposing changing the default for even THP pages with a cap or perhaps having mTHP being used silently change the cap - that is clearly _far_ worse from a tuning standpoint. With a new tunable you can just set a sensible default and people don't even necessarily have to think about it.
On Thu, Aug 21, 2025 at 9:40 AM Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > > On Thu, Aug 21, 2025 at 09:27:19AM -0600, Nico Pache wrote: > > On Thu, Aug 21, 2025 at 9:25 AM Nico Pache <npache@redhat.com> wrote: > > > > > > On Thu, Aug 21, 2025 at 9:20 AM Lorenzo Stoakes > > > <lorenzo.stoakes@oracle.com> wrote: > > > > > > > > On Thu, Aug 21, 2025 at 08:43:18PM +0530, Dev Jain wrote: > > > > > > > > > > On 21/08/25 8:31 pm, Lorenzo Stoakes wrote: > > > > > > OK so I noticed in patch 13/13 (!) where you change the documentation that you > > > > > > essentially state that the whole method used to determine the ratio of PTEs to > > > > > > collapse to mTHP is broken: > > > > > > > > > > > > khugepaged uses max_ptes_none scaled to the order of the enabled > > > > > > mTHP size to determine collapses. When using mTHPs it's recommended > > > > > > to set max_ptes_none low-- ideally less than HPAGE_PMD_NR / 2 (255 > > > > > > on 4k page size). This will prevent undesired "creep" behavior that > > > > > > leads to continuously collapsing to the largest mTHP size; when we > > > > > > collapse, we are bringing in new non-zero pages that will, on a > > > > > > subsequent scan, cause the max_ptes_none check of the +1 order to > > > > > > always be satisfied. By limiting this to less than half the current > > > > > > order, we make sure we don't cause this feedback > > > > > > loop. max_ptes_shared and max_ptes_swap have no effect when > > > > > > collapsing to a mTHP, and mTHP collapse will fail on shared or > > > > > > swapped out pages. > > > > > > > > > > > > This seems to me to suggest that using > > > > > > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none as some means > > > > > > of establishing a 'ratio' to do this calculation is fundamentally flawed. > > > > > > > > > > > > So surely we ought to introduce a new sysfs tunable for this? Perhaps > > > > > > > > > > > > /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio > > > > > > > > > > > > Or something like this? > > > > > > > > > > > > It's already questionable that we are taking a value that is expressed > > > > > > essentially in terms of PTE entries per PMD and then use it implicitly to > > > > > > determine the ratio for mTHP, but to then say 'oh but the default value is > > > > > > known-broken' is just a blocker for the series in my opinion. > > > > > > > > > > > > This really has to be done a different way I think. > > > > > > > > > > > > Cheers, Lorenzo > > > > > > > > > > FWIW this was my version of the documentation patch: > > > > > https://lore.kernel.org/all/20250211111326.14295-18-dev.jain@arm.com/ > > > > > > > > > > The discussion about the creep problem started here: > > > > > https://lore.kernel.org/all/7098654a-776d-413b-8aca-28f811620df7@arm.com/ > > > > > > > > > > and the discussion continuing here: > > > > > https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com/ > > > > > > > > > > ending with a summary I gave here: > > > > > https://lore.kernel.org/all/8114d47b-b383-4d6e-ab65-a0e88b99c873@arm.com/ > > > > > > > > > > This should help you with the context. > > > > > > > > > > > > > > > > > > Thanks and I"ll have a look, but this series is unmergeable with a broken > > > > default in > > > > /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio > > > > sorry. > > > > > > > > We need to have a new tunable as far as I can tell. I also find the use of > > > > this PMD-specific value as an arbitrary way of expressing a ratio pretty > > > > gross. > > > The first thing that comes to mind is that we can pin max_ptes_none to > > > 255 if it exceeds 255. It's worth noting that the issue occurs only > > > for adjacently enabled mTHP sizes. > > No! Presumably the default of 511 (for PMDs with 512 entries) is set for a > reason, arbitrarily changing this to suit a specific case seems crazy no? We wouldn't be changing it for PMD collapse, just for the new behavior. At 511, no mTHP collapses would ever occur anyways, unless you have 2MB disabled and other mTHP sizes enabled. Technically at 511 only the highest enabled order always gets collapsed. Ive also argued in the past that 511 is a terrible default for anything other than thp.enabled=always, but that's a whole other can of worms we dont need to discuss now. with this cap of 255, the PMD scan/collapse would work as intended, then in mTHP collapses we would never introduce this undesired behavior. We've discussed before that this would be a hard problem to solve without introducing some expensive way of tracking what has already been through a collapse, and that doesnt even consider what happens if things change or are unmapped, and rescanning that section would be helpful. So having a strictly enforced limit of 255 actually seems like a good idea to me, as it completely avoids the undesired behavior and does not require the admins to be aware of such an issue. Another thought similar to what (IIRC) Dev has mentioned before, if we have max_ptes_none > 255 then we only consider collapses to the largest enabled order, that way no creep to the largest enabled order would occur in the first place, and we would get there straight away. To me one of these two solutions seem sane in the context of what we are dealing with. > > > > > > > ie) > > > if order!=HPAGE_PMD_ORDER && khugepaged_max_ptes_none > 255 > > > temp_max_ptes_none = 255; > > Oh and my second point, introducing a new tunable to control mTHP > > collapse may become exceedingly complex from a tuning and code > > management standpoint. > > Umm right now you hve a ratio expressed in PTES per mTHP * ((PTEs per PMD) / > PMD) 'except please don't set to the usual default when using mTHP' and it's > currently default-broken. > > I'm really not sure how that is simpler than a seprate tunable that can be > expressed as a ratio (e.g. percentage) that actually makes some kind of sense? I agree that the current tunable wasn't designed for this, but we tried to come up with something that leverages the tunable we have to avoid new tunables and added complexity. > > And we can make anything workable from a code management point of view by > refactoring/developing appropriately. What happens if max_ptes_none = 0 and the ratio is 50% - 1 pte (ideally the max number)? seems like we would be saying we want no new none pages, but also to allow new none pages. To me that seems equally broken and more confusing than just taking a scale of the current number (now with a cap). -- Nico > > And given you're now proposing changing the default for even THP pages with a > cap or perhaps having mTHP being used silently change the cap - that is clearly > _far_ worse from a tuning standpoint. > > With a new tunable you can just set a sensible default and people don't even > necessarily have to think about it. >
On Thu, Aug 21, 2025 at 10:46:18AM -0600, Nico Pache wrote: > > > > > Thanks and I"ll have a look, but this series is unmergeable with a broken > > > > > default in > > > > > /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio > > > > > sorry. > > > > > > > > > > We need to have a new tunable as far as I can tell. I also find the use of > > > > > this PMD-specific value as an arbitrary way of expressing a ratio pretty > > > > > gross. > > > > The first thing that comes to mind is that we can pin max_ptes_none to > > > > 255 if it exceeds 255. It's worth noting that the issue occurs only > > > > for adjacently enabled mTHP sizes. > > > > No! Presumably the default of 511 (for PMDs with 512 entries) is set for a > > reason, arbitrarily changing this to suit a specific case seems crazy no? > We wouldn't be changing it for PMD collapse, just for the new > behavior. At 511, no mTHP collapses would ever occur anyways, unless > you have 2MB disabled and other mTHP sizes enabled. Technically at 511 > only the highest enabled order always gets collapsed. > > Ive also argued in the past that 511 is a terrible default for > anything other than thp.enabled=always, but that's a whole other can > of worms we dont need to discuss now. > > with this cap of 255, the PMD scan/collapse would work as intended, > then in mTHP collapses we would never introduce this undesired > behavior. We've discussed before that this would be a hard problem to > solve without introducing some expensive way of tracking what has > already been through a collapse, and that doesnt even consider what > happens if things change or are unmapped, and rescanning that section > would be helpful. So having a strictly enforced limit of 255 actually > seems like a good idea to me, as it completely avoids the undesired > behavior and does not require the admins to be aware of such an issue. > > Another thought similar to what (IIRC) Dev has mentioned before, if we > have max_ptes_none > 255 then we only consider collapses to the > largest enabled order, that way no creep to the largest enabled order > would occur in the first place, and we would get there straight away. > > To me one of these two solutions seem sane in the context of what we > are dealing with. > > > > > > > > > > ie) > > > > if order!=HPAGE_PMD_ORDER && khugepaged_max_ptes_none > 255 > > > > temp_max_ptes_none = 255; > > > Oh and my second point, introducing a new tunable to control mTHP > > > collapse may become exceedingly complex from a tuning and code > > > management standpoint. > > > > Umm right now you hve a ratio expressed in PTES per mTHP * ((PTEs per PMD) / > > PMD) 'except please don't set to the usual default when using mTHP' and it's > > currently default-broken. > > > > I'm really not sure how that is simpler than a seprate tunable that can be > > expressed as a ratio (e.g. percentage) that actually makes some kind of sense? > I agree that the current tunable wasn't designed for this, but we > tried to come up with something that leverages the tunable we have to > avoid new tunables and added complexity. > > > > And we can make anything workable from a code management point of view by > > refactoring/developing appropriately. > What happens if max_ptes_none = 0 and the ratio is 50% - 1 pte > (ideally the max number)? seems like we would be saying we want no new > none pages, but also to allow new none pages. To me that seems equally > broken and more confusing than just taking a scale of the current > number (now with a cap). > > The one thing we absolutely cannot have is a default that causes this 'creeping' behaviour. This feels like shipping something that is broken and alluding to it in the documentation. I spoke to David off-list and he gave some insight into this and perhaps some reasonable means of avoiding an additional tunable. I don't want to rehash what he said as I think it's more productive for him to reply when he has time but broadly I think how we handle this needs careful consideration. To me it's clear that some sense of ratio is just immediately very very confusing, but then again this interface is already confusing, as with much of THP. Anyway I'll let David respond here so we don't loop around before he has a chance to add his input. Cheers, Lorenzo
On Thu, Aug 21, 2025 at 10:55 AM Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > > On Thu, Aug 21, 2025 at 10:46:18AM -0600, Nico Pache wrote: > > > > > > Thanks and I"ll have a look, but this series is unmergeable with a broken > > > > > > default in > > > > > > /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio > > > > > > sorry. > > > > > > > > > > > > We need to have a new tunable as far as I can tell. I also find the use of > > > > > > this PMD-specific value as an arbitrary way of expressing a ratio pretty > > > > > > gross. > > > > > The first thing that comes to mind is that we can pin max_ptes_none to > > > > > 255 if it exceeds 255. It's worth noting that the issue occurs only > > > > > for adjacently enabled mTHP sizes. > > > > > > No! Presumably the default of 511 (for PMDs with 512 entries) is set for a > > > reason, arbitrarily changing this to suit a specific case seems crazy no? > > We wouldn't be changing it for PMD collapse, just for the new > > behavior. At 511, no mTHP collapses would ever occur anyways, unless > > you have 2MB disabled and other mTHP sizes enabled. Technically at 511 > > only the highest enabled order always gets collapsed. > > > > Ive also argued in the past that 511 is a terrible default for > > anything other than thp.enabled=always, but that's a whole other can > > of worms we dont need to discuss now. > > > > with this cap of 255, the PMD scan/collapse would work as intended, > > then in mTHP collapses we would never introduce this undesired > > behavior. We've discussed before that this would be a hard problem to > > solve without introducing some expensive way of tracking what has > > already been through a collapse, and that doesnt even consider what > > happens if things change or are unmapped, and rescanning that section > > would be helpful. So having a strictly enforced limit of 255 actually > > seems like a good idea to me, as it completely avoids the undesired > > behavior and does not require the admins to be aware of such an issue. > > > > Another thought similar to what (IIRC) Dev has mentioned before, if we > > have max_ptes_none > 255 then we only consider collapses to the > > largest enabled order, that way no creep to the largest enabled order > > would occur in the first place, and we would get there straight away. > > > > To me one of these two solutions seem sane in the context of what we > > are dealing with. > > > > > > > > > > > > > ie) > > > > > if order!=HPAGE_PMD_ORDER && khugepaged_max_ptes_none > 255 > > > > > temp_max_ptes_none = 255; > > > > Oh and my second point, introducing a new tunable to control mTHP > > > > collapse may become exceedingly complex from a tuning and code > > > > management standpoint. > > > > > > Umm right now you hve a ratio expressed in PTES per mTHP * ((PTEs per PMD) / > > > PMD) 'except please don't set to the usual default when using mTHP' and it's > > > currently default-broken. > > > > > > I'm really not sure how that is simpler than a seprate tunable that can be > > > expressed as a ratio (e.g. percentage) that actually makes some kind of sense? > > I agree that the current tunable wasn't designed for this, but we > > tried to come up with something that leverages the tunable we have to > > avoid new tunables and added complexity. > > > > > > And we can make anything workable from a code management point of view by > > > refactoring/developing appropriately. > > What happens if max_ptes_none = 0 and the ratio is 50% - 1 pte > > (ideally the max number)? seems like we would be saying we want no new > > none pages, but also to allow new none pages. To me that seems equally > > broken and more confusing than just taking a scale of the current > > number (now with a cap). > > > > > > The one thing we absolutely cannot have is a default that causes this > 'creeping' behaviour. This feels like shipping something that is broken and > alluding to it in the documentation. Ok I've put a lot of thought and time into this and came up with a solution. Here is what I currently have tested and would like to proposing: - Expand bitmap to HPAGE_PMD_NR (512)*, this increases the accuracy of the max_pte_none handling, and removes a lot of inaccuracies caused by the compression into 128 bits that was being done. This also makes the code a lot easier to understand. - When attempting mTHP level collapses cap max_ptes_none to 255 to prevent the creep issue Ive tested this and found this performs better than my previous version, allows for more granular control via max_ptes_none, and prevents the creep issue without any admin knowledge needed. I think this is a good middle ground between completely disabling the fine tune control, and doing a better job at mitigating misconfiguration. **Baolin actually also expands the bitmap to 512 in his khugepaged collapse file mTHP support patchset Does this sound reasonable to you? -- Nico > > I spoke to David off-list and he gave some insight into this and perhaps > some reasonable means of avoiding an additional tunable. > > I don't want to rehash what he said as I think it's more productive for him > to reply when he has time but broadly I think how we handle this needs > careful consideration. > > To me it's clear that some sense of ratio is just immediately very very > confusing, but then again this interface is already confusing, as with much > of THP. > > Anyway I'll let David respond here so we don't loop around before he has a > chance to add his input. > > Cheers, Lorenzo >
On 04.09.25 04:44, Nico Pache wrote: > On Thu, Aug 21, 2025 at 10:55 AM Lorenzo Stoakes > <lorenzo.stoakes@oracle.com> wrote: >> >> On Thu, Aug 21, 2025 at 10:46:18AM -0600, Nico Pache wrote: >>>>>>> Thanks and I"ll have a look, but this series is unmergeable with a broken >>>>>>> default in >>>>>>> /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio >>>>>>> sorry. >>>>>>> >>>>>>> We need to have a new tunable as far as I can tell. I also find the use of >>>>>>> this PMD-specific value as an arbitrary way of expressing a ratio pretty >>>>>>> gross. >>>>>> The first thing that comes to mind is that we can pin max_ptes_none to >>>>>> 255 if it exceeds 255. It's worth noting that the issue occurs only >>>>>> for adjacently enabled mTHP sizes. >>>> >>>> No! Presumably the default of 511 (for PMDs with 512 entries) is set for a >>>> reason, arbitrarily changing this to suit a specific case seems crazy no? >>> We wouldn't be changing it for PMD collapse, just for the new >>> behavior. At 511, no mTHP collapses would ever occur anyways, unless >>> you have 2MB disabled and other mTHP sizes enabled. Technically at 511 >>> only the highest enabled order always gets collapsed. >>> >>> Ive also argued in the past that 511 is a terrible default for >>> anything other than thp.enabled=always, but that's a whole other can >>> of worms we dont need to discuss now. >>> >>> with this cap of 255, the PMD scan/collapse would work as intended, >>> then in mTHP collapses we would never introduce this undesired >>> behavior. We've discussed before that this would be a hard problem to >>> solve without introducing some expensive way of tracking what has >>> already been through a collapse, and that doesnt even consider what >>> happens if things change or are unmapped, and rescanning that section >>> would be helpful. So having a strictly enforced limit of 255 actually >>> seems like a good idea to me, as it completely avoids the undesired >>> behavior and does not require the admins to be aware of such an issue. >>> >>> Another thought similar to what (IIRC) Dev has mentioned before, if we >>> have max_ptes_none > 255 then we only consider collapses to the >>> largest enabled order, that way no creep to the largest enabled order >>> would occur in the first place, and we would get there straight away. >>> >>> To me one of these two solutions seem sane in the context of what we >>> are dealing with. >>>> >>>>>> >>>>>> ie) >>>>>> if order!=HPAGE_PMD_ORDER && khugepaged_max_ptes_none > 255 >>>>>> temp_max_ptes_none = 255; >>>>> Oh and my second point, introducing a new tunable to control mTHP >>>>> collapse may become exceedingly complex from a tuning and code >>>>> management standpoint. >>>> >>>> Umm right now you hve a ratio expressed in PTES per mTHP * ((PTEs per PMD) / >>>> PMD) 'except please don't set to the usual default when using mTHP' and it's >>>> currently default-broken. >>>> >>>> I'm really not sure how that is simpler than a seprate tunable that can be >>>> expressed as a ratio (e.g. percentage) that actually makes some kind of sense? >>> I agree that the current tunable wasn't designed for this, but we >>> tried to come up with something that leverages the tunable we have to >>> avoid new tunables and added complexity. >>>> >>>> And we can make anything workable from a code management point of view by >>>> refactoring/developing appropriately. >>> What happens if max_ptes_none = 0 and the ratio is 50% - 1 pte >>> (ideally the max number)? seems like we would be saying we want no new >>> none pages, but also to allow new none pages. To me that seems equally >>> broken and more confusing than just taking a scale of the current >>> number (now with a cap). >>> >>> >> >> The one thing we absolutely cannot have is a default that causes this >> 'creeping' behaviour. This feels like shipping something that is broken and >> alluding to it in the documentation. > Ok I've put a lot of thought and time into this and came up with a solution. > > Here is what I currently have tested and would like to proposing: > > - Expand bitmap to HPAGE_PMD_NR (512)*, this increases the accuracy of > the max_pte_none handling, and removes a lot of inaccuracies caused by > the compression into 128 bits that was being done. This also makes the > code a lot easier to understand. That sounds good to me. Should make the code easier as well. > > - When attempting mTHP level collapses cap max_ptes_none to 255 to > prevent the creep issue I guess the documentation would then state something like * When collapsing smaller THPs, "max_ptes_none" is scaled proportional to the THP size. * When collapsing smaller THPs, "max_ptes_none" may be internally capped at 255 if it exceeds 255 but is not set to the default (511). Not 100% a fan of all of that, but maybe the only option when wanting to avoid other toggles. The only alternative would really be respecting only 0/511 for mTHP, and not doing any scaling. That would obviously make the documentation easier and would allow us to revisit that later. The documentation would be: * When collapsing smaller THPs, "max_ptes_none" may be interpreted as "0" when set to a value different to the default (511). This behavior might change in the future. > > Ive tested this and found this performs better than my previous > version, allows for more granular control via max_ptes_none, and > prevents the creep issue without any admin knowledge needed. How would this interact with the shrinker once extended to mTHP? Would your RFC patch be sufficient for that or would we actually also want to cap? I haven't; fully thought this through yet. I'd assume we would not want to cap here. Which makes the doc weird as well, lol. -- Cheers David / dhildenb
> > The one thing we absolutely cannot have is a default that causes this > 'creeping' behaviour. This feels like shipping something that is broken and > alluding to it in the documentation. > > I spoke to David off-list and he gave some insight into this and perhaps > some reasonable means of avoiding an additional tunable. > > I don't want to rehash what he said as I think it's more productive for him > to reply when he has time but broadly I think how we handle this needs > careful consideration. > > To me it's clear that some sense of ratio is just immediately very very > confusing, but then again this interface is already confusing, as with much > of THP. > > Anyway I'll let David respond here so we don't loop around before he has a > chance to add his input. > > Cheers, Lorenzo > [Resending because Thunderbird decided to use the wrong smtp server] I've been summoned. As raised in the past, I would initially only support specific values here like 0 : Never collapse with any pte_none/zeropage 511 (HPAGE_PMD_NR - 1) / default : Always collapse, ignoring pte_none/zeropage Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1), but not sure if we have to add that for now. Because, as raised in the past, I'm afraid nobody on this earth has a clue how to set this parameter to values different to 0 (don't waste memory with khugepaged) and 511 (page fault behavior). If any other value is set, essentially pr_warn("Unsupported 'max_ptes_none' value for mTHP collapse"); for now and just disable it. -- Cheers David / dhildenb
On Thu, Aug 21, 2025 at 10:43:35PM +0200, David Hildenbrand wrote: > > > > The one thing we absolutely cannot have is a default that causes this > > 'creeping' behaviour. This feels like shipping something that is broken and > > alluding to it in the documentation. > > > > I spoke to David off-list and he gave some insight into this and perhaps > > some reasonable means of avoiding an additional tunable. > > > > I don't want to rehash what he said as I think it's more productive for him > > to reply when he has time but broadly I think how we handle this needs > > careful consideration. > > > > To me it's clear that some sense of ratio is just immediately very very > > confusing, but then again this interface is already confusing, as with much > > of THP. > > > > Anyway I'll let David respond here so we don't loop around before he has a > > chance to add his input. > > > > Cheers, Lorenzo > > > > [Resending because Thunderbird decided to use the wrong smtp server] > > I've been summoned. Welcome :) > > As raised in the past, I would initially only support specific values here like > > 0 : Never collapse with any pte_none/zeropage > 511 (HPAGE_PMD_NR - 1) / default : Always collapse, ignoring pte_none/zeropage > OK so if had effectively an off/on (I guess we have to keep this as it is for legay purposes) and is forced to one or other of these values then fine (as long as we don't have uAPI worries). > Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1), but not sure > if we have to add that for now. Yeah not so sure about this, this is a 'just have to know' too, and yes you might add it to the docs, but people are going to be mightily confused, esp if it's a calculated value. I don't see any other way around having a separate tunable if we don't just have something VERY simple like on/off. Also the mentioned issue sounds like something that needs to be fixed elsewhere honestly in the algorithm used to figure out mTHP ranges (I may be wrong - and happy to stand corrected if this is somehow inherent, but reallly feels that way). > > Because, as raised in the past, I'm afraid nobody on this earth has a clue how > to set this parameter to values different to 0 (don't waste memory with khugepaged) > and 511 (page fault behavior). Yup > > > If any other value is set, essentially > pr_warn("Unsupported 'max_ptes_none' value for mTHP collapse"); > > for now and just disable it. Hmm but under what circumstances? I would just say unsupported value not mention mTHP or people who don't use mTHP might find that confusing. > > -- > Cheers > > David / dhildenb > Cheers, Lorenzo
>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1), but not sure >> if we have to add that for now. > > Yeah not so sure about this, this is a 'just have to know' too, and yes you > might add it to the docs, but people are going to be mightily confused, esp if > it's a calculated value. > > I don't see any other way around having a separate tunable if we don't just have > something VERY simple like on/off. Yeah, not advocating that we add support for other values than 0/511, really. > > Also the mentioned issue sounds like something that needs to be fixed elsewhere > honestly in the algorithm used to figure out mTHP ranges (I may be wrong - and > happy to stand corrected if this is somehow inherent, but reallly feels that > way). I think the creep is unavoidable for certain values. If you have the first two pages of a PMD area populated, and you allow for at least half of the #PTEs to be non/zero, you'd collapse first a order-2 folio, then and order-3 ... until you reached PMD order. So for now we really should just support 0 / 511 to say "don't collapse if there are holes" vs. "always collapse if there is at least one pte used". > >> >> Because, as raised in the past, I'm afraid nobody on this earth has a clue how >> to set this parameter to values different to 0 (don't waste memory with khugepaged) >> and 511 (page fault behavior). > > Yup > >> >> >> If any other value is set, essentially >> pr_warn("Unsupported 'max_ptes_none' value for mTHP collapse"); >> >> for now and just disable it. > > Hmm but under what circumstances? I would just say unsupported value not mention > mTHP or people who don't use mTHP might find that confusing. Well, we can check whether any mTHP size is enabled while the value is set to something unexpected. We can then even print the problematic sizes if we have to. We could also just just say that if the value is set to something else than 511 (which is the default), it will be treated as being "0" when collapsing mthp, instead of doing any scaling. -- Cheers David / dhildenb
(Sorry for chiming in late) On 2025/8/22 22:10, David Hildenbrand wrote: >>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1), >>> but not sure >>> if we have to add that for now. >> >> Yeah not so sure about this, this is a 'just have to know' too, and >> yes you >> might add it to the docs, but people are going to be mightily >> confused, esp if >> it's a calculated value. >> >> I don't see any other way around having a separate tunable if we don't >> just have >> something VERY simple like on/off. > > Yeah, not advocating that we add support for other values than 0/511, > really. > >> >> Also the mentioned issue sounds like something that needs to be fixed >> elsewhere >> honestly in the algorithm used to figure out mTHP ranges (I may be >> wrong - and >> happy to stand corrected if this is somehow inherent, but reallly >> feels that >> way). > > I think the creep is unavoidable for certain values. > > If you have the first two pages of a PMD area populated, and you allow > for at least half of the #PTEs to be non/zero, you'd collapse first a > order-2 folio, then and order-3 ... until you reached PMD order. > > So for now we really should just support 0 / 511 to say "don't collapse > if there are holes" vs. "always collapse if there is at least one pte > used". If we only allow setting 0 or 511, as Nico mentioned before, "At 511, no mTHP collapses would ever occur anyway, unless you have 2MB disabled and other mTHP sizes enabled. Technically, at 511, only the highest enabled order would ever be collapsed." In other words, for the scenario you described, although there are only 2 PTEs present in a PMD, it would still get collapsed into a PMD-sized THP. In reality, what we probably need is just an order-2 mTHP collapse. If 'khugepaged_max_ptes_none' is set to 255, I think this would achieve the desired result: when there are only 2 PTEs present in a PMD, an order-2 mTHP collapse would be successed, but it wouldn’t creep up to an order-3 mTHP collapse. That’s because: When attempting an order-3 mTHP collapse, 'threshold_bits' = 1, while 'bits_set' = 1 (means only 1 chunk is present), so 'bits_set > threshold_bits' is false, then an order-3 mTHP collapse wouldn’t be attempted. No? So I have some concerns that if we only allow setting 0 or 511, it may not meet the goal we have for mTHP collapsing. >>> Because, as raised in the past, I'm afraid nobody on this earth has a >>> clue how >>> to set this parameter to values different to 0 (don't waste memory >>> with khugepaged) >>> and 511 (page fault behavior). >> >> Yup >> >>> >>> >>> If any other value is set, essentially >>> pr_warn("Unsupported 'max_ptes_none' value for mTHP collapse"); >>> >>> for now and just disable it. >> >> Hmm but under what circumstances? I would just say unsupported value >> not mention >> mTHP or people who don't use mTHP might find that confusing. > > Well, we can check whether any mTHP size is enabled while the value is > set to something unexpected. We can then even print the problematic > sizes if we have to. > > We could also just just say that if the value is set to something else > than 511 (which is the default), it will be treated as being "0" when > collapsing mthp, instead of doing any scaling. >
On 28/08/25 3:16 pm, Baolin Wang wrote: > (Sorry for chiming in late) > > On 2025/8/22 22:10, David Hildenbrand wrote: >>>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1), >>>> but not sure >>>> if we have to add that for now. >>> >>> Yeah not so sure about this, this is a 'just have to know' too, and >>> yes you >>> might add it to the docs, but people are going to be mightily >>> confused, esp if >>> it's a calculated value. >>> >>> I don't see any other way around having a separate tunable if we >>> don't just have >>> something VERY simple like on/off. >> >> Yeah, not advocating that we add support for other values than 0/511, >> really. >> >>> >>> Also the mentioned issue sounds like something that needs to be >>> fixed elsewhere >>> honestly in the algorithm used to figure out mTHP ranges (I may be >>> wrong - and >>> happy to stand corrected if this is somehow inherent, but reallly >>> feels that >>> way). >> >> I think the creep is unavoidable for certain values. >> >> If you have the first two pages of a PMD area populated, and you >> allow for at least half of the #PTEs to be non/zero, you'd collapse >> first a >> order-2 folio, then and order-3 ... until you reached PMD order. >> >> So for now we really should just support 0 / 511 to say "don't >> collapse if there are holes" vs. "always collapse if there is at >> least one pte used". > > If we only allow setting 0 or 511, as Nico mentioned before, "At 511, > no mTHP collapses would ever occur anyway, unless you have 2MB > disabled and other mTHP sizes enabled. Technically, at 511, only the > highest enabled order would ever be collapsed." I didn't understand this statement. At 511, mTHP collapses will occur if khugepaged cannot get a PMD folio. Our goal is to collapse to the highest order folio. > > In other words, for the scenario you described, although there are > only 2 PTEs present in a PMD, it would still get collapsed into a > PMD-sized THP. In reality, what we probably need is just an order-2 > mTHP collapse. > > If 'khugepaged_max_ptes_none' is set to 255, I think this would > achieve the desired result: when there are only 2 PTEs present in a > PMD, an order-2 mTHP collapse would be successed, but it wouldn’t > creep up to an order-3 mTHP collapse. That’s because: > When attempting an order-3 mTHP collapse, 'threshold_bits' = 1, while > 'bits_set' = 1 (means only 1 chunk is present), so 'bits_set > > threshold_bits' is false, then an order-3 mTHP collapse wouldn’t be > attempted. No? > > So I have some concerns that if we only allow setting 0 or 511, it may > not meet the goal we have for mTHP collapsing. > >>>> Because, as raised in the past, I'm afraid nobody on this earth has >>>> a clue how >>>> to set this parameter to values different to 0 (don't waste memory >>>> with khugepaged) >>>> and 511 (page fault behavior). >>> >>> Yup >>> >>>> >>>> >>>> If any other value is set, essentially >>>> pr_warn("Unsupported 'max_ptes_none' value for mTHP collapse"); >>>> >>>> for now and just disable it. >>> >>> Hmm but under what circumstances? I would just say unsupported value >>> not mention >>> mTHP or people who don't use mTHP might find that confusing. >> >> Well, we can check whether any mTHP size is enabled while the value >> is set to something unexpected. We can then even print the >> problematic sizes if we have to. >> >> We could also just just say that if the value is set to something >> else than 511 (which is the default), it will be treated as being "0" >> when collapsing mthp, instead of doing any scaling. >> >
On 2025/8/28 18:48, Dev Jain wrote: > > On 28/08/25 3:16 pm, Baolin Wang wrote: >> (Sorry for chiming in late) >> >> On 2025/8/22 22:10, David Hildenbrand wrote: >>>>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1), >>>>> but not sure >>>>> if we have to add that for now. >>>> >>>> Yeah not so sure about this, this is a 'just have to know' too, and >>>> yes you >>>> might add it to the docs, but people are going to be mightily >>>> confused, esp if >>>> it's a calculated value. >>>> >>>> I don't see any other way around having a separate tunable if we >>>> don't just have >>>> something VERY simple like on/off. >>> >>> Yeah, not advocating that we add support for other values than 0/511, >>> really. >>> >>>> >>>> Also the mentioned issue sounds like something that needs to be >>>> fixed elsewhere >>>> honestly in the algorithm used to figure out mTHP ranges (I may be >>>> wrong - and >>>> happy to stand corrected if this is somehow inherent, but reallly >>>> feels that >>>> way). >>> >>> I think the creep is unavoidable for certain values. >>> >>> If you have the first two pages of a PMD area populated, and you >>> allow for at least half of the #PTEs to be non/zero, you'd collapse >>> first a >>> order-2 folio, then and order-3 ... until you reached PMD order. >>> >>> So for now we really should just support 0 / 511 to say "don't >>> collapse if there are holes" vs. "always collapse if there is at >>> least one pte used". >> >> If we only allow setting 0 or 511, as Nico mentioned before, "At 511, >> no mTHP collapses would ever occur anyway, unless you have 2MB >> disabled and other mTHP sizes enabled. Technically, at 511, only the >> highest enabled order would ever be collapsed." > I didn't understand this statement. At 511, mTHP collapses will occur if > khugepaged cannot get a PMD folio. Our goal is to collapse to the > highest order folio. Yes, I’m not saying that it’s incorrect behavior when set to 511. What I mean is, as in the example I gave below, users may only want to allow a large order collapse when the number of present PTEs reaches half of the large folio, in order to avoid RSS bloat. So we might also need to consider whether 255 is a reasonable configuration for mTHP collapse. >> In other words, for the scenario you described, although there are >> only 2 PTEs present in a PMD, it would still get collapsed into a PMD- >> sized THP. In reality, what we probably need is just an order-2 mTHP >> collapse. >> >> If 'khugepaged_max_ptes_none' is set to 255, I think this would >> achieve the desired result: when there are only 2 PTEs present in a >> PMD, an order-2 mTHP collapse would be successed, but it wouldn’t >> creep up to an order-3 mTHP collapse. That’s because: >> When attempting an order-3 mTHP collapse, 'threshold_bits' = 1, while >> 'bits_set' = 1 (means only 1 chunk is present), so 'bits_set > >> threshold_bits' is false, then an order-3 mTHP collapse wouldn’t be >> attempted. No? >> >> So I have some concerns that if we only allow setting 0 or 511, it may >> not meet the goal we have for mTHP collapsing. >> >>>>> Because, as raised in the past, I'm afraid nobody on this earth has >>>>> a clue how >>>>> to set this parameter to values different to 0 (don't waste memory >>>>> with khugepaged) >>>>> and 511 (page fault behavior). >>>> >>>> Yup >>>> >>>>> >>>>> >>>>> If any other value is set, essentially >>>>> pr_warn("Unsupported 'max_ptes_none' value for mTHP collapse"); >>>>> >>>>> for now and just disable it. >>>> >>>> Hmm but under what circumstances? I would just say unsupported value >>>> not mention >>>> mTHP or people who don't use mTHP might find that confusing. >>> >>> Well, we can check whether any mTHP size is enabled while the value >>> is set to something unexpected. We can then even print the >>> problematic sizes if we have to. >>> >>> We could also just just say that if the value is set to something >>> else than 511 (which is the default), it will be treated as being "0" >>> when collapsing mthp, instead of doing any scaling. >>> >>
On 29.08.25 03:55, Baolin Wang wrote: > > > On 2025/8/28 18:48, Dev Jain wrote: >> >> On 28/08/25 3:16 pm, Baolin Wang wrote: >>> (Sorry for chiming in late) >>> >>> On 2025/8/22 22:10, David Hildenbrand wrote: >>>>>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1), >>>>>> but not sure >>>>>> if we have to add that for now. >>>>> >>>>> Yeah not so sure about this, this is a 'just have to know' too, and >>>>> yes you >>>>> might add it to the docs, but people are going to be mightily >>>>> confused, esp if >>>>> it's a calculated value. >>>>> >>>>> I don't see any other way around having a separate tunable if we >>>>> don't just have >>>>> something VERY simple like on/off. >>>> >>>> Yeah, not advocating that we add support for other values than 0/511, >>>> really. >>>> >>>>> >>>>> Also the mentioned issue sounds like something that needs to be >>>>> fixed elsewhere >>>>> honestly in the algorithm used to figure out mTHP ranges (I may be >>>>> wrong - and >>>>> happy to stand corrected if this is somehow inherent, but reallly >>>>> feels that >>>>> way). >>>> >>>> I think the creep is unavoidable for certain values. >>>> >>>> If you have the first two pages of a PMD area populated, and you >>>> allow for at least half of the #PTEs to be non/zero, you'd collapse >>>> first a >>>> order-2 folio, then and order-3 ... until you reached PMD order. >>>> >>>> So for now we really should just support 0 / 511 to say "don't >>>> collapse if there are holes" vs. "always collapse if there is at >>>> least one pte used". >>> >>> If we only allow setting 0 or 511, as Nico mentioned before, "At 511, >>> no mTHP collapses would ever occur anyway, unless you have 2MB >>> disabled and other mTHP sizes enabled. Technically, at 511, only the >>> highest enabled order would ever be collapsed." >> I didn't understand this statement. At 511, mTHP collapses will occur if >> khugepaged cannot get a PMD folio. Our goal is to collapse to the >> highest order folio. > > Yes, I’m not saying that it’s incorrect behavior when set to 511. What I > mean is, as in the example I gave below, users may only want to allow a > large order collapse when the number of present PTEs reaches half of the > large folio, in order to avoid RSS bloat. How do these users control allocation at fault time where this parameter is completely ignored? -- Cheers David / dhildenb
On 2025/9/2 00:46, David Hildenbrand wrote: > On 29.08.25 03:55, Baolin Wang wrote: >> >> >> On 2025/8/28 18:48, Dev Jain wrote: >>> >>> On 28/08/25 3:16 pm, Baolin Wang wrote: >>>> (Sorry for chiming in late) >>>> >>>> On 2025/8/22 22:10, David Hildenbrand wrote: >>>>>>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1), >>>>>>> but not sure >>>>>>> if we have to add that for now. >>>>>> >>>>>> Yeah not so sure about this, this is a 'just have to know' too, and >>>>>> yes you >>>>>> might add it to the docs, but people are going to be mightily >>>>>> confused, esp if >>>>>> it's a calculated value. >>>>>> >>>>>> I don't see any other way around having a separate tunable if we >>>>>> don't just have >>>>>> something VERY simple like on/off. >>>>> >>>>> Yeah, not advocating that we add support for other values than 0/511, >>>>> really. >>>>> >>>>>> >>>>>> Also the mentioned issue sounds like something that needs to be >>>>>> fixed elsewhere >>>>>> honestly in the algorithm used to figure out mTHP ranges (I may be >>>>>> wrong - and >>>>>> happy to stand corrected if this is somehow inherent, but reallly >>>>>> feels that >>>>>> way). >>>>> >>>>> I think the creep is unavoidable for certain values. >>>>> >>>>> If you have the first two pages of a PMD area populated, and you >>>>> allow for at least half of the #PTEs to be non/zero, you'd collapse >>>>> first a >>>>> order-2 folio, then and order-3 ... until you reached PMD order. >>>>> >>>>> So for now we really should just support 0 / 511 to say "don't >>>>> collapse if there are holes" vs. "always collapse if there is at >>>>> least one pte used". >>>> >>>> If we only allow setting 0 or 511, as Nico mentioned before, "At 511, >>>> no mTHP collapses would ever occur anyway, unless you have 2MB >>>> disabled and other mTHP sizes enabled. Technically, at 511, only the >>>> highest enabled order would ever be collapsed." >>> I didn't understand this statement. At 511, mTHP collapses will occur if >>> khugepaged cannot get a PMD folio. Our goal is to collapse to the >>> highest order folio. >> >> Yes, I’m not saying that it’s incorrect behavior when set to 511. What I >> mean is, as in the example I gave below, users may only want to allow a >> large order collapse when the number of present PTEs reaches half of the >> large folio, in order to avoid RSS bloat. > > How do these users control allocation at fault time where this parameter > is completely ignored? Sorry, I did not get your point. Why does the 'max_pte_none' need to control allocation at fault time? Could you be more specific? Thanks.
On 02.09.25 04:28, Baolin Wang wrote: > > > On 2025/9/2 00:46, David Hildenbrand wrote: >> On 29.08.25 03:55, Baolin Wang wrote: >>> >>> >>> On 2025/8/28 18:48, Dev Jain wrote: >>>> >>>> On 28/08/25 3:16 pm, Baolin Wang wrote: >>>>> (Sorry for chiming in late) >>>>> >>>>> On 2025/8/22 22:10, David Hildenbrand wrote: >>>>>>>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1), >>>>>>>> but not sure >>>>>>>> if we have to add that for now. >>>>>>> >>>>>>> Yeah not so sure about this, this is a 'just have to know' too, and >>>>>>> yes you >>>>>>> might add it to the docs, but people are going to be mightily >>>>>>> confused, esp if >>>>>>> it's a calculated value. >>>>>>> >>>>>>> I don't see any other way around having a separate tunable if we >>>>>>> don't just have >>>>>>> something VERY simple like on/off. >>>>>> >>>>>> Yeah, not advocating that we add support for other values than 0/511, >>>>>> really. >>>>>> >>>>>>> >>>>>>> Also the mentioned issue sounds like something that needs to be >>>>>>> fixed elsewhere >>>>>>> honestly in the algorithm used to figure out mTHP ranges (I may be >>>>>>> wrong - and >>>>>>> happy to stand corrected if this is somehow inherent, but reallly >>>>>>> feels that >>>>>>> way). >>>>>> >>>>>> I think the creep is unavoidable for certain values. >>>>>> >>>>>> If you have the first two pages of a PMD area populated, and you >>>>>> allow for at least half of the #PTEs to be non/zero, you'd collapse >>>>>> first a >>>>>> order-2 folio, then and order-3 ... until you reached PMD order. >>>>>> >>>>>> So for now we really should just support 0 / 511 to say "don't >>>>>> collapse if there are holes" vs. "always collapse if there is at >>>>>> least one pte used". >>>>> >>>>> If we only allow setting 0 or 511, as Nico mentioned before, "At 511, >>>>> no mTHP collapses would ever occur anyway, unless you have 2MB >>>>> disabled and other mTHP sizes enabled. Technically, at 511, only the >>>>> highest enabled order would ever be collapsed." >>>> I didn't understand this statement. At 511, mTHP collapses will occur if >>>> khugepaged cannot get a PMD folio. Our goal is to collapse to the >>>> highest order folio. >>> >>> Yes, I’m not saying that it’s incorrect behavior when set to 511. What I >>> mean is, as in the example I gave below, users may only want to allow a >>> large order collapse when the number of present PTEs reaches half of the >>> large folio, in order to avoid RSS bloat. >> >> How do these users control allocation at fault time where this parameter >> is completely ignored? > > Sorry, I did not get your point. Why does the 'max_pte_none' need to > control allocation at fault time? Could you be more specific? Thanks. The comment over khugepaged_max_ptes_none gives a hint: /* * default collapse hugepages if there is at least one pte mapped like * it would have happened if the vma was large enough during page * fault. * * Note that these are only respected if collapse was initiated by khugepaged. */ In the common case (for anything that really cares about RSS bloat) you will just a get a THP during page fault and consequently RSS bloat. As raised in my other reply, the only documented reason to set max_ptes_none=0 seems to be when an application later (after once possibly getting a THP already during page faults) did some MADV_DONTNEED and wants to control the usage of THPs itself using MADV_COLLAPSE. It's a questionable use case, that already got more problematic with mTHP and page table reclaim. Let me explain: Before mTHP, if someone would MADV_DONTNEED (resulting in a page table with at least one pte_none entry), there would have been no way we would get memory over-allocated afterwards with max_ptes_none=0. (1) Page faults would spot "there is a page table" and just fallback to order-0 pages. (2) khugepaged was told to not collapse through max_ptes_none=0. But now: (A) With mTHP during page-faults, we can just end up over-allocating memory in such an area again: page faults will simply spot a bunch of pte_nones around the fault area and install an mTHP. (B) With page table reclaim (when zapping all PTEs in a table at once), we will reclaim the page table. The next page fault will just try installing a PMD THP again, because there is no PTE table anymore. So I question the utility of max_ptes_none. If you can't tame page faults, then there is only limited sense in taming khugepaged. I think there is vale in setting max_ptes_none=0 for some corner cases, but I am yet to learn why max_ptes_none=123 would make any sense. -- Cheers David / dhildenb
On 02/09/2025 10:03, David Hildenbrand wrote: > On 02.09.25 04:28, Baolin Wang wrote: >> >> >> On 2025/9/2 00:46, David Hildenbrand wrote: >>> On 29.08.25 03:55, Baolin Wang wrote: >>>> >>>> >>>> On 2025/8/28 18:48, Dev Jain wrote: >>>>> >>>>> On 28/08/25 3:16 pm, Baolin Wang wrote: >>>>>> (Sorry for chiming in late) >>>>>> >>>>>> On 2025/8/22 22:10, David Hildenbrand wrote: >>>>>>>>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1), >>>>>>>>> but not sure >>>>>>>>> if we have to add that for now. >>>>>>>> >>>>>>>> Yeah not so sure about this, this is a 'just have to know' too, and >>>>>>>> yes you >>>>>>>> might add it to the docs, but people are going to be mightily >>>>>>>> confused, esp if >>>>>>>> it's a calculated value. >>>>>>>> >>>>>>>> I don't see any other way around having a separate tunable if we >>>>>>>> don't just have >>>>>>>> something VERY simple like on/off. >>>>>>> >>>>>>> Yeah, not advocating that we add support for other values than 0/511, >>>>>>> really. >>>>>>> >>>>>>>> >>>>>>>> Also the mentioned issue sounds like something that needs to be >>>>>>>> fixed elsewhere >>>>>>>> honestly in the algorithm used to figure out mTHP ranges (I may be >>>>>>>> wrong - and >>>>>>>> happy to stand corrected if this is somehow inherent, but reallly >>>>>>>> feels that >>>>>>>> way). >>>>>>> >>>>>>> I think the creep is unavoidable for certain values. >>>>>>> >>>>>>> If you have the first two pages of a PMD area populated, and you >>>>>>> allow for at least half of the #PTEs to be non/zero, you'd collapse >>>>>>> first a >>>>>>> order-2 folio, then and order-3 ... until you reached PMD order. >>>>>>> >>>>>>> So for now we really should just support 0 / 511 to say "don't >>>>>>> collapse if there are holes" vs. "always collapse if there is at >>>>>>> least one pte used". >>>>>> >>>>>> If we only allow setting 0 or 511, as Nico mentioned before, "At 511, >>>>>> no mTHP collapses would ever occur anyway, unless you have 2MB >>>>>> disabled and other mTHP sizes enabled. Technically, at 511, only the >>>>>> highest enabled order would ever be collapsed." >>>>> I didn't understand this statement. At 511, mTHP collapses will occur if >>>>> khugepaged cannot get a PMD folio. Our goal is to collapse to the >>>>> highest order folio. >>>> >>>> Yes, I’m not saying that it’s incorrect behavior when set to 511. What I >>>> mean is, as in the example I gave below, users may only want to allow a >>>> large order collapse when the number of present PTEs reaches half of the >>>> large folio, in order to avoid RSS bloat. >>> >>> How do these users control allocation at fault time where this parameter >>> is completely ignored? >> >> Sorry, I did not get your point. Why does the 'max_pte_none' need to >> control allocation at fault time? Could you be more specific? Thanks. > > The comment over khugepaged_max_ptes_none gives a hint: > > /* > * default collapse hugepages if there is at least one pte mapped like > * it would have happened if the vma was large enough during page > * fault. > * > * Note that these are only respected if collapse was initiated by khugepaged. > */ > > In the common case (for anything that really cares about RSS bloat) you will just a > get a THP during page fault and consequently RSS bloat. > > As raised in my other reply, the only documented reason to set max_ptes_none=0 seems > to be when an application later (after once possibly getting a THP already during > page faults) did some MADV_DONTNEED and wants to control the usage of THPs itself using > MADV_COLLAPSE. > > It's a questionable use case, that already got more problematic with mTHP and page > table reclaim. > > Let me explain: > > Before mTHP, if someone would MADV_DONTNEED (resulting in > a page table with at least one pte_none entry), there would have been no way we would > get memory over-allocated afterwards with max_ptes_none=0. > > (1) Page faults would spot "there is a page table" and just fallback to order-0 pages. > (2) khugepaged was told to not collapse through max_ptes_none=0. > > But now: > > (A) With mTHP during page-faults, we can just end up over-allocating memory in such > an area again: page faults will simply spot a bunch of pte_nones around the fault area > and install an mTHP. > > (B) With page table reclaim (when zapping all PTEs in a table at once), we will reclaim the > page table. The next page fault will just try installing a PMD THP again, because there is > no PTE table anymore. > > So I question the utility of max_ptes_none. If you can't tame page faults, then there is only > limited sense in taming khugepaged. I think there is vale in setting max_ptes_none=0 for some > corner cases, but I am yet to learn why max_ptes_none=123 would make any sense. > > For PMD mapped THPs with THP shrinker, this has changed. You can basically tame pagefaults, as when you encounter memory pressure, the shrinker kicks in if the value is less than HPAGE_PMD_NR -1 (i.e. 511 for x86), and will break down those hugepages and free up zero-filled memory. I have seen in our prod workloads where the memory usage and THP usage can spike (usually when the workload starts), but with memory pressure, the memory usage is lower compared to with max_ptes_none = 511, while still still keeping the benefits of THPs like lower TLB misses. I do agree that the value of max_ptes_none is magical and different workloads can react very differently to it. The relationship is definitely not linear. i.e. if I use max_ptes_none = 256, it does not mean that the memory regression of using THP=always vs THP=madvise is halved.
On 02.09.25 12:34, Usama Arif wrote: > > > On 02/09/2025 10:03, David Hildenbrand wrote: >> On 02.09.25 04:28, Baolin Wang wrote: >>> >>> >>> On 2025/9/2 00:46, David Hildenbrand wrote: >>>> On 29.08.25 03:55, Baolin Wang wrote: >>>>> >>>>> >>>>> On 2025/8/28 18:48, Dev Jain wrote: >>>>>> >>>>>> On 28/08/25 3:16 pm, Baolin Wang wrote: >>>>>>> (Sorry for chiming in late) >>>>>>> >>>>>>> On 2025/8/22 22:10, David Hildenbrand wrote: >>>>>>>>>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1), >>>>>>>>>> but not sure >>>>>>>>>> if we have to add that for now. >>>>>>>>> >>>>>>>>> Yeah not so sure about this, this is a 'just have to know' too, and >>>>>>>>> yes you >>>>>>>>> might add it to the docs, but people are going to be mightily >>>>>>>>> confused, esp if >>>>>>>>> it's a calculated value. >>>>>>>>> >>>>>>>>> I don't see any other way around having a separate tunable if we >>>>>>>>> don't just have >>>>>>>>> something VERY simple like on/off. >>>>>>>> >>>>>>>> Yeah, not advocating that we add support for other values than 0/511, >>>>>>>> really. >>>>>>>> >>>>>>>>> >>>>>>>>> Also the mentioned issue sounds like something that needs to be >>>>>>>>> fixed elsewhere >>>>>>>>> honestly in the algorithm used to figure out mTHP ranges (I may be >>>>>>>>> wrong - and >>>>>>>>> happy to stand corrected if this is somehow inherent, but reallly >>>>>>>>> feels that >>>>>>>>> way). >>>>>>>> >>>>>>>> I think the creep is unavoidable for certain values. >>>>>>>> >>>>>>>> If you have the first two pages of a PMD area populated, and you >>>>>>>> allow for at least half of the #PTEs to be non/zero, you'd collapse >>>>>>>> first a >>>>>>>> order-2 folio, then and order-3 ... until you reached PMD order. >>>>>>>> >>>>>>>> So for now we really should just support 0 / 511 to say "don't >>>>>>>> collapse if there are holes" vs. "always collapse if there is at >>>>>>>> least one pte used". >>>>>>> >>>>>>> If we only allow setting 0 or 511, as Nico mentioned before, "At 511, >>>>>>> no mTHP collapses would ever occur anyway, unless you have 2MB >>>>>>> disabled and other mTHP sizes enabled. Technically, at 511, only the >>>>>>> highest enabled order would ever be collapsed." >>>>>> I didn't understand this statement. At 511, mTHP collapses will occur if >>>>>> khugepaged cannot get a PMD folio. Our goal is to collapse to the >>>>>> highest order folio. >>>>> >>>>> Yes, I’m not saying that it’s incorrect behavior when set to 511. What I >>>>> mean is, as in the example I gave below, users may only want to allow a >>>>> large order collapse when the number of present PTEs reaches half of the >>>>> large folio, in order to avoid RSS bloat. >>>> >>>> How do these users control allocation at fault time where this parameter >>>> is completely ignored? >>> >>> Sorry, I did not get your point. Why does the 'max_pte_none' need to >>> control allocation at fault time? Could you be more specific? Thanks. >> >> The comment over khugepaged_max_ptes_none gives a hint: >> >> /* >> * default collapse hugepages if there is at least one pte mapped like >> * it would have happened if the vma was large enough during page >> * fault. >> * >> * Note that these are only respected if collapse was initiated by khugepaged. >> */ >> >> In the common case (for anything that really cares about RSS bloat) you will just a >> get a THP during page fault and consequently RSS bloat. >> >> As raised in my other reply, the only documented reason to set max_ptes_none=0 seems >> to be when an application later (after once possibly getting a THP already during >> page faults) did some MADV_DONTNEED and wants to control the usage of THPs itself using >> MADV_COLLAPSE. >> >> It's a questionable use case, that already got more problematic with mTHP and page >> table reclaim. >> >> Let me explain: >> >> Before mTHP, if someone would MADV_DONTNEED (resulting in >> a page table with at least one pte_none entry), there would have been no way we would >> get memory over-allocated afterwards with max_ptes_none=0. >> >> (1) Page faults would spot "there is a page table" and just fallback to order-0 pages. >> (2) khugepaged was told to not collapse through max_ptes_none=0. >> >> But now: >> >> (A) With mTHP during page-faults, we can just end up over-allocating memory in such >> an area again: page faults will simply spot a bunch of pte_nones around the fault area >> and install an mTHP. >> >> (B) With page table reclaim (when zapping all PTEs in a table at once), we will reclaim the >> page table. The next page fault will just try installing a PMD THP again, because there is >> no PTE table anymore. >> >> So I question the utility of max_ptes_none. If you can't tame page faults, then there is only >> limited sense in taming khugepaged. I think there is vale in setting max_ptes_none=0 for some >> corner cases, but I am yet to learn why max_ptes_none=123 would make any sense. >> >> > > For PMD mapped THPs with THP shrinker, this has changed. You can basically tame pagefaults, as when you encounter > memory pressure, the shrinker kicks in if the value is less than HPAGE_PMD_NR -1 (i.e. 511 for x86), and > will break down those hugepages and free up zero-filled memory. You are not really taming page faults, though, you are undoing what page faults might have messed up :) I have seen in our prod workloads where > the memory usage and THP usage can spike (usually when the workload starts), but with memory pressure, > the memory usage is lower compared to with max_ptes_none = 511, while still still keeping the benefits > of THPs like lower TLB misses. Thanks for raising that: I think the current behavior is in place such that you don't bounce back-and-forth between khugepaged collapse and shrinker-split. There are likely other ways to achieve that, when we have in mind that the thp shrinker will install zero pages and max_ptes_none includes zero pages. > > I do agree that the value of max_ptes_none is magical and different workloads can react very differently > to it. The relationship is definitely not linear. i.e. if I use max_ptes_none = 256, it does not mean > that the memory regression of using THP=always vs THP=madvise is halved. To which value would you set it? Just 510? 0? -- Cheers David / dhildenb
On 02/09/2025 12:03, David Hildenbrand wrote: > On 02.09.25 12:34, Usama Arif wrote: >> >> >> On 02/09/2025 10:03, David Hildenbrand wrote: >>> On 02.09.25 04:28, Baolin Wang wrote: >>>> >>>> >>>> On 2025/9/2 00:46, David Hildenbrand wrote: >>>>> On 29.08.25 03:55, Baolin Wang wrote: >>>>>> >>>>>> >>>>>> On 2025/8/28 18:48, Dev Jain wrote: >>>>>>> >>>>>>> On 28/08/25 3:16 pm, Baolin Wang wrote: >>>>>>>> (Sorry for chiming in late) >>>>>>>> >>>>>>>> On 2025/8/22 22:10, David Hildenbrand wrote: >>>>>>>>>>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1), >>>>>>>>>>> but not sure >>>>>>>>>>> if we have to add that for now. >>>>>>>>>> >>>>>>>>>> Yeah not so sure about this, this is a 'just have to know' too, and >>>>>>>>>> yes you >>>>>>>>>> might add it to the docs, but people are going to be mightily >>>>>>>>>> confused, esp if >>>>>>>>>> it's a calculated value. >>>>>>>>>> >>>>>>>>>> I don't see any other way around having a separate tunable if we >>>>>>>>>> don't just have >>>>>>>>>> something VERY simple like on/off. >>>>>>>>> >>>>>>>>> Yeah, not advocating that we add support for other values than 0/511, >>>>>>>>> really. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Also the mentioned issue sounds like something that needs to be >>>>>>>>>> fixed elsewhere >>>>>>>>>> honestly in the algorithm used to figure out mTHP ranges (I may be >>>>>>>>>> wrong - and >>>>>>>>>> happy to stand corrected if this is somehow inherent, but reallly >>>>>>>>>> feels that >>>>>>>>>> way). >>>>>>>>> >>>>>>>>> I think the creep is unavoidable for certain values. >>>>>>>>> >>>>>>>>> If you have the first two pages of a PMD area populated, and you >>>>>>>>> allow for at least half of the #PTEs to be non/zero, you'd collapse >>>>>>>>> first a >>>>>>>>> order-2 folio, then and order-3 ... until you reached PMD order. >>>>>>>>> >>>>>>>>> So for now we really should just support 0 / 511 to say "don't >>>>>>>>> collapse if there are holes" vs. "always collapse if there is at >>>>>>>>> least one pte used". >>>>>>>> >>>>>>>> If we only allow setting 0 or 511, as Nico mentioned before, "At 511, >>>>>>>> no mTHP collapses would ever occur anyway, unless you have 2MB >>>>>>>> disabled and other mTHP sizes enabled. Technically, at 511, only the >>>>>>>> highest enabled order would ever be collapsed." >>>>>>> I didn't understand this statement. At 511, mTHP collapses will occur if >>>>>>> khugepaged cannot get a PMD folio. Our goal is to collapse to the >>>>>>> highest order folio. >>>>>> >>>>>> Yes, I’m not saying that it’s incorrect behavior when set to 511. What I >>>>>> mean is, as in the example I gave below, users may only want to allow a >>>>>> large order collapse when the number of present PTEs reaches half of the >>>>>> large folio, in order to avoid RSS bloat. >>>>> >>>>> How do these users control allocation at fault time where this parameter >>>>> is completely ignored? >>>> >>>> Sorry, I did not get your point. Why does the 'max_pte_none' need to >>>> control allocation at fault time? Could you be more specific? Thanks. >>> >>> The comment over khugepaged_max_ptes_none gives a hint: >>> >>> /* >>> * default collapse hugepages if there is at least one pte mapped like >>> * it would have happened if the vma was large enough during page >>> * fault. >>> * >>> * Note that these are only respected if collapse was initiated by khugepaged. >>> */ >>> >>> In the common case (for anything that really cares about RSS bloat) you will just a >>> get a THP during page fault and consequently RSS bloat. >>> >>> As raised in my other reply, the only documented reason to set max_ptes_none=0 seems >>> to be when an application later (after once possibly getting a THP already during >>> page faults) did some MADV_DONTNEED and wants to control the usage of THPs itself using >>> MADV_COLLAPSE. >>> >>> It's a questionable use case, that already got more problematic with mTHP and page >>> table reclaim. >>> >>> Let me explain: >>> >>> Before mTHP, if someone would MADV_DONTNEED (resulting in >>> a page table with at least one pte_none entry), there would have been no way we would >>> get memory over-allocated afterwards with max_ptes_none=0. >>> >>> (1) Page faults would spot "there is a page table" and just fallback to order-0 pages. >>> (2) khugepaged was told to not collapse through max_ptes_none=0. >>> >>> But now: >>> >>> (A) With mTHP during page-faults, we can just end up over-allocating memory in such >>> an area again: page faults will simply spot a bunch of pte_nones around the fault area >>> and install an mTHP. >>> >>> (B) With page table reclaim (when zapping all PTEs in a table at once), we will reclaim the >>> page table. The next page fault will just try installing a PMD THP again, because there is >>> no PTE table anymore. >>> >>> So I question the utility of max_ptes_none. If you can't tame page faults, then there is only >>> limited sense in taming khugepaged. I think there is vale in setting max_ptes_none=0 for some >>> corner cases, but I am yet to learn why max_ptes_none=123 would make any sense. >>> >>> >> >> For PMD mapped THPs with THP shrinker, this has changed. You can basically tame pagefaults, as when you encounter >> memory pressure, the shrinker kicks in if the value is less than HPAGE_PMD_NR -1 (i.e. 511 for x86), and >> will break down those hugepages and free up zero-filled memory. > > You are not really taming page faults, though, you are undoing what page faults might have messed up :) > > I have seen in our prod workloads where >> the memory usage and THP usage can spike (usually when the workload starts), but with memory pressure, >> the memory usage is lower compared to with max_ptes_none = 511, while still still keeping the benefits >> of THPs like lower TLB misses. > > Thanks for raising that: I think the current behavior is in place such that you don't bounce back-and-forth between khugepaged collapse and shrinker-split. > Yes, both collapse and shrinker split hinge on max_ptes_none to prevent one of these things thrashing the effect of the other. > There are likely other ways to achieve that, when we have in mind that the thp shrinker will install zero pages and max_ptes_none includes > zero pages. > >> >> I do agree that the value of max_ptes_none is magical and different workloads can react very differently >> to it. The relationship is definitely not linear. i.e. if I use max_ptes_none = 256, it does not mean >> that the memory regression of using THP=always vs THP=madvise is halved. > > To which value would you set it? Just 510? 0? > There are some very large workloads in the meta fleet that I experimented with and found that having a small value works out. I experimented with 0, 51 (10%) and 256 (50%). 51 was found to be an optimal comprimise in terms of application metrics improving, having an acceptable amount of memory regression and improved system level metrics (lower TLB misses, lower page faults). I am sure there was a better value out there for these workloads, but not possible to experiment with every value. In terms of wider rollout across the fleet, we are going to target 0 (or a very very small value) when moving from THP=madvise to always. Mainly because it is the least likely to cause a memory regression as THP shrinker will deal with page faults faulting in mostly zero-filled pages and khugepaged wont collapse pages that are dominated by 4K zero-filled chunks.
On Tue, Sep 2, 2025 at 2:23 PM Usama Arif <usamaarif642@gmail.com> wrote: > > > > On 02/09/2025 12:03, David Hildenbrand wrote: > > On 02.09.25 12:34, Usama Arif wrote: > >> > >> > >> On 02/09/2025 10:03, David Hildenbrand wrote: > >>> On 02.09.25 04:28, Baolin Wang wrote: > >>>> > >>>> > >>>> On 2025/9/2 00:46, David Hildenbrand wrote: > >>>>> On 29.08.25 03:55, Baolin Wang wrote: > >>>>>> > >>>>>> > >>>>>> On 2025/8/28 18:48, Dev Jain wrote: > >>>>>>> > >>>>>>> On 28/08/25 3:16 pm, Baolin Wang wrote: > >>>>>>>> (Sorry for chiming in late) > >>>>>>>> > >>>>>>>> On 2025/8/22 22:10, David Hildenbrand wrote: > >>>>>>>>>>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1), > >>>>>>>>>>> but not sure > >>>>>>>>>>> if we have to add that for now. > >>>>>>>>>> > >>>>>>>>>> Yeah not so sure about this, this is a 'just have to know' too, and > >>>>>>>>>> yes you > >>>>>>>>>> might add it to the docs, but people are going to be mightily > >>>>>>>>>> confused, esp if > >>>>>>>>>> it's a calculated value. > >>>>>>>>>> > >>>>>>>>>> I don't see any other way around having a separate tunable if we > >>>>>>>>>> don't just have > >>>>>>>>>> something VERY simple like on/off. > >>>>>>>>> > >>>>>>>>> Yeah, not advocating that we add support for other values than 0/511, > >>>>>>>>> really. > >>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Also the mentioned issue sounds like something that needs to be > >>>>>>>>>> fixed elsewhere > >>>>>>>>>> honestly in the algorithm used to figure out mTHP ranges (I may be > >>>>>>>>>> wrong - and > >>>>>>>>>> happy to stand corrected if this is somehow inherent, but reallly > >>>>>>>>>> feels that > >>>>>>>>>> way). > >>>>>>>>> > >>>>>>>>> I think the creep is unavoidable for certain values. > >>>>>>>>> > >>>>>>>>> If you have the first two pages of a PMD area populated, and you > >>>>>>>>> allow for at least half of the #PTEs to be non/zero, you'd collapse > >>>>>>>>> first a > >>>>>>>>> order-2 folio, then and order-3 ... until you reached PMD order. > >>>>>>>>> > >>>>>>>>> So for now we really should just support 0 / 511 to say "don't > >>>>>>>>> collapse if there are holes" vs. "always collapse if there is at > >>>>>>>>> least one pte used". > >>>>>>>> > >>>>>>>> If we only allow setting 0 or 511, as Nico mentioned before, "At 511, > >>>>>>>> no mTHP collapses would ever occur anyway, unless you have 2MB > >>>>>>>> disabled and other mTHP sizes enabled. Technically, at 511, only the > >>>>>>>> highest enabled order would ever be collapsed." > >>>>>>> I didn't understand this statement. At 511, mTHP collapses will occur if > >>>>>>> khugepaged cannot get a PMD folio. Our goal is to collapse to the > >>>>>>> highest order folio. > >>>>>> > >>>>>> Yes, I’m not saying that it’s incorrect behavior when set to 511. What I > >>>>>> mean is, as in the example I gave below, users may only want to allow a > >>>>>> large order collapse when the number of present PTEs reaches half of the > >>>>>> large folio, in order to avoid RSS bloat. > >>>>> > >>>>> How do these users control allocation at fault time where this parameter > >>>>> is completely ignored? > >>>> > >>>> Sorry, I did not get your point. Why does the 'max_pte_none' need to > >>>> control allocation at fault time? Could you be more specific? Thanks. > >>> > >>> The comment over khugepaged_max_ptes_none gives a hint: > >>> > >>> /* > >>> * default collapse hugepages if there is at least one pte mapped like > >>> * it would have happened if the vma was large enough during page > >>> * fault. > >>> * > >>> * Note that these are only respected if collapse was initiated by khugepaged. > >>> */ > >>> > >>> In the common case (for anything that really cares about RSS bloat) you will just a > >>> get a THP during page fault and consequently RSS bloat. > >>> > >>> As raised in my other reply, the only documented reason to set max_ptes_none=0 seems > >>> to be when an application later (after once possibly getting a THP already during > >>> page faults) did some MADV_DONTNEED and wants to control the usage of THPs itself using > >>> MADV_COLLAPSE. > >>> > >>> It's a questionable use case, that already got more problematic with mTHP and page > >>> table reclaim. > >>> > >>> Let me explain: > >>> > >>> Before mTHP, if someone would MADV_DONTNEED (resulting in > >>> a page table with at least one pte_none entry), there would have been no way we would > >>> get memory over-allocated afterwards with max_ptes_none=0. > >>> > >>> (1) Page faults would spot "there is a page table" and just fallback to order-0 pages. > >>> (2) khugepaged was told to not collapse through max_ptes_none=0. > >>> > >>> But now: > >>> > >>> (A) With mTHP during page-faults, we can just end up over-allocating memory in such > >>> an area again: page faults will simply spot a bunch of pte_nones around the fault area > >>> and install an mTHP. > >>> > >>> (B) With page table reclaim (when zapping all PTEs in a table at once), we will reclaim the > >>> page table. The next page fault will just try installing a PMD THP again, because there is > >>> no PTE table anymore. > >>> > >>> So I question the utility of max_ptes_none. If you can't tame page faults, then there is only > >>> limited sense in taming khugepaged. I think there is vale in setting max_ptes_none=0 for some > >>> corner cases, but I am yet to learn why max_ptes_none=123 would make any sense. > >>> > >>> > >> > >> For PMD mapped THPs with THP shrinker, this has changed. You can basically tame pagefaults, as when you encounter > >> memory pressure, the shrinker kicks in if the value is less than HPAGE_PMD_NR -1 (i.e. 511 for x86), and > >> will break down those hugepages and free up zero-filled memory. > > > > You are not really taming page faults, though, you are undoing what page faults might have messed up :) > > > > I have seen in our prod workloads where > >> the memory usage and THP usage can spike (usually when the workload starts), but with memory pressure, > >> the memory usage is lower compared to with max_ptes_none = 511, while still still keeping the benefits > >> of THPs like lower TLB misses. > > > > Thanks for raising that: I think the current behavior is in place such that you don't bounce back-and-forth between khugepaged collapse and shrinker-split. > > > > Yes, both collapse and shrinker split hinge on max_ptes_none to prevent one of these things thrashing the effect of the other. I believe with mTHP support in khugepaged, the max_ptes_none value in the shrinker must also leverage the 'order' scaling to properly prevent thrashing. I've been testing a patch for this that I might include in the V11. > > > There are likely other ways to achieve that, when we have in mind that the thp shrinker will install zero pages and max_ptes_none includes > > zero pages. > > > >> > >> I do agree that the value of max_ptes_none is magical and different workloads can react very differently > >> to it. The relationship is definitely not linear. i.e. if I use max_ptes_none = 256, it does not mean > >> that the memory regression of using THP=always vs THP=madvise is halved. > > > > To which value would you set it? Just 510? 0? > > > > There are some very large workloads in the meta fleet that I experimented with and found that having > a small value works out. I experimented with 0, 51 (10%) and 256 (50%). 51 was found to be an optimal > comprimise in terms of application metrics improving, having an acceptable amount of memory regression and > improved system level metrics (lower TLB misses, lower page faults). I am sure there was a better value out > there for these workloads, but not possible to experiment with every value. > > In terms of wider rollout across the fleet, we are going to target 0 (or a very very small value) > when moving from THP=madvise to always. Mainly because it is the least likely to cause a memory regression as > THP shrinker will deal with page faults faulting in mostly zero-filled pages and khugepaged wont collapse > pages that are dominated by 4K zero-filled chunks. >
On Wed, Sep 03, 2025 at 08:54:39PM -0600, Nico Pache wrote: > On Tue, Sep 2, 2025 at 2:23 PM Usama Arif <usamaarif642@gmail.com> wrote: > > >>> So I question the utility of max_ptes_none. If you can't tame page faults, then there is only > > >>> limited sense in taming khugepaged. I think there is vale in setting max_ptes_none=0 for some > > >>> corner cases, but I am yet to learn why max_ptes_none=123 would make any sense. > > >>> > > >>> > > >> > > >> For PMD mapped THPs with THP shrinker, this has changed. You can basically tame pagefaults, as when you encounter > > >> memory pressure, the shrinker kicks in if the value is less than HPAGE_PMD_NR -1 (i.e. 511 for x86), and > > >> will break down those hugepages and free up zero-filled memory. > > > > > > You are not really taming page faults, though, you are undoing what page faults might have messed up :) > > > > > > I have seen in our prod workloads where > > >> the memory usage and THP usage can spike (usually when the workload starts), but with memory pressure, > > >> the memory usage is lower compared to with max_ptes_none = 511, while still still keeping the benefits > > >> of THPs like lower TLB misses. > > > > > > Thanks for raising that: I think the current behavior is in place such that you don't bounce back-and-forth between khugepaged collapse and shrinker-split. > > > > > > > Yes, both collapse and shrinker split hinge on max_ptes_none to prevent one of these things thrashing the effect of the other. > I believe with mTHP support in khugepaged, the max_ptes_none value in > the shrinker must also leverage the 'order' scaling to properly > prevent thrashing. No please do not extend this 'scalling' stuff somewhere else, it's really horrid. We have to find an alternative to that, it's extremely confusing in what is already extremely confusing THP code. As I said before, if we can't have a boolean we need another interface, which makes most sense to be a ratio or in practice, a percentage sysctl. Speaking with David off-list, maybe the answer - if we must have this - is to add a new percentage interface and have this in lock-step with the existing max_ptes_none interface. One updates the other, but internally we're just using the percentage value. > I've been testing a patch for this that I might include in the V11. > > > > > There are likely other ways to achieve that, when we have in mind that the thp shrinker will install zero pages and max_ptes_none includes > > > zero pages. > > > > > >> > > >> I do agree that the value of max_ptes_none is magical and different workloads can react very differently > > >> to it. The relationship is definitely not linear. i.e. if I use max_ptes_none = 256, it does not mean > > >> that the memory regression of using THP=always vs THP=madvise is halved. > > > > > > To which value would you set it? Just 510? 0? > > > > > > > There are some very large workloads in the meta fleet that I experimented with and found that having > > a small value works out. I experimented with 0, 51 (10%) and 256 (50%). 51 was found to be an optimal > > comprimise in terms of application metrics improving, having an acceptable amount of memory regression and > > improved system level metrics (lower TLB misses, lower page faults). I am sure there was a better value out > > there for these workloads, but not possible to experiment with every value. (->Usama) It's a pity that such workloads exist. But then the percentage solution should work. > > > > In terms of wider rollout across the fleet, we are going to target 0 (or a very very small value) > > when moving from THP=madvise to always. Mainly because it is the least likely to cause a memory regression as > > THP shrinker will deal with page faults faulting in mostly zero-filled pages and khugepaged wont collapse > > pages that are dominated by 4K zero-filled chunks. > > (->Usama) Interesting though that you've decided against doing this fleetwide... I wonder then again whether we truly need non-boolean values. But the fact workloads might theoretically exist where it's useful does make me think we have to have this, sadly. Cheers, Lorenzo
On 05.09.25 13:48, Lorenzo Stoakes wrote: > On Wed, Sep 03, 2025 at 08:54:39PM -0600, Nico Pache wrote: >> On Tue, Sep 2, 2025 at 2:23 PM Usama Arif <usamaarif642@gmail.com> wrote: >>>>>> So I question the utility of max_ptes_none. If you can't tame page faults, then there is only >>>>>> limited sense in taming khugepaged. I think there is vale in setting max_ptes_none=0 for some >>>>>> corner cases, but I am yet to learn why max_ptes_none=123 would make any sense. >>>>>> >>>>>> >>>>> >>>>> For PMD mapped THPs with THP shrinker, this has changed. You can basically tame pagefaults, as when you encounter >>>>> memory pressure, the shrinker kicks in if the value is less than HPAGE_PMD_NR -1 (i.e. 511 for x86), and >>>>> will break down those hugepages and free up zero-filled memory. >>>> >>>> You are not really taming page faults, though, you are undoing what page faults might have messed up :) >>>> >>>> I have seen in our prod workloads where >>>>> the memory usage and THP usage can spike (usually when the workload starts), but with memory pressure, >>>>> the memory usage is lower compared to with max_ptes_none = 511, while still still keeping the benefits >>>>> of THPs like lower TLB misses. >>>> >>>> Thanks for raising that: I think the current behavior is in place such that you don't bounce back-and-forth between khugepaged collapse and shrinker-split. >>>> >>> >>> Yes, both collapse and shrinker split hinge on max_ptes_none to prevent one of these things thrashing the effect of the other. >> I believe with mTHP support in khugepaged, the max_ptes_none value in >> the shrinker must also leverage the 'order' scaling to properly >> prevent thrashing. > > No please do not extend this 'scalling' stuff somewhere else, it's really horrid. > > We have to find an alternative to that, it's extremely confusing in what is > already extremely confusing THP code. > > As I said before, if we can't have a boolean we need another interface, which > makes most sense to be a ratio or in practice, a percentage sysctl. > > Speaking with David off-list, maybe the answer - if we must have this - is to > add a new percentage interface and have this in lock-step with the existing > max_ptes_none interface. One updates the other, but internally we're just using > the percentage value. Yes, I'll try hacking something up and sending it as an RFC. > >> I've been testing a patch for this that I might include in the V11. >>> >>>> There are likely other ways to achieve that, when we have in mind that the thp shrinker will install zero pages and max_ptes_none includes >>>> zero pages. >>>> >>>>> >>>>> I do agree that the value of max_ptes_none is magical and different workloads can react very differently >>>>> to it. The relationship is definitely not linear. i.e. if I use max_ptes_none = 256, it does not mean >>>>> that the memory regression of using THP=always vs THP=madvise is halved. >>>> >>>> To which value would you set it? Just 510? 0? Sorry, I missed Usama's reply. Thanks Usama! >>>> >>> >>> There are some very large workloads in the meta fleet that I experimented with and found that having >>> a small value works out. I experimented with 0, 51 (10%) and 256 (50%). 51 was found to be an optimal >>> comprimise in terms of application metrics improving, having an acceptable amount of memory regression and >>> improved system level metrics (lower TLB misses, lower page faults). I am sure there was a better value out >>> there for these workloads, but not possible to experiment with every value. > > (->Usama) It's a pity that such workloads exist. But then the percentage solution should work. Good. So if there is no strong case for > 255, that's already valuable for mTHP. -- Cheers David / dhildenb
On 05/09/2025 12:55, David Hildenbrand wrote: > On 05.09.25 13:48, Lorenzo Stoakes wrote: >> On Wed, Sep 03, 2025 at 08:54:39PM -0600, Nico Pache wrote: >>> On Tue, Sep 2, 2025 at 2:23 PM Usama Arif <usamaarif642@gmail.com> wrote: >>>>>>> So I question the utility of max_ptes_none. If you can't tame page faults, then there is only >>>>>>> limited sense in taming khugepaged. I think there is vale in setting max_ptes_none=0 for some >>>>>>> corner cases, but I am yet to learn why max_ptes_none=123 would make any sense. >>>>>>> >>>>>>> >>>>>> >>>>>> For PMD mapped THPs with THP shrinker, this has changed. You can basically tame pagefaults, as when you encounter >>>>>> memory pressure, the shrinker kicks in if the value is less than HPAGE_PMD_NR -1 (i.e. 511 for x86), and >>>>>> will break down those hugepages and free up zero-filled memory. >>>>> >>>>> You are not really taming page faults, though, you are undoing what page faults might have messed up :) >>>>> >>>>> I have seen in our prod workloads where >>>>>> the memory usage and THP usage can spike (usually when the workload starts), but with memory pressure, >>>>>> the memory usage is lower compared to with max_ptes_none = 511, while still still keeping the benefits >>>>>> of THPs like lower TLB misses. >>>>> >>>>> Thanks for raising that: I think the current behavior is in place such that you don't bounce back-and-forth between khugepaged collapse and shrinker-split. >>>>> >>>> >>>> Yes, both collapse and shrinker split hinge on max_ptes_none to prevent one of these things thrashing the effect of the other. >>> I believe with mTHP support in khugepaged, the max_ptes_none value in >>> the shrinker must also leverage the 'order' scaling to properly >>> prevent thrashing. >> >> No please do not extend this 'scalling' stuff somewhere else, it's really horrid. >> >> We have to find an alternative to that, it's extremely confusing in what is >> already extremely confusing THP code. >> >> As I said before, if we can't have a boolean we need another interface, which >> makes most sense to be a ratio or in practice, a percentage sysctl. >> >> Speaking with David off-list, maybe the answer - if we must have this - is to >> add a new percentage interface and have this in lock-step with the existing >> max_ptes_none interface. One updates the other, but internally we're just using >> the percentage value. > > Yes, I'll try hacking something up and sending it as an RFC. > >> >>> I've been testing a patch for this that I might include in the V11. >>>> >>>>> There are likely other ways to achieve that, when we have in mind that the thp shrinker will install zero pages and max_ptes_none includes >>>>> zero pages. >>>>> >>>>>> >>>>>> I do agree that the value of max_ptes_none is magical and different workloads can react very differently >>>>>> to it. The relationship is definitely not linear. i.e. if I use max_ptes_none = 256, it does not mean >>>>>> that the memory regression of using THP=always vs THP=madvise is halved. >>>>> >>>>> To which value would you set it? Just 510? 0? > > Sorry, I missed Usama's reply. Thanks Usama! > >>>>> >>>> >>>> There are some very large workloads in the meta fleet that I experimented with and found that having >>>> a small value works out. I experimented with 0, 51 (10%) and 256 (50%). 51 was found to be an optimal >>>> comprimise in terms of application metrics improving, having an acceptable amount of memory regression and >>>> improved system level metrics (lower TLB misses, lower page faults). I am sure there was a better value out >>>> there for these workloads, but not possible to experiment with every value. >> >> (->Usama) It's a pity that such workloads exist. But then the percentage solution should work. > > Good. So if there is no strong case for > 255, that's already valuable for mTHP. > tbh the default value of 511 is horrible. I have thought about sending a patch to change it to 0 as default in upstream for sometime, but it might mean that people who upgrade their kernel might suddenly see their memory not getting hugified and it could be confusing for them?
On 05.09.25 14:31, Usama Arif wrote: > > > On 05/09/2025 12:55, David Hildenbrand wrote: >> On 05.09.25 13:48, Lorenzo Stoakes wrote: >>> On Wed, Sep 03, 2025 at 08:54:39PM -0600, Nico Pache wrote: >>>> On Tue, Sep 2, 2025 at 2:23 PM Usama Arif <usamaarif642@gmail.com> wrote: >>>>>>>> So I question the utility of max_ptes_none. If you can't tame page faults, then there is only >>>>>>>> limited sense in taming khugepaged. I think there is vale in setting max_ptes_none=0 for some >>>>>>>> corner cases, but I am yet to learn why max_ptes_none=123 would make any sense. >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> For PMD mapped THPs with THP shrinker, this has changed. You can basically tame pagefaults, as when you encounter >>>>>>> memory pressure, the shrinker kicks in if the value is less than HPAGE_PMD_NR -1 (i.e. 511 for x86), and >>>>>>> will break down those hugepages and free up zero-filled memory. >>>>>> >>>>>> You are not really taming page faults, though, you are undoing what page faults might have messed up :) >>>>>> >>>>>> I have seen in our prod workloads where >>>>>>> the memory usage and THP usage can spike (usually when the workload starts), but with memory pressure, >>>>>>> the memory usage is lower compared to with max_ptes_none = 511, while still still keeping the benefits >>>>>>> of THPs like lower TLB misses. >>>>>> >>>>>> Thanks for raising that: I think the current behavior is in place such that you don't bounce back-and-forth between khugepaged collapse and shrinker-split. >>>>>> >>>>> >>>>> Yes, both collapse and shrinker split hinge on max_ptes_none to prevent one of these things thrashing the effect of the other. >>>> I believe with mTHP support in khugepaged, the max_ptes_none value in >>>> the shrinker must also leverage the 'order' scaling to properly >>>> prevent thrashing. >>> >>> No please do not extend this 'scalling' stuff somewhere else, it's really horrid. >>> >>> We have to find an alternative to that, it's extremely confusing in what is >>> already extremely confusing THP code. >>> >>> As I said before, if we can't have a boolean we need another interface, which >>> makes most sense to be a ratio or in practice, a percentage sysctl. >>> >>> Speaking with David off-list, maybe the answer - if we must have this - is to >>> add a new percentage interface and have this in lock-step with the existing >>> max_ptes_none interface. One updates the other, but internally we're just using >>> the percentage value. >> >> Yes, I'll try hacking something up and sending it as an RFC. >> >>> >>>> I've been testing a patch for this that I might include in the V11. >>>>> >>>>>> There are likely other ways to achieve that, when we have in mind that the thp shrinker will install zero pages and max_ptes_none includes >>>>>> zero pages. >>>>>> >>>>>>> >>>>>>> I do agree that the value of max_ptes_none is magical and different workloads can react very differently >>>>>>> to it. The relationship is definitely not linear. i.e. if I use max_ptes_none = 256, it does not mean >>>>>>> that the memory regression of using THP=always vs THP=madvise is halved. >>>>>> >>>>>> To which value would you set it? Just 510? 0? >> >> Sorry, I missed Usama's reply. Thanks Usama! >> >>>>>> >>>>> >>>>> There are some very large workloads in the meta fleet that I experimented with and found that having >>>>> a small value works out. I experimented with 0, 51 (10%) and 256 (50%). 51 was found to be an optimal >>>>> comprimise in terms of application metrics improving, having an acceptable amount of memory regression and >>>>> improved system level metrics (lower TLB misses, lower page faults). I am sure there was a better value out >>>>> there for these workloads, but not possible to experiment with every value. >>> >>> (->Usama) It's a pity that such workloads exist. But then the percentage solution should work. >> >> Good. So if there is no strong case for > 255, that's already valuable for mTHP. >> > > tbh the default value of 511 is horrible. I have thought about sending a patch to change it to 0 as default > in upstream for sometime, but it might mean that people who upgrade their kernel might suddenly see > their memory not getting hugified and it could be confusing for them? 511 is just what a page fault would have done, so I think it makes perfect sense. More than anything else, actually. It's just not optimal in many cases. -- Cheers David / dhildenb
On 2025/9/3 04:23, Usama Arif wrote: > > > On 02/09/2025 12:03, David Hildenbrand wrote: >> On 02.09.25 12:34, Usama Arif wrote: >>> >>> >>> On 02/09/2025 10:03, David Hildenbrand wrote: >>>> On 02.09.25 04:28, Baolin Wang wrote: >>>>> >>>>> >>>>> On 2025/9/2 00:46, David Hildenbrand wrote: >>>>>> On 29.08.25 03:55, Baolin Wang wrote: >>>>>>> >>>>>>> >>>>>>> On 2025/8/28 18:48, Dev Jain wrote: >>>>>>>> >>>>>>>> On 28/08/25 3:16 pm, Baolin Wang wrote: >>>>>>>>> (Sorry for chiming in late) >>>>>>>>> >>>>>>>>> On 2025/8/22 22:10, David Hildenbrand wrote: >>>>>>>>>>>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1), >>>>>>>>>>>> but not sure >>>>>>>>>>>> if we have to add that for now. >>>>>>>>>>> >>>>>>>>>>> Yeah not so sure about this, this is a 'just have to know' too, and >>>>>>>>>>> yes you >>>>>>>>>>> might add it to the docs, but people are going to be mightily >>>>>>>>>>> confused, esp if >>>>>>>>>>> it's a calculated value. >>>>>>>>>>> >>>>>>>>>>> I don't see any other way around having a separate tunable if we >>>>>>>>>>> don't just have >>>>>>>>>>> something VERY simple like on/off. >>>>>>>>>> >>>>>>>>>> Yeah, not advocating that we add support for other values than 0/511, >>>>>>>>>> really. >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Also the mentioned issue sounds like something that needs to be >>>>>>>>>>> fixed elsewhere >>>>>>>>>>> honestly in the algorithm used to figure out mTHP ranges (I may be >>>>>>>>>>> wrong - and >>>>>>>>>>> happy to stand corrected if this is somehow inherent, but reallly >>>>>>>>>>> feels that >>>>>>>>>>> way). >>>>>>>>>> >>>>>>>>>> I think the creep is unavoidable for certain values. >>>>>>>>>> >>>>>>>>>> If you have the first two pages of a PMD area populated, and you >>>>>>>>>> allow for at least half of the #PTEs to be non/zero, you'd collapse >>>>>>>>>> first a >>>>>>>>>> order-2 folio, then and order-3 ... until you reached PMD order. >>>>>>>>>> >>>>>>>>>> So for now we really should just support 0 / 511 to say "don't >>>>>>>>>> collapse if there are holes" vs. "always collapse if there is at >>>>>>>>>> least one pte used". >>>>>>>>> >>>>>>>>> If we only allow setting 0 or 511, as Nico mentioned before, "At 511, >>>>>>>>> no mTHP collapses would ever occur anyway, unless you have 2MB >>>>>>>>> disabled and other mTHP sizes enabled. Technically, at 511, only the >>>>>>>>> highest enabled order would ever be collapsed." >>>>>>>> I didn't understand this statement. At 511, mTHP collapses will occur if >>>>>>>> khugepaged cannot get a PMD folio. Our goal is to collapse to the >>>>>>>> highest order folio. >>>>>>> >>>>>>> Yes, I’m not saying that it’s incorrect behavior when set to 511. What I >>>>>>> mean is, as in the example I gave below, users may only want to allow a >>>>>>> large order collapse when the number of present PTEs reaches half of the >>>>>>> large folio, in order to avoid RSS bloat. >>>>>> >>>>>> How do these users control allocation at fault time where this parameter >>>>>> is completely ignored? >>>>> >>>>> Sorry, I did not get your point. Why does the 'max_pte_none' need to >>>>> control allocation at fault time? Could you be more specific? Thanks. >>>> >>>> The comment over khugepaged_max_ptes_none gives a hint: >>>> >>>> /* >>>> * default collapse hugepages if there is at least one pte mapped like >>>> * it would have happened if the vma was large enough during page >>>> * fault. >>>> * >>>> * Note that these are only respected if collapse was initiated by khugepaged. >>>> */ >>>> >>>> In the common case (for anything that really cares about RSS bloat) you will just a >>>> get a THP during page fault and consequently RSS bloat. >>>> >>>> As raised in my other reply, the only documented reason to set max_ptes_none=0 seems >>>> to be when an application later (after once possibly getting a THP already during >>>> page faults) did some MADV_DONTNEED and wants to control the usage of THPs itself using >>>> MADV_COLLAPSE. >>>> >>>> It's a questionable use case, that already got more problematic with mTHP and page >>>> table reclaim. >>>> >>>> Let me explain: >>>> >>>> Before mTHP, if someone would MADV_DONTNEED (resulting in >>>> a page table with at least one pte_none entry), there would have been no way we would >>>> get memory over-allocated afterwards with max_ptes_none=0. >>>> >>>> (1) Page faults would spot "there is a page table" and just fallback to order-0 pages. >>>> (2) khugepaged was told to not collapse through max_ptes_none=0. >>>> >>>> But now: >>>> >>>> (A) With mTHP during page-faults, we can just end up over-allocating memory in such >>>> an area again: page faults will simply spot a bunch of pte_nones around the fault area >>>> and install an mTHP. >>>> >>>> (B) With page table reclaim (when zapping all PTEs in a table at once), we will reclaim the >>>> page table. The next page fault will just try installing a PMD THP again, because there is >>>> no PTE table anymore. >>>> >>>> So I question the utility of max_ptes_none. If you can't tame page faults, then there is only >>>> limited sense in taming khugepaged. I think there is vale in setting max_ptes_none=0 for some >>>> corner cases, but I am yet to learn why max_ptes_none=123 would make any sense. Thanks David for your explanation. I see your point now. >>> For PMD mapped THPs with THP shrinker, this has changed. You can basically tame pagefaults, as when you encounter >>> memory pressure, the shrinker kicks in if the value is less than HPAGE_PMD_NR -1 (i.e. 511 for x86), and >>> will break down those hugepages and free up zero-filled memory. >> >> You are not really taming page faults, though, you are undoing what page faults might have messed up :) >> >> I have seen in our prod workloads where >>> the memory usage and THP usage can spike (usually when the workload starts), but with memory pressure, >>> the memory usage is lower compared to with max_ptes_none = 511, while still still keeping the benefits >>> of THPs like lower TLB misses. >> >> Thanks for raising that: I think the current behavior is in place such that you don't bounce back-and-forth between khugepaged collapse and shrinker-split. >> > > Yes, both collapse and shrinker split hinge on max_ptes_none to prevent one of these things thrashing the effect of the other. > >> There are likely other ways to achieve that, when we have in mind that the thp shrinker will install zero pages and max_ptes_none includes >> zero pages. >> >>> >>> I do agree that the value of max_ptes_none is magical and different workloads can react very differently >>> to it. The relationship is definitely not linear. i.e. if I use max_ptes_none = 256, it does not mean >>> that the memory regression of using THP=always vs THP=madvise is halved. >> >> To which value would you set it? Just 510? 0? >> > > There are some very large workloads in the meta fleet that I experimented with and found that having > a small value works out. I experimented with 0, 51 (10%) and 256 (50%). 51 was found to be an optimal > comprimise in terms of application metrics improving, having an acceptable amount of memory regression and > improved system level metrics (lower TLB misses, lower page faults). I am sure there was a better value out > there for these workloads, but not possible to experiment with every value. > > In terms of wider rollout across the fleet, we are going to target 0 (or a very very small value) > when moving from THP=madvise to always. Mainly because it is the least likely to cause a memory regression as > THP shrinker will deal with page faults faulting in mostly zero-filled pages and khugepaged wont collapse > pages that are dominated by 4K zero-filled chunks. Thanks for sharing this. We're also investigating what max_ptes_none should be set to in order to use the THP shrinker properly, and currently, our customers always set max_ptes_none to its default value: 511, which is not good. If 0 is better, it seems like there isn't much conflict with the values expected by mTHP collapse (0 and 511). Sounds good to me.
On Fri, Aug 22, 2025 at 04:10:35PM +0200, David Hildenbrand wrote: > > > Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1), but not sure > > > if we have to add that for now. > > > > Yeah not so sure about this, this is a 'just have to know' too, and yes you > > might add it to the docs, but people are going to be mightily confused, esp if > > it's a calculated value. > > > > I don't see any other way around having a separate tunable if we don't just have > > something VERY simple like on/off. > > Yeah, not advocating that we add support for other values than 0/511, > really. Yeah I'm fine with 0/511. > > > > > Also the mentioned issue sounds like something that needs to be fixed elsewhere > > honestly in the algorithm used to figure out mTHP ranges (I may be wrong - and > > happy to stand corrected if this is somehow inherent, but reallly feels that > > way). > > I think the creep is unavoidable for certain values. > > If you have the first two pages of a PMD area populated, and you allow for > at least half of the #PTEs to be non/zero, you'd collapse first a > order-2 folio, then and order-3 ... until you reached PMD order. Feels like we should be looking at this in reverse? What's the largest, then next largest, then etc.? Surely this is the sensible way of doing it? > > So for now we really should just support 0 / 511 to say "don't collapse if > there are holes" vs. "always collapse if there is at least one pte used". Yes. > > > > > > > > > Because, as raised in the past, I'm afraid nobody on this earth has a clue how > > > to set this parameter to values different to 0 (don't waste memory with khugepaged) > > > and 511 (page fault behavior). > > > > Yup > > > > > > > > > > > If any other value is set, essentially > > > pr_warn("Unsupported 'max_ptes_none' value for mTHP collapse"); > > > > > > for now and just disable it. > > > > Hmm but under what circumstances? I would just say unsupported value not mention > > mTHP or people who don't use mTHP might find that confusing. > > Well, we can check whether any mTHP size is enabled while the value is set > to something unexpected. We can then even print the problematic sizes if we > have to. Ack > > We could also just just say that if the value is set to something else than > 511 (which is the default), it will be treated as being "0" when collapsing > mthp, instead of doing any scaling. Or we could make it an error to set anything but 0, 511, but on the other hand that's likely to break userspace so yeah probably not. Maybe have a warning saying 'this is no longer supported and will be ignored' then set the value to 0 for anything but 511 or 0. Then can remove the warning later. By having 0/511 we can really simplify the 'scaling' logic too which would be fantastic! :) Cheers, Lorenzo
On 22/08/25 8:19 pm, Lorenzo Stoakes wrote: > On Fri, Aug 22, 2025 at 04:10:35PM +0200, David Hildenbrand wrote: >>>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1), but not sure >>>> if we have to add that for now. >>> Yeah not so sure about this, this is a 'just have to know' too, and yes you >>> might add it to the docs, but people are going to be mightily confused, esp if >>> it's a calculated value. >>> >>> I don't see any other way around having a separate tunable if we don't just have >>> something VERY simple like on/off. >> Yeah, not advocating that we add support for other values than 0/511, >> really. > Yeah I'm fine with 0/511. > >>> Also the mentioned issue sounds like something that needs to be fixed elsewhere >>> honestly in the algorithm used to figure out mTHP ranges (I may be wrong - and >>> happy to stand corrected if this is somehow inherent, but reallly feels that >>> way). >> I think the creep is unavoidable for certain values. >> >> If you have the first two pages of a PMD area populated, and you allow for >> at least half of the #PTEs to be non/zero, you'd collapse first a >> order-2 folio, then and order-3 ... until you reached PMD order. > Feels like we should be looking at this in reverse? What's the largest, then > next largest, then etc.? > > Surely this is the sensible way of doing it? What David means to say is, for example, suppose all orders are enabled, and we fail to collapse for order-9, then order-8, then order-7, and so on, *only* because the distribution of ptes did not obey the scaled max_ptes_none. Let order-4 collapse succeed. Next time, khugepaged comes and tries for order-9, fails, then order-8, fails and so on. Then it checks for order-5, and it comes under the scaled max_ptes_none constraint only because the previous cycle's order-4 collapse changed the ptes' distribution. > >> So for now we really should just support 0 / 511 to say "don't collapse if >> there are holes" vs. "always collapse if there is at least one pte used". > Yes. > >>>> Because, as raised in the past, I'm afraid nobody on this earth has a clue how >>>> to set this parameter to values different to 0 (don't waste memory with khugepaged) >>>> and 511 (page fault behavior). >>> Yup >>> >>>> >>>> If any other value is set, essentially >>>> pr_warn("Unsupported 'max_ptes_none' value for mTHP collapse"); >>>> >>>> for now and just disable it. >>> Hmm but under what circumstances? I would just say unsupported value not mention >>> mTHP or people who don't use mTHP might find that confusing. >> Well, we can check whether any mTHP size is enabled while the value is set >> to something unexpected. We can then even print the problematic sizes if we >> have to. > Ack > >> We could also just just say that if the value is set to something else than >> 511 (which is the default), it will be treated as being "0" when collapsing >> mthp, instead of doing any scaling. > Or we could make it an error to set anything but 0, 511, but on the other hand > that's likely to break userspace so yeah probably not. > > Maybe have a warning saying 'this is no longer supported and will be ignored' > then set the value to 0 for anything but 511 or 0. > > Then can remove the warning later. > > By having 0/511 we can really simplify the 'scaling' logic too which would be > fantastic! :) FWIW here was my implementation of this thing, for ease of everyone: https://lore.kernel.org/all/20250211111326.14295-17-dev.jain@arm.com/ > > Cheers, Lorenzo
On Fri, Aug 22, 2025 at 09:03:41PM +0530, Dev Jain wrote: > > On 22/08/25 8:19 pm, Lorenzo Stoakes wrote: > > On Fri, Aug 22, 2025 at 04:10:35PM +0200, David Hildenbrand wrote: > > > > > Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1), but not sure > > > > > if we have to add that for now. > > > > Yeah not so sure about this, this is a 'just have to know' too, and yes you > > > > might add it to the docs, but people are going to be mightily confused, esp if > > > > it's a calculated value. > > > > > > > > I don't see any other way around having a separate tunable if we don't just have > > > > something VERY simple like on/off. > > > Yeah, not advocating that we add support for other values than 0/511, > > > really. > > Yeah I'm fine with 0/511. > > > > > > Also the mentioned issue sounds like something that needs to be fixed elsewhere > > > > honestly in the algorithm used to figure out mTHP ranges (I may be wrong - and > > > > happy to stand corrected if this is somehow inherent, but reallly feels that > > > > way). > > > I think the creep is unavoidable for certain values. > > > > > > If you have the first two pages of a PMD area populated, and you allow for > > > at least half of the #PTEs to be non/zero, you'd collapse first a > > > order-2 folio, then and order-3 ... until you reached PMD order. > > Feels like we should be looking at this in reverse? What's the largest, then > > next largest, then etc.? > > > > Surely this is the sensible way of doing it? > > What David means to say is, for example, suppose all orders are enabled, > and we fail to collapse for order-9, then order-8, then order-7, and so on, > *only* because the distribution of ptes did not obey the scaled max_ptes_none. > Let order-4 collapse succeed. Ah so it is the overhead of this that's the problem? All roads lead to David's suggestion imo. > > By having 0/511 we can really simplify the 'scaling' logic too which would be > > fantastic! :) > > FWIW here was my implementation of this thing, for ease of everyone: > https://lore.kernel.org/all/20250211111326.14295-17-dev.jain@arm.com/ That's fine, but I really think we should just replace all this stuff with a boolean, and change the interface to max_ptes to set boolean if 511, or clear if 0. Cheers, Lorenzo
On 21.08.25 18:54, Lorenzo Stoakes wrote: > On Thu, Aug 21, 2025 at 10:46:18AM -0600, Nico Pache wrote: >>>>>> Thanks and I"ll have a look, but this series is unmergeable with a broken >>>>>> default in >>>>>> /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio >>>>>> sorry. >>>>>> >>>>>> We need to have a new tunable as far as I can tell. I also find the use of >>>>>> this PMD-specific value as an arbitrary way of expressing a ratio pretty >>>>>> gross. >>>>> The first thing that comes to mind is that we can pin max_ptes_none to >>>>> 255 if it exceeds 255. It's worth noting that the issue occurs only >>>>> for adjacently enabled mTHP sizes. >>> >>> No! Presumably the default of 511 (for PMDs with 512 entries) is set for a >>> reason, arbitrarily changing this to suit a specific case seems crazy no? >> We wouldn't be changing it for PMD collapse, just for the new >> behavior. At 511, no mTHP collapses would ever occur anyways, unless >> you have 2MB disabled and other mTHP sizes enabled. Technically at 511 >> only the highest enabled order always gets collapsed. >> >> Ive also argued in the past that 511 is a terrible default for >> anything other than thp.enabled=always, but that's a whole other can >> of worms we dont need to discuss now. >> >> with this cap of 255, the PMD scan/collapse would work as intended, >> then in mTHP collapses we would never introduce this undesired >> behavior. We've discussed before that this would be a hard problem to >> solve without introducing some expensive way of tracking what has >> already been through a collapse, and that doesnt even consider what >> happens if things change or are unmapped, and rescanning that section >> would be helpful. So having a strictly enforced limit of 255 actually >> seems like a good idea to me, as it completely avoids the undesired >> behavior and does not require the admins to be aware of such an issue. >> >> Another thought similar to what (IIRC) Dev has mentioned before, if we >> have max_ptes_none > 255 then we only consider collapses to the >> largest enabled order, that way no creep to the largest enabled order >> would occur in the first place, and we would get there straight away. >> >> To me one of these two solutions seem sane in the context of what we >> are dealing with. >>> >>>>> >>>>> ie) >>>>> if order!=HPAGE_PMD_ORDER && khugepaged_max_ptes_none > 255 >>>>> temp_max_ptes_none = 255; >>>> Oh and my second point, introducing a new tunable to control mTHP >>>> collapse may become exceedingly complex from a tuning and code >>>> management standpoint. >>> >>> Umm right now you hve a ratio expressed in PTES per mTHP * ((PTEs per PMD) / >>> PMD) 'except please don't set to the usual default when using mTHP' and it's >>> currently default-broken. >>> >>> I'm really not sure how that is simpler than a seprate tunable that can be >>> expressed as a ratio (e.g. percentage) that actually makes some kind of sense? >> I agree that the current tunable wasn't designed for this, but we >> tried to come up with something that leverages the tunable we have to >> avoid new tunables and added complexity. >>> >>> And we can make anything workable from a code management point of view by >>> refactoring/developing appropriately. >> What happens if max_ptes_none = 0 and the ratio is 50% - 1 pte >> (ideally the max number)? seems like we would be saying we want no new >> none pages, but also to allow new none pages. To me that seems equally >> broken and more confusing than just taking a scale of the current >> number (now with a cap). >> >> > > The one thing we absolutely cannot have is a default that causes this > 'creeping' behaviour. This feels like shipping something that is broken and > alluding to it in the documentation. > > I spoke to David off-list and he gave some insight into this and perhaps > some reasonable means of avoiding an additional tunable. > > I don't want to rehash what he said as I think it's more productive for him > to reply when he has time but broadly I think how we handle this needs > careful consideration. > > To me it's clear that some sense of ratio is just immediately very very > confusing, but then again this interface is already confusing, as with much > of THP. > > Anyway I'll let David respond here so we don't loop around before he has a > chance to add his input. I've been summoned. As raised in the past, I would initially only support specific values here like 0 : Never collapse with any pte_none/zeropage 511 (HPAGE_PMD_NR - 1) / default : Always collapse, ignoring pte_none/zeropage Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1), but not sure if we have to add that for now. Because, as raised in the past, I'm afraid nobody on this earth has a clue how to set this parameter to values different to 0 (don't waste memory with khugepaged) and 511 (page fault behavior). If any other value is set, essentially pr_warn("Unsupported 'max_ptes_none' value for mTHP collapse"); for now and just disable it. -- Cheers David / dhildenb
© 2016 - 2025 Red Hat, Inc.