Documentation/admin-guide/mm/transhuge.rst | 72 ++- include/linux/huge_mm.h | 5 + include/trace/events/huge_memory.h | 34 +- mm/huge_memory.c | 11 + mm/khugepaged.c | 634 ++++++++++++++++----- 5 files changed, 584 insertions(+), 172 deletions(-)
The following series provides khugepaged with the capability to collapse
anonymous memory regions to mTHPs.
To achieve this we generalize the khugepaged functions to no longer depend
on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
pages that are occupied (!none/zero). After the PMD scan is done, we use
the bitmap to find the optimal mTHP sizes for the PMD range. The
restriction on max_ptes_none is removed during the scan, to make sure we
account for the whole PMD range in the bitmap. When no mTHP size is
enabled, the legacy behavior of khugepaged is maintained.
We currently only support max_ptes_none values of 0 or HPAGE_PMD_NR - 1
(ie 511). If any other value is specified, the kernel will emit a warning
and mTHP collapse will default to max_ptes_none=0. If a mTHP collapse is
attempted, but contains swapped out, or shared pages, we don't perform
the collapse.
It is now also possible to collapse to mTHPs without requiring the PMD THP
size to be enabled. These limitations are to prevent collapse "creep"
behavior. This prevents constantly promoting mTHPs to the next available
size, which would occur because a collapse introduces more non-zero pages
that would satisfy the promotion condition on subsequent scans.
Patch 1-2: Generalize hugepage_vma_revalidate and alloc_charge_folio
for arbitrary orders.
Patch 3: Rework max_ptes_* handling into helper functions
Patch 4: Generalize __collapse_huge_page_* for mTHP support
Patch 5: Require collapse_huge_page to enter/exit with the lock dropped
Patch 6: Generalize collapse_huge_page for mTHP collapse
Patch 7: Skip collapsing mTHP to smaller orders
Patch 8-9: Add per-order mTHP statistics and tracepoints
Patch 10: Introduce collapse_allowable_orders helper function
Patch 11-13: Introduce bitmap and mTHP collapse support, fully enabled
Patch 14: Documentation
Testing:
- Built for x86_64, aarch64, ppc64le, and s390x
- ran all arches on test suites provided by the kernel-tests project
- internal testing suites: functional testing and performance testing
- selftests mm
- I created a test script that I used to push khugepaged to its limits
while monitoring a number of stats and tracepoints. The code is
available here[1] (Run in legacy mode for these changes and set mthp
sizes to inherit)
The summary from my testings was that there was no significant
regression noticed through this test. In some cases my changes had
better collapse latencies, and was able to scan more pages in the same
amount of time/work, but for the most part the results were consistent.
- redis testing. I did some testing with these changes along with my defer
changes (see followup [2] post for more details). We've decided to get
the mTHP changes merged first before attempting the defer series.
- some basic testing on 64k page size.
- lots of general use.
[1] - https://gitlab.com/npache/khugepaged_mthp_test
[2] - https://lore.kernel.org/lkml/20250515033857.132535-1-npache@redhat.com/
V18 Changes:
- Added RBs/Acks
- [patch 02] Guard count_memcg_folio_events with is_pmd_order() to keep
THP_COLLAPSE_ALLOC PMD-only (Usama, Lance)
- [patch 03] Convert C++ comments to C-style; fix "none-page" terminology
to "empty PTEs or PTEs mapping the shared zeropage"; drop unnecessary
userfaultfd comment; add const to local max_ptes_* variables; fix
"repect" typo (Lance, David)
- [patch 04] collapse_max_ptes_none() now returns 0 instead of -EINVAL for
unsupported values; remove SCAN_INVALID_PTES_NONE; change return type
from int to unsigned int and propagate to all callers; add comment above
__collapse_huge_page_swapin explaining mTHP swap bail-out (David,
Lorenzo, Lance, Wei Yang, Usama)
- [patch 05] Rewrite collapse_huge_page lock comment to David's suggested
wording (David)
- [patch 11] Propagate unsigned int return type for max_ptes_none; remove
the now-unnecessary negative return check (consequence of patch 04);
Add optimization to the next_order goto that will prevent unnecessary
iterations if there are no lower orders enabled (Vernon); update locking
comment; pass VMA to mthp_collapse to improve uffd-armed detection, and
prevent unnecessary work. (Wei)
- [patch 14] Update documentation to reflect fallback-to-0 behavior
V17: https://lore.kernel.org/all/20260511185817.686831-1-npache@redhat.com
V16: https://lore.kernel.org/all/20260419185750.260784-1-npache@redhat.com
V15: https://lore.kernel.org/all/20260226031741.230674-1-npache@redhat.com
V14: https://lore.kernel.org/all/20260122192841.128719-1-npache@redhat.com
V13: https://lore.kernel.org/all/20251201174627.23295-1-npache@redhat.com
V12: https://lore.kernel.org/all/20251022183717.70829-1-npache@redhat.com
V11: https://lore.kernel.org/all/20250912032810.197475-1-npache@redhat.com
V10: https://lore.kernel.org/all/20250819134205.622806-1-npache@redhat.com
V9 : https://lore.kernel.org/all/20250714003207.113275-1-npache@redhat.com
V8 : https://lore.kernel.org/all/20250702055742.102808-1-npache@redhat.com
V7 : https://lore.kernel.org/all/20250515032226.128900-1-npache@redhat.com
V6 : https://lore.kernel.org/all/20250515030312.125567-1-npache@redhat.com
V5 : https://lore.kernel.org/all/20250428181218.85925-1-npache@redhat.com
V4 : https://lore.kernel.org/all/20250417000238.74567-1-npache@redhat.com
V3 : https://lore.kernel.org/all/20250414220557.35388-1-npache@redhat.com
V2 : https://lore.kernel.org/all/20250211003028.213461-1-npache@redhat.com
V1 : https://lore.kernel.org/all/20250108233128.14484-1-npache@redhat.com
Baolin Wang (1):
mm/khugepaged: run khugepaged for all orders
Dev Jain (1):
mm/khugepaged: generalize alloc_charge_folio()
Nico Pache (12):
mm/khugepaged: generalize hugepage_vma_revalidate for mTHP support
mm/khugepaged: rework max_ptes_* handling with helper functions
mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
mm/khugepaged: require collapse_huge_page to enter/exit with the lock
dropped
mm/khugepaged: generalize collapse_huge_page for mTHP collapse
mm/khugepaged: skip collapsing mTHP to smaller orders
mm/khugepaged: add per-order mTHP collapse failure statistics
mm/khugepaged: improve tracepoints for mTHP orders
mm/khugepaged: introduce collapse_allowable_orders helper function
mm/khugepaged: Introduce mTHP collapse support
mm/khugepaged: avoid unnecessary mTHP collapse attempts
Documentation: mm: update the admin guide for mTHP collapse
Documentation/admin-guide/mm/transhuge.rst | 72 ++-
include/linux/huge_mm.h | 5 +
include/trace/events/huge_memory.h | 34 +-
mm/huge_memory.c | 11 +
mm/khugepaged.c | 634 ++++++++++++++++-----
5 files changed, 584 insertions(+), 172 deletions(-)
base-commit: 6c8cb505a5634594b3ea159fd1c71bce2acf3346
--
2.54.0
NAK. This is not a hotfixes candidate Nico. Don't send a massive series like this with that tag please. Thanks, Lorenzo
On Fri, May 22, 2026 at 8:59 AM Nico Pache <npache@redhat.com> wrote: > > The following series provides khugepaged with the capability to collapse > anonymous memory regions to mTHPs. > > To achieve this we generalize the khugepaged functions to no longer depend > on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual > pages that are occupied (!none/zero). After the PMD scan is done, we use > the bitmap to find the optimal mTHP sizes for the PMD range. The > restriction on max_ptes_none is removed during the scan, to make sure we > account for the whole PMD range in the bitmap. When no mTHP size is > enabled, the legacy behavior of khugepaged is maintained. > > We currently only support max_ptes_none values of 0 or HPAGE_PMD_NR - 1 > (ie 511). If any other value is specified, the kernel will emit a warning > and mTHP collapse will default to max_ptes_none=0. If a mTHP collapse is > attempted, but contains swapped out, or shared pages, we don't perform > the collapse. > It is now also possible to collapse to mTHPs without requiring the PMD THP > size to be enabled. These limitations are to prevent collapse "creep" > behavior. This prevents constantly promoting mTHPs to the next available > size, which would occur because a collapse introduces more non-zero pages > that would satisfy the promotion condition on subsequent scans. > > Patch 1-2: Generalize hugepage_vma_revalidate and alloc_charge_folio > for arbitrary orders. > Patch 3: Rework max_ptes_* handling into helper functions > Patch 4: Generalize __collapse_huge_page_* for mTHP support > Patch 5: Require collapse_huge_page to enter/exit with the lock dropped > Patch 6: Generalize collapse_huge_page for mTHP collapse > Patch 7: Skip collapsing mTHP to smaller orders > Patch 8-9: Add per-order mTHP statistics and tracepoints > Patch 10: Introduce collapse_allowable_orders helper function > Patch 11-13: Introduce bitmap and mTHP collapse support, fully enabled > Patch 14: Documentation > > Testing: > - Built for x86_64, aarch64, ppc64le, and s390x > - ran all arches on test suites provided by the kernel-tests project > - internal testing suites: functional testing and performance testing > - selftests mm > - I created a test script that I used to push khugepaged to its limits > while monitoring a number of stats and tracepoints. The code is > available here[1] (Run in legacy mode for these changes and set mthp > sizes to inherit) > The summary from my testings was that there was no significant > regression noticed through this test. In some cases my changes had > better collapse latencies, and was able to scan more pages in the same > amount of time/work, but for the most part the results were consistent. > - redis testing. I did some testing with these changes along with my defer > changes (see followup [2] post for more details). We've decided to get > the mTHP changes merged first before attempting the defer series. > - some basic testing on 64k page size. > - lots of general use. > > [1] - https://gitlab.com/npache/khugepaged_mthp_test > [2] - https://lore.kernel.org/lkml/20250515033857.132535-1-npache@redhat.com/ > > V18 Changes: > - Added RBs/Acks > - [patch 02] Guard count_memcg_folio_events with is_pmd_order() to keep > THP_COLLAPSE_ALLOC PMD-only (Usama, Lance) > - [patch 03] Convert C++ comments to C-style; fix "none-page" terminology > to "empty PTEs or PTEs mapping the shared zeropage"; drop unnecessary > userfaultfd comment; add const to local max_ptes_* variables; fix > "repect" typo (Lance, David) > - [patch 04] collapse_max_ptes_none() now returns 0 instead of -EINVAL for > unsupported values; remove SCAN_INVALID_PTES_NONE; change return type > from int to unsigned int and propagate to all callers; add comment above > __collapse_huge_page_swapin explaining mTHP swap bail-out (David, > Lorenzo, Lance, Wei Yang, Usama) > - [patch 05] Rewrite collapse_huge_page lock comment to David's suggested > wording (David) > - [patch 11] Propagate unsigned int return type for max_ptes_none; remove > the now-unnecessary negative return check (consequence of patch 04); > Add optimization to the next_order goto that will prevent unnecessary > iterations if there are no lower orders enabled (Vernon); update locking > comment; pass VMA to mthp_collapse to improve uffd-armed detection, and > prevent unnecessary work. (Wei) > - [patch 14] Update documentation to reflect fallback-to-0 behavior > > V17: https://lore.kernel.org/all/20260511185817.686831-1-npache@redhat.com > V16: https://lore.kernel.org/all/20260419185750.260784-1-npache@redhat.com > V15: https://lore.kernel.org/all/20260226031741.230674-1-npache@redhat.com > V14: https://lore.kernel.org/all/20260122192841.128719-1-npache@redhat.com > V13: https://lore.kernel.org/all/20251201174627.23295-1-npache@redhat.com > V12: https://lore.kernel.org/all/20251022183717.70829-1-npache@redhat.com > V11: https://lore.kernel.org/all/20250912032810.197475-1-npache@redhat.com > V10: https://lore.kernel.org/all/20250819134205.622806-1-npache@redhat.com > V9 : https://lore.kernel.org/all/20250714003207.113275-1-npache@redhat.com > V8 : https://lore.kernel.org/all/20250702055742.102808-1-npache@redhat.com > V7 : https://lore.kernel.org/all/20250515032226.128900-1-npache@redhat.com > V6 : https://lore.kernel.org/all/20250515030312.125567-1-npache@redhat.com > V5 : https://lore.kernel.org/all/20250428181218.85925-1-npache@redhat.com > V4 : https://lore.kernel.org/all/20250417000238.74567-1-npache@redhat.com > V3 : https://lore.kernel.org/all/20250414220557.35388-1-npache@redhat.com > V2 : https://lore.kernel.org/all/20250211003028.213461-1-npache@redhat.com > V1 : https://lore.kernel.org/all/20250108233128.14484-1-npache@redhat.com > > Baolin Wang (1): > mm/khugepaged: run khugepaged for all orders > > Dev Jain (1): > mm/khugepaged: generalize alloc_charge_folio() > > Nico Pache (12): > mm/khugepaged: generalize hugepage_vma_revalidate for mTHP support > mm/khugepaged: rework max_ptes_* handling with helper functions > mm/khugepaged: generalize __collapse_huge_page_* for mTHP support > mm/khugepaged: require collapse_huge_page to enter/exit with the lock > dropped > mm/khugepaged: generalize collapse_huge_page for mTHP collapse > mm/khugepaged: skip collapsing mTHP to smaller orders > mm/khugepaged: add per-order mTHP collapse failure statistics > mm/khugepaged: improve tracepoints for mTHP orders > mm/khugepaged: introduce collapse_allowable_orders helper function > mm/khugepaged: Introduce mTHP collapse support > mm/khugepaged: avoid unnecessary mTHP collapse attempts > Documentation: mm: update the admin guide for mTHP collapse > > Documentation/admin-guide/mm/transhuge.rst | 72 ++- > include/linux/huge_mm.h | 5 + > include/trace/events/huge_memory.h | 34 +- > mm/huge_memory.c | 11 + > mm/khugepaged.c | 634 ++++++++++++++++----- > 5 files changed, 584 insertions(+), 172 deletions(-) > > > base-commit: 6c8cb505a5634594b3ea159fd1c71bce2acf3346 Whoops I manually changed the coverletter subject to reflect that this in on mm-hotfixes-unstable but never updated the others... Hopefully that is ok. Just a small mistake. Base commit is referenced here. -- Nico > -- > 2.54.0 >
On Fri, May 22, 2026 at 09:07:29AM -0600, Nico Pache wrote: > Whoops I manually changed the coverletter subject to reflect that this > in on mm-hotfixes-unstable but never updated the others... > > Hopefully that is ok. Just a small mistake. Base commit is referenced here. It's not ok, this isn't suitable for a hotfix in any way shape or form? As you know, because we told you :) May has been difficult because of conferences, holidays (and in my case burnout recovery). And unfortunately the series seems to have needed quite a bit of review again (my suggestion to you would be to ensure you don't make major changes, only small incremental ones on the basis of review feedback). So this isn't viable for 7.2, and we'll have to target 7.3. Therefore there was no rush. Also please don't spring a respin on this series on us without discussion first, with people away and (frankly) the amount of work involved here, you're going to have to accept the pace that workload/availability permits. Adding spurious hotfixes tags doesn't help anything :) please don't do that again. Thanks, Lorenzo
On Fri, May 22, 2026 at 9:17 AM Lorenzo Stoakes <ljs@kernel.org> wrote: > > On Fri, May 22, 2026 at 09:07:29AM -0600, Nico Pache wrote: > > Whoops I manually changed the coverletter subject to reflect that this > > in on mm-hotfixes-unstable but never updated the others... > > > > Hopefully that is ok. Just a small mistake. Base commit is referenced here. > > It's not ok, this isn't suitable for a hotfix in any way shape or form? > > As you know, because we told you :) May has been difficult because of > conferences, holidays (and in my case burnout recovery). > > And unfortunately the series seems to have needed quite a bit of review again > (my suggestion to you would be to ensure you don't make major changes, only > small incremental ones on the basis of review feedback). > > So this isn't viable for 7.2, and we'll have to target 7.3. Therefore there > was no rush. > > Also please don't spring a respin on this series on us without discussion > first, with people away and (frankly) the amount of work involved here, > you're going to have to accept the pace that workload/availability permits. > > Adding spurious hotfixes tags doesn't help anything :) please don't do that > again. Hi, Sorry for the confusion but Andrew and I spoke about this before I sent it, and he confirmed that I should send it against this tree to prevent merge conflicts. Because Zi's series depends on this, and this is already in the mm tree, choosing a candidate before my commits was best to prevent merge conflicts. The intent wasn't that this is a hotfix, just that this was the closest base before the v17 that is already in the tree. Sorry for the confusion, hopefully Andrew can still apply it to the correct tree. -- Nico > > Thanks, Lorenzo >
On Fri, May 22, 2026 at 10:08:19AM -0600, Nico Pache wrote: > On Fri, May 22, 2026 at 9:17 AM Lorenzo Stoakes <ljs@kernel.org> wrote: > > > > On Fri, May 22, 2026 at 09:07:29AM -0600, Nico Pache wrote: > > > Whoops I manually changed the coverletter subject to reflect that this > > > in on mm-hotfixes-unstable but never updated the others... > > > > > > Hopefully that is ok. Just a small mistake. Base commit is referenced here. > > > > It's not ok, this isn't suitable for a hotfix in any way shape or form? > > > > As you know, because we told you :) May has been difficult because of > > conferences, holidays (and in my case burnout recovery). > > > > And unfortunately the series seems to have needed quite a bit of review again > > (my suggestion to you would be to ensure you don't make major changes, only > > small incremental ones on the basis of review feedback). > > > > So this isn't viable for 7.2, and we'll have to target 7.3. Therefore there > > was no rush. > > > > Also please don't spring a respin on this series on us without discussion > > first, with people away and (frankly) the amount of work involved here, > > you're going to have to accept the pace that workload/availability permits. > > > > Adding spurious hotfixes tags doesn't help anything :) please don't do that > > again. > > Hi, > > Sorry for the confusion but Andrew and I spoke about this before I > sent it, and he confirmed that I should send it against this tree to > prevent merge conflicts. > > Because Zi's series depends on this, and this is already in the mm > tree, choosing a candidate before my commits was best to prevent merge > conflicts. There's some kind of confusion here. This series isn't suited for 7.2. Sorry but Zi's series, unless it depends on functionality here, will have to be rebased. People have been at conferences, people have been on leave, I've had to pace myself for health reasons and it seems there's been more than simply review comment-based changes happening here. (Again I strongly encourage, at this stage, to ONLY be making changes based on review, not adding ANYTHING else or changing ANYTHING else to avoid delays :) Also - shouldn't mm-unstable already have mm-hotfixes-unstable in it? I think in mm-next we will have an stable branch, that everything is based on, where things go once review is complete and things are mergeable. And a separate hotfixes branch based on Linus's tree. That would avoid issues like this :) > > The intent wasn't that this is a hotfix, just that this was the > closest base before the v17 that is already in the tree. The convention is that [PATCH ... <branch>] indicates the target of the changes. Putting the hotfixes branch there implies it's a hotfix. So please be careful with that in future :) > > Sorry for the confusion, hopefully Andrew can still apply it to the > correct tree. I'm not even sure what's best for that at this stage given we have conflicts and this has to be delayed until 7.3. I wonder if given that we should not have this in mm-unstable at all and just wait it out until the next cycle begins? Review can happen concurrently. > > -- Nico > > > > > Thanks, Lorenzo > > > Thanks, Lorenzo
On Fri, May 22, 2026 at 10:20 AM Lorenzo Stoakes <ljs@kernel.org> wrote: > > On Fri, May 22, 2026 at 10:08:19AM -0600, Nico Pache wrote: > > On Fri, May 22, 2026 at 9:17 AM Lorenzo Stoakes <ljs@kernel.org> wrote: > > > > > > On Fri, May 22, 2026 at 09:07:29AM -0600, Nico Pache wrote: > > > > Whoops I manually changed the coverletter subject to reflect that this > > > > in on mm-hotfixes-unstable but never updated the others... > > > > > > > > Hopefully that is ok. Just a small mistake. Base commit is referenced here. > > > > > > It's not ok, this isn't suitable for a hotfix in any way shape or form? > > > > > > As you know, because we told you :) May has been difficult because of > > > conferences, holidays (and in my case burnout recovery). > > > > > > And unfortunately the series seems to have needed quite a bit of review again > > > (my suggestion to you would be to ensure you don't make major changes, only > > > small incremental ones on the basis of review feedback). > > > > > > So this isn't viable for 7.2, and we'll have to target 7.3. Therefore there > > > was no rush. > > > > > > Also please don't spring a respin on this series on us without discussion > > > first, with people away and (frankly) the amount of work involved here, > > > you're going to have to accept the pace that workload/availability permits. > > > > > > Adding spurious hotfixes tags doesn't help anything :) please don't do that > > > again. > > > > Hi, > > > > Sorry for the confusion but Andrew and I spoke about this before I > > sent it, and he confirmed that I should send it against this tree to > > prevent merge conflicts. > > > > Because Zi's series depends on this, and this is already in the mm > > tree, choosing a candidate before my commits was best to prevent merge > > conflicts. > > There's some kind of confusion here. > > This series isn't suited for 7.2. > > Sorry but Zi's series, unless it depends on functionality here, will have > to be rebased. > > People have been at conferences, people have been on leave, I've had to > pace myself for health reasons and it seems there's been more than simply > review comment-based changes happening here. > > (Again I strongly encourage, at this stage, to ONLY be making changes based > on review, not adding ANYTHING else or changing ANYTHING else to avoid > delays :) All the changes are based on review points. Very small changes in this version; the largest being the one that you specifically argeed too. > > Also - shouldn't mm-unstable already have mm-hotfixes-unstable in it? > > I think in mm-next we will have an stable branch, that everything is > based on, where things go once review is complete and things are mergeable. > > And a separate hotfixes branch based on Linus's tree. > > That would avoid issues like this :) Im sorry im new to this, but I really dont think this tiny error, and something that I'd confirmed with Andrew beforehand deserves NAKing and defering it. Ive worked through my PTO to clean up some of these review nits just to get it in 7.2. I even through this through my rounds of testing today before resending. > > > > > The intent wasn't that this is a hotfix, just that this was the > > closest base before the v17 that is already in the tree. > > The convention is that [PATCH ... <branch>] indicates the target of the > changes. Putting the hotfixes branch there implies it's a hotfix. Sorry I thought the <branch> was what base you used. > > So please be careful with that in future :) Yes will do for sure. > > > > > Sorry for the confusion, hopefully Andrew can still apply it to the > > correct tree. > > I'm not even sure what's best for that at this stage given we have > conflicts and this has to be delayed until 7.3. > > I wonder if given that we should not have this in mm-unstable at all and > just wait it out until the next cycle begins? Review can happen > concurrently. I still dont see why this has to be deferred, I was working with Andrew to prevent merge headaches. -- Nico > > > > > -- Nico > > > > > > > > Thanks, Lorenzo > > > > > > > Thanks, Lorenzo >
On Fri, May 22, 2026 at 10:31:41AM -0600, Nico Pache wrote: > On Fri, May 22, 2026 at 10:20 AM Lorenzo Stoakes <ljs@kernel.org> wrote: > > There's some kind of confusion here. > > > > This series isn't suited for 7.2. > > > > Sorry but Zi's series, unless it depends on functionality here, will have > > to be rebased. > > > > People have been at conferences, people have been on leave, I've had to > > pace myself for health reasons and it seems there's been more than simply > > review comment-based changes happening here. > > > > (Again I strongly encourage, at this stage, to ONLY be making changes based > > on review, not adding ANYTHING else or changing ANYTHING else to avoid > > delays :) > > All the changes are based on review points. Very small changes in this > version; the largest being the one that you specifically argeed too. 16->17 Documentation/admin-guide/mm/transhuge.rst | 24 +++++------------- include/linux/khugepaged.h | 7 ++--- include/trace/events/huge_memory.h | 3 ++- mm/huge_memory.c | 2 +- mm/khugepaged.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------------------------------------------------------------ mm/vma.c | 6 ++--- tools/testing/vma/include/stubs.h | 3 ++- 7 files changed, 103 insertions(+), 110 deletions(-) 17->18 Documentation/admin-guide/mm/transhuge.rst | 5 +++-- include/trace/events/huge_memory.h | 3 +-- mm/khugepaged.c | 121 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------------------------------------------------------- 3 files changed, 66 insertions(+), 63 deletions(-) These are not small 'very small changes'. We're nearly at rc-5, and this is a major, invasive, dangerous change that we have to get right. You've also made changes unrelated to review, repeatedly, throughout this process, which as I've told you, is causing delays. You've also throughout the review of this series done stuff like make MAJOR changes to things and _kept review tags_. You're forcing us to use git range-diff etc. to forensically check that the series is what is claimed. Dude I mean you switched to using // comment style which is not used in mm anywhere for instance? Don't do things like that and complain about delays. Honestly. Also, again, LSF happened. Other confeerences happened. Bandwidth is reduced. So again, I'm sorry, but you've been hit with some bad luck here. I really wanted this in for 7.2, and I feel bad that we couldn't make it, but you're also doing thing that's making it difficult for us. I've spent double-digits hours on your series, and I've also had work pushed out becasue of that leading me to work evenings and weekends as a result. And I'm not even going to get any credit for it :)) So while I sypmathise, really, please have empathy and realise it goes both ways, please. I'm not being mean for the sake of it, I'm pushing back because I feel this is not at a stage where I'd feel confident in this being merged at this time. And it's very much a regret, as I _really_ wanted us to have it in this time. But life and circumstances and the issues mentioned above have intervened, sadly. > > > > > Also - shouldn't mm-unstable already have mm-hotfixes-unstable in it? > > > > I think in mm-next we will have an stable branch, that everything is > > based on, where things go once review is complete and things are mergeable. > > > > And a separate hotfixes branch based on Linus's tree. > > > > That would avoid issues like this :) > > Im sorry im new to this, but I really dont think this tiny error, and > something that I'd confirmed with Andrew beforehand deserves NAKing > and defering it. Ive worked through my PTO to clean up some of these > review nits just to get it in 7.2. I even through this through my > rounds of testing today before resending. The issue wasn't the error (though it wasn't tiny...!), it's the state of review. There was fresh review comments from a few days ago, and there's big diffs between revisions. You've also made unrelated changes as you have done throughout the series. As I said above, I'm sorry that you spent time in your PTO on this, but we cannot rush this in when things are not clearly ready yet, and I am not confident in this being ready at this stage. > > > > > > > > > The intent wasn't that this is a hotfix, just that this was the > > > closest base before the v17 that is already in the tree. > > > > The convention is that [PATCH ... <branch>] indicates the target of the > > changes. Putting the hotfixes branch there implies it's a hotfix. > > Sorry I thought the <branch> was what base you used. I mean, sure there's clearly confusion here as you sent [PATCH 7.2 v16 ...] (against an unreleased kernel version) then a branch specifier then the hotfixes one... Anyway sure, it's fine, I've made vastly more dumb mistakes than that myself, nobody minds, but it's concerning as by convention [PATCH ... <mm->hotfixes<whatever>] generally is taken to mean 'please rush this to hotfixes!' :) So be careful with that please! > > > > > So please be careful with that in future :) > > Yes will do for sure. Thanks! > > > > > > > > > Sorry for the confusion, hopefully Andrew can still apply it to the > > > correct tree. > > > > I'm not even sure what's best for that at this stage given we have > > conflicts and this has to be delayed until 7.3. > > > > I wonder if given that we should not have this in mm-unstable at all and > > just wait it out until the next cycle begins? Review can happen > > concurrently. > > I still dont see why this has to be deferred, I was working with > Andrew to prevent merge headaches. I've explained the why above, and David and I co-maintain THP so I feel that ultimately given the blood, sweat and tears we've put into THP review we ought to have some input on this :) Thanks, Lorenzo
On 5/22/26 17:07, Nico Pache wrote: > On Fri, May 22, 2026 at 8:59 AM Nico Pache <npache@redhat.com> wrote: >> include/trace/events/huge_memory.h | 34 +- >> mm/huge_memory.c | 11 + >> mm/khugepaged.c | 634 ++++++++++++++++----- >> 5 files changed, 584 insertions(+), 172 deletions(-) >> >> >> base-commit: 6c8cb505a5634594b3ea159fd1c71bce2acf3346 > > Whoops I manually changed the coverletter subject to reflect that this > in on mm-hotfixes-unstable but never updated the others... But why? That branch is for hotfixes that would go to the current 7.1-rcX series. mm-unstable would be the correct one for this, AFAICT. > Hopefully that is ok. Just a small mistake. Base commit is referenced here. > > -- Nico > > >> -- >> 2.54.0 >> >
On Fri, May 22, 2026 at 9:13 AM Vlastimil Babka (SUSE) <vbabka@kernel.org> wrote: > > On 5/22/26 17:07, Nico Pache wrote: > > On Fri, May 22, 2026 at 8:59 AM Nico Pache <npache@redhat.com> wrote: > >> include/trace/events/huge_memory.h | 34 +- > >> mm/huge_memory.c | 11 + > >> mm/khugepaged.c | 634 ++++++++++++++++----- > >> 5 files changed, 584 insertions(+), 172 deletions(-) > >> > >> > >> base-commit: 6c8cb505a5634594b3ea159fd1c71bce2acf3346 > > > > Whoops I manually changed the coverletter subject to reflect that this > > in on mm-hotfixes-unstable but never updated the others... > > But why? That branch is for hotfixes that would go to the current 7.1-rcX > series. mm-unstable would be the correct one for this, AFAICT. Sorry this was a misunderstanding. The goal here was to base this off the closest base commit behind where my v17 already lies in the tree. That just happened to be the hotfixes tree (previously it was mm-unstable, but that seems the have moved). Sorry... -- Nico > > > Hopefully that is ok. Just a small mistake. Base commit is referenced here. > > > > -- Nico > > > > > >> -- > >> 2.54.0 > >> > > >
On 5/22/26 18:11, Nico Pache wrote: > On Fri, May 22, 2026 at 9:13 AM Vlastimil Babka (SUSE) > <vbabka@kernel.org> wrote: >> >> On 5/22/26 17:07, Nico Pache wrote: >>> >>> Whoops I manually changed the coverletter subject to reflect that this >>> in on mm-hotfixes-unstable but never updated the others... >> >> But why? That branch is for hotfixes that would go to the current 7.1-rcX >> series. mm-unstable would be the correct one for this, AFAICT. > > Sorry this was a misunderstanding. The goal here was to base this off > the closest base commit behind where my v17 already lies in the tree. Ah, I guess this is a problem of "v17 is already in mm-unstable, so against what to base v18". Yeah, we touched on that problem in the LSF/MM process discussion ... -- Cheers, David
On Fri, 22 May 2026 08:59:55 -0600 Nico Pache <npache@redhat.com> wrote:
> The following series provides khugepaged with the capability to collapse
> anonymous memory regions to mTHPs.
Thanks, I've update mm.git's mm-unstable branch to this version.
It sounds like I might be dropping it soon, haven't started looking at
that yet. But let's at least eyeball the latest version at this time.
Sashiko was able to apply this, so the base-it-on-hotfixes thing worked
well, thanks. The AI checking made a few allegations:
https://sashiko.dev/#/patchset/20260522150009.121603-1-npache@redhat.com
> V18 Changes:
> - Added RBs/Acks
> - [patch 02] Guard count_memcg_folio_events with is_pmd_order() to keep
> THP_COLLAPSE_ALLOC PMD-only (Usama, Lance)
> - [patch 03] Convert C++ comments to C-style; fix "none-page" terminology
> to "empty PTEs or PTEs mapping the shared zeropage"; drop unnecessary
> userfaultfd comment; add const to local max_ptes_* variables; fix
> "repect" typo (Lance, David)
> - [patch 04] collapse_max_ptes_none() now returns 0 instead of -EINVAL for
> unsupported values; remove SCAN_INVALID_PTES_NONE; change return type
> from int to unsigned int and propagate to all callers; add comment above
> __collapse_huge_page_swapin explaining mTHP swap bail-out (David,
> Lorenzo, Lance, Wei Yang, Usama)
> - [patch 05] Rewrite collapse_huge_page lock comment to David's suggested
> wording (David)
> - [patch 11] Propagate unsigned int return type for max_ptes_none; remove
> the now-unnecessary negative return check (consequence of patch 04);
> Add optimization to the next_order goto that will prevent unnecessary
> iterations if there are no lower orders enabled (Vernon); update locking
> comment; pass VMA to mthp_collapse to improve uffd-armed detection, and
> prevent unnecessary work. (Wei)
> - [patch 14] Update documentation to reflect fallback-to-0 behavior
>
Below is how v18 altered mm.git.
Quite a lot of it seems to be replacement of "//"-style comments. It's
unfortunate that this work isn't separated from the substantive
changes. We could have done that with a few followup fixes rather than
a wholesale replacement of the series.
Documentation/admin-guide/mm/transhuge.rst | 5
include/trace/events/huge_memory.h | 3
mm/khugepaged.c | 121 +++++++++----------
3 files changed, 66 insertions(+), 63 deletions(-)
--- a/Documentation/admin-guide/mm/transhuge.rst~b
+++ a/Documentation/admin-guide/mm/transhuge.rst
@@ -312,8 +312,9 @@ when collapsing a group of small pages i
For PMD-sized THP collapse, this directly limits the number of empty pages
allowed in the 2MB region.
-For mTHP collapse, only 0 or (HPAGE_PMD_NR - 1) are supported. Any other value
-will emit a warning and no mTHP collapse will be attempted.
+For mTHP collapse, only 0 or (HPAGE_PMD_NR - 1) are supported. At
+HPAGE_PMD_NR - 1, we collapse to the highest possible order. Any intermediate
+value will emit a warning and mTHP collapse will default to max_ptes_none=0.
A higher value allows more empty pages, potentially leading to more memory
usage but better THP performance. A lower value is more conservative and
--- a/include/trace/events/huge_memory.h~b
+++ a/include/trace/events/huge_memory.h
@@ -39,8 +39,7 @@
EM( SCAN_STORE_FAILED, "store_failed") \
EM( SCAN_COPY_MC, "copy_poisoned_page") \
EM( SCAN_PAGE_FILLED, "page_filled") \
- EM(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback") \
- EMe(SCAN_INVALID_PTES_NONE, "invalid_ptes_none")
+ EMe(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback")
#undef EM
#undef EMe
--- a/mm/khugepaged.c~b
+++ a/mm/khugepaged.c
@@ -61,7 +61,6 @@ enum scan_result {
SCAN_COPY_MC,
SCAN_PAGE_FILLED,
SCAN_PAGE_DIRTY_OR_WRITEBACK,
- SCAN_INVALID_PTES_NONE,
};
#define CREATE_TRACE_POINTS
@@ -380,41 +379,43 @@ static bool pte_none_or_zero(pte_t pte)
}
/**
- * collapse_max_ptes_none - Calculate maximum allowed none-page or zero-page
- * PTEs for the given collapse operation.
+ * collapse_max_ptes_none - Calculate maximum allowed empty PTEs or PTEs mapping
+ * the shared zeropage for the given collapse operation.
* @cc: The collapse control struct
* @vma: The vma to check for userfaultfd
* @order: The folio order being collapsed to
*
- * Return: Maximum number of none-page or zero-page PTEs allowed for the
- * collapse operation.
+ * Return: Maximum number of empty/shared zeropage PTEs for the collapse operation
*/
-static int collapse_max_ptes_none(struct collapse_control *cc,
+static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
struct vm_area_struct *vma, unsigned int order)
{
unsigned int max_ptes_none = khugepaged_max_ptes_none;
- // If the vma is userfaultfd-armed, allow no none-page or zero-page PTEs.
+
if (vma && userfaultfd_armed(vma))
return 0;
- // for MADV_COLLAPSE, allow any none-page or zero-page PTEs.
+ /* for MADV_COLLAPSE, allow any empty/shared zeropage PTEs */
if (!cc->is_khugepaged)
return HPAGE_PMD_NR;
- // for PMD collapse, respect the user defined maximum.
+ /* for PMD collapse, respect the user defined maximum */
if (is_pmd_order(order))
return max_ptes_none;
- /* Zero/non-present collapse disabled. */
- if (!max_ptes_none)
- return 0;
- // for mTHP collapse with the sysctl value set to KHUGEPAGED_MAX_PTES_LIMIT,
- // scale the maximum number of PTEs to the order of the collapse.
+ /*
+ * for mTHP collapse with the sysctl value set to KHUGEPAGED_MAX_PTES_LIMIT,
+ * scale the maximum number of PTEs to the order of the collapse.
+ */
if (max_ptes_none == KHUGEPAGED_MAX_PTES_LIMIT)
return (1 << order) - 1;
-
- // We currently only support max_ptes_none values of 0 or KHUGEPAGED_MAX_PTES_LIMIT.
- // Emit a warning and return -EINVAL.
- pr_warn_once("mTHP collapse only supports max_ptes_none values of 0 or %u\n",
- KHUGEPAGED_MAX_PTES_LIMIT);
- return -EINVAL;
+ if (!max_ptes_none)
+ return 0;
+ /*
+ * For mTHP collapse of values other than 0 or KHUGEPAGED_MAX_PTES_LIMIT,
+ * emit a warning and return 0.
+ */
+ pr_warn_once("mTHP collapse does not support max_ptes_none values"
+ " other than 0 or %u, defaulting to 0.\n",
+ KHUGEPAGED_MAX_PTES_LIMIT);
+ return 0;
}
/**
@@ -429,15 +430,19 @@ static int collapse_max_ptes_none(struct
static unsigned int collapse_max_ptes_shared(struct collapse_control *cc,
unsigned int order)
{
- // for MADV_COLLAPSE, do not restrict the number of PTEs that map shared
- // anonymous pages.
+ /*
+ * For MADV_COLLAPSE, do not restrict the number of PTEs that map shared
+ * anonymous pages.
+ */
if (!cc->is_khugepaged)
return HPAGE_PMD_NR;
- // for mTHP collapse do not allow collapsing anonymous memory pages that
- // are shared between processes.
+ /*
+ * for mTHP collapse do not allow collapsing anonymous memory pages that
+ * are shared between processes.
+ */
if (!is_pmd_order(order))
return 0;
- // for PMD collapse, respect the user defined maximum.
+ /* for PMD collapse, respect the user defined maximum */
return khugepaged_max_ptes_shared;
}
@@ -453,14 +458,16 @@ static unsigned int collapse_max_ptes_sh
static unsigned int collapse_max_ptes_swap(struct collapse_control *cc,
unsigned int order)
{
- // for MADV_COLLAPSE, do not restrict the number PTEs entries or
- // pagecache entries that are non-present.
+ /*
+ * For MADV_COLLAPSE, do not restrict the number PTEs entries or
+ * pagecache entries that are non-present.
+ */
if (!cc->is_khugepaged)
return HPAGE_PMD_NR;
- // for mTHP collapse do not allow any non-present PTEs or pagecache entries.
+ /* for mTHP collapse do not allow any non-present PTEs or pagecache entries */
if (!is_pmd_order(order))
return 0;
- // for PMD collapse, respect the user defined maximum.
+ /* for PMD collapse, respect the user defined maximum */
return khugepaged_max_ptes_swap;
}
@@ -593,9 +600,8 @@ static unsigned long collapse_allowable_
void khugepaged_enter_vma(struct vm_area_struct *vma,
vm_flags_t vm_flags)
{
- if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
- collapse_allowable_orders(vma, vm_flags, TVA_KHUGEPAGED) &&
- hugepage_enabled())
+ if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) && hugepage_enabled()
+ && collapse_allowable_orders(vma, vm_flags, TVA_KHUGEPAGED))
__khugepaged_enter(vma->vm_mm);
}
@@ -670,6 +676,8 @@ static enum scan_result __collapse_huge_
unsigned long start_addr, pte_t *pte, struct collapse_control *cc,
unsigned int order, struct list_head *compound_pagelist)
{
+ const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, order);
+ const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, order);
const unsigned long nr_pages = 1UL << order;
struct page *page = NULL;
struct folio *folio = NULL;
@@ -677,11 +685,6 @@ static enum scan_result __collapse_huge_
pte_t *_pte;
int none_or_zero = 0, shared = 0, referenced = 0;
enum scan_result result = SCAN_FAIL;
- int max_ptes_none = collapse_max_ptes_none(cc, vma, order);
- unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, order);
-
- if (max_ptes_none < 0)
- return SCAN_INVALID_PTES_NONE;
for (_pte = pte; _pte < pte + nr_pages;
_pte++, addr += PAGE_SIZE) {
@@ -1136,6 +1139,10 @@ static enum scan_result check_pmd_still_
* Bring missing pages in from swap, to complete THP collapse.
* Only done if khugepaged_scan_pmd believes it is worthwhile.
*
+ * For mTHP orders the function bails on the first swap entry, because
+ * faulting pages back in during collapse could re-populate PTEs that
+ * push a later scan over the threshold for a higher-order collapse.
+ *
* Called and returns without pte mapped or spinlocks held.
* Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
*/
@@ -1257,19 +1264,18 @@ static enum scan_result alloc_charge_fol
return SCAN_CGROUP_CHARGE_FAIL;
}
- count_memcg_folio_events(folio, THP_COLLAPSE_ALLOC, 1);
+ if (is_pmd_order(order))
+ count_memcg_folio_events(folio, THP_COLLAPSE_ALLOC, 1);
*foliop = folio;
return SCAN_SUCCEED;
}
/*
- * collapse_huge_page expects the mmap_read_lock to be dropped before
- * entering this function. The function will also always return with the lock
- * dropped. The function starts by allocation a folio, which can potentially
- * take a long time if it involves sync compaction, and we do not need to hold
- * the mmap_lock during that. We must recheck the vma after taking it again in
- * write mode.
+ * collapse_huge_page expects the mmap_lock to be unlocked before entering and
+ * will always return with the lock unlocked, to avoid holding the mmap_lock
+ * while allocating a THP, as that could trigger direct reclaim/compaction.
+ * Note that the VMA must be rechecked after grabbing the mmap_lock again.
*/
static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
int referenced, int unmapped, struct collapse_control *cc,
@@ -1500,12 +1506,12 @@ static unsigned int collapse_mthp_count_
* If a collapse is permitted, we attempt to collapse the PTE range into a
* mTHP.
*/
-static int mthp_collapse(struct mm_struct *mm, unsigned long address,
- int referenced, int unmapped, struct collapse_control *cc,
- unsigned long enabled_orders)
+static int mthp_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, int referenced, int unmapped,
+ struct collapse_control *cc, unsigned long enabled_orders)
{
- unsigned int nr_occupied_ptes, nr_ptes;
- int max_ptes_none, collapsed = 0, stack_size = 0;
+ unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
+ int collapsed = 0, stack_size = 0;
unsigned long collapse_address;
struct mthp_range range;
u16 offset;
@@ -1522,10 +1528,7 @@ static int mthp_collapse(struct mm_struc
if (!test_bit(order, &enabled_orders))
goto next_order;
- max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
-
- if (max_ptes_none < 0)
- return collapsed;
+ max_ptes_none = collapse_max_ptes_none(cc, vma, order);
nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
nr_ptes);
@@ -1565,7 +1568,7 @@ static int mthp_collapse(struct mm_struc
}
next_order:
- if (order > KHUGEPAGED_MIN_MTHP_ORDER) {
+ if ((BIT(order) - 1) & enabled_orders) {
const u8 next_order = order - 1;
const u16 mid_offset = offset + (nr_ptes / 2);
@@ -1582,9 +1585,9 @@ static enum scan_result collapse_scan_pm
struct vm_area_struct *vma, unsigned long start_addr,
bool *lock_dropped, struct collapse_control *cc)
{
- int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
+ unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
enum tva_type tva_flags = cc->is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
pmd_t *pmd;
pte_t *pte, *_pte, pteval;
@@ -1772,9 +1775,9 @@ out_unmap:
if (result == SCAN_SUCCEED) {
/* collapse_huge_page expects the lock to be dropped before calling */
mmap_read_unlock(mm);
- nr_collapsed = mthp_collapse(mm, start_addr, referenced, unmapped,
- cc, enabled_orders);
- /* collapse_huge_page will return with the mmap_lock released */
+ nr_collapsed = mthp_collapse(mm, vma, start_addr, referenced,
+ unmapped, cc, enabled_orders);
+ /* mmap_lock was released above, set lock_dropped */
*lock_dropped = true;
result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;
}
@@ -2665,7 +2668,7 @@ static enum scan_result collapse_scan_fi
unsigned long addr, struct file *file, pgoff_t start,
struct collapse_control *cc)
{
- const int max_ptes_none = collapse_max_ptes_none(cc, NULL, HPAGE_PMD_ORDER);
+ const unsigned int max_ptes_none = collapse_max_ptes_none(cc, NULL, HPAGE_PMD_ORDER);
const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
struct folio *folio = NULL;
struct address_space *mapping = file->f_mapping;
_
© 2016 - 2026 Red Hat, Inc.