Now that we can collapse to mTHPs lets update the admin guide to
reflect these changes and provide proper guidence on how to utilize it.
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
Documentation/admin-guide/mm/transhuge.rst | 53 ++++++++++++----------
1 file changed, 30 insertions(+), 23 deletions(-)
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 7c71cda8aea1..2569a92fd96c 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -63,7 +63,8 @@ often.
THP can be enabled system wide or restricted to certain tasks or even
memory ranges inside task's address space. Unless THP is completely
disabled, there is ``khugepaged`` daemon that scans memory and
-collapses sequences of basic pages into PMD-sized huge pages.
+collapses sequences of basic pages into huge pages of either PMD size
+or mTHP sizes, if the system is configured to do so
The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
interface and using madvise(2) and prctl(2) system calls.
@@ -212,17 +213,17 @@ PMD-mappable transparent hugepage::
All THPs at fault and collapse time will be added to _deferred_list,
and will therefore be split under memory presure if they are considered
"underused". A THP is underused if the number of zero-filled pages in
-the THP is above max_ptes_none (see below). It is possible to disable
-this behaviour by writing 0 to shrink_underused, and enable it by writing
-1 to it::
+the THP is above max_ptes_none (see below) scaled by the THP order. It is
+possible to disable this behaviour by writing 0 to shrink_underused, and enable
+it by writing 1 to it::
echo 0 > /sys/kernel/mm/transparent_hugepage/shrink_underused
echo 1 > /sys/kernel/mm/transparent_hugepage/shrink_underused
-khugepaged will be automatically started when PMD-sized THP is enabled
+khugepaged will be automatically started when any THP size is enabled
(either of the per-size anon control or the top-level control are set
to "always" or "madvise"), and it'll be automatically shutdown when
-PMD-sized THP is disabled (when both the per-size anon control and the
+all THP sizes are disabled (when both the per-size anon control and the
top-level control are "never")
process THP controls
@@ -264,11 +265,6 @@ support the following arguments::
Khugepaged controls
-------------------
-.. note::
- khugepaged currently only searches for opportunities to collapse to
- PMD-sized THP and no attempt is made to collapse to other THP
- sizes.
-
khugepaged runs usually at low frequency so while one may not want to
invoke defrag algorithms synchronously during the page faults, it
should be worth invoking defrag at least in khugepaged. However it's
@@ -296,11 +292,11 @@ allocation failure to throttle the next allocation attempt::
The khugepaged progress can be seen in the number of pages collapsed (note
that this counter may not be an exact count of the number of pages
collapsed, since "collapsed" could mean multiple things: (1) A PTE mapping
-being replaced by a PMD mapping, or (2) All 4K physical pages replaced by
-one 2M hugepage. Each may happen independently, or together, depending on
-the type of memory and the failures that occur. As such, this value should
-be interpreted roughly as a sign of progress, and counters in /proc/vmstat
-consulted for more accurate accounting)::
+being replaced by a PMD mapping, or (2) physical pages replaced by one
+hugepage of various sizes (PMD-sized or mTHP). Each may happen independently,
+or together, depending on the type of memory and the failures that occur.
+As such, this value should be interpreted roughly as a sign of progress,
+and counters in /proc/vmstat consulted for more accurate accounting)::
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
@@ -308,16 +304,18 @@ for each pass::
/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
-``max_ptes_none`` specifies how many extra small pages (that are
-not already mapped) can be allocated when collapsing a group
-of small pages into one large page::
+``max_ptes_none`` specifies how many empty (none/zero) pages are allowed
+when collapsing a group of small pages into one large page::
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
-A higher value leads to use additional memory for programs.
-A lower value leads to gain less thp performance. Value of
-max_ptes_none can waste cpu time very little, you can
-ignore it.
+For PMD-sized THP collapse, this directly limits the number of empty pages
+allowed in the 2MB region. For mTHP collapse, the kernel might use a more
+conservative value when determining eligibility.
+
+A higher value allows more empty pages, potentially leading to more memory
+usage but better THP performance. A lower value is more conservative and
+may result in fewer THP collapses.
``max_ptes_swap`` specifies how many pages can be brought in from
swap when collapsing a group of pages into a transparent huge page::
@@ -337,6 +335,15 @@ that THP is shared. Exceeding the number would block the collapse::
A higher value may increase memory footprint for some workloads.
+.. note::
+ For mTHP collapse, khugepaged does not support collapsing regions that
+ contain shared or swapped out pages, as this could lead to continuous
+ promotion to higher orders. The collapse will fail if any shared or
+ swapped PTEs are encountered during the scan.
+
+ Currently, madvise_collapse only supports collapsing to PMD-sized THPs
+ and does not attempt mTHP collapses.
+
Boot parameters
===============
--
2.51.0
On Wed, 22 Oct 2025, Nico Pache wrote: > Currently, madvise_collapse only supports collapsing to PMD-sized THPs + > and does not attempt mTHP collapses. + madvise collapse is frequently used as far as I can tell from the THP loads being tested. Could we support madvise collapse for mTHP?
On 22.10.25 21:52, Christoph Lameter (Ampere) wrote: > On Wed, 22 Oct 2025, Nico Pache wrote: > >> Currently, madvise_collapse only supports collapsing to PMD-sized THPs + >> and does not attempt mTHP collapses. + > > madvise collapse is frequently used as far as I can tell from the THP > loads being tested. Could we support madvise collapse for mTHP? The big question is still how user space can communicate the desired order, and how we can not break existing users. So I guess there will definitely be some support to trigger collapse to mTHP in the future, the big question is through which interface. So it will happen after this series. Maybe through process_madvise() where we have an additional parameter, I think that was what people discussed in the past. -- Cheers David / dhildenb
On Wed, 22 Oct 2025, David Hildenbrand wrote:
> The big question is still how user space can communicate the desired order,
> and how we can not break existing users.
>
> So I guess there will definitely be some support to trigger collapse to mTHP
> in the future, the big question is through which interface. So it will happen
> after this series.
Well we have a possibility of a memory policy for each VMA and we can set
memory policies for arbitrary memory ranges as well as per process through
the existing APIs from user space.
Extending the memory policies by a parameter to allow setting a preferred
order would allow us to use this mechanisms.
Memory policies can already be used to control numa balancing and
migration. The ability to specify page sizes is similar I think.
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 8fbbe613611a..429117bbd2f4 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -31,6 +31,7 @@ enum {
#define MPOL_F_STATIC_NODES (1 << 15)
#define MPOL_F_RELATIVE_NODES (1 << 14)
#define MPOL_F_NUMA_BALANCING (1 << 13) /* Optimize with NUMA balancing if possible */
+#define MPOL_F_PAGE_ORDER (1 << 12)
/*
* MPOL_MODE_FLAGS is the union of all possible optional mode flags passed to
@@ -56,6 +57,9 @@ enum {
MPOL_MF_MOVE | \
MPOL_MF_MOVE_ALL)
+#define MPOL_MF_PAGE_ORDER (1<<5) /* Set preferred page order */
+
+
/*
* Internal flags that share the struct mempolicy flags word with
* "mode flags". These flags are allocated from bit 0 up, as they
On Wed, Oct 22, 2025 at 10:22:08PM +0200, David Hildenbrand wrote: > On 22.10.25 21:52, Christoph Lameter (Ampere) wrote: > > On Wed, 22 Oct 2025, Nico Pache wrote: > > > > > Currently, madvise_collapse only supports collapsing to PMD-sized THPs + > > > and does not attempt mTHP collapses. + > > > > madvise collapse is frequently used as far as I can tell from the THP > > loads being tested. Could we support madvise collapse for mTHP? > > The big question is still how user space can communicate the desired order, > and how we can not break existing users. Yes, and let's go one step at a time, this series still needs careful scrutiny and we need to ensure the _fundamentals_ are in place for khugepaged before we get into MADV_COLLAPSE :) > > So I guess there will definitely be some support to trigger collapse to mTHP > in the future, the big question is through which interface. So it will > happen after this series. Yes. > > Maybe through process_madvise() where we have an additional parameter, I > think that was what people discussed in the past. I wouldn't absolutely love us doing that, given it is a general parameter so would seem applicable to any madvise() option and could lead to confusion, also process_madvise() was originally for cross-process madvise vector operations. I expanded this to make it applicable to the current process (and introduced PIDFD_SELF to make that more sane), and SJ has optimised it across vector operations (thanks SJ! :), but in general - it seems very weird to have madvise() provide an operation that process_madvise() providse another version of that has an extra parameter. As usual we've painted ourselves into a corner with an API... :) Perhaps we'll to accept the process_madvise() compromise and add MADV_COLLAPSE_MHTP that only works with it or something. Of course adding a new syscall isn't impossible... madvise2() not very appealing however... TL;DR I guess we'll deal with that when we come to it :) > > -- > Cheers > > David / dhildenb > Cheers, Lorenzo
On Thu, Oct 23, 2025 at 09:00:10AM +0100, Lorenzo Stoakes wrote: > On Wed, Oct 22, 2025 at 10:22:08PM +0200, David Hildenbrand wrote: > > On 22.10.25 21:52, Christoph Lameter (Ampere) wrote: > > > On Wed, 22 Oct 2025, Nico Pache wrote: > > > > > > > Currently, madvise_collapse only supports collapsing to PMD-sized THPs + > > > > and does not attempt mTHP collapses. + > > > > > > madvise collapse is frequently used as far as I can tell from the THP > > > loads being tested. Could we support madvise collapse for mTHP? > > > > The big question is still how user space can communicate the desired order, > > and how we can not break existing users. > Do we want to let userspace communicate order? It seems like an extremely specific thing to do. A more simple&sane semantic could be something like: "MADV_COLLAPSE collapses a given [addr, addr+len] range into the highest order THP it can/thinks it should.". The implementation details of PMD or contpte or <...> are lost by the time we get to userspace. The man page itself is pretty vaguely written to allow us to do whatever we want. It sounds to me that allowing userspace to create arbitrary order mTHPs would be another pandora's box we shouldn't get into. > Yes, and let's go one step at a time, this series still needs careful scrutiny > and we need to ensure the _fundamentals_ are in place for khugepaged before we > get into MADV_COLLAPSE :) > > > > > So I guess there will definitely be some support to trigger collapse to mTHP > > in the future, the big question is through which interface. So it will > > happen after this series. > > Yes. > > > > > Maybe through process_madvise() where we have an additional parameter, I > > think that was what people discussed in the past. > > I wouldn't absolutely love us doing that, given it is a general parameter so > would seem applicable to any madvise() option and could lead to confusion, also > process_madvise() was originally for cross-process madvise vector operations. For what it's worth, it would probably not be too hard to devise a generic separation there between "generic flags" and "behavior-specific flags". And then stuff the desired THP order into MADV_COLLAPSE-specific flags. > > I expanded this to make it applicable to the current process (and introduced > PIDFD_SELF to make that more sane), and SJ has optimised it across vector > operations (thanks SJ! :), but in general - it seems very weird to have > madvise() provide an operation that process_madvise() providse another version > of that has an extra parameter. > > As usual we've painted ourselves into a corner with an API... :) But yes, I agree it would feel weird. > > Perhaps we'll to accept the process_madvise() compromise and add > MADV_COLLAPSE_MHTP that only works with it or something. > > Of course adding a new syscall isn't impossible... madvise2() not very appealing > however... It is my impression that process_madvise() is already madvise2(), but poorly named. > > TL;DR I guess we'll deal with that when we come to it :) Amen :) -- Pedro
On Thu, Oct 23, 2025 at 1:44 AM Pedro Falcato <pfalcato@suse.de> wrote: > > On Thu, Oct 23, 2025 at 09:00:10AM +0100, Lorenzo Stoakes wrote: > > On Wed, Oct 22, 2025 at 10:22:08PM +0200, David Hildenbrand wrote: > > > On 22.10.25 21:52, Christoph Lameter (Ampere) wrote: > > > > On Wed, 22 Oct 2025, Nico Pache wrote: > > > > > > > > > Currently, madvise_collapse only supports collapsing to PMD-sized THPs + > > > > > and does not attempt mTHP collapses. + > > > > > > > > madvise collapse is frequently used as far as I can tell from the THP > > > > loads being tested. Could we support madvise collapse for mTHP? > > > > > > The big question is still how user space can communicate the desired order, > > > and how we can not break existing users. > > > > Do we want to let userspace communicate order? It seems like an extremely > specific thing to do. A more simple&sane semantic could be something like: > "MADV_COLLAPSE collapses a given [addr, addr+len] range into the highest > order THP it can/thinks it should.". The implementation details of PMD or > contpte or <...> are lost by the time we get to userspace. > > The man page itself is pretty vaguely written to allow us to do whatever > we want. It sounds to me that allowing userspace to create arbitrary order > mTHPs would be another pandora's box we shouldn't get into. > > > Yes, and let's go one step at a time, this series still needs careful scrutiny > > and we need to ensure the _fundamentals_ are in place for khugepaged before we > > get into MADV_COLLAPSE :) > > > > > > > > So I guess there will definitely be some support to trigger collapse to mTHP > > > in the future, the big question is through which interface. So it will > > > happen after this series. > > > > Yes. > > > > > > > > Maybe through process_madvise() where we have an additional parameter, I > > > think that was what people discussed in the past. > > > > I wouldn't absolutely love us doing that, given it is a general parameter so > > would seem applicable to any madvise() option and could lead to confusion, also > > process_madvise() was originally for cross-process madvise vector operations. > > For what it's worth, it would probably not be too hard to devise a generic > separation there between "generic flags" and "behavior-specific flags". > And then stuff the desired THP order into MADV_COLLAPSE-specific flags. Yeah, this is how I envisioned the flags to be leveraged; reserve some number of bits for generic, and overload the others for advice-specific. I suspect once the seal is broken on this, more advice-specific flags will promptly follow. > > > > I expanded this to make it applicable to the current process (and introduced > > PIDFD_SELF to make that more sane), and SJ has optimised it across vector > > operations (thanks SJ! :), but in general - it seems very weird to have > > madvise() provide an operation that process_madvise() providse another version > > of that has an extra parameter. > > > > As usual we've painted ourselves into a corner with an API... :) > > But yes, I agree it would feel weird. > > > > > Perhaps we'll to accept the process_madvise() compromise and add > > MADV_COLLAPSE_MHTP that only works with it or something. > > > > Of course adding a new syscall isn't impossible... madvise2() not very appealing > > however... > > It is my impression that process_madvise() is already madvise2(), but > poorly named. +1 > > > > TL;DR I guess we'll deal with that when we come to it :) > > Amen :) > > -- > Pedro
© 2016 - 2026 Red Hat, Inc.