mm: PUD (1GB) THP implementation

[RFC 00/12] mm: PUD (1GB) THP implementation

Posted by Usama Arif 5 days, 2 hours ago

This is an RFC series to implement 1GB PUD-level THPs, allowing
applications to benefit from reduced TLB pressure without requiring
hugetlbfs. The patches are based on top of
f9b74c13b773b7c7e4920d7bc214ea3d5f37b422 from mm-stable (6.19-rc6).

Motivation: Why 1GB THP over hugetlbfs?
=======================================

While hugetlbfs provides 1GB huge pages today, it has significant limitations
that make it unsuitable for many workloads:

1. Static Reservation: hugetlbfs requires pre-allocating huge pages at boot
   or runtime, taking memory away. This requires capacity planning,
   administrative overhead, and makes workload orchastration much much more
   complex, especially colocating with workloads that don't use hugetlbfs.

4. No Fallback: If a 1GB huge page cannot be allocated, hugetlbfs fails
   rather than falling back to smaller pages. This makes it fragile under
   memory pressure.

4. No Splitting: hugetlbfs pages cannot be split when only partial access
   is needed, leading to memory waste and preventing partial reclaim.

5. Memory Accounting: hugetlbfs memory is accounted separately and cannot
   be easily shared with regular memory pools.

PUD THP solves these limitations by integrating 1GB pages into the existing
THP infrastructure.

Performance Results
===================

Benchmark results of these patches on Intel Xeon Platinum 8321HC:

Test: True Random Memory Access [1] test of 4GB memory region with pointer
chasing workload (4M random pointer dereferences through memory):

| Metric            | PUD THP (1GB) | PMD THP (2MB) | Change       |
|-------------------|---------------|---------------|--------------|
| Memory access     | 88 ms         | 134 ms        | 34% faster   |
| Page fault time   | 898 ms        | 331 ms        | 2.7x slower  |

Page faulting 1G pages is 2.7x slower (Allocating 1G pages is hard :)).
For long-running workloads this will be a one-off cost, and the 34%
improvement in access latency provides significant benefit.

ARM with 64K PAGE_SZIE supports 512M PMD THPs. In meta, we have a CPU
bound workload running on a large number of ARM servers (256G). I enabled
the 512M THP settings to always for a 100 servers in production (didn't
really have high expectations :)). The average memory used for the workload
increased from 217G to 233G. The amount of memory backed by 512M pages was
68G! The dTLB misses went down by 26% and the PID multiplier increased input
by 5.9% (This is a very significant improvment in workload performance).
A significant number of these THPs were faulted in at application start when
were present across different VMAs. Ofcourse getting these 512M pages is
easier on ARM due to bigger PAGE_SIZE and pageblock order.

I am hoping that these patches for 1G THP can be used to provide similar
benefits for x86. I expect workloads to fault them in at start time when there
is plenty of free memory available.


Previous attempt by Zi Yan
==========================

Zi Yan attempted 1G THPs [2] in kernel version 5.11. There have been
significant changes in kernel since then, including folio conversion, mTHP
framework, ptdesc, rmap changes, etc. I found it easier to use the current PMD
code as reference for making 1G PUD THP work. I am hoping Zi can provide
guidance on these patches!

Major Design Decisions
======================

1. No shared 1G zero page: The memory cost would be quite significant!

2. Page Table Pre-deposit Strategy
   PMD THP deposits a single PTE page table. PUD THP deposits 512 PTE
   page tables (one for each potential PMD entry after split).
   We allocate a PMD page table and use its pmd_huge_pte list to store
   the deposited PTE tables. This ensures split operations don't fail due
   to page table allocation failures (at the cost of 2M per PUD THP)

3. Split to Base Pages
   When a PUD THP must be split (COW, partial unmap, mprotect), we split
   directly to base pages (262,144 PTEs). The ideal thing would be to split
   to 2M pages and then to 4K pages if needed. However, this would require
   significant rmap and mapcount tracking changes.

4. COW and fork handling via split
   Copy-on-write and fork for PUD THP triggers a split to base pages, then
   uses existing PTE-level COW infrastructure. Getting another 1G region is
   hard and could fail. If only a 4K is written, copying 1G is a waste.
   Probably this should only be done on CoW and not fork?

5. Migration via split
   Split PUD to PTEs and migrate individual pages. It is going to be difficult
   to find a 1G continguous memory to migrate to. Maybe its better to not
   allow migration of PUDs at all? I am more tempted to not allow migration,
   but have kept splitting in this RFC.


Reviewers guide
===============

Most of the code is written by adapting from PMD code. For e.g. the PUD page
fault path is very similar to PMD. The difference is no shared zero page and
the page table deposit strategy. I think the easiest way to review this series
is to compare with PMD code.

Test results
============

  1..7
  # Starting 7 tests from 1 test cases.
  #  RUN           pud_thp.basic_allocation ...
  # pud_thp_test.c:169:basic_allocation:PUD THP allocated (anon_fault_alloc: 0 -> 1)
  #            OK  pud_thp.basic_allocation
  ok 1 pud_thp.basic_allocation
  #  RUN           pud_thp.read_write_access ...
  #            OK  pud_thp.read_write_access
  ok 2 pud_thp.read_write_access
  #  RUN           pud_thp.fork_cow ...
  # pud_thp_test.c:236:fork_cow:Fork COW completed (thp_split_pud: 0 -> 1)
  #            OK  pud_thp.fork_cow
  ok 3 pud_thp.fork_cow
  #  RUN           pud_thp.partial_munmap ...
  # pud_thp_test.c:267:partial_munmap:Partial munmap completed (thp_split_pud: 1 -> 2)
  #            OK  pud_thp.partial_munmap
  ok 4 pud_thp.partial_munmap
  #  RUN           pud_thp.mprotect_split ...
  # pud_thp_test.c:293:mprotect_split:mprotect split completed (thp_split_pud: 2 -> 3)
  #            OK  pud_thp.mprotect_split
  ok 5 pud_thp.mprotect_split
  #  RUN           pud_thp.reclaim_pageout ...
  # pud_thp_test.c:322:reclaim_pageout:Reclaim completed (thp_split_pud: 3 -> 4)
  #            OK  pud_thp.reclaim_pageout
  ok 6 pud_thp.reclaim_pageout
  #  RUN           pud_thp.migration_mbind ...
  # pud_thp_test.c:356:migration_mbind:Migration completed (thp_split_pud: 4 -> 5)
  #            OK  pud_thp.migration_mbind
  ok 7 pud_thp.migration_mbind
  # PASSED: 7 / 7 tests passed.
  # Totals: pass:7 fail:0 xfail:0 xpass:0 skip:0 error:0

[1] https://gist.github.com/uarif1/bf279b2a01a536cda945ff9f40196a26
[2] https://lore.kernel.org/linux-mm/20210224223536.803765-1-zi.yan@sent.com/

Signed-off-by: Usama Arif <usamaarif642@gmail.com>

Usama Arif (12):
  mm: add PUD THP ptdesc and rmap support
  mm/thp: add mTHP stats infrastructure for PUD THP
  mm: thp: add PUD THP allocation and fault handling
  mm: thp: implement PUD THP split to PTE level
  mm: thp: add reclaim and migration support for PUD THP
  selftests/mm: add PUD THP basic allocation test
  selftests/mm: add PUD THP read/write access test
  selftests/mm: add PUD THP fork COW test
  selftests/mm: add PUD THP partial munmap test
  selftests/mm: add PUD THP mprotect split test
  selftests/mm: add PUD THP reclaim test
  selftests/mm: add PUD THP migration test

 include/linux/huge_mm.h                   |  60 ++-
 include/linux/mm.h                        |  19 +
 include/linux/mm_types.h                  |   5 +-
 include/linux/pgtable.h                   |   8 +
 include/linux/rmap.h                      |   7 +-
 mm/huge_memory.c                          | 535 +++++++++++++++++++++-
 mm/internal.h                             |   3 +
 mm/memory.c                               |   8 +-
 mm/migrate.c                              |  17 +
 mm/page_vma_mapped.c                      |  35 ++
 mm/pgtable-generic.c                      |  83 ++++
 mm/rmap.c                                 |  96 +++-
 mm/vmscan.c                               |   2 +
 tools/testing/selftests/mm/Makefile       |   1 +
 tools/testing/selftests/mm/pud_thp_test.c | 360 +++++++++++++++
 15 files changed, 1197 insertions(+), 42 deletions(-)
 create mode 100644 tools/testing/selftests/mm/pud_thp_test.c

-- 
2.47.3

Re: [RFC 00/12] mm: PUD (1GB) THP implementation

Posted by Lorenzo Stoakes 4 days, 16 hours ago

OK so this is somewhat unexpected :)

It would have been nice to discuss it in the THP cabal or at a conference
etc. so we could discuss approaches ahead of time. Communication is important,
especially with major changes like this.

And PUD THP is especially problematic in that it requires pages that the page
allocator can't give us, presumably you're doing something with CMA and... it's
a whole kettle of fish.

It's also complicated by the fact we _already_ support it in the DAX, VFIO cases
but it's kinda a weird sorta special case that we need to keep supporting.

There's questions about how this will interact with khugepaged, MADV_COLLAPSE,
mTHP (and really I want to see Nico's series land before we really consider
this).

So overall, I want to be very cautious and SLOW here. So let's please not drop
the RFC tag until David and I are ok with that?

Also the THP code base is in _dire_ need of rework, and I don't really want to
add major new features without us paying down some technical debt, to be honest.

So let's proceed with caution, and treat this as a very early bit of
experimental code.

Thanks, Lorenzo

Re: [RFC 00/12] mm: PUD (1GB) THP implementation

Posted by Usama Arif 3 days, 2 hours ago

On 02/02/2026 03:20, Lorenzo Stoakes wrote:
> OK so this is somewhat unexpected :)
> 
> It would have been nice to discuss it in the THP cabal or at a conference
> etc. so we could discuss approaches ahead of time. Communication is important,
> especially with major changes like this.

Makes sense!

> 
> And PUD THP is especially problematic in that it requires pages that the page
> allocator can't give us, presumably you're doing something with CMA and... it's
> a whole kettle of fish.

So we dont need CMA. It helps ofcourse, but we don't *need* it.
Its summarized in the first reply I gave to Zi in [1]:

> 
> It's also complicated by the fact we _already_ support it in the DAX, VFIO cases
> but it's kinda a weird sorta special case that we need to keep supporting.
> 
> There's questions about how this will interact with khugepaged, MADV_COLLAPSE,
> mTHP (and really I want to see Nico's series land before we really consider
> this).

So I have numbers and experiments for page faults which are in the cover letter,
but not for khugepaged. I would be very surprised (although pleasently :)) if
khugepaged by some magic finds 262144 pages that meets all the khugepaged requirements
to collapse the page. In the basic infrastructure support which this series is adding,
I want to keep khugepaged collapse disabled for 1G pages. This is also the initial
approach that was taken in other mTHP sizes. We should go slow with 1G THPs.

> 
> So overall, I want to be very cautious and SLOW here. So let's please not drop
> the RFC tag until David and I are ok with that?
> 
> Also the THP code base is in _dire_ need of rework, and I don't really want to
> add major new features without us paying down some technical debt, to be honest.
> 
> So let's proceed with caution, and treat this as a very early bit of
> experimental code.
> 
> Thanks, Lorenzo

Ack, yeah so this is mainly an RFC to discuss what the major design choices will be.
I got a kernel with selftests for allocation, memory integrity, fork, partial munmap,
mprotect, reclaim and migration passing and am running them with DEBUG_VM to make sure
we dont get the VM bugs/warnings and the numbers are good, so just wanted to share it
upstream and get your opinions! Basically try and trigger a discussion similar to what
Zi asked in [2]! And also if someone could point out if there is something fundamental
we are missing in this series.

Thanks for the reviews! Really do apprecaite it!

[1] https://lore.kernel.org/all/20f92576-e932-435f-bb7b-de49eb84b012@gmail.com/#t
[2] https://lore.kernel.org/all/3561FD10-664D-42AA-8351-DE7D8D49D42E@nvidia.com/

Re: [RFC 00/12] mm: PUD (1GB) THP implementation

Posted by Lorenzo Stoakes 2 days, 16 hours ago

On Tue, Feb 03, 2026 at 05:00:10PM -0800, Usama Arif wrote:
>
>
> On 02/02/2026 03:20, Lorenzo Stoakes wrote:
> > OK so this is somewhat unexpected :)
> >
> > It would have been nice to discuss it in the THP cabal or at a conference
> > etc. so we could discuss approaches ahead of time. Communication is important,
> > especially with major changes like this.
>
> Makes sense!
>
> >
> > And PUD THP is especially problematic in that it requires pages that the page
> > allocator can't give us, presumably you're doing something with CMA and... it's
> > a whole kettle of fish.
>
> So we dont need CMA. It helps ofcourse, but we don't *need* it.
> Its summarized in the first reply I gave to Zi in [1]:
>
> >
> > It's also complicated by the fact we _already_ support it in the DAX, VFIO cases
> > but it's kinda a weird sorta special case that we need to keep supporting.
> >
> > There's questions about how this will interact with khugepaged, MADV_COLLAPSE,
> > mTHP (and really I want to see Nico's series land before we really consider
> > this).
>
>
> So I have numbers and experiments for page faults which are in the cover letter,
> but not for khugepaged. I would be very surprised (although pleasently :)) if
> khugepaged by some magic finds 262144 pages that meets all the khugepaged requirements
> to collapse the page. In the basic infrastructure support which this series is adding,
> I want to keep khugepaged collapse disabled for 1G pages. This is also the initial
> approach that was taken in other mTHP sizes. We should go slow with 1G THPs.

Yes we definitely want to limit to page faults for now.

But keep in mind for that to be viable you'd surely need to update who gets
appropriate alignment in __get_unmapped_area()... not read through series far
enough to see so not sure if you update that though!

I guess that'd be the sanest place to start, if an allocation _size_ is aligned
1 GB, then align the unmapped area _address_ to 1 GB for maximum chance of 1 GB
fault-in.

Oh by the way I made some rough THP notes at
https://publish.obsidian.md/mm/Transparent+Huge+Pages+(THP) which are helpful
for reminding me about what does what where, useful for a top-down view of how
things are now.

>
> >
> > So overall, I want to be very cautious and SLOW here. So let's please not drop
> > the RFC tag until David and I are ok with that?
> >
> > Also the THP code base is in _dire_ need of rework, and I don't really want to
> > add major new features without us paying down some technical debt, to be honest.
> >
> > So let's proceed with caution, and treat this as a very early bit of
> > experimental code.
> >
> > Thanks, Lorenzo
>
> Ack, yeah so this is mainly an RFC to discuss what the major design choices will be.
> I got a kernel with selftests for allocation, memory integrity, fork, partial munmap,
> mprotect, reclaim and migration passing and am running them with DEBUG_VM to make sure
> we dont get the VM bugs/warnings and the numbers are good, so just wanted to share it
> upstream and get your opinions! Basically try and trigger a discussion similar to what
> Zi asked in [2]! And also if someone could point out if there is something fundamental
> we are missing in this series.

Well that's fair enough :)

But do come to a THP cabal so we can chat, face-to-face (ok, digital face to
digital face ;). It's usually a force-multiplier I find, esp. if multiple people
have input which I think is the case here. We're friendly :)

In any case, conversations are already kicking off so that's definitely positive!

I think we will definitely get there with this at _some point_ but I would urge
patience and also I really want to underline my desire for us in THP to start
paying down some of this technical debt.

I know people are already making efforts (Vernon, Luiz), and sorry that I've not
been great at review recently (should be gradually increasing over time), but I
feel that for large features to be added like this now we really do require some
refactoring work before we take it.

We definitely need to rebase this once Nico's series lands (should do next
cycle) and think about how it plays with this, I'm not sure if arm64 supports
mTHP between PMD and PUD size (Dev? Do you know?) so maybe that one is moot, but
in general want to make sure it plays nice.

>
> Thanks for the reviews! Really do apprecaite it!

No worries! :)

>
> [1] https://lore.kernel.org/all/20f92576-e932-435f-bb7b-de49eb84b012@gmail.com/#t
> [2] https://lore.kernel.org/all/3561FD10-664D-42AA-8351-DE7D8D49D42E@nvidia.com/

Cheers, Lorenzo

Re: [RFC 00/12] mm: PUD (1GB) THP implementation

Posted by Usama Arif 1 day, 21 hours ago


On 04/02/2026 03:08, Lorenzo Stoakes wrote:
> On Tue, Feb 03, 2026 at 05:00:10PM -0800, Usama Arif wrote:
>>
>>
>> On 02/02/2026 03:20, Lorenzo Stoakes wrote:
>>> OK so this is somewhat unexpected :)
>>>
>>> It would have been nice to discuss it in the THP cabal or at a conference
>>> etc. so we could discuss approaches ahead of time. Communication is important,
>>> especially with major changes like this.
>>
>> Makes sense!
>>
>>>
>>> And PUD THP is especially problematic in that it requires pages that the page
>>> allocator can't give us, presumably you're doing something with CMA and... it's
>>> a whole kettle of fish.
>>
>> So we dont need CMA. It helps ofcourse, but we don't *need* it.
>> Its summarized in the first reply I gave to Zi in [1]:
>>
>>>
>>> It's also complicated by the fact we _already_ support it in the DAX, VFIO cases
>>> but it's kinda a weird sorta special case that we need to keep supporting.
>>>
>>> There's questions about how this will interact with khugepaged, MADV_COLLAPSE,
>>> mTHP (and really I want to see Nico's series land before we really consider
>>> this).
>>
>>
>> So I have numbers and experiments for page faults which are in the cover letter,
>> but not for khugepaged. I would be very surprised (although pleasently :)) if
>> khugepaged by some magic finds 262144 pages that meets all the khugepaged requirements
>> to collapse the page. In the basic infrastructure support which this series is adding,
>> I want to keep khugepaged collapse disabled for 1G pages. This is also the initial
>> approach that was taken in other mTHP sizes. We should go slow with 1G THPs.
> 
> Yes we definitely want to limit to page faults for now.
> 
> But keep in mind for that to be viable you'd surely need to update who gets
> appropriate alignment in __get_unmapped_area()... not read through series far
> enough to see so not sure if you update that though!
> 
> I guess that'd be the sanest place to start, if an allocation _size_ is aligned
> 1 GB, then align the unmapped area _address_ to 1 GB for maximum chance of 1 GB
> fault-in.


Yeah this was definitely missing. I was manually aligning the fault address in selftest
and benchmarks with the trick used in other selftests
(((unsigned long)addr + PUD_SIZE - 1) & ~(PUD_SIZE - 1))

Thanks for pointing this out! This is basically what I wanted with the RFC, to find out
what I am missing and not testing. Will look into VFIO and DAX as you mentioned as well.

> 
> Oh by the way I made some rough THP notes at
> https://publish.obsidian.md/mm/Transparent+Huge+Pages+(THP) which are helpful
> for reminding me about what does what where, useful for a top-down view of how
> things are now.
> 

Thanks!

>>
>>>
>>> So overall, I want to be very cautious and SLOW here. So let's please not drop
>>> the RFC tag until David and I are ok with that?
>>>
>>> Also the THP code base is in _dire_ need of rework, and I don't really want to
>>> add major new features without us paying down some technical debt, to be honest.
>>>
>>> So let's proceed with caution, and treat this as a very early bit of
>>> experimental code.
>>>
>>> Thanks, Lorenzo
>>
>> Ack, yeah so this is mainly an RFC to discuss what the major design choices will be.
>> I got a kernel with selftests for allocation, memory integrity, fork, partial munmap,
>> mprotect, reclaim and migration passing and am running them with DEBUG_VM to make sure
>> we dont get the VM bugs/warnings and the numbers are good, so just wanted to share it
>> upstream and get your opinions! Basically try and trigger a discussion similar to what
>> Zi asked in [2]! And also if someone could point out if there is something fundamental
>> we are missing in this series.
> 
> Well that's fair enough :)
> 
> But do come to a THP cabal so we can chat, face-to-face (ok, digital face to
> digital face ;). It's usually a force-multiplier I find, esp. if multiple people
> have input which I think is the case here. We're friendly :)


Yes, Thanks for this! It would be really helpful to discuss in a call. I didn't
know there was a meeting but have requested details (date/time) in another thread.

> 
> In any case, conversations are already kicking off so that's definitely positive!
> 
> I think we will definitely get there with this at _some point_ but I would urge
> patience and also I really want to underline my desire for us in THP to start
> paying down some of this technical debt.
> 
> I know people are already making efforts (Vernon, Luiz), and sorry that I've not
> been great at review recently (should be gradually increasing over time), but I
> feel that for large features to be added like this now we really do require some
> refactoring work before we take it.
> 

Yes agreed! I will definitely need your and others guidance on what needs to be
properly refractored so that this fits well with the current code.

> We definitely need to rebase this once Nico's series lands (should do next
> cycle) and think about how it plays with this, I'm not sure if arm64 supports
> mTHP between PMD and PUD size (Dev? Do you know?) so maybe that one is moot, but
> in general want to make sure it plays nice.
> 

Will do!


>>
>> Thanks for the reviews! Really do apprecaite it!
> 
> No worries! :)
> 
>>
>> [1] https://lore.kernel.org/all/20f92576-e932-435f-bb7b-de49eb84b012@gmail.com/#t
>> [2] https://lore.kernel.org/all/3561FD10-664D-42AA-8351-DE7D8D49D42E@nvidia.com/
> 
> Cheers, Lorenzo

Re: [RFC 00/12] mm: PUD (1GB) THP implementation

Posted by Dev Jain 2 days, 15 hours ago

On 04/02/26 4:38 pm, Lorenzo Stoakes wrote:
> On Tue, Feb 03, 2026 at 05:00:10PM -0800, Usama Arif wrote:
>>
>> On 02/02/2026 03:20, Lorenzo Stoakes wrote:
>>> OK so this is somewhat unexpected :)
>>>
>>> It would have been nice to discuss it in the THP cabal or at a conference
>>> etc. so we could discuss approaches ahead of time. Communication is important,
>>> especially with major changes like this.
>> Makes sense!
>>
>>> And PUD THP is especially problematic in that it requires pages that the page
>>> allocator can't give us, presumably you're doing something with CMA and... it's
>>> a whole kettle of fish.
>> So we dont need CMA. It helps ofcourse, but we don't *need* it.
>> Its summarized in the first reply I gave to Zi in [1]:
>>
>>> It's also complicated by the fact we _already_ support it in the DAX, VFIO cases
>>> but it's kinda a weird sorta special case that we need to keep supporting.
>>>
>>> There's questions about how this will interact with khugepaged, MADV_COLLAPSE,
>>> mTHP (and really I want to see Nico's series land before we really consider
>>> this).
>>
>> So I have numbers and experiments for page faults which are in the cover letter,
>> but not for khugepaged. I would be very surprised (although pleasently :)) if
>> khugepaged by some magic finds 262144 pages that meets all the khugepaged requirements
>> to collapse the page. In the basic infrastructure support which this series is adding,
>> I want to keep khugepaged collapse disabled for 1G pages. This is also the initial
>> approach that was taken in other mTHP sizes. We should go slow with 1G THPs.
> Yes we definitely want to limit to page faults for now.
>
> But keep in mind for that to be viable you'd surely need to update who gets
> appropriate alignment in __get_unmapped_area()... not read through series far
> enough to see so not sure if you update that though!
>
> I guess that'd be the sanest place to start, if an allocation _size_ is aligned
> 1 GB, then align the unmapped area _address_ to 1 GB for maximum chance of 1 GB
> fault-in.
>
> Oh by the way I made some rough THP notes at
> https://publish.obsidian.md/mm/Transparent+Huge+Pages+(THP) which are helpful
> for reminding me about what does what where, useful for a top-down view of how
> things are now.
>
>>> So overall, I want to be very cautious and SLOW here. So let's please not drop
>>> the RFC tag until David and I are ok with that?
>>>
>>> Also the THP code base is in _dire_ need of rework, and I don't really want to
>>> add major new features without us paying down some technical debt, to be honest.
>>>
>>> So let's proceed with caution, and treat this as a very early bit of
>>> experimental code.
>>>
>>> Thanks, Lorenzo
>> Ack, yeah so this is mainly an RFC to discuss what the major design choices will be.
>> I got a kernel with selftests for allocation, memory integrity, fork, partial munmap,
>> mprotect, reclaim and migration passing and am running them with DEBUG_VM to make sure
>> we dont get the VM bugs/warnings and the numbers are good, so just wanted to share it
>> upstream and get your opinions! Basically try and trigger a discussion similar to what
>> Zi asked in [2]! And also if someone could point out if there is something fundamental
>> we are missing in this series.
> Well that's fair enough :)
>
> But do come to a THP cabal so we can chat, face-to-face (ok, digital face to
> digital face ;). It's usually a force-multiplier I find, esp. if multiple people
> have input which I think is the case here. We're friendly :)
>
> In any case, conversations are already kicking off so that's definitely positive!
>
> I think we will definitely get there with this at _some point_ but I would urge
> patience and also I really want to underline my desire for us in THP to start
> paying down some of this technical debt.
>
> I know people are already making efforts (Vernon, Luiz), and sorry that I've not
> been great at review recently (should be gradually increasing over time), but I
> feel that for large features to be added like this now we really do require some
> refactoring work before we take it.
>
> We definitely need to rebase this once Nico's series lands (should do next
> cycle) and think about how it plays with this, I'm not sure if arm64 supports
> mTHP between PMD and PUD size (Dev? Do you know?) so maybe that one is moot, but

arm64 does support cont mappings at the PMD level. Currently, they are supported
for kernel pagetables, and hugetlbpages. You may search around for "CONT_PMD" in
the codebase. Hence it only supports cont PMD in the "static" case, there is
no dynamic folding/unfolding of the cont bit at the PMD level, which mTHP requires.

I see that this patchset splits PUD all the way down to PTEs. If we were to split
it down to PMD, and add arm64 support for dynamic cont mappings at the PMD level,
it will be nicer. But I guess there is some mapcount/rmap stuff involved
here stopping us from doing that :(

> in general want to make sure it plays nice.
>
>> Thanks for the reviews! Really do apprecaite it!
> No worries! :)
>
>> [1] https://lore.kernel.org/all/20f92576-e932-435f-bb7b-de49eb84b012@gmail.com/#t
>> [2] https://lore.kernel.org/all/3561FD10-664D-42AA-8351-DE7D8D49D42E@nvidia.com/
> Cheers, Lorenzo
>

Re: [RFC 00/12] mm: PUD (1GB) THP implementation

Posted by Dev Jain 2 days, 15 hours ago

On 04/02/26 5:20 pm, Dev Jain wrote:
> On 04/02/26 4:38 pm, Lorenzo Stoakes wrote:
>> On Tue, Feb 03, 2026 at 05:00:10PM -0800, Usama Arif wrote:
>>> On 02/02/2026 03:20, Lorenzo Stoakes wrote:
>>>> OK so this is somewhat unexpected :)
>>>>
>>>> It would have been nice to discuss it in the THP cabal or at a conference
>>>> etc. so we could discuss approaches ahead of time. Communication is important,
>>>> especially with major changes like this.
>>> Makes sense!
>>>
>>>> And PUD THP is especially problematic in that it requires pages that the page
>>>> allocator can't give us, presumably you're doing something with CMA and... it's
>>>> a whole kettle of fish.
>>> So we dont need CMA. It helps ofcourse, but we don't *need* it.
>>> Its summarized in the first reply I gave to Zi in [1]:
>>>
>>>> It's also complicated by the fact we _already_ support it in the DAX, VFIO cases
>>>> but it's kinda a weird sorta special case that we need to keep supporting.
>>>>
>>>> There's questions about how this will interact with khugepaged, MADV_COLLAPSE,
>>>> mTHP (and really I want to see Nico's series land before we really consider
>>>> this).
>>> So I have numbers and experiments for page faults which are in the cover letter,
>>> but not for khugepaged. I would be very surprised (although pleasently :)) if
>>> khugepaged by some magic finds 262144 pages that meets all the khugepaged requirements
>>> to collapse the page. In the basic infrastructure support which this series is adding,
>>> I want to keep khugepaged collapse disabled for 1G pages. This is also the initial
>>> approach that was taken in other mTHP sizes. We should go slow with 1G THPs.
>> Yes we definitely want to limit to page faults for now.
>>
>> But keep in mind for that to be viable you'd surely need to update who gets
>> appropriate alignment in __get_unmapped_area()... not read through series far
>> enough to see so not sure if you update that though!
>>
>> I guess that'd be the sanest place to start, if an allocation _size_ is aligned
>> 1 GB, then align the unmapped area _address_ to 1 GB for maximum chance of 1 GB
>> fault-in.
>>
>> Oh by the way I made some rough THP notes at
>> https://publish.obsidian.md/mm/Transparent+Huge+Pages+(THP) which are helpful
>> for reminding me about what does what where, useful for a top-down view of how
>> things are now.
>>
>>>> So overall, I want to be very cautious and SLOW here. So let's please not drop
>>>> the RFC tag until David and I are ok with that?
>>>>
>>>> Also the THP code base is in _dire_ need of rework, and I don't really want to
>>>> add major new features without us paying down some technical debt, to be honest.
>>>>
>>>> So let's proceed with caution, and treat this as a very early bit of
>>>> experimental code.
>>>>
>>>> Thanks, Lorenzo
>>> Ack, yeah so this is mainly an RFC to discuss what the major design choices will be.
>>> I got a kernel with selftests for allocation, memory integrity, fork, partial munmap,
>>> mprotect, reclaim and migration passing and am running them with DEBUG_VM to make sure
>>> we dont get the VM bugs/warnings and the numbers are good, so just wanted to share it
>>> upstream and get your opinions! Basically try and trigger a discussion similar to what
>>> Zi asked in [2]! And also if someone could point out if there is something fundamental
>>> we are missing in this series.
>> Well that's fair enough :)
>>
>> But do come to a THP cabal so we can chat, face-to-face (ok, digital face to
>> digital face ;). It's usually a force-multiplier I find, esp. if multiple people
>> have input which I think is the case here. We're friendly :)
>>
>> In any case, conversations are already kicking off so that's definitely positive!
>>
>> I think we will definitely get there with this at _some point_ but I would urge
>> patience and also I really want to underline my desire for us in THP to start
>> paying down some of this technical debt.
>>
>> I know people are already making efforts (Vernon, Luiz), and sorry that I've not
>> been great at review recently (should be gradually increasing over time), but I
>> feel that for large features to be added like this now we really do require some
>> refactoring work before we take it.
>>
>> We definitely need to rebase this once Nico's series lands (should do next
>> cycle) and think about how it plays with this, I'm not sure if arm64 supports
>> mTHP between PMD and PUD size (Dev? Do you know?) so maybe that one is moot, but
> arm64 does support cont mappings at the PMD level. Currently, they are supported
> for kernel pagetables, and hugetlbpages. You may search around for "CONT_PMD" in
> the codebase. Hence it only supports cont PMD in the "static" case, there is
> no dynamic folding/unfolding of the cont bit at the PMD level, which mTHP requires.
>
> I see that this patchset splits PUD all the way down to PTEs. If we were to split
> it down to PMD, and add arm64 support for dynamic cont mappings at the PMD level,
> it will be nicer. But I guess there is some mapcount/rmap stuff involved
> here stopping us from doing that :(

Hmm, this won't make a difference w.r.t cont PMD. If we were to split PUD folio
down to PMD folios, we won't get cont PMD. But yes, in general PMD mappings
are nicer.

>
>> in general want to make sure it plays nice.
>>
>>> Thanks for the reviews! Really do apprecaite it!
>> No worries! :)
>>
>>> [1] https://lore.kernel.org/all/20f92576-e932-435f-bb7b-de49eb84b012@gmail.com/#t
>>> [2] https://lore.kernel.org/all/3561FD10-664D-42AA-8351-DE7D8D49D42E@nvidia.com/
>> Cheers, Lorenzo
>>

Re: [RFC 00/12] mm: PUD (1GB) THP implementation

Posted by Matthew Wilcox 4 days, 23 hours ago

On Sun, Feb 01, 2026 at 04:50:17PM -0800, Usama Arif wrote:
> This is an RFC series to implement 1GB PUD-level THPs, allowing
> applications to benefit from reduced TLB pressure without requiring
> hugetlbfs. The patches are based on top of
> f9b74c13b773b7c7e4920d7bc214ea3d5f37b422 from mm-stable (6.19-rc6).

I suggest this has not had enough testing.  There are dozens of places
in the MM which assume that if a folio is at leaast PMD size then it is
exactly PMD size.  Everywhere that calls folio_test_pmd_mappable() needs
to be audited to make sure that it will work properly if the folio is
larger than PMD size.

zap_pmd_range() for example.  Or finish_fault():

                page = vmf->page;
(can be any page within the folio)
        folio = page_folio(page);
        if (pmd_none(*vmf->pmd)) {
                if (!needs_fallback && folio_test_pmd_mappable(folio)) {
                        ret = do_set_pmd(vmf, folio, page);

then do_set_pmd() does:

        if (folio_order(folio) != HPAGE_PMD_ORDER)
                return ret;
        page = &folio->page;

so that check needs to be changed, and then we need to select the
appropriate page within the folio rather than just the first page
of the folio.  And then after the call:

        entry = folio_mk_pmd(folio, vma->vm_page_prot);

we need to adjust entry to point to the appropriate PMD-sized range
within the folio.

Re: [RFC 00/12] mm: PUD (1GB) THP implementation

Posted by David Hildenbrand (arm) 4 days, 18 hours ago

On 2/2/26 05:00, Matthew Wilcox wrote:
> On Sun, Feb 01, 2026 at 04:50:17PM -0800, Usama Arif wrote:
>> This is an RFC series to implement 1GB PUD-level THPs, allowing
>> applications to benefit from reduced TLB pressure without requiring
>> hugetlbfs. The patches are based on top of
>> f9b74c13b773b7c7e4920d7bc214ea3d5f37b422 from mm-stable (6.19-rc6).
> 
> I suggest this has not had enough testing.  There are dozens of places
> in the MM which assume that if a folio is at leaast PMD size then it is
> exactly PMD size.  Everywhere that calls folio_test_pmd_mappable() needs
> to be audited to make sure that it will work properly if the folio is
> larger than PMD size.

I think the hack (ehm trick) in this patch set is to do it just like dax 
PUDs: only map through a PUD or through PTEs, not through PMDs.

That also avoids dealing with mapcounts until I sorted that out.

-- 
Cheers

David

Re: [RFC 00/12] mm: PUD (1GB) THP implementation

Posted by Usama Arif 3 days, 6 hours ago

On 02/02/2026 01:06, David Hildenbrand (arm) wrote:
> On 2/2/26 05:00, Matthew Wilcox wrote:
>> On Sun, Feb 01, 2026 at 04:50:17PM -0800, Usama Arif wrote:
>>> This is an RFC series to implement 1GB PUD-level THPs, allowing
>>> applications to benefit from reduced TLB pressure without requiring
>>> hugetlbfs. The patches are based on top of
>>> f9b74c13b773b7c7e4920d7bc214ea3d5f37b422 from mm-stable (6.19-rc6).
>>
>> I suggest this has not had enough testing.  There are dozens of places
>> in the MM which assume that if a folio is at leaast PMD size then it is
>> exactly PMD size.  Everywhere that calls folio_test_pmd_mappable() needs
>> to be audited to make sure that it will work properly if the folio is
>> larger than PMD size.
> 
> I think the hack (ehm trick) in this patch set is to do it just like dax PUDs: only map through a PUD or through PTEs, not through PMDs.
> 
> That also avoids dealing with mapcounts until I sorted that out.
> 

Hello!

Thanks for the review! So its as David said, currently for PUD THP case, we
won't run into those paths.
PUD is split via TTU_SPLIT_HUGE_PUD which calls __split_huge_pud_locked().
This splits PUD to PTE directly (not PMD), so we never have a PUD folio
going through do_set_pmd(). The anonymous fault path uses
do_huge_pud_anonymous_page() so we won't go to finish_fault()

When I started working on this, I was really hoping that we could split PUDs to PMDs,
but very quickly realised thats a separate and much more complicated mapcount problem
(which is probably why David is dealing with it as he mentioned in the reply :P)
and should not be dealt with in this series.

In terms of more testing, I would definitely like to add more.
I have added selftests for allocation, memory integrity, fork, partial munmap, mprotect,
reclaim and migration, and am running them with DEBUG_VM to make sure we dont get the VM
bugs/warnings, but I am sure I am missing paths. I will try to think of more
but please let me know if there are more cases we can come up with.

Re: [RFC 00/12] mm: PUD (1GB) THP implementation

Posted by Rik van Riel 5 days ago

On Sun, 2026-02-01 at 16:50 -0800, Usama Arif wrote:
> 
> 1. Static Reservation: hugetlbfs requires pre-allocating huge pages
> at boot
>    or runtime, taking memory away. This requires capacity planning,
>    administrative overhead, and makes workload orchastration much
> much more
>    complex, especially colocating with workloads that don't use
> hugetlbfs.
> 
To address the obvious objection "but how could we
possibly allocate 1GB huge pages while the workload
is running?", I am planning to pick up the CMA balancing 
patch series (thank you, Frank) and get that in an 
upstream ready shape soon.

https://lkml.org/2025/9/15/1735

That patch set looks like another case where no
amount of internal testing will find every single
corner case, and we'll probably just want to
merge it upstream, deploy it experimentally, and
aggressively deal with anything that might pop up.

With CMA balancing, it would be possibly to just
have half (or even more) of system memory for
movable allocations only, which would make it possible
to allocate 1GB huge pages dynamically.

-- 
All Rights Reversed.

Re: [RFC 00/12] mm: PUD (1GB) THP implementation

Posted by Lorenzo Stoakes 4 days, 15 hours ago

On Sun, Feb 01, 2026 at 09:44:12PM -0500, Rik van Riel wrote:
> On Sun, 2026-02-01 at 16:50 -0800, Usama Arif wrote:
> >
> > 1. Static Reservation: hugetlbfs requires pre-allocating huge pages
> > at boot
> >    or runtime, taking memory away. This requires capacity planning,
> >    administrative overhead, and makes workload orchastration much
> > much more
> >    complex, especially colocating with workloads that don't use
> > hugetlbfs.
> >
> To address the obvious objection "but how could we
> possibly allocate 1GB huge pages while the workload
> is running?", I am planning to pick up the CMA balancing
> patch series (thank you, Frank) and get that in an
> upstream ready shape soon.
>
> https://lkml.org/2025/9/15/1735

That link doesn't work?

Did a quick search for CMA balancing on lore, couldn't find anything, could you
provide a lore link?

>
> That patch set looks like another case where no
> amount of internal testing will find every single
> corner case, and we'll probably just want to
> merge it upstream, deploy it experimentally, and
> aggressively deal with anything that might pop up.

I'm not really in favour of this kind of approach. There's plenty of things that
were considered 'temporary' upstream that became rather permanent :)

Maybe we can't cover all corner-cases, but we need to make sure whatever we do
send upstream is maintainable, conceptually sensible and doesn't paint us into
any corners, etc.

>
> With CMA balancing, it would be possibly to just
> have half (or even more) of system memory for
> movable allocations only, which would make it possible
> to allocate 1GB huge pages dynamically.

Could you expand on that?

>
> --
> All Rights Reversed.

Thanks, Lorenzo

Re: [RFC 00/12] mm: PUD (1GB) THP implementation

Posted by Zi Yan 4 days, 11 hours ago

On 2 Feb 2026, at 6:30, Lorenzo Stoakes wrote:

> On Sun, Feb 01, 2026 at 09:44:12PM -0500, Rik van Riel wrote:
>> On Sun, 2026-02-01 at 16:50 -0800, Usama Arif wrote:
>>>
>>> 1. Static Reservation: hugetlbfs requires pre-allocating huge pages
>>> at boot
>>>    or runtime, taking memory away. This requires capacity planning,
>>>    administrative overhead, and makes workload orchastration much
>>> much more
>>>    complex, especially colocating with workloads that don't use
>>> hugetlbfs.
>>>
>> To address the obvious objection "but how could we
>> possibly allocate 1GB huge pages while the workload
>> is running?", I am planning to pick up the CMA balancing
>> patch series (thank you, Frank) and get that in an
>> upstream ready shape soon.
>>
>> https://lkml.org/2025/9/15/1735
>
> That link doesn't work?
>
> Did a quick search for CMA balancing on lore, couldn't find anything, could you
> provide a lore link?

https://lwn.net/Articles/1038263/

>
>>
>> That patch set looks like another case where no
>> amount of internal testing will find every single
>> corner case, and we'll probably just want to
>> merge it upstream, deploy it experimentally, and
>> aggressively deal with anything that might pop up.
>
> I'm not really in favour of this kind of approach. There's plenty of things that
> were considered 'temporary' upstream that became rather permanent :)
>
> Maybe we can't cover all corner-cases, but we need to make sure whatever we do
> send upstream is maintainable, conceptually sensible and doesn't paint us into
> any corners, etc.
>
>>
>> With CMA balancing, it would be possibly to just
>> have half (or even more) of system memory for
>> movable allocations only, which would make it possible
>> to allocate 1GB huge pages dynamically.
>
> Could you expand on that?

I also would like to hear David’s opinion on using CMA for 1GB THP.
He did not like it[1] when I posted my patch back in 2020, but it has
been more than 5 years. :)

The other direction I explored is to get 1GB THP from buddy allocator.
That means we need to:
1. bump MAX_PAGE_ORDER to 18 or make it a runtime variable so that only 1GB
   THP users need to bump it,
2. handle cross memory section PFN merge in buddy allocator,
3. improve anti-fragmentation mechanism for 1GB range compaction.

1 is easier-ish[2]. I have not looked into 2 and 3 much yet.

[1] https://lore.kernel.org/all/52bc2d5d-eb8a-83de-1c93-abd329132d58@redhat.com/
[2] https://lore.kernel.org/all/20210805190253.2795604-1-zi.yan@sent.com/


Best Regards,
Yan, Zi

Re: [RFC 00/12] mm: PUD (1GB) THP implementation

Posted by David Hildenbrand (arm) 1 day, 16 hours ago

On 2/2/26 16:50, Zi Yan wrote:
> On 2 Feb 2026, at 6:30, Lorenzo Stoakes wrote:
> 
>> On Sun, Feb 01, 2026 at 09:44:12PM -0500, Rik van Riel wrote:
>>> To address the obvious objection "but how could we
>>> possibly allocate 1GB huge pages while the workload
>>> is running?", I am planning to pick up the CMA balancing
>>> patch series (thank you, Frank) and get that in an
>>> upstream ready shape soon.
>>>
>>> https://lkml.org/2025/9/15/1735
>>
>> That link doesn't work?
>>
>> Did a quick search for CMA balancing on lore, couldn't find anything, could you
>> provide a lore link?
> 
> https://lwn.net/Articles/1038263/
> 
>>
>>>
>>> That patch set looks like another case where no
>>> amount of internal testing will find every single
>>> corner case, and we'll probably just want to
>>> merge it upstream, deploy it experimentally, and
>>> aggressively deal with anything that might pop up.
>>
>> I'm not really in favour of this kind of approach. There's plenty of things that
>> were considered 'temporary' upstream that became rather permanent :)
>>
>> Maybe we can't cover all corner-cases, but we need to make sure whatever we do
>> send upstream is maintainable, conceptually sensible and doesn't paint us into
>> any corners, etc.
>>
>>>
>>> With CMA balancing, it would be possibly to just
>>> have half (or even more) of system memory for
>>> movable allocations only, which would make it possible
>>> to allocate 1GB huge pages dynamically.
>>
>> Could you expand on that?
> 
> I also would like to hear David’s opinion on using CMA for 1GB THP.
> He did not like it[1] when I posted my patch back in 2020, but it has
> been more than 5 years. :)

Hehe, not particularly excited about that.

We really have to avoid short-term hacks by any means. We have enough of 
that in THP land already.

We talked about challenges in the past like:
* Controlling who gets to allocate them.
* Having a reasonable swap/migration mechanism
* Reliably allocating them without hacks, while being future-proof
* Long-term pinning them when they are actually on ZONE_MOVABLE or CMA
   (the latter could be made working but requires thought)

I agree with Lorenzo that this RFC is a bit surprising, because I assume 
none of the real challenges were tackled.

Having that said, it will take me some time to come back to this RFC 
here, other stuff that piled up is more urgent and more important.

But I'll note that we really have to cleanup the THP mess before we add 
more stuff on it.

For example, I still wonder whether we can just stop pre-allocating page 
tables for THPs and instead let code fail+retry in case we cannot remap 
the page. I wanted to look into the details a long time ago but never 
got to it.

Avoiding that would make the remapping much easier; and we should then 
remap from PUD->PMD->PTEs.

Implementing 1 GiB support for shmem might be a reasonable first step, 
before we start digging into the anonymous memory land with all these 
nasty things involved.

-- 
Cheers,

David

Re: [RFC 00/12] mm: PUD (1GB) THP implementation

Posted by Lorenzo Stoakes 2 days, 16 hours ago

On Mon, Feb 02, 2026 at 10:50:35AM -0500, Zi Yan wrote:
> On 2 Feb 2026, at 6:30, Lorenzo Stoakes wrote:
>
> > On Sun, Feb 01, 2026 at 09:44:12PM -0500, Rik van Riel wrote:
> >> On Sun, 2026-02-01 at 16:50 -0800, Usama Arif wrote:
> >>>
> >>> 1. Static Reservation: hugetlbfs requires pre-allocating huge pages
> >>> at boot
> >>>    or runtime, taking memory away. This requires capacity planning,
> >>>    administrative overhead, and makes workload orchastration much
> >>> much more
> >>>    complex, especially colocating with workloads that don't use
> >>> hugetlbfs.
> >>>
> >> To address the obvious objection "but how could we
> >> possibly allocate 1GB huge pages while the workload
> >> is running?", I am planning to pick up the CMA balancing
> >> patch series (thank you, Frank) and get that in an
> >> upstream ready shape soon.
> >>
> >> https://lkml.org/2025/9/15/1735
> >
> > That link doesn't work?
> >
> > Did a quick search for CMA balancing on lore, couldn't find anything, could you
> > provide a lore link?
>
> https://lwn.net/Articles/1038263/
>
> >
> >>
> >> That patch set looks like another case where no
> >> amount of internal testing will find every single
> >> corner case, and we'll probably just want to
> >> merge it upstream, deploy it experimentally, and
> >> aggressively deal with anything that might pop up.
> >
> > I'm not really in favour of this kind of approach. There's plenty of things that
> > were considered 'temporary' upstream that became rather permanent :)
> >
> > Maybe we can't cover all corner-cases, but we need to make sure whatever we do
> > send upstream is maintainable, conceptually sensible and doesn't paint us into
> > any corners, etc.
> >
> >>
> >> With CMA balancing, it would be possibly to just
> >> have half (or even more) of system memory for
> >> movable allocations only, which would make it possible
> >> to allocate 1GB huge pages dynamically.
> >
> > Could you expand on that?
>
> I also would like to hear David’s opinion on using CMA for 1GB THP.
> He did not like it[1] when I posted my patch back in 2020, but it has
> been more than 5 years. :)

Yes please David :)

I find the idea of using the CMA for this a bit gross. And I fear we're
essentially expanding the hacks for DAX to everyone.

Again I really feel that we should be tackling technical debt here, rather
than adding features on shaky foundations and just making things worse.

We are inundated with series-after-series for THP trying to add features
but really not very many that are tackling this debt, and I think it's time
to get firmer about that.

>
> The other direction I explored is to get 1GB THP from buddy allocator.
> That means we need to:
> 1. bump MAX_PAGE_ORDER to 18 or make it a runtime variable so that only 1GB
>    THP users need to bump it,

Would we need to bump the page block size too to stand more of a chance of
avoiding fragmentation?

Doing that though would result in reserves being way higher and thus more
memory used and we'd be in the territory of the unresolved issues with 64
KB page size kernels :)

> 2. handle cross memory section PFN merge in buddy allocator,

Ugh god...

> 3. improve anti-fragmentation mechanism for 1GB range compaction.

I think we'd really need something like this. Obviously there's the series
Rik refers to.

I mean CMA itself feels like a hack, though efforts are being made to at
least make it more robust (series mentioned, also the guaranteed CMA stuff
from Suren).

>
> 1 is easier-ish[2]. I have not looked into 2 and 3 much yet.
>
> [1] https://lore.kernel.org/all/52bc2d5d-eb8a-83de-1c93-abd329132d58@redhat.com/
> [2] https://lore.kernel.org/all/20210805190253.2795604-1-zi.yan@sent.com/
>
>
> Best Regards,
> Yan, Zi

Cheers, Lorenzo

Re: [RFC 00/12] mm: PUD (1GB) THP implementation

Posted by David Hildenbrand (arm) 1 day, 15 hours ago

On 2/4/26 11:56, Lorenzo Stoakes wrote:
> On Mon, Feb 02, 2026 at 10:50:35AM -0500, Zi Yan wrote:
>> On 2 Feb 2026, at 6:30, Lorenzo Stoakes wrote:
>>
>>>
>>> That link doesn't work?
>>>
>>> Did a quick search for CMA balancing on lore, couldn't find anything, could you
>>> provide a lore link?
>>
>> https://lwn.net/Articles/1038263/
>>
>>>
>>>
>>> I'm not really in favour of this kind of approach. There's plenty of things that
>>> were considered 'temporary' upstream that became rather permanent :)
>>>
>>> Maybe we can't cover all corner-cases, but we need to make sure whatever we do
>>> send upstream is maintainable, conceptually sensible and doesn't paint us into
>>> any corners, etc.
>>>
>>>
>>> Could you expand on that?
>>
>> I also would like to hear David’s opinion on using CMA for 1GB THP.
>> He did not like it[1] when I posted my patch back in 2020, but it has
>> been more than 5 years. :)
> 
> Yes please David :)

Heh, read Zi's mail first :)

> 
> I find the idea of using the CMA for this a bit gross. And I fear we're
> essentially expanding the hacks for DAX to everyone.

Jup.

> 
> Again I really feel that we should be tackling technical debt here, rather
> than adding features on shaky foundations and just making things worse.
> 

Jup.

> We are inundated with series-after-series for THP trying to add features
> but really not very many that are tackling this debt, and I think it's time
> to get firmer about that.

Almost nobody wants do cleanups because there is the believe that only 
features are important; and some companies seem to value features more 
than cleanups when it comes to promotions etc.

And cleanups in that area are hard, because you'll very likely just 
break stuff because it's all so weirdly interconnected.

See max_ptes_none discussion ...

> 
>>
>> The other direction I explored is to get 1GB THP from buddy allocator.
>> That means we need to:
>> 1. bump MAX_PAGE_ORDER to 18 or make it a runtime variable so that only 1GB
>>     THP users need to bump it,
> 
> Would we need to bump the page block size too to stand more of a chance of
> avoiding fragmentation?

We discussed one idea of another level of anti-fragmentation on top (I 
forgot how we called it, essentially bigger blocks that group pages in 
the buddy). But implementing that is non trivial.

But long-term we really need something better than pageblocks and using 
hacky CMA reservations for anything larger.

-- 
Cheers,

David

Re: [RFC 00/12] mm: PUD (1GB) THP implementation

Posted by Zi Yan 4 days, 11 hours ago

On 1 Feb 2026, at 19:50, Usama Arif wrote:

> This is an RFC series to implement 1GB PUD-level THPs, allowing
> applications to benefit from reduced TLB pressure without requiring
> hugetlbfs. The patches are based on top of
> f9b74c13b773b7c7e4920d7bc214ea3d5f37b422 from mm-stable (6.19-rc6).

It is nice to see you are working on 1GB THP.

>
> Motivation: Why 1GB THP over hugetlbfs?
> =======================================
>
> While hugetlbfs provides 1GB huge pages today, it has significant limitations
> that make it unsuitable for many workloads:
>
> 1. Static Reservation: hugetlbfs requires pre-allocating huge pages at boot
>    or runtime, taking memory away. This requires capacity planning,
>    administrative overhead, and makes workload orchastration much much more
>    complex, especially colocating with workloads that don't use hugetlbfs.

But you are using CMA, the same allocation mechanism as hugetlb_cma. What
is the difference?

>
> 4. No Fallback: If a 1GB huge page cannot be allocated, hugetlbfs fails
>    rather than falling back to smaller pages. This makes it fragile under
>    memory pressure.

True.

>
> 4. No Splitting: hugetlbfs pages cannot be split when only partial access
>    is needed, leading to memory waste and preventing partial reclaim.

Since you have PUD THP implementation, have you run any workload on it?
How often you see a PUD THP split?

Oh, you actually ran 512MB THP on ARM64 (I saw it below), do you have
any split stats to show the necessity of THP split?

>
> 5. Memory Accounting: hugetlbfs memory is accounted separately and cannot
>    be easily shared with regular memory pools.

True.

>
> PUD THP solves these limitations by integrating 1GB pages into the existing
> THP infrastructure.

The main advantage of PUD THP over hugetlb is that it can be split and mapped
at sub-folio level. Do you have any data to support the necessity of them?
I wonder if it would be easier to just support 1GB folio in core-mm first
and we can add 1GB THP split and sub-folio mapping later. With that, we
can move hugetlb users to 1GB folio.

BTW, without split support, you can apply HVO to 1GB folio to save memory.
That is a disadvantage of PUD THP. Have you taken that into consideration?
Basically, switching from hugetlb to PUD THP, you will lose memory due
to vmemmap usage.

>
> Performance Results
> ===================
>
> Benchmark results of these patches on Intel Xeon Platinum 8321HC:
>
> Test: True Random Memory Access [1] test of 4GB memory region with pointer
> chasing workload (4M random pointer dereferences through memory):
>
> | Metric            | PUD THP (1GB) | PMD THP (2MB) | Change       |
> |-------------------|---------------|---------------|--------------|
> | Memory access     | 88 ms         | 134 ms        | 34% faster   |
> | Page fault time   | 898 ms        | 331 ms        | 2.7x slower  |
>
> Page faulting 1G pages is 2.7x slower (Allocating 1G pages is hard :)).
> For long-running workloads this will be a one-off cost, and the 34%
> improvement in access latency provides significant benefit.
>
> ARM with 64K PAGE_SZIE supports 512M PMD THPs. In meta, we have a CPU
> bound workload running on a large number of ARM servers (256G). I enabled
> the 512M THP settings to always for a 100 servers in production (didn't
> really have high expectations :)). The average memory used for the workload
> increased from 217G to 233G. The amount of memory backed by 512M pages was
> 68G! The dTLB misses went down by 26% and the PID multiplier increased input
> by 5.9% (This is a very significant improvment in workload performance).
> A significant number of these THPs were faulted in at application start when
> were present across different VMAs. Ofcourse getting these 512M pages is
> easier on ARM due to bigger PAGE_SIZE and pageblock order.
>
> I am hoping that these patches for 1G THP can be used to provide similar
> benefits for x86. I expect workloads to fault them in at start time when there
> is plenty of free memory available.
>
>
> Previous attempt by Zi Yan
> ==========================
>
> Zi Yan attempted 1G THPs [2] in kernel version 5.11. There have been
> significant changes in kernel since then, including folio conversion, mTHP
> framework, ptdesc, rmap changes, etc. I found it easier to use the current PMD
> code as reference for making 1G PUD THP work. I am hoping Zi can provide
> guidance on these patches!

I am more than happy to help you. :)

>
> Major Design Decisions
> ======================
>
> 1. No shared 1G zero page: The memory cost would be quite significant!
>
> 2. Page Table Pre-deposit Strategy
>    PMD THP deposits a single PTE page table. PUD THP deposits 512 PTE
>    page tables (one for each potential PMD entry after split).
>    We allocate a PMD page table and use its pmd_huge_pte list to store
>    the deposited PTE tables. This ensures split operations don't fail due
>    to page table allocation failures (at the cost of 2M per PUD THP)
>
> 3. Split to Base Pages
>    When a PUD THP must be split (COW, partial unmap, mprotect), we split
>    directly to base pages (262,144 PTEs). The ideal thing would be to split
>    to 2M pages and then to 4K pages if needed. However, this would require
>    significant rmap and mapcount tracking changes.
>
> 4. COW and fork handling via split
>    Copy-on-write and fork for PUD THP triggers a split to base pages, then
>    uses existing PTE-level COW infrastructure. Getting another 1G region is
>    hard and could fail. If only a 4K is written, copying 1G is a waste.
>    Probably this should only be done on CoW and not fork?
>
> 5. Migration via split
>    Split PUD to PTEs and migrate individual pages. It is going to be difficult
>    to find a 1G continguous memory to migrate to. Maybe its better to not
>    allow migration of PUDs at all? I am more tempted to not allow migration,
>    but have kept splitting in this RFC.

Without migration, PUD THP loses its flexibility and transparency. But with
its 1GB size, I also wonder what the purpose of PUD THP migration can be.
It does not create memory fragmentation, since it is the largest folio size
we have and contiguous. NUMA balancing 1GB THP seems too much work.

BTW, I posted many questions, but that does not mean I object the patchset.
I just want to understand your use case better, reduce unnecessary
code changes, and hopefully get it upstreamed this time. :)

Thank you for the work.

>
>
> Reviewers guide
> ===============
>
> Most of the code is written by adapting from PMD code. For e.g. the PUD page
> fault path is very similar to PMD. The difference is no shared zero page and
> the page table deposit strategy. I think the easiest way to review this series
> is to compare with PMD code.
>
> Test results
> ============
>
>   1..7
>   # Starting 7 tests from 1 test cases.
>   #  RUN           pud_thp.basic_allocation ...
>   # pud_thp_test.c:169:basic_allocation:PUD THP allocated (anon_fault_alloc: 0 -> 1)
>   #            OK  pud_thp.basic_allocation
>   ok 1 pud_thp.basic_allocation
>   #  RUN           pud_thp.read_write_access ...
>   #            OK  pud_thp.read_write_access
>   ok 2 pud_thp.read_write_access
>   #  RUN           pud_thp.fork_cow ...
>   # pud_thp_test.c:236:fork_cow:Fork COW completed (thp_split_pud: 0 -> 1)
>   #            OK  pud_thp.fork_cow
>   ok 3 pud_thp.fork_cow
>   #  RUN           pud_thp.partial_munmap ...
>   # pud_thp_test.c:267:partial_munmap:Partial munmap completed (thp_split_pud: 1 -> 2)
>   #            OK  pud_thp.partial_munmap
>   ok 4 pud_thp.partial_munmap
>   #  RUN           pud_thp.mprotect_split ...
>   # pud_thp_test.c:293:mprotect_split:mprotect split completed (thp_split_pud: 2 -> 3)
>   #            OK  pud_thp.mprotect_split
>   ok 5 pud_thp.mprotect_split
>   #  RUN           pud_thp.reclaim_pageout ...
>   # pud_thp_test.c:322:reclaim_pageout:Reclaim completed (thp_split_pud: 3 -> 4)
>   #            OK  pud_thp.reclaim_pageout
>   ok 6 pud_thp.reclaim_pageout
>   #  RUN           pud_thp.migration_mbind ...
>   # pud_thp_test.c:356:migration_mbind:Migration completed (thp_split_pud: 4 -> 5)
>   #            OK  pud_thp.migration_mbind
>   ok 7 pud_thp.migration_mbind
>   # PASSED: 7 / 7 tests passed.
>   # Totals: pass:7 fail:0 xfail:0 xpass:0 skip:0 error:0
>
> [1] https://gist.github.com/uarif1/bf279b2a01a536cda945ff9f40196a26
> [2] https://lore.kernel.org/linux-mm/20210224223536.803765-1-zi.yan@sent.com/
>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>
> Usama Arif (12):
>   mm: add PUD THP ptdesc and rmap support
>   mm/thp: add mTHP stats infrastructure for PUD THP
>   mm: thp: add PUD THP allocation and fault handling
>   mm: thp: implement PUD THP split to PTE level
>   mm: thp: add reclaim and migration support for PUD THP
>   selftests/mm: add PUD THP basic allocation test
>   selftests/mm: add PUD THP read/write access test
>   selftests/mm: add PUD THP fork COW test
>   selftests/mm: add PUD THP partial munmap test
>   selftests/mm: add PUD THP mprotect split test
>   selftests/mm: add PUD THP reclaim test
>   selftests/mm: add PUD THP migration test
>
>  include/linux/huge_mm.h                   |  60 ++-
>  include/linux/mm.h                        |  19 +
>  include/linux/mm_types.h                  |   5 +-
>  include/linux/pgtable.h                   |   8 +
>  include/linux/rmap.h                      |   7 +-
>  mm/huge_memory.c                          | 535 +++++++++++++++++++++-
>  mm/internal.h                             |   3 +
>  mm/memory.c                               |   8 +-
>  mm/migrate.c                              |  17 +
>  mm/page_vma_mapped.c                      |  35 ++
>  mm/pgtable-generic.c                      |  83 ++++
>  mm/rmap.c                                 |  96 +++-
>  mm/vmscan.c                               |   2 +
>  tools/testing/selftests/mm/Makefile       |   1 +
>  tools/testing/selftests/mm/pud_thp_test.c | 360 +++++++++++++++
>  15 files changed, 1197 insertions(+), 42 deletions(-)
>  create mode 100644 tools/testing/selftests/mm/pud_thp_test.c
>
> -- 
> 2.47.3


Best Regards,
Yan, Zi

Re: [RFC 00/12] mm: PUD (1GB) THP implementation

Posted by Usama Arif 3 days, 3 hours ago


On 02/02/2026 08:24, Zi Yan wrote:
> On 1 Feb 2026, at 19:50, Usama Arif wrote:
> 
>> This is an RFC series to implement 1GB PUD-level THPs, allowing
>> applications to benefit from reduced TLB pressure without requiring
>> hugetlbfs. The patches are based on top of
>> f9b74c13b773b7c7e4920d7bc214ea3d5f37b422 from mm-stable (6.19-rc6).
> 
> It is nice to see you are working on 1GB THP.
> 
>>
>> Motivation: Why 1GB THP over hugetlbfs?
>> =======================================
>>
>> While hugetlbfs provides 1GB huge pages today, it has significant limitations
>> that make it unsuitable for many workloads:
>>
>> 1. Static Reservation: hugetlbfs requires pre-allocating huge pages at boot
>>    or runtime, taking memory away. This requires capacity planning,
>>    administrative overhead, and makes workload orchastration much much more
>>    complex, especially colocating with workloads that don't use hugetlbfs.
> 
> But you are using CMA, the same allocation mechanism as hugetlb_cma. What
> is the difference?
> 

So we dont really need to use CMA. CMA can help a lot ofcourse, but we dont *need* it.
For e.g. I can run the very simple case [1] of trying to get 1G pages in the upstream
kernel without CMA on my server and it works. The server has been up for more than a week
(so pretty fragmented), is running a bunch of stuff in the background, uses 0 CMA memory,
and I tried to get 20x1G pages on it and it worked.
It uses folio_alloc_gigantic, which is exactly what this series uses:

$ uptime -p
up 1 week, 3 days, 5 hours, 7 minutes
$ cat /proc/meminfo | grep -i cma                                                                                                                                                                                                   
CmaTotal:              0 kB                                                                                                                                                                                                                                           
CmaFree:               0 kB        
$ echo 20 | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages                                                                                                                                                      
20                                                                                                                                                                                                                                                                    
$ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages                                                                                                                                                                     
20
$ free -h
               total        used        free      shared  buff/cache   available
Mem:           1.0Ti       142Gi       292Gi       143Mi       583Gi       868Gi
Swap:          129Gi       3.5Gi       126Gi
$ ./map_1g_hugepages 
Mapping 20 x 1GB huge pages (20 GB total)
Mapped at 0x7f43c0000000
Touched page 0 at 0x7f43c0000000
Touched page 1 at 0x7f4400000000
Touched page 2 at 0x7f4440000000
Touched page 3 at 0x7f4480000000
Touched page 4 at 0x7f44c0000000
Touched page 5 at 0x7f4500000000
Touched page 6 at 0x7f4540000000
Touched page 7 at 0x7f4580000000
Touched page 8 at 0x7f45c0000000
Touched page 9 at 0x7f4600000000
Touched page 10 at 0x7f4640000000
Touched page 11 at 0x7f4680000000
Touched page 12 at 0x7f46c0000000
Touched page 13 at 0x7f4700000000
Touched page 14 at 0x7f4740000000
Touched page 15 at 0x7f4780000000
Touched page 16 at 0x7f47c0000000
Touched page 17 at 0x7f4800000000
Touched page 18 at 0x7f4840000000
Touched page 19 at 0x7f4880000000
Unmapped successfully
                                  



>>
>> 4. No Fallback: If a 1GB huge page cannot be allocated, hugetlbfs fails
>>    rather than falling back to smaller pages. This makes it fragile under
>>    memory pressure.
> 
> True.
> 
>>
>> 4. No Splitting: hugetlbfs pages cannot be split when only partial access
>>    is needed, leading to memory waste and preventing partial reclaim.
> 
> Since you have PUD THP implementation, have you run any workload on it?
> How often you see a PUD THP split?
> 

Ah so running non upstream kernels in production is a bit more difficult
(and also risky). I was trying to use the 512M experiment on arm as a comparison,
although I know its not the same thing with PAGE_SIZE and pageblock order.

I can try some other upstream benchmarks if it helps? Although will need to find
ones that create VMA > 1G.

> Oh, you actually ran 512MB THP on ARM64 (I saw it below), do you have
> any split stats to show the necessity of THP split?
> 
>>
>> 5. Memory Accounting: hugetlbfs memory is accounted separately and cannot
>>    be easily shared with regular memory pools.
> 
> True.
> 
>>
>> PUD THP solves these limitations by integrating 1GB pages into the existing
>> THP infrastructure.
> 
> The main advantage of PUD THP over hugetlb is that it can be split and mapped
> at sub-folio level. Do you have any data to support the necessity of them?
> I wonder if it would be easier to just support 1GB folio in core-mm first
> and we can add 1GB THP split and sub-folio mapping later. With that, we
> can move hugetlb users to 1GB folio.
> 

I would say its not the main advantage? But its definitely one of them.
The 2 main areas where split would be helpful is munmap partial
range and reclaim (MADV_PAGEOUT). For e.g. jemalloc/tcmalloc can now start
taking advantge of 1G pages. My knowledge is not that great when it comes
to memory allocators, but I believe they track for how long certain areas
have been cold and can trigger reclaim as an example. Then split will be useful.
Having memory allocators use hugetlb is probably going to be a no?


> BTW, without split support, you can apply HVO to 1GB folio to save memory.
> That is a disadvantage of PUD THP. Have you taken that into consideration?
> Basically, switching from hugetlb to PUD THP, you will lose memory due
> to vmemmap usage.
> 

Yeah so HVO saves 16M per 1G, and the page depost mechanism adds ~2M as per 1G.
We have HVO enabled in the meta fleet. I think we should not only think of PUD THP
as a replacement for hugetlb, but to also enable further usescases where hugetlb
would not be feasible.

Ater the basic infrastructure for 1G is there, we can work on optimizing, I think
there would be a a lot of interesting work we can do. HVO for 1G THP would be one
of them? 

>>
>> Performance Results
>> ===================
>>
>> Benchmark results of these patches on Intel Xeon Platinum 8321HC:
>>
>> Test: True Random Memory Access [1] test of 4GB memory region with pointer
>> chasing workload (4M random pointer dereferences through memory):
>>
>> | Metric            | PUD THP (1GB) | PMD THP (2MB) | Change       |
>> |-------------------|---------------|---------------|--------------|
>> | Memory access     | 88 ms         | 134 ms        | 34% faster   |
>> | Page fault time   | 898 ms        | 331 ms        | 2.7x slower  |
>>
>> Page faulting 1G pages is 2.7x slower (Allocating 1G pages is hard :)).
>> For long-running workloads this will be a one-off cost, and the 34%
>> improvement in access latency provides significant benefit.
>>
>> ARM with 64K PAGE_SZIE supports 512M PMD THPs. In meta, we have a CPU
>> bound workload running on a large number of ARM servers (256G). I enabled
>> the 512M THP settings to always for a 100 servers in production (didn't
>> really have high expectations :)). The average memory used for the workload
>> increased from 217G to 233G. The amount of memory backed by 512M pages was
>> 68G! The dTLB misses went down by 26% and the PID multiplier increased input
>> by 5.9% (This is a very significant improvment in workload performance).
>> A significant number of these THPs were faulted in at application start when
>> were present across different VMAs. Ofcourse getting these 512M pages is
>> easier on ARM due to bigger PAGE_SIZE and pageblock order.
>>
>> I am hoping that these patches for 1G THP can be used to provide similar
>> benefits for x86. I expect workloads to fault them in at start time when there
>> is plenty of free memory available.
>>
>>
>> Previous attempt by Zi Yan
>> ==========================
>>
>> Zi Yan attempted 1G THPs [2] in kernel version 5.11. There have been
>> significant changes in kernel since then, including folio conversion, mTHP
>> framework, ptdesc, rmap changes, etc. I found it easier to use the current PMD
>> code as reference for making 1G PUD THP work. I am hoping Zi can provide
>> guidance on these patches!
> 
> I am more than happy to help you. :)
> 

Thanks!!!

>>
>> Major Design Decisions
>> ======================
>>
>> 1. No shared 1G zero page: The memory cost would be quite significant!
>>
>> 2. Page Table Pre-deposit Strategy
>>    PMD THP deposits a single PTE page table. PUD THP deposits 512 PTE
>>    page tables (one for each potential PMD entry after split).
>>    We allocate a PMD page table and use its pmd_huge_pte list to store
>>    the deposited PTE tables. This ensures split operations don't fail due
>>    to page table allocation failures (at the cost of 2M per PUD THP)
>>
>> 3. Split to Base Pages
>>    When a PUD THP must be split (COW, partial unmap, mprotect), we split
>>    directly to base pages (262,144 PTEs). The ideal thing would be to split
>>    to 2M pages and then to 4K pages if needed. However, this would require
>>    significant rmap and mapcount tracking changes.
>>
>> 4. COW and fork handling via split
>>    Copy-on-write and fork for PUD THP triggers a split to base pages, then
>>    uses existing PTE-level COW infrastructure. Getting another 1G region is
>>    hard and could fail. If only a 4K is written, copying 1G is a waste.
>>    Probably this should only be done on CoW and not fork?
>>
>> 5. Migration via split
>>    Split PUD to PTEs and migrate individual pages. It is going to be difficult
>>    to find a 1G continguous memory to migrate to. Maybe its better to not
>>    allow migration of PUDs at all? I am more tempted to not allow migration,
>>    but have kept splitting in this RFC.
> 
> Without migration, PUD THP loses its flexibility and transparency. But with
> its 1GB size, I also wonder what the purpose of PUD THP migration can be.
> It does not create memory fragmentation, since it is the largest folio size
> we have and contiguous. NUMA balancing 1GB THP seems too much work.

Yeah this is exactly what I was thinking as well. It is going to be expensive
and difficult to migrate 1G pages, and I am not sure if what we get out of it
is worth it? I kept the splitting code in this RFC as I wanted to show that
its possible to split and migrate and the rejecting migration code is a lot easier.

> 
> BTW, I posted many questions, but that does not mean I object the patchset.
> I just want to understand your use case better, reduce unnecessary
> code changes, and hopefully get it upstreamed this time. :)
> 
> Thank you for the work.
> 

Ah no this is awesome! Thanks for the questions! Its basically the discussion I
wanted to start with the RFC.


[1] https://gist.github.com/uarif1/35dcd63f9d76048b07eb5c16ace85991

Re: [RFC 00/12] mm: PUD (1GB) THP implementation

Posted by Zi Yan 1 day, 9 hours ago

On 3 Feb 2026, at 18:29, Usama Arif wrote:

> On 02/02/2026 08:24, Zi Yan wrote:
>> On 1 Feb 2026, at 19:50, Usama Arif wrote:
>>
>>> This is an RFC series to implement 1GB PUD-level THPs, allowing
>>> applications to benefit from reduced TLB pressure without requiring
>>> hugetlbfs. The patches are based on top of
>>> f9b74c13b773b7c7e4920d7bc214ea3d5f37b422 from mm-stable (6.19-rc6).
>>
>> It is nice to see you are working on 1GB THP.
>>
>>>
>>> Motivation: Why 1GB THP over hugetlbfs?
>>> =======================================
>>>
>>> While hugetlbfs provides 1GB huge pages today, it has significant limitations
>>> that make it unsuitable for many workloads:
>>>
>>> 1. Static Reservation: hugetlbfs requires pre-allocating huge pages at boot
>>>    or runtime, taking memory away. This requires capacity planning,
>>>    administrative overhead, and makes workload orchastration much much more
>>>    complex, especially colocating with workloads that don't use hugetlbfs.
>>
>> But you are using CMA, the same allocation mechanism as hugetlb_cma. What
>> is the difference?
>>
>
> So we dont really need to use CMA. CMA can help a lot ofcourse, but we dont *need* it.
> For e.g. I can run the very simple case [1] of trying to get 1G pages in the upstream
> kernel without CMA on my server and it works. The server has been up for more than a week
> (so pretty fragmented), is running a bunch of stuff in the background, uses 0 CMA memory,
> and I tried to get 20x1G pages on it and it worked.
> It uses folio_alloc_gigantic, which is exactly what this series uses:
>
> $ uptime -p
> up 1 week, 3 days, 5 hours, 7 minutes
> $ cat /proc/meminfo | grep -i cma
> CmaTotal:              0 kB
> CmaFree:               0 kB
> $ echo 20 | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> 20
> $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> 20
> $ free -h
>                total        used        free      shared  buff/cache   available
> Mem:           1.0Ti       142Gi       292Gi       143Mi       583Gi       868Gi
> Swap:          129Gi       3.5Gi       126Gi
> $ ./map_1g_hugepages
> Mapping 20 x 1GB huge pages (20 GB total)
> Mapped at 0x7f43c0000000
> Touched page 0 at 0x7f43c0000000
> Touched page 1 at 0x7f4400000000
> Touched page 2 at 0x7f4440000000
> Touched page 3 at 0x7f4480000000
> Touched page 4 at 0x7f44c0000000
> Touched page 5 at 0x7f4500000000
> Touched page 6 at 0x7f4540000000
> Touched page 7 at 0x7f4580000000
> Touched page 8 at 0x7f45c0000000
> Touched page 9 at 0x7f4600000000
> Touched page 10 at 0x7f4640000000
> Touched page 11 at 0x7f4680000000
> Touched page 12 at 0x7f46c0000000
> Touched page 13 at 0x7f4700000000
> Touched page 14 at 0x7f4740000000
> Touched page 15 at 0x7f4780000000
> Touched page 16 at 0x7f47c0000000
> Touched page 17 at 0x7f4800000000
> Touched page 18 at 0x7f4840000000
> Touched page 19 at 0x7f4880000000
> Unmapped successfully
>

OK, I see the subtle difference among CMA, hugetlb_cma, alloc_contig_pages(),
although CMA and hugetlb_cma use alloc_contig_pages() behind the scenes:

1. CMA and hugetlb_cma reserves some amount of memory at boot as MIGRATE_CMA
and only CMA allocations are allowed. It is a carveout.

2. alloc_contig_pages() without CMA needs to look for a contiguous physical
range without any unmovable page or pinned movable pages, so that the allocation
can succeeds.

Your example is quite optimistic, since the free memory is much bigger than
the requested 1GB pages, 292GB vs 20GB. Unless the worst scenario, where
each 1GB of the free memory has 1 unmovable pages, happens, alloc_contig_pages()
will succeed. But does it represent the product environment, where free memory
is scarce? And in that case, how long does alloc_contig_pages() take to get
1GB memory? Is that delay tolerable?

This discussion all comes back to
“should we have a dedicated source for 1GB folio?” Yu Zhao’s TAO[1] was
interesting, since it has a dedicated zone for large folios and split is
replaced by migrating after-split folios to a different zone. But how to
adjust that dedicated zone size is still not determined. Lots of ideas,
but no conclusion yet.

[1] https://lwn.net/Articles/964097/

>
>
>
>>>
>>> 4. No Fallback: If a 1GB huge page cannot be allocated, hugetlbfs fails
>>>    rather than falling back to smaller pages. This makes it fragile under
>>>    memory pressure.
>>
>> True.
>>
>>>
>>> 4. No Splitting: hugetlbfs pages cannot be split when only partial access
>>>    is needed, leading to memory waste and preventing partial reclaim.
>>
>> Since you have PUD THP implementation, have you run any workload on it?
>> How often you see a PUD THP split?
>>
>
> Ah so running non upstream kernels in production is a bit more difficult
> (and also risky). I was trying to use the 512M experiment on arm as a comparison,
> although I know its not the same thing with PAGE_SIZE and pageblock order.
>
> I can try some other upstream benchmarks if it helps? Although will need to find
> ones that create VMA > 1G.

I think getting split stats from ARM 512MB PMD THP can give some clue about
1GB THP, since the THP sizes are similar (yeah, base page to THP size ratios
are 32x different but the gap between base page size and THP size is still
much bigger than 4KB vs 2MB).

>
>> Oh, you actually ran 512MB THP on ARM64 (I saw it below), do you have
>> any split stats to show the necessity of THP split?
>>
>>>
>>> 5. Memory Accounting: hugetlbfs memory is accounted separately and cannot
>>>    be easily shared with regular memory pools.
>>
>> True.
>>
>>>
>>> PUD THP solves these limitations by integrating 1GB pages into the existing
>>> THP infrastructure.
>>
>> The main advantage of PUD THP over hugetlb is that it can be split and mapped
>> at sub-folio level. Do you have any data to support the necessity of them?
>> I wonder if it would be easier to just support 1GB folio in core-mm first
>> and we can add 1GB THP split and sub-folio mapping later. With that, we
>> can move hugetlb users to 1GB folio.
>>
>
> I would say its not the main advantage? But its definitely one of them.
> The 2 main areas where split would be helpful is munmap partial
> range and reclaim (MADV_PAGEOUT). For e.g. jemalloc/tcmalloc can now start
> taking advantge of 1G pages. My knowledge is not that great when it comes
> to memory allocators, but I believe they track for how long certain areas
> have been cold and can trigger reclaim as an example. Then split will be useful.
> Having memory allocators use hugetlb is probably going to be a no?

To take advantage of 1GB pages, memory allocators would want to keep that
whole GB mapped by PUD, otherwise TLB wise there is no difference from
using 2MB pages, right? I guess memory allocators would want to promote
a set of stable memory objects to 1GB and demote them from 1GB if any
is gone (promote by migrating them into a 1GB folio, demote by migrating
them out of a 1GB folio) and this can avoid split.

>
>
>> BTW, without split support, you can apply HVO to 1GB folio to save memory.
>> That is a disadvantage of PUD THP. Have you taken that into consideration?
>> Basically, switching from hugetlb to PUD THP, you will lose memory due
>> to vmemmap usage.
>>
>
> Yeah so HVO saves 16M per 1G, and the page depost mechanism adds ~2M as per 1G.
> We have HVO enabled in the meta fleet. I think we should not only think of PUD THP
> as a replacement for hugetlb, but to also enable further usescases where hugetlb
> would not be feasible.
>
> Ater the basic infrastructure for 1G is there, we can work on optimizing, I think
> there would be a a lot of interesting work we can do. HVO for 1G THP would be one
> of them?

HVO would prevent folio split, right? Since most of struct pages are mapped
to the same memory area. You will need to allocate more memory, 16MB, to split
1GB. That further decreases the motivation of splitting 1GB.

>
>>>
>>> Performance Results
>>> ===================
>>>
>>> Benchmark results of these patches on Intel Xeon Platinum 8321HC:
>>>
>>> Test: True Random Memory Access [1] test of 4GB memory region with pointer
>>> chasing workload (4M random pointer dereferences through memory):
>>>
>>> | Metric            | PUD THP (1GB) | PMD THP (2MB) | Change       |
>>> |-------------------|---------------|---------------|--------------|
>>> | Memory access     | 88 ms         | 134 ms        | 34% faster   |
>>> | Page fault time   | 898 ms        | 331 ms        | 2.7x slower  |
>>>
>>> Page faulting 1G pages is 2.7x slower (Allocating 1G pages is hard :)).
>>> For long-running workloads this will be a one-off cost, and the 34%
>>> improvement in access latency provides significant benefit.
>>>
>>> ARM with 64K PAGE_SZIE supports 512M PMD THPs. In meta, we have a CPU
>>> bound workload running on a large number of ARM servers (256G). I enabled
>>> the 512M THP settings to always for a 100 servers in production (didn't
>>> really have high expectations :)). The average memory used for the workload
>>> increased from 217G to 233G. The amount of memory backed by 512M pages was
>>> 68G! The dTLB misses went down by 26% and the PID multiplier increased input
>>> by 5.9% (This is a very significant improvment in workload performance).
>>> A significant number of these THPs were faulted in at application start when
>>> were present across different VMAs. Ofcourse getting these 512M pages is
>>> easier on ARM due to bigger PAGE_SIZE and pageblock order.
>>>
>>> I am hoping that these patches for 1G THP can be used to provide similar
>>> benefits for x86. I expect workloads to fault them in at start time when there
>>> is plenty of free memory available.
>>>
>>>
>>> Previous attempt by Zi Yan
>>> ==========================
>>>
>>> Zi Yan attempted 1G THPs [2] in kernel version 5.11. There have been
>>> significant changes in kernel since then, including folio conversion, mTHP
>>> framework, ptdesc, rmap changes, etc. I found it easier to use the current PMD
>>> code as reference for making 1G PUD THP work. I am hoping Zi can provide
>>> guidance on these patches!
>>
>> I am more than happy to help you. :)
>>
>
> Thanks!!!
>
>>>
>>> Major Design Decisions
>>> ======================
>>>
>>> 1. No shared 1G zero page: The memory cost would be quite significant!
>>>
>>> 2. Page Table Pre-deposit Strategy
>>>    PMD THP deposits a single PTE page table. PUD THP deposits 512 PTE
>>>    page tables (one for each potential PMD entry after split).
>>>    We allocate a PMD page table and use its pmd_huge_pte list to store
>>>    the deposited PTE tables. This ensures split operations don't fail due
>>>    to page table allocation failures (at the cost of 2M per PUD THP)
>>>
>>> 3. Split to Base Pages
>>>    When a PUD THP must be split (COW, partial unmap, mprotect), we split
>>>    directly to base pages (262,144 PTEs). The ideal thing would be to split
>>>    to 2M pages and then to 4K pages if needed. However, this would require
>>>    significant rmap and mapcount tracking changes.
>>>
>>> 4. COW and fork handling via split
>>>    Copy-on-write and fork for PUD THP triggers a split to base pages, then
>>>    uses existing PTE-level COW infrastructure. Getting another 1G region is
>>>    hard and could fail. If only a 4K is written, copying 1G is a waste.
>>>    Probably this should only be done on CoW and not fork?
>>>
>>> 5. Migration via split
>>>    Split PUD to PTEs and migrate individual pages. It is going to be difficult
>>>    to find a 1G continguous memory to migrate to. Maybe its better to not
>>>    allow migration of PUDs at all? I am more tempted to not allow migration,
>>>    but have kept splitting in this RFC.
>>
>> Without migration, PUD THP loses its flexibility and transparency. But with
>> its 1GB size, I also wonder what the purpose of PUD THP migration can be.
>> It does not create memory fragmentation, since it is the largest folio size
>> we have and contiguous. NUMA balancing 1GB THP seems too much work.
>
> Yeah this is exactly what I was thinking as well. It is going to be expensive
> and difficult to migrate 1G pages, and I am not sure if what we get out of it
> is worth it? I kept the splitting code in this RFC as I wanted to show that
> its possible to split and migrate and the rejecting migration code is a lot easier.

Got it. Maybe reframing this patchset as 1GB folio support without split or
migration is better?

>
>>
>> BTW, I posted many questions, but that does not mean I object the patchset.
>> I just want to understand your use case better, reduce unnecessary
>> code changes, and hopefully get it upstreamed this time. :)
>>
>> Thank you for the work.
>>
>
> Ah no this is awesome! Thanks for the questions! Its basically the discussion I
> wanted to start with the RFC.
>
>
> [1] https://gist.github.com/uarif1/35dcd63f9d76048b07eb5c16ace85991


Best Regards,
Yan, Zi

Re: [RFC 00/12] mm: PUD (1GB) THP implementation

Posted by Frank van der Linden 3 days, 3 hours ago

On Tue, Feb 3, 2026 at 3:29 PM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 02/02/2026 08:24, Zi Yan wrote:
> > On 1 Feb 2026, at 19:50, Usama Arif wrote:
> >
> >> This is an RFC series to implement 1GB PUD-level THPs, allowing
> >> applications to benefit from reduced TLB pressure without requiring
> >> hugetlbfs. The patches are based on top of
> >> f9b74c13b773b7c7e4920d7bc214ea3d5f37b422 from mm-stable (6.19-rc6).
> >
> > It is nice to see you are working on 1GB THP.
> >
> >>
> >> Motivation: Why 1GB THP over hugetlbfs?
> >> =======================================
> >>
> >> While hugetlbfs provides 1GB huge pages today, it has significant limitations
> >> that make it unsuitable for many workloads:
> >>
> >> 1. Static Reservation: hugetlbfs requires pre-allocating huge pages at boot
> >>    or runtime, taking memory away. This requires capacity planning,
> >>    administrative overhead, and makes workload orchastration much much more
> >>    complex, especially colocating with workloads that don't use hugetlbfs.
> >
> > But you are using CMA, the same allocation mechanism as hugetlb_cma. What
> > is the difference?
> >
>
> So we dont really need to use CMA. CMA can help a lot ofcourse, but we dont *need* it.
> For e.g. I can run the very simple case [1] of trying to get 1G pages in the upstream
> kernel without CMA on my server and it works. The server has been up for more than a week
> (so pretty fragmented), is running a bunch of stuff in the background, uses 0 CMA memory,
> and I tried to get 20x1G pages on it and it worked.
> It uses folio_alloc_gigantic, which is exactly what this series uses:
>
> $ uptime -p
> up 1 week, 3 days, 5 hours, 7 minutes
> $ cat /proc/meminfo | grep -i cma
> CmaTotal:              0 kB
> CmaFree:               0 kB
> $ echo 20 | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> 20
> $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> 20
> $ free -h
>                total        used        free      shared  buff/cache   available
> Mem:           1.0Ti       142Gi       292Gi       143Mi       583Gi       868Gi
> Swap:          129Gi       3.5Gi       126Gi
> $ ./map_1g_hugepages
> Mapping 20 x 1GB huge pages (20 GB total)
> Mapped at 0x7f43c0000000
> Touched page 0 at 0x7f43c0000000
> Touched page 1 at 0x7f4400000000
> Touched page 2 at 0x7f4440000000
> Touched page 3 at 0x7f4480000000
> Touched page 4 at 0x7f44c0000000
> Touched page 5 at 0x7f4500000000
> Touched page 6 at 0x7f4540000000
> Touched page 7 at 0x7f4580000000
> Touched page 8 at 0x7f45c0000000
> Touched page 9 at 0x7f4600000000
> Touched page 10 at 0x7f4640000000
> Touched page 11 at 0x7f4680000000
> Touched page 12 at 0x7f46c0000000
> Touched page 13 at 0x7f4700000000
> Touched page 14 at 0x7f4740000000
> Touched page 15 at 0x7f4780000000
> Touched page 16 at 0x7f47c0000000
> Touched page 17 at 0x7f4800000000
> Touched page 18 at 0x7f4840000000
> Touched page 19 at 0x7f4880000000
> Unmapped successfully
>
>
>
>
> >>
> >> 4. No Fallback: If a 1GB huge page cannot be allocated, hugetlbfs fails
> >>    rather than falling back to smaller pages. This makes it fragile under
> >>    memory pressure.
> >
> > True.
> >
> >>
> >> 4. No Splitting: hugetlbfs pages cannot be split when only partial access
> >>    is needed, leading to memory waste and preventing partial reclaim.
> >
> > Since you have PUD THP implementation, have you run any workload on it?
> > How often you see a PUD THP split?
> >
>
> Ah so running non upstream kernels in production is a bit more difficult
> (and also risky). I was trying to use the 512M experiment on arm as a comparison,
> although I know its not the same thing with PAGE_SIZE and pageblock order.
>
> I can try some other upstream benchmarks if it helps? Although will need to find
> ones that create VMA > 1G.
>
> > Oh, you actually ran 512MB THP on ARM64 (I saw it below), do you have
> > any split stats to show the necessity of THP split?
> >
> >>
> >> 5. Memory Accounting: hugetlbfs memory is accounted separately and cannot
> >>    be easily shared with regular memory pools.
> >
> > True.
> >
> >>
> >> PUD THP solves these limitations by integrating 1GB pages into the existing
> >> THP infrastructure.
> >
> > The main advantage of PUD THP over hugetlb is that it can be split and mapped
> > at sub-folio level. Do you have any data to support the necessity of them?
> > I wonder if it would be easier to just support 1GB folio in core-mm first
> > and we can add 1GB THP split and sub-folio mapping later. With that, we
> > can move hugetlb users to 1GB folio.
> >
>
> I would say its not the main advantage? But its definitely one of them.
> The 2 main areas where split would be helpful is munmap partial
> range and reclaim (MADV_PAGEOUT). For e.g. jemalloc/tcmalloc can now start
> taking advantge of 1G pages. My knowledge is not that great when it comes
> to memory allocators, but I believe they track for how long certain areas
> have been cold and can trigger reclaim as an example. Then split will be useful.
> Having memory allocators use hugetlb is probably going to be a no?
>
>
> > BTW, without split support, you can apply HVO to 1GB folio to save memory.
> > That is a disadvantage of PUD THP. Have you taken that into consideration?
> > Basically, switching from hugetlb to PUD THP, you will lose memory due
> > to vmemmap usage.
> >
>
> Yeah so HVO saves 16M per 1G, and the page depost mechanism adds ~2M as per 1G.
> We have HVO enabled in the meta fleet. I think we should not only think of PUD THP
> as a replacement for hugetlb, but to also enable further usescases where hugetlb
> would not be feasible.
>
> Ater the basic infrastructure for 1G is there, we can work on optimizing, I think
> there would be a a lot of interesting work we can do. HVO for 1G THP would be one
> of them?
>
> >>
> >> Performance Results
> >> ===================
> >>
> >> Benchmark results of these patches on Intel Xeon Platinum 8321HC:
> >>
> >> Test: True Random Memory Access [1] test of 4GB memory region with pointer
> >> chasing workload (4M random pointer dereferences through memory):
> >>
> >> | Metric            | PUD THP (1GB) | PMD THP (2MB) | Change       |
> >> |-------------------|---------------|---------------|--------------|
> >> | Memory access     | 88 ms         | 134 ms        | 34% faster   |
> >> | Page fault time   | 898 ms        | 331 ms        | 2.7x slower  |
> >>
> >> Page faulting 1G pages is 2.7x slower (Allocating 1G pages is hard :)).
> >> For long-running workloads this will be a one-off cost, and the 34%
> >> improvement in access latency provides significant benefit.
> >>
> >> ARM with 64K PAGE_SZIE supports 512M PMD THPs. In meta, we have a CPU
> >> bound workload running on a large number of ARM servers (256G). I enabled
> >> the 512M THP settings to always for a 100 servers in production (didn't
> >> really have high expectations :)). The average memory used for the workload
> >> increased from 217G to 233G. The amount of memory backed by 512M pages was
> >> 68G! The dTLB misses went down by 26% and the PID multiplier increased input
> >> by 5.9% (This is a very significant improvment in workload performance).
> >> A significant number of these THPs were faulted in at application start when
> >> were present across different VMAs. Ofcourse getting these 512M pages is
> >> easier on ARM due to bigger PAGE_SIZE and pageblock order.
> >>
> >> I am hoping that these patches for 1G THP can be used to provide similar
> >> benefits for x86. I expect workloads to fault them in at start time when there
> >> is plenty of free memory available.
> >>
> >>
> >> Previous attempt by Zi Yan
> >> ==========================
> >>
> >> Zi Yan attempted 1G THPs [2] in kernel version 5.11. There have been
> >> significant changes in kernel since then, including folio conversion, mTHP
> >> framework, ptdesc, rmap changes, etc. I found it easier to use the current PMD
> >> code as reference for making 1G PUD THP work. I am hoping Zi can provide
> >> guidance on these patches!
> >
> > I am more than happy to help you. :)
> >
>
> Thanks!!!
>
> >>
> >> Major Design Decisions
> >> ======================
> >>
> >> 1. No shared 1G zero page: The memory cost would be quite significant!
> >>
> >> 2. Page Table Pre-deposit Strategy
> >>    PMD THP deposits a single PTE page table. PUD THP deposits 512 PTE
> >>    page tables (one for each potential PMD entry after split).
> >>    We allocate a PMD page table and use its pmd_huge_pte list to store
> >>    the deposited PTE tables. This ensures split operations don't fail due
> >>    to page table allocation failures (at the cost of 2M per PUD THP)
> >>
> >> 3. Split to Base Pages
> >>    When a PUD THP must be split (COW, partial unmap, mprotect), we split
> >>    directly to base pages (262,144 PTEs). The ideal thing would be to split
> >>    to 2M pages and then to 4K pages if needed. However, this would require
> >>    significant rmap and mapcount tracking changes.
> >>
> >> 4. COW and fork handling via split
> >>    Copy-on-write and fork for PUD THP triggers a split to base pages, then
> >>    uses existing PTE-level COW infrastructure. Getting another 1G region is
> >>    hard and could fail. If only a 4K is written, copying 1G is a waste.
> >>    Probably this should only be done on CoW and not fork?
> >>
> >> 5. Migration via split
> >>    Split PUD to PTEs and migrate individual pages. It is going to be difficult
> >>    to find a 1G continguous memory to migrate to. Maybe its better to not
> >>    allow migration of PUDs at all? I am more tempted to not allow migration,
> >>    but have kept splitting in this RFC.
> >
> > Without migration, PUD THP loses its flexibility and transparency. But with
> > its 1GB size, I also wonder what the purpose of PUD THP migration can be.
> > It does not create memory fragmentation, since it is the largest folio size
> > we have and contiguous. NUMA balancing 1GB THP seems too much work.
>
> Yeah this is exactly what I was thinking as well. It is going to be expensive
> and difficult to migrate 1G pages, and I am not sure if what we get out of it
> is worth it? I kept the splitting code in this RFC as I wanted to show that
> its possible to split and migrate and the rejecting migration code is a lot easier.
>
> >
> > BTW, I posted many questions, but that does not mean I object the patchset.
> > I just want to understand your use case better, reduce unnecessary
> > code changes, and hopefully get it upstreamed this time. :)
> >
> > Thank you for the work.
> >
>
> Ah no this is awesome! Thanks for the questions! Its basically the discussion I
> wanted to start with the RFC.
>
>
> [1] https://gist.github.com/uarif1/35dcd63f9d76048b07eb5c16ace85991
>
>

It looks like the scenario you're going for is an application that
allocates a sizeable chunk of memory upfront, and would like it to be
1G pages as much as possible, right?

You can do that with 1G THPs, the advantage being that any failures to
get 1G pages are not explicit, so you're not left with having to grow
the number of hugetlb pages yourself, and see how many you can use.

1G THPs seem useful for that. I don't recall all of the discussion
here, but I assume that hooking 1G THP support in to khugepaged is
quite something else - the potential churn to get an 1G page could
well cause more system interference than you'd like.

The CMA scenario Rik was talking about is similar: you set
hugetlb_cma=NG, and then, when you need 1G pages, you grow the hugetlb
pool and use them. Disadvantage: you have to do it explicitly.

However, hugetlb_cma does give you a much larger chance of getting
those 1G pages. The example you give, 20 1G pages on a 1T system where
there is 292G free, isn't much of a problem in my experience. You
should have no problem getting that amount of 1G pages. Things get
more difficult when most of your memory is taken - hugetlb_cma really
helps there. E.g. we have systems that have 90% hugetlb_cma, and there
is a pretty good success rate converting back and forth between
hugetlb and normal page allocator pages with hugetlb_cma, while
operating close to that 90% hugetlb coverage. Without CMA, the success
rate drops quite a bit at that level.

CMA balancing is a related issue, for hugetlb. It fixes a problem that
has been known for years: the more memory you set aside for movable
only allocations (e.g. hugetlb_cma), the less breathing room you have
for unmovable allocations. So you risk the 'false OOM' scenario, where
the kernel can't make an unmovable allocation, even though there is
enough memory available, even outside of CMA. It's just that those
MOVABLE pageblocks were used for movable allocations. So ideally, you
would migrate those movable allocations to CMA under those
circumstances. Which is what CMA balancing does. It's worked out very
well for us in the scenario I list above (most memory being
hugetlb_cma).

Anyway, I'm rambling on a bit. Let's see if I got this right:

1G THP
  - advantages: transparent interface
  - disadvantage: no HVO, lower success rate under higher memory
pressure than hugetlb_cma

hugetlb_cma
   - disadvantage: explicit interface, for higher values needs 'false
OOM' avoidance
   - advange: better success rate under pressure.

I think 1G THPs are a good solution for "nice to have" scenarios, but
there will still be use cases where a higher success rate is preferred
and HugeTLB is preferred.

Lastly, there's also the ZONE_MOVABLE story. I think 1G THPs and
ZONE_MOVABLE could work well together, improving the success rate. But
then the issue of pinning raise its head again, and whether that
should be allowed or configurable per zone..

- Frank

Re: [RFC 00/12] mm: PUD (1GB) THP implementation

Posted by Usama Arif 1 day, 21 hours ago


On 03/02/2026 16:08, Frank van der Linden wrote:
> On Tue, Feb 3, 2026 at 3:29 PM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>>
>>
>> On 02/02/2026 08:24, Zi Yan wrote:
>>> On 1 Feb 2026, at 19:50, Usama Arif wrote:
>>>
>>>> This is an RFC series to implement 1GB PUD-level THPs, allowing
>>>> applications to benefit from reduced TLB pressure without requiring
>>>> hugetlbfs. The patches are based on top of
>>>> f9b74c13b773b7c7e4920d7bc214ea3d5f37b422 from mm-stable (6.19-rc6).
>>>
>>> It is nice to see you are working on 1GB THP.
>>>
>>>>
>>>> Motivation: Why 1GB THP over hugetlbfs?
>>>> =======================================
>>>>
>>>> While hugetlbfs provides 1GB huge pages today, it has significant limitations
>>>> that make it unsuitable for many workloads:
>>>>
>>>> 1. Static Reservation: hugetlbfs requires pre-allocating huge pages at boot
>>>>    or runtime, taking memory away. This requires capacity planning,
>>>>    administrative overhead, and makes workload orchastration much much more
>>>>    complex, especially colocating with workloads that don't use hugetlbfs.
>>>
>>> But you are using CMA, the same allocation mechanism as hugetlb_cma. What
>>> is the difference?
>>>
>>
>> So we dont really need to use CMA. CMA can help a lot ofcourse, but we dont *need* it.
>> For e.g. I can run the very simple case [1] of trying to get 1G pages in the upstream
>> kernel without CMA on my server and it works. The server has been up for more than a week
>> (so pretty fragmented), is running a bunch of stuff in the background, uses 0 CMA memory,
>> and I tried to get 20x1G pages on it and it worked.
>> It uses folio_alloc_gigantic, which is exactly what this series uses:
>>
>> $ uptime -p
>> up 1 week, 3 days, 5 hours, 7 minutes
>> $ cat /proc/meminfo | grep -i cma
>> CmaTotal:              0 kB
>> CmaFree:               0 kB
>> $ echo 20 | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>> 20
>> $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>> 20
>> $ free -h
>>                total        used        free      shared  buff/cache   available
>> Mem:           1.0Ti       142Gi       292Gi       143Mi       583Gi       868Gi
>> Swap:          129Gi       3.5Gi       126Gi
>> $ ./map_1g_hugepages
>> Mapping 20 x 1GB huge pages (20 GB total)
>> Mapped at 0x7f43c0000000
>> Touched page 0 at 0x7f43c0000000
>> Touched page 1 at 0x7f4400000000
>> Touched page 2 at 0x7f4440000000
>> Touched page 3 at 0x7f4480000000
>> Touched page 4 at 0x7f44c0000000
>> Touched page 5 at 0x7f4500000000
>> Touched page 6 at 0x7f4540000000
>> Touched page 7 at 0x7f4580000000
>> Touched page 8 at 0x7f45c0000000
>> Touched page 9 at 0x7f4600000000
>> Touched page 10 at 0x7f4640000000
>> Touched page 11 at 0x7f4680000000
>> Touched page 12 at 0x7f46c0000000
>> Touched page 13 at 0x7f4700000000
>> Touched page 14 at 0x7f4740000000
>> Touched page 15 at 0x7f4780000000
>> Touched page 16 at 0x7f47c0000000
>> Touched page 17 at 0x7f4800000000
>> Touched page 18 at 0x7f4840000000
>> Touched page 19 at 0x7f4880000000
>> Unmapped successfully
>>
>>
>>
>>
>>>>
>>>> 4. No Fallback: If a 1GB huge page cannot be allocated, hugetlbfs fails
>>>>    rather than falling back to smaller pages. This makes it fragile under
>>>>    memory pressure.
>>>
>>> True.
>>>
>>>>
>>>> 4. No Splitting: hugetlbfs pages cannot be split when only partial access
>>>>    is needed, leading to memory waste and preventing partial reclaim.
>>>
>>> Since you have PUD THP implementation, have you run any workload on it?
>>> How often you see a PUD THP split?
>>>
>>
>> Ah so running non upstream kernels in production is a bit more difficult
>> (and also risky). I was trying to use the 512M experiment on arm as a comparison,
>> although I know its not the same thing with PAGE_SIZE and pageblock order.
>>
>> I can try some other upstream benchmarks if it helps? Although will need to find
>> ones that create VMA > 1G.
>>
>>> Oh, you actually ran 512MB THP on ARM64 (I saw it below), do you have
>>> any split stats to show the necessity of THP split?
>>>
>>>>
>>>> 5. Memory Accounting: hugetlbfs memory is accounted separately and cannot
>>>>    be easily shared with regular memory pools.
>>>
>>> True.
>>>
>>>>
>>>> PUD THP solves these limitations by integrating 1GB pages into the existing
>>>> THP infrastructure.
>>>
>>> The main advantage of PUD THP over hugetlb is that it can be split and mapped
>>> at sub-folio level. Do you have any data to support the necessity of them?
>>> I wonder if it would be easier to just support 1GB folio in core-mm first
>>> and we can add 1GB THP split and sub-folio mapping later. With that, we
>>> can move hugetlb users to 1GB folio.
>>>
>>
>> I would say its not the main advantage? But its definitely one of them.
>> The 2 main areas where split would be helpful is munmap partial
>> range and reclaim (MADV_PAGEOUT). For e.g. jemalloc/tcmalloc can now start
>> taking advantge of 1G pages. My knowledge is not that great when it comes
>> to memory allocators, but I believe they track for how long certain areas
>> have been cold and can trigger reclaim as an example. Then split will be useful.
>> Having memory allocators use hugetlb is probably going to be a no?
>>
>>
>>> BTW, without split support, you can apply HVO to 1GB folio to save memory.
>>> That is a disadvantage of PUD THP. Have you taken that into consideration?
>>> Basically, switching from hugetlb to PUD THP, you will lose memory due
>>> to vmemmap usage.
>>>
>>
>> Yeah so HVO saves 16M per 1G, and the page depost mechanism adds ~2M as per 1G.
>> We have HVO enabled in the meta fleet. I think we should not only think of PUD THP
>> as a replacement for hugetlb, but to also enable further usescases where hugetlb
>> would not be feasible.
>>
>> Ater the basic infrastructure for 1G is there, we can work on optimizing, I think
>> there would be a a lot of interesting work we can do. HVO for 1G THP would be one
>> of them?
>>
>>>>
>>>> Performance Results
>>>> ===================
>>>>
>>>> Benchmark results of these patches on Intel Xeon Platinum 8321HC:
>>>>
>>>> Test: True Random Memory Access [1] test of 4GB memory region with pointer
>>>> chasing workload (4M random pointer dereferences through memory):
>>>>
>>>> | Metric            | PUD THP (1GB) | PMD THP (2MB) | Change       |
>>>> |-------------------|---------------|---------------|--------------|
>>>> | Memory access     | 88 ms         | 134 ms        | 34% faster   |
>>>> | Page fault time   | 898 ms        | 331 ms        | 2.7x slower  |
>>>>
>>>> Page faulting 1G pages is 2.7x slower (Allocating 1G pages is hard :)).
>>>> For long-running workloads this will be a one-off cost, and the 34%
>>>> improvement in access latency provides significant benefit.
>>>>
>>>> ARM with 64K PAGE_SZIE supports 512M PMD THPs. In meta, we have a CPU
>>>> bound workload running on a large number of ARM servers (256G). I enabled
>>>> the 512M THP settings to always for a 100 servers in production (didn't
>>>> really have high expectations :)). The average memory used for the workload
>>>> increased from 217G to 233G. The amount of memory backed by 512M pages was
>>>> 68G! The dTLB misses went down by 26% and the PID multiplier increased input
>>>> by 5.9% (This is a very significant improvment in workload performance).
>>>> A significant number of these THPs were faulted in at application start when
>>>> were present across different VMAs. Ofcourse getting these 512M pages is
>>>> easier on ARM due to bigger PAGE_SIZE and pageblock order.
>>>>
>>>> I am hoping that these patches for 1G THP can be used to provide similar
>>>> benefits for x86. I expect workloads to fault them in at start time when there
>>>> is plenty of free memory available.
>>>>
>>>>
>>>> Previous attempt by Zi Yan
>>>> ==========================
>>>>
>>>> Zi Yan attempted 1G THPs [2] in kernel version 5.11. There have been
>>>> significant changes in kernel since then, including folio conversion, mTHP
>>>> framework, ptdesc, rmap changes, etc. I found it easier to use the current PMD
>>>> code as reference for making 1G PUD THP work. I am hoping Zi can provide
>>>> guidance on these patches!
>>>
>>> I am more than happy to help you. :)
>>>
>>
>> Thanks!!!
>>
>>>>
>>>> Major Design Decisions
>>>> ======================
>>>>
>>>> 1. No shared 1G zero page: The memory cost would be quite significant!
>>>>
>>>> 2. Page Table Pre-deposit Strategy
>>>>    PMD THP deposits a single PTE page table. PUD THP deposits 512 PTE
>>>>    page tables (one for each potential PMD entry after split).
>>>>    We allocate a PMD page table and use its pmd_huge_pte list to store
>>>>    the deposited PTE tables. This ensures split operations don't fail due
>>>>    to page table allocation failures (at the cost of 2M per PUD THP)
>>>>
>>>> 3. Split to Base Pages
>>>>    When a PUD THP must be split (COW, partial unmap, mprotect), we split
>>>>    directly to base pages (262,144 PTEs). The ideal thing would be to split
>>>>    to 2M pages and then to 4K pages if needed. However, this would require
>>>>    significant rmap and mapcount tracking changes.
>>>>
>>>> 4. COW and fork handling via split
>>>>    Copy-on-write and fork for PUD THP triggers a split to base pages, then
>>>>    uses existing PTE-level COW infrastructure. Getting another 1G region is
>>>>    hard and could fail. If only a 4K is written, copying 1G is a waste.
>>>>    Probably this should only be done on CoW and not fork?
>>>>
>>>> 5. Migration via split
>>>>    Split PUD to PTEs and migrate individual pages. It is going to be difficult
>>>>    to find a 1G continguous memory to migrate to. Maybe its better to not
>>>>    allow migration of PUDs at all? I am more tempted to not allow migration,
>>>>    but have kept splitting in this RFC.
>>>
>>> Without migration, PUD THP loses its flexibility and transparency. But with
>>> its 1GB size, I also wonder what the purpose of PUD THP migration can be.
>>> It does not create memory fragmentation, since it is the largest folio size
>>> we have and contiguous. NUMA balancing 1GB THP seems too much work.
>>
>> Yeah this is exactly what I was thinking as well. It is going to be expensive
>> and difficult to migrate 1G pages, and I am not sure if what we get out of it
>> is worth it? I kept the splitting code in this RFC as I wanted to show that
>> its possible to split and migrate and the rejecting migration code is a lot easier.
>>
>>>
>>> BTW, I posted many questions, but that does not mean I object the patchset.
>>> I just want to understand your use case better, reduce unnecessary
>>> code changes, and hopefully get it upstreamed this time. :)
>>>
>>> Thank you for the work.
>>>
>>
>> Ah no this is awesome! Thanks for the questions! Its basically the discussion I
>> wanted to start with the RFC.
>>
>>
>> [1] https://gist.github.com/uarif1/35dcd63f9d76048b07eb5c16ace85991
>>
>>
> 
> It looks like the scenario you're going for is an application that
> allocates a sizeable chunk of memory upfront, and would like it to be
> 1G pages as much as possible, right?
> 

Hello!

Yes. But also it doesnt need to be a single chunk (VMA).

> You can do that with 1G THPs, the advantage being that any failures to
> get 1G pages are not explicit, so you're not left with having to grow
> the number of hugetlb pages yourself, and see how many you can use.
> 
> 1G THPs seem useful for that. I don't recall all of the discussion
> here, but I assume that hooking 1G THP support in to khugepaged is
> quite something else - the potential churn to get an 1G page could
> well cause more system interference than you'd like.
> 

Yes completely agree.

> The CMA scenario Rik was talking about is similar: you set
> hugetlb_cma=NG, and then, when you need 1G pages, you grow the hugetlb
> pool and use them. Disadvantage: you have to do it explicitly.
> 
> However, hugetlb_cma does give you a much larger chance of getting
> those 1G pages. The example you give, 20 1G pages on a 1T system where
> there is 292G free, isn't much of a problem in my experience. You
> should have no problem getting that amount of 1G pages. Things get
> more difficult when most of your memory is taken - hugetlb_cma really
> helps there. E.g. we have systems that have 90% hugetlb_cma, and there
> is a pretty good success rate converting back and forth between
> hugetlb and normal page allocator pages with hugetlb_cma, while
> operating close to that 90% hugetlb coverage. Without CMA, the success
> rate drops quite a bit at that level.

Yes agreed.
> 
> CMA balancing is a related issue, for hugetlb. It fixes a problem that
> has been known for years: the more memory you set aside for movable
> only allocations (e.g. hugetlb_cma), the less breathing room you have
> for unmovable allocations. So you risk the 'false OOM' scenario, where
> the kernel can't make an unmovable allocation, even though there is
> enough memory available, even outside of CMA. It's just that those
> MOVABLE pageblocks were used for movable allocations. So ideally, you
> would migrate those movable allocations to CMA under those
> circumstances. Which is what CMA balancing does. It's worked out very
> well for us in the scenario I list above (most memory being
> hugetlb_cma).
> 
> Anyway, I'm rambling on a bit. Let's see if I got this right:
> 
> 1G THP
>   - advantages: transparent interface
>   - disadvantage: no HVO, lower success rate under higher memory
> pressure than hugetlb_cma
> 

Yes! But also, the problem of having no HVO for THPs I think can be worked
on once the support for it is there. The lower success rate is a much more
difficult problem to solve.

> hugetlb_cma
>    - disadvantage: explicit interface, for higher values needs 'false
> OOM' avoidance
>    - advange: better success rate under pressure.
> 
> I think 1G THPs are a good solution for "nice to have" scenarios, but
> there will still be use cases where a higher success rate is preferred
> and HugeTLB is preferred.
> 

Agreed. I dont think 1G THPs can completely replace hugetlb. Maybe after
getting after several years of work to optimize it, there might be a path to it
but not at the very start.


> Lastly, there's also the ZONE_MOVABLE story. I think 1G THPs and
> ZONE_MOVABLE could work well together, improving the success rate. But
> then the issue of pinning raise its head again, and whether that
> should be allowed or configurable per zone..
> 

Ack

> - Frank