Open HugeTLB allocation routine for more generic use

[RFC PATCH v1 0/7] Open HugeTLB allocation routine for more generic use

Posted by Ackerley Tng 1 month, 2 weeks ago

Hi,

The motivation for this patch series is guest_memfd, which would like
to use HugeTLB as a generic source of huge pages but not adopt
HugeTLB's reservation at mmap() time.

By refactoring alloc_hugetlb_folio() and some dependent functions,
there is now an option to allocate HugeTLB folios without providing a
VMA. Specifically, HugeTLB allocation used to be dependent on the VMA
to

1. Look up reservations in the resv_map
2. Get mpol, stored at vma->vm_policy

This refactoring provides hugetlb_alloc_folio(), which focuses on just
the allocation itself, and associated memory and HugeTLB charging
(cgroups). alloc_hugetlb_folio() still handles reservations in the
resv_map and subpools.

Regarding naming, I'm definitely open to alternative names :) I chose
hugetlb_alloc_folio() because I'm seeing this function as a general
allocation function that is provided by the HugeTLB subsystem (hence
the hugetlb_ prefix). I'm intending for alloc_hugetlb_folio() to be
later refactored as a static function for use just by HugeTLB, and
HugeTLBfs should probably use hugetlb_alloc_folio() directly.

I would like to get feedback on:

1. Opening up HugeTLB's allocation for more generic use
2. Reverting and re-adopting the try-commit-cancel protocol for memory
   charging

To see how hugetlb_alloc_folio() is used by guest_memfd, the most
recent patch series that uses this more generic HugeTLB allocation
routine is at [1], and a newer revision of that patch series is at
[2].

Independently of guest_memfd, I believe this change is useful in
simplifying alloc_hugetlb_folio(). alloc_hugetlb_folio() was so
coupled to a VMA that even HugeTLBfs allocates HugeTLB folios using a
pseudo-VMA.

[1] https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com/T/
[2] https://github.com/googleprodkernel/linux-cc/tree/wip-gmem-conversions-hugetlb-restructuring-12-08-25

Ackerley Tng (7):
  mm: hugetlb: Consolidate interpretation of gbl_chg within
    alloc_hugetlb_folio()
  mm: hugetlb: Move mpol interpretation out of
    alloc_buddy_hugetlb_folio_with_mpol()
  mm: hugetlb: Move mpol interpretation out of
    dequeue_hugetlb_folio_vma()
  Revert "memcg/hugetlb: remove memcg hugetlb try-commit-cancel
    protocol"
  mm: hugetlb: Adopt memcg try-commit-cancel protocol
  mm: memcontrol: Remove now-unused function mem_cgroup_charge_hugetlb
  mm: hugetlb: Refactor out hugetlb_alloc_folio()

 include/linux/hugetlb.h    |  11 ++
 include/linux/memcontrol.h |  21 +++-
 mm/hugetlb.c               | 228 +++++++++++++++++++++----------------
 mm/memcontrol.c            |  77 ++++++++-----
 4 files changed, 212 insertions(+), 125 deletions(-)


base-commit: db9571a66156bfbc0273e66e5c77923869bda547
--
2.53.0.310.g728cabbaf7-goog

Re: [RFC PATCH v1 0/7] Open HugeTLB allocation routine for more generic use

Posted by Joshua Hahn 1 month ago

On Wed, 11 Feb 2026 16:37:11 -0800 Ackerley Tng <ackerleytng@google.com> wrote:

Hi Ackerly, I hope you're donig well!

[...snip...]

> I would like to get feedback on:
> 
> 1. Opening up HugeTLB's allocation for more generic use

I'm not entirely familiar with guest_memfd, so pleae excuse my ignorance
if I'm missing anything obvious. But I'm wondering what hugeTLB offers
that other hugepage solutions cannot offer for guest_memfd, if the
goal of this series is to decouple it from hugeTLBfs.

> 2. Reverting and re-adopting the try-commit-cancel protocol for memory
>    charging

On the second point, I am wondering if reintroducing the try-commit-cancel
protocol is tied to factoring out hugetlb_alloc_folio. When I removed
the protocol a while back, the justification was that for the most part,
grabbing a hugetlb folio was a relatively cheap & fast operation, since
hugetlb mostly operates out of a preallocated pool.

So the cost of being wrong, going above the limit, and having to return
the hugetlb folio was also relatively low.

It seems like this patch series introduces some new paths for hugetlb
pages to be consumed (specifically, without a reservation or vma).
I imagine that these new paths make the slowpath for hugetlb more frequent,
which makes the cost of assuming that the memcg limit is OK higher?
I think explicitly spelling this out in the justification for reintroducing
the charging protocol could be helpful. 

Thank you for the series, again. I hope you have a great day!
Joshua

> To see how hugetlb_alloc_folio() is used by guest_memfd, the most
> recent patch series that uses this more generic HugeTLB allocation
> routine is at [1], and a newer revision of that patch series is at
> [2].
> 
> Independently of guest_memfd, I believe this change is useful in
> simplifying alloc_hugetlb_folio(). alloc_hugetlb_folio() was so
> coupled to a VMA that even HugeTLBfs allocates HugeTLB folios using a
> pseudo-VMA.
> 
> [1] https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com/T/
> [2] https://github.com/googleprodkernel/linux-cc/tree/wip-gmem-conversions-hugetlb-restructuring-12-08-25
> 
> Ackerley Tng (7):
>   mm: hugetlb: Consolidate interpretation of gbl_chg within
>     alloc_hugetlb_folio()
>   mm: hugetlb: Move mpol interpretation out of
>     alloc_buddy_hugetlb_folio_with_mpol()
>   mm: hugetlb: Move mpol interpretation out of
>     dequeue_hugetlb_folio_vma()
>   Revert "memcg/hugetlb: remove memcg hugetlb try-commit-cancel
>     protocol"
>   mm: hugetlb: Adopt memcg try-commit-cancel protocol
>   mm: memcontrol: Remove now-unused function mem_cgroup_charge_hugetlb
>   mm: hugetlb: Refactor out hugetlb_alloc_folio()
> 
>  include/linux/hugetlb.h    |  11 ++
>  include/linux/memcontrol.h |  21 +++-
>  mm/hugetlb.c               | 228 +++++++++++++++++++++----------------
>  mm/memcontrol.c            |  77 ++++++++-----
>  4 files changed, 212 insertions(+), 125 deletions(-)
> 
> 
> base-commit: db9571a66156bfbc0273e66e5c77923869bda547
> --
> 2.53.0.310.g728cabbaf7-goog
>

Re: [RFC PATCH v1 0/7] Open HugeTLB allocation routine for more generic use

Posted by Ackerley Tng 1 month ago

Joshua Hahn <joshua.hahnjy@gmail.com> writes:

> On Wed, 11 Feb 2026 16:37:11 -0800 Ackerley Tng <ackerleytng@google.com> wrote:
>
> Hi Ackerly, I hope you're donig well!
>
> [...snip...]
>
>> I would like to get feedback on:
>>
>> 1. Opening up HugeTLB's allocation for more generic use
>
> I'm not entirely familiar with guest_memfd, so pleae excuse my ignorance
> if I'm missing anything obvious.

Happy to take questions! Thank you for your thoughts and reviews!

> But I'm wondering what hugeTLB offers
> that other hugepage solutions cannot offer for guest_memfd, if the
> goal of this series is to decouple it from hugeTLBfs.
>

The one other huge page source that we've explored is THP pages from the
buddy allocator. Compared to HugeTLB, huge pages from the buddy
allocator

+ Has a maximum size of 2M
+ Does not guarantee huge pages the way HugeTLB does - HugeTLB pages are
  allocated at boot, and guest_memfd can reserve pages at guest_memfd
  creation time.
+ Allocation of HugeTLB pages is also really fast, it's just dequeuing
  from a preallocated pool

The last reason to use HugeTLB is not because of any inherent advantage
of using HugeTLB over other sources of huge pages, but for
administrative/scheduling purposes:

  Given that existing non-guest_memfd workloads are already using
  HugeTLB, for optimal scheduling, machine memory is already carved up
  in HugeTLB pages for these workloads. Workloads that require using
  guest_memfd (like Confidential VMs) must also use HugeTLB to
  participate in optimial workload scheduling across machines.

>> 2. Reverting and re-adopting the try-commit-cancel protocol for memory
>>    charging
>
> On the second point, I am wondering if reintroducing the try-commit-cancel
> protocol is tied to factoring out hugetlb_alloc_folio. When I removed
> the protocol a while back, the justification was that for the most part,
> grabbing a hugetlb folio was a relatively cheap & fast operation, since
> hugetlb mostly operates out of a preallocated pool.
>
> So the cost of being wrong, going above the limit, and having to return
> the hugetlb folio was also relatively low.
>

Thanks for this! I saw your patch to just optimistically grab a HugeTLB
page :) For that patch, the primary reason was to simplify the logic,
and the simplification was justifiable because grabbing a folio is
cheap, right? (And so grabbing a folio being cheap wasn't a reason in
itself?)

> It seems like this patch series introduces some new paths for hugetlb
> pages to be consumed (specifically, without a reservation or vma).
> I imagine that these new paths make the slowpath for hugetlb more frequent,
> which makes the cost of assuming that the memcg limit is OK higher?
> I think explicitly spelling this out in the justification for reintroducing
> the charging protocol could be helpful.
>

Yes, I should have done that. Will copy the following to the next
revision.

The main reason is that reintroducing the charging protocol is the
clearest way (for me) to cleanly refactor out hugetlb_alloc_folio()
without worrying about the edge cases around HugeTLB reservations and
charging.

If I didn't reintroduce the charging protocol, I would have to depend on
freeing the new hugetlb folio on memcg charging failure, and the freeing
in turn depends on the subpool correctly being set in the folio, and the
presence of the subpool influences (in free_huge_folio()) whether the
reservation was returned to the global hstate. Aaannnd... there's also a
hugetlb_restore_reserve flag that controls whether to return the folio
to the subpool (and the hstate). I find folio_clear_hugetlb_restore_reserve()
on certain code paths kind of magical/unexplained too.

I would rather iron out those charging and reservation details
separately from this series (with more testing support).

On the other hand, reintroducing the charging protocol has the benefit
of avoiding allocations (not just dequeuing, if surplus HugeTLB pages
are required) if the memcg limit is hit. Also, if the original reason
for removing the protocol was to simplify the code, refactoring out
hugetlb_alloc_folio() also simplifies the code, and I think it's
actually nice that memcg charging is done the same way as the other two
(h_cg and h_cg_rsvd charging). After hugetlb_alloc_folio() is refactored
out, the gotos make all three charging systems consistent and symmetric,
which I think is nice to have :)

I hope the consistent/symmetric charging among all 3 systems is welcome,
what do you think?

> Thank you for the series, again. I hope you have a great day!
> Joshua
>
>> To see how hugetlb_alloc_folio() is used by guest_memfd, the most
>> recent patch series that uses this more generic HugeTLB allocation
>> routine is at [1], and a newer revision of that patch series is at
>> [2].
>>
>> Independently of guest_memfd, I believe this change is useful in
>> simplifying alloc_hugetlb_folio(). alloc_hugetlb_folio() was so
>> coupled to a VMA that even HugeTLBfs allocates HugeTLB folios using a
>> pseudo-VMA.
>>
>> [1] https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com/T/
>> [2] https://github.com/googleprodkernel/linux-cc/tree/wip-gmem-conversions-hugetlb-restructuring-12-08-25
>>
>> Ackerley Tng (7):
>>   mm: hugetlb: Consolidate interpretation of gbl_chg within
>>     alloc_hugetlb_folio()
>>   mm: hugetlb: Move mpol interpretation out of
>>     alloc_buddy_hugetlb_folio_with_mpol()
>>   mm: hugetlb: Move mpol interpretation out of
>>     dequeue_hugetlb_folio_vma()
>>   Revert "memcg/hugetlb: remove memcg hugetlb try-commit-cancel
>>     protocol"
>>   mm: hugetlb: Adopt memcg try-commit-cancel protocol
>>   mm: memcontrol: Remove now-unused function mem_cgroup_charge_hugetlb
>>   mm: hugetlb: Refactor out hugetlb_alloc_folio()
>>
>>  include/linux/hugetlb.h    |  11 ++
>>  include/linux/memcontrol.h |  21 +++-
>>  mm/hugetlb.c               | 228 +++++++++++++++++++++----------------
>>  mm/memcontrol.c            |  77 ++++++++-----
>>  4 files changed, 212 insertions(+), 125 deletions(-)
>>
>>
>> base-commit: db9571a66156bfbc0273e66e5c77923869bda547
>> --
>> 2.53.0.310.g728cabbaf7-goog
>>

Re: [RFC PATCH v1 0/7] Open HugeTLB allocation routine for more generic use

Posted by Joshua Hahn 1 month ago

On Wed, 25 Feb 2026 19:37:04 -0800 Ackerley Tng <ackerleytng@google.com> wrote:

> Joshua Hahn <joshua.hahnjy@gmail.com> writes:
> 
> > On Wed, 11 Feb 2026 16:37:11 -0800 Ackerley Tng <ackerleytng@google.com> wrote:
> >
> > Hi Ackerly, I hope you're donig well!
> >
> > [...snip...]
> >
> >> I would like to get feedback on:
> >>
> >> 1. Opening up HugeTLB's allocation for more generic use
> >
> > I'm not entirely familiar with guest_memfd, so pleae excuse my ignorance
> > if I'm missing anything obvious.
> 
> Happy to take questions! Thank you for your thoughts and reviews!

Of course, thank you for your work, Ackerley!

> > But I'm wondering what hugeTLB offers
> > that other hugepage solutions cannot offer for guest_memfd, if the
> > goal of this series is to decouple it from hugeTLBfs.
> >
> 
> The one other huge page source that we've explored is THP pages from the
> buddy allocator. Compared to HugeTLB, huge pages from the buddy
> allocator
> 
> + Has a maximum size of 2M
> + Does not guarantee huge pages the way HugeTLB does - HugeTLB pages are
>   allocated at boot, and guest_memfd can reserve pages at guest_memfd
>   creation time.
> + Allocation of HugeTLB pages is also really fast, it's just dequeuing
>   from a preallocated pool

All of these make sense. Just wanted to know if guest_memfd had any
unique usecases for hugeTLB that normal hugetlbfs didn't have.

> The last reason to use HugeTLB is not because of any inherent advantage
> of using HugeTLB over other sources of huge pages, but for
> administrative/scheduling purposes:
> 
>   Given that existing non-guest_memfd workloads are already using
>   HugeTLB, for optimal scheduling, machine memory is already carved up
>   in HugeTLB pages for these workloads. Workloads that require using
>   guest_memfd (like Confidential VMs) must also use HugeTLB to
>   participate in optimial workload scheduling across machines.
> 
> >> 2. Reverting and re-adopting the try-commit-cancel protocol for memory
> >>    charging
> >
> > On the second point, I am wondering if reintroducing the try-commit-cancel
> > protocol is tied to factoring out hugetlb_alloc_folio. When I removed
> > the protocol a while back, the justification was that for the most part,
> > grabbing a hugetlb folio was a relatively cheap & fast operation, since
> > hugetlb mostly operates out of a preallocated pool.
> >
> > So the cost of being wrong, going above the limit, and having to return
> > the hugetlb folio was also relatively low.
> >
> 
> Thanks for this! I saw your patch to just optimistically grab a HugeTLB
> page :) For that patch, the primary reason was to simplify the logic,
> and the simplification was justifiable because grabbing a folio is
> cheap, right? (And so grabbing a folio being cheap wasn't a reason in
> itself?)

Yes, exactly!

> > It seems like this patch series introduces some new paths for hugetlb
> > pages to be consumed (specifically, without a reservation or vma).
> > I imagine that these new paths make the slowpath for hugetlb more frequent,
> > which makes the cost of assuming that the memcg limit is OK higher?
> > I think explicitly spelling this out in the justification for reintroducing
> > the charging protocol could be helpful.
> >
> 
> Yes, I should have done that. Will copy the following to the next
> revision.

Thank you for considering!

> The main reason is that reintroducing the charging protocol is the
> clearest way (for me) to cleanly refactor out hugetlb_alloc_folio()
> without worrying about the edge cases around HugeTLB reservations and
> charging.
> 
> If I didn't reintroduce the charging protocol, I would have to depend on
> freeing the new hugetlb folio on memcg charging failure, and the freeing
> in turn depends on the subpool correctly being set in the folio, and the
> presence of the subpool influences (in free_huge_folio()) whether the
> reservation was returned to the global hstate. Aaannnd... there's also a
> hugetlb_restore_reserve flag that controls whether to return the folio
> to the subpool (and the hstate). I find folio_clear_hugetlb_restore_reserve()
> on certain code paths kind of magical/unexplained too.

I see, if it makes the code simpler to introduce the protocol again, I see
no reason why we shouldn't revert the patch : -)

> I would rather iron out those charging and reservation details
> separately from this series (with more testing support).
> 
> On the other hand, reintroducing the charging protocol has the benefit
> of avoiding allocations (not just dequeuing, if surplus HugeTLB pages
> are required) if the memcg limit is hit. Also, if the original reason
> for removing the protocol was to simplify the code, refactoring out
> hugetlb_alloc_folio() also simplifies the code, and I think it's
> actually nice that memcg charging is done the same way as the other two
> (h_cg and h_cg_rsvd charging). After hugetlb_alloc_folio() is refactored
> out, the gotos make all three charging systems consistent and symmetric,
> which I think is nice to have :)
> 
> I hope the consistent/symmetric charging among all 3 systems is welcome,
> what do you think?

For the hugetlbfs case, the path to allocate a hugeTLB page on demand
makes sense, so I definitely see the argument for avoiding allocations.
Does guest_memfd also have a path to allocate a hugeTLB page outside of
the boottime reservations? In that case I think it would be nice to
clarify that the allocation failure case optimization is also for
guest_memfd, not only for hugetlbfs.

Symmetric charging is definitely welcome : -) All of your reasons make
sense to me, I just wanted to ask and make sure.

Thanks for your thoughts! I hope you have a great day!!
Joshua

Re: [RFC PATCH v1 0/7] Open HugeTLB allocation routine for more generic use

Posted by Ackerley Tng 3 weeks, 3 days ago

Joshua Hahn <joshua.hahnjy@gmail.com> writes:

> On Wed, 25 Feb 2026 19:37:04 -0800 Ackerley Tng <ackerleytng@google.com> wrote:
>
>> Joshua Hahn <joshua.hahnjy@gmail.com> writes:
>>
>> > On Wed, 11 Feb 2026 16:37:11 -0800 Ackerley Tng <ackerleytng@google.com> wrote:
>> >
>> > Hi Ackerly, I hope you're donig well!
>> >
>> > [...snip...]
>> >
>> >> I would like to get feedback on:
>> >>
>> >> 1. Opening up HugeTLB's allocation for more generic use
>> >
>> > I'm not entirely familiar with guest_memfd, so pleae excuse my ignorance
>> > if I'm missing anything obvious.
>>
>> Happy to take questions! Thank you for your thoughts and reviews!
>
> Of course, thank you for your work, Ackerley!
>
>> > But I'm wondering what hugeTLB offers
>> > that other hugepage solutions cannot offer for guest_memfd, if the
>> > goal of this series is to decouple it from hugeTLBfs.
>> >
>>
>> The one other huge page source that we've explored is THP pages from the
>> buddy allocator. Compared to HugeTLB, huge pages from the buddy
>> allocator
>>
>> + Has a maximum size of 2M
>> + Does not guarantee huge pages the way HugeTLB does - HugeTLB pages are
>>   allocated at boot, and guest_memfd can reserve pages at guest_memfd
>>   creation time.
>> + Allocation of HugeTLB pages is also really fast, it's just dequeuing
>>   from a preallocated pool
>
> All of these make sense. Just wanted to know if guest_memfd had any
> unique usecases for hugeTLB that normal hugetlbfs didn't have.
>

IIUC HugeTLB was meant to make huge pages available to userspace for
performance reasons, guest_memfd wants HugeTLB for the same reason, but
just for virtualization use cases. So nope, I don't think there's any
specifically unique usecases.

These are the differences I can think of between guest_memfd and
HugeTLBfs's usage of HugeTLB:

+ guest_memfd may split HugeTLB pages to individual struct pages during
  guest_memfd's ownership of the HugeTLB page. (The pages will be merged
  before returning them to HugeTLB)

+ guest_memfd will provide an option to remove memory in guest_memfd
  ownership from the kernel direct map - I think HugeTLB pages are
  always in the direct map (?)

+ guest_memfd doesn't want to use HugeTLB surplus pages, for now

+ guest_memfd will reserve pages at fd creation time instead of at mmap
  time. Reservation is done by creating a subpool, so guest_memfd
  doesn't use resv_map.

>> The last reason to use HugeTLB is not because of any inherent advantage
>> of using HugeTLB over other sources of huge pages, but for
>> administrative/scheduling purposes:
>>
>>   Given that existing non-guest_memfd workloads are already using
>>   HugeTLB, for optimal scheduling, machine memory is already carved up
>>   in HugeTLB pages for these workloads. Workloads that require using
>>   guest_memfd (like Confidential VMs) must also use HugeTLB to
>>   participate in optimial workload scheduling across machines.
>>
>>
>> [...snip...]
>>
>> On the other hand, reintroducing the charging protocol has the benefit
>> of avoiding allocations (not just dequeuing, if surplus HugeTLB pages
>> are required) if the memcg limit is hit. Also, if the original reason
>> for removing the protocol was to simplify the code, refactoring out
>> hugetlb_alloc_folio() also simplifies the code, and I think it's
>> actually nice that memcg charging is done the same way as the other two
>> (h_cg and h_cg_rsvd charging). After hugetlb_alloc_folio() is refactored
>> out, the gotos make all three charging systems consistent and symmetric,
>> which I think is nice to have :)
>>
>> I hope the consistent/symmetric charging among all 3 systems is welcome,
>> what do you think?
>
> For the hugetlbfs case, the path to allocate a hugeTLB page on demand
> makes sense, so I definitely see the argument for avoiding allocations.
> Does guest_memfd also have a path to allocate a hugeTLB page outside of
> the boottime reservations? In that case I think it would be nice to
> clarify that the allocation failure case optimization is also for
> guest_memfd, not only for hugetlbfs.
>

For now, guest_memfd actually doesn't want to use surplus pages, so
guest_memfd won't be allocating pages outside of boottime
reservations.

> Symmetric charging is definitely welcome : -) All of your reasons make
> sense to me, I just wanted to ask and make sure.
>

This change is mostly for (an alternate form of) simplicity :)

> Thanks for your thoughts! I hope you have a great day!!
> Joshua