include/linux/hugetlb.h | 11 ++ include/linux/memcontrol.h | 21 +++- mm/hugetlb.c | 228 +++++++++++++++++++++---------------- mm/memcontrol.c | 77 ++++++++----- 4 files changed, 212 insertions(+), 125 deletions(-)
Hi,
The motivation for this patch series is guest_memfd, which would like
to use HugeTLB as a generic source of huge pages but not adopt
HugeTLB's reservation at mmap() time.
By refactoring alloc_hugetlb_folio() and some dependent functions,
there is now an option to allocate HugeTLB folios without providing a
VMA. Specifically, HugeTLB allocation used to be dependent on the VMA
to
1. Look up reservations in the resv_map
2. Get mpol, stored at vma->vm_policy
This refactoring provides hugetlb_alloc_folio(), which focuses on just
the allocation itself, and associated memory and HugeTLB charging
(cgroups). alloc_hugetlb_folio() still handles reservations in the
resv_map and subpools.
Regarding naming, I'm definitely open to alternative names :) I chose
hugetlb_alloc_folio() because I'm seeing this function as a general
allocation function that is provided by the HugeTLB subsystem (hence
the hugetlb_ prefix). I'm intending for alloc_hugetlb_folio() to be
later refactored as a static function for use just by HugeTLB, and
HugeTLBfs should probably use hugetlb_alloc_folio() directly.
I would like to get feedback on:
1. Opening up HugeTLB's allocation for more generic use
2. Reverting and re-adopting the try-commit-cancel protocol for memory
charging
To see how hugetlb_alloc_folio() is used by guest_memfd, the most
recent patch series that uses this more generic HugeTLB allocation
routine is at [1], and a newer revision of that patch series is at
[2].
Independently of guest_memfd, I believe this change is useful in
simplifying alloc_hugetlb_folio(). alloc_hugetlb_folio() was so
coupled to a VMA that even HugeTLBfs allocates HugeTLB folios using a
pseudo-VMA.
[1] https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com/T/
[2] https://github.com/googleprodkernel/linux-cc/tree/wip-gmem-conversions-hugetlb-restructuring-12-08-25
Ackerley Tng (7):
mm: hugetlb: Consolidate interpretation of gbl_chg within
alloc_hugetlb_folio()
mm: hugetlb: Move mpol interpretation out of
alloc_buddy_hugetlb_folio_with_mpol()
mm: hugetlb: Move mpol interpretation out of
dequeue_hugetlb_folio_vma()
Revert "memcg/hugetlb: remove memcg hugetlb try-commit-cancel
protocol"
mm: hugetlb: Adopt memcg try-commit-cancel protocol
mm: memcontrol: Remove now-unused function mem_cgroup_charge_hugetlb
mm: hugetlb: Refactor out hugetlb_alloc_folio()
include/linux/hugetlb.h | 11 ++
include/linux/memcontrol.h | 21 +++-
mm/hugetlb.c | 228 +++++++++++++++++++++----------------
mm/memcontrol.c | 77 ++++++++-----
4 files changed, 212 insertions(+), 125 deletions(-)
base-commit: db9571a66156bfbc0273e66e5c77923869bda547
--
2.53.0.310.g728cabbaf7-goog
On Wed, 11 Feb 2026 16:37:11 -0800 Ackerley Tng <ackerleytng@google.com> wrote: Hi Ackerly, I hope you're donig well! [...snip...] > I would like to get feedback on: > > 1. Opening up HugeTLB's allocation for more generic use I'm not entirely familiar with guest_memfd, so pleae excuse my ignorance if I'm missing anything obvious. But I'm wondering what hugeTLB offers that other hugepage solutions cannot offer for guest_memfd, if the goal of this series is to decouple it from hugeTLBfs. > 2. Reverting and re-adopting the try-commit-cancel protocol for memory > charging On the second point, I am wondering if reintroducing the try-commit-cancel protocol is tied to factoring out hugetlb_alloc_folio. When I removed the protocol a while back, the justification was that for the most part, grabbing a hugetlb folio was a relatively cheap & fast operation, since hugetlb mostly operates out of a preallocated pool. So the cost of being wrong, going above the limit, and having to return the hugetlb folio was also relatively low. It seems like this patch series introduces some new paths for hugetlb pages to be consumed (specifically, without a reservation or vma). I imagine that these new paths make the slowpath for hugetlb more frequent, which makes the cost of assuming that the memcg limit is OK higher? I think explicitly spelling this out in the justification for reintroducing the charging protocol could be helpful. Thank you for the series, again. I hope you have a great day! Joshua > To see how hugetlb_alloc_folio() is used by guest_memfd, the most > recent patch series that uses this more generic HugeTLB allocation > routine is at [1], and a newer revision of that patch series is at > [2]. > > Independently of guest_memfd, I believe this change is useful in > simplifying alloc_hugetlb_folio(). alloc_hugetlb_folio() was so > coupled to a VMA that even HugeTLBfs allocates HugeTLB folios using a > pseudo-VMA. > > [1] https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com/T/ > [2] https://github.com/googleprodkernel/linux-cc/tree/wip-gmem-conversions-hugetlb-restructuring-12-08-25 > > Ackerley Tng (7): > mm: hugetlb: Consolidate interpretation of gbl_chg within > alloc_hugetlb_folio() > mm: hugetlb: Move mpol interpretation out of > alloc_buddy_hugetlb_folio_with_mpol() > mm: hugetlb: Move mpol interpretation out of > dequeue_hugetlb_folio_vma() > Revert "memcg/hugetlb: remove memcg hugetlb try-commit-cancel > protocol" > mm: hugetlb: Adopt memcg try-commit-cancel protocol > mm: memcontrol: Remove now-unused function mem_cgroup_charge_hugetlb > mm: hugetlb: Refactor out hugetlb_alloc_folio() > > include/linux/hugetlb.h | 11 ++ > include/linux/memcontrol.h | 21 +++- > mm/hugetlb.c | 228 +++++++++++++++++++++---------------- > mm/memcontrol.c | 77 ++++++++----- > 4 files changed, 212 insertions(+), 125 deletions(-) > > > base-commit: db9571a66156bfbc0273e66e5c77923869bda547 > -- > 2.53.0.310.g728cabbaf7-goog >
Joshua Hahn <joshua.hahnjy@gmail.com> writes: > On Wed, 11 Feb 2026 16:37:11 -0800 Ackerley Tng <ackerleytng@google.com> wrote: > > Hi Ackerly, I hope you're donig well! > > [...snip...] > >> I would like to get feedback on: >> >> 1. Opening up HugeTLB's allocation for more generic use > > I'm not entirely familiar with guest_memfd, so pleae excuse my ignorance > if I'm missing anything obvious. Happy to take questions! Thank you for your thoughts and reviews! > But I'm wondering what hugeTLB offers > that other hugepage solutions cannot offer for guest_memfd, if the > goal of this series is to decouple it from hugeTLBfs. > The one other huge page source that we've explored is THP pages from the buddy allocator. Compared to HugeTLB, huge pages from the buddy allocator + Has a maximum size of 2M + Does not guarantee huge pages the way HugeTLB does - HugeTLB pages are allocated at boot, and guest_memfd can reserve pages at guest_memfd creation time. + Allocation of HugeTLB pages is also really fast, it's just dequeuing from a preallocated pool The last reason to use HugeTLB is not because of any inherent advantage of using HugeTLB over other sources of huge pages, but for administrative/scheduling purposes: Given that existing non-guest_memfd workloads are already using HugeTLB, for optimal scheduling, machine memory is already carved up in HugeTLB pages for these workloads. Workloads that require using guest_memfd (like Confidential VMs) must also use HugeTLB to participate in optimial workload scheduling across machines. >> 2. Reverting and re-adopting the try-commit-cancel protocol for memory >> charging > > On the second point, I am wondering if reintroducing the try-commit-cancel > protocol is tied to factoring out hugetlb_alloc_folio. When I removed > the protocol a while back, the justification was that for the most part, > grabbing a hugetlb folio was a relatively cheap & fast operation, since > hugetlb mostly operates out of a preallocated pool. > > So the cost of being wrong, going above the limit, and having to return > the hugetlb folio was also relatively low. > Thanks for this! I saw your patch to just optimistically grab a HugeTLB page :) For that patch, the primary reason was to simplify the logic, and the simplification was justifiable because grabbing a folio is cheap, right? (And so grabbing a folio being cheap wasn't a reason in itself?) > It seems like this patch series introduces some new paths for hugetlb > pages to be consumed (specifically, without a reservation or vma). > I imagine that these new paths make the slowpath for hugetlb more frequent, > which makes the cost of assuming that the memcg limit is OK higher? > I think explicitly spelling this out in the justification for reintroducing > the charging protocol could be helpful. > Yes, I should have done that. Will copy the following to the next revision. The main reason is that reintroducing the charging protocol is the clearest way (for me) to cleanly refactor out hugetlb_alloc_folio() without worrying about the edge cases around HugeTLB reservations and charging. If I didn't reintroduce the charging protocol, I would have to depend on freeing the new hugetlb folio on memcg charging failure, and the freeing in turn depends on the subpool correctly being set in the folio, and the presence of the subpool influences (in free_huge_folio()) whether the reservation was returned to the global hstate. Aaannnd... there's also a hugetlb_restore_reserve flag that controls whether to return the folio to the subpool (and the hstate). I find folio_clear_hugetlb_restore_reserve() on certain code paths kind of magical/unexplained too. I would rather iron out those charging and reservation details separately from this series (with more testing support). On the other hand, reintroducing the charging protocol has the benefit of avoiding allocations (not just dequeuing, if surplus HugeTLB pages are required) if the memcg limit is hit. Also, if the original reason for removing the protocol was to simplify the code, refactoring out hugetlb_alloc_folio() also simplifies the code, and I think it's actually nice that memcg charging is done the same way as the other two (h_cg and h_cg_rsvd charging). After hugetlb_alloc_folio() is refactored out, the gotos make all three charging systems consistent and symmetric, which I think is nice to have :) I hope the consistent/symmetric charging among all 3 systems is welcome, what do you think? > Thank you for the series, again. I hope you have a great day! > Joshua > >> To see how hugetlb_alloc_folio() is used by guest_memfd, the most >> recent patch series that uses this more generic HugeTLB allocation >> routine is at [1], and a newer revision of that patch series is at >> [2]. >> >> Independently of guest_memfd, I believe this change is useful in >> simplifying alloc_hugetlb_folio(). alloc_hugetlb_folio() was so >> coupled to a VMA that even HugeTLBfs allocates HugeTLB folios using a >> pseudo-VMA. >> >> [1] https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com/T/ >> [2] https://github.com/googleprodkernel/linux-cc/tree/wip-gmem-conversions-hugetlb-restructuring-12-08-25 >> >> Ackerley Tng (7): >> mm: hugetlb: Consolidate interpretation of gbl_chg within >> alloc_hugetlb_folio() >> mm: hugetlb: Move mpol interpretation out of >> alloc_buddy_hugetlb_folio_with_mpol() >> mm: hugetlb: Move mpol interpretation out of >> dequeue_hugetlb_folio_vma() >> Revert "memcg/hugetlb: remove memcg hugetlb try-commit-cancel >> protocol" >> mm: hugetlb: Adopt memcg try-commit-cancel protocol >> mm: memcontrol: Remove now-unused function mem_cgroup_charge_hugetlb >> mm: hugetlb: Refactor out hugetlb_alloc_folio() >> >> include/linux/hugetlb.h | 11 ++ >> include/linux/memcontrol.h | 21 +++- >> mm/hugetlb.c | 228 +++++++++++++++++++++---------------- >> mm/memcontrol.c | 77 ++++++++----- >> 4 files changed, 212 insertions(+), 125 deletions(-) >> >> >> base-commit: db9571a66156bfbc0273e66e5c77923869bda547 >> -- >> 2.53.0.310.g728cabbaf7-goog >>
On Wed, 25 Feb 2026 19:37:04 -0800 Ackerley Tng <ackerleytng@google.com> wrote: > Joshua Hahn <joshua.hahnjy@gmail.com> writes: > > > On Wed, 11 Feb 2026 16:37:11 -0800 Ackerley Tng <ackerleytng@google.com> wrote: > > > > Hi Ackerly, I hope you're donig well! > > > > [...snip...] > > > >> I would like to get feedback on: > >> > >> 1. Opening up HugeTLB's allocation for more generic use > > > > I'm not entirely familiar with guest_memfd, so pleae excuse my ignorance > > if I'm missing anything obvious. > > Happy to take questions! Thank you for your thoughts and reviews! Of course, thank you for your work, Ackerley! > > But I'm wondering what hugeTLB offers > > that other hugepage solutions cannot offer for guest_memfd, if the > > goal of this series is to decouple it from hugeTLBfs. > > > > The one other huge page source that we've explored is THP pages from the > buddy allocator. Compared to HugeTLB, huge pages from the buddy > allocator > > + Has a maximum size of 2M > + Does not guarantee huge pages the way HugeTLB does - HugeTLB pages are > allocated at boot, and guest_memfd can reserve pages at guest_memfd > creation time. > + Allocation of HugeTLB pages is also really fast, it's just dequeuing > from a preallocated pool All of these make sense. Just wanted to know if guest_memfd had any unique usecases for hugeTLB that normal hugetlbfs didn't have. > The last reason to use HugeTLB is not because of any inherent advantage > of using HugeTLB over other sources of huge pages, but for > administrative/scheduling purposes: > > Given that existing non-guest_memfd workloads are already using > HugeTLB, for optimal scheduling, machine memory is already carved up > in HugeTLB pages for these workloads. Workloads that require using > guest_memfd (like Confidential VMs) must also use HugeTLB to > participate in optimial workload scheduling across machines. > > >> 2. Reverting and re-adopting the try-commit-cancel protocol for memory > >> charging > > > > On the second point, I am wondering if reintroducing the try-commit-cancel > > protocol is tied to factoring out hugetlb_alloc_folio. When I removed > > the protocol a while back, the justification was that for the most part, > > grabbing a hugetlb folio was a relatively cheap & fast operation, since > > hugetlb mostly operates out of a preallocated pool. > > > > So the cost of being wrong, going above the limit, and having to return > > the hugetlb folio was also relatively low. > > > > Thanks for this! I saw your patch to just optimistically grab a HugeTLB > page :) For that patch, the primary reason was to simplify the logic, > and the simplification was justifiable because grabbing a folio is > cheap, right? (And so grabbing a folio being cheap wasn't a reason in > itself?) Yes, exactly! > > It seems like this patch series introduces some new paths for hugetlb > > pages to be consumed (specifically, without a reservation or vma). > > I imagine that these new paths make the slowpath for hugetlb more frequent, > > which makes the cost of assuming that the memcg limit is OK higher? > > I think explicitly spelling this out in the justification for reintroducing > > the charging protocol could be helpful. > > > > Yes, I should have done that. Will copy the following to the next > revision. Thank you for considering! > The main reason is that reintroducing the charging protocol is the > clearest way (for me) to cleanly refactor out hugetlb_alloc_folio() > without worrying about the edge cases around HugeTLB reservations and > charging. > > If I didn't reintroduce the charging protocol, I would have to depend on > freeing the new hugetlb folio on memcg charging failure, and the freeing > in turn depends on the subpool correctly being set in the folio, and the > presence of the subpool influences (in free_huge_folio()) whether the > reservation was returned to the global hstate. Aaannnd... there's also a > hugetlb_restore_reserve flag that controls whether to return the folio > to the subpool (and the hstate). I find folio_clear_hugetlb_restore_reserve() > on certain code paths kind of magical/unexplained too. I see, if it makes the code simpler to introduce the protocol again, I see no reason why we shouldn't revert the patch : -) > I would rather iron out those charging and reservation details > separately from this series (with more testing support). > > On the other hand, reintroducing the charging protocol has the benefit > of avoiding allocations (not just dequeuing, if surplus HugeTLB pages > are required) if the memcg limit is hit. Also, if the original reason > for removing the protocol was to simplify the code, refactoring out > hugetlb_alloc_folio() also simplifies the code, and I think it's > actually nice that memcg charging is done the same way as the other two > (h_cg and h_cg_rsvd charging). After hugetlb_alloc_folio() is refactored > out, the gotos make all three charging systems consistent and symmetric, > which I think is nice to have :) > > I hope the consistent/symmetric charging among all 3 systems is welcome, > what do you think? For the hugetlbfs case, the path to allocate a hugeTLB page on demand makes sense, so I definitely see the argument for avoiding allocations. Does guest_memfd also have a path to allocate a hugeTLB page outside of the boottime reservations? In that case I think it would be nice to clarify that the allocation failure case optimization is also for guest_memfd, not only for hugetlbfs. Symmetric charging is definitely welcome : -) All of your reasons make sense to me, I just wanted to ask and make sure. Thanks for your thoughts! I hope you have a great day!! Joshua
Joshua Hahn <joshua.hahnjy@gmail.com> writes: > On Wed, 25 Feb 2026 19:37:04 -0800 Ackerley Tng <ackerleytng@google.com> wrote: > >> Joshua Hahn <joshua.hahnjy@gmail.com> writes: >> >> > On Wed, 11 Feb 2026 16:37:11 -0800 Ackerley Tng <ackerleytng@google.com> wrote: >> > >> > Hi Ackerly, I hope you're donig well! >> > >> > [...snip...] >> > >> >> I would like to get feedback on: >> >> >> >> 1. Opening up HugeTLB's allocation for more generic use >> > >> > I'm not entirely familiar with guest_memfd, so pleae excuse my ignorance >> > if I'm missing anything obvious. >> >> Happy to take questions! Thank you for your thoughts and reviews! > > Of course, thank you for your work, Ackerley! > >> > But I'm wondering what hugeTLB offers >> > that other hugepage solutions cannot offer for guest_memfd, if the >> > goal of this series is to decouple it from hugeTLBfs. >> > >> >> The one other huge page source that we've explored is THP pages from the >> buddy allocator. Compared to HugeTLB, huge pages from the buddy >> allocator >> >> + Has a maximum size of 2M >> + Does not guarantee huge pages the way HugeTLB does - HugeTLB pages are >> allocated at boot, and guest_memfd can reserve pages at guest_memfd >> creation time. >> + Allocation of HugeTLB pages is also really fast, it's just dequeuing >> from a preallocated pool > > All of these make sense. Just wanted to know if guest_memfd had any > unique usecases for hugeTLB that normal hugetlbfs didn't have. > IIUC HugeTLB was meant to make huge pages available to userspace for performance reasons, guest_memfd wants HugeTLB for the same reason, but just for virtualization use cases. So nope, I don't think there's any specifically unique usecases. These are the differences I can think of between guest_memfd and HugeTLBfs's usage of HugeTLB: + guest_memfd may split HugeTLB pages to individual struct pages during guest_memfd's ownership of the HugeTLB page. (The pages will be merged before returning them to HugeTLB) + guest_memfd will provide an option to remove memory in guest_memfd ownership from the kernel direct map - I think HugeTLB pages are always in the direct map (?) + guest_memfd doesn't want to use HugeTLB surplus pages, for now + guest_memfd will reserve pages at fd creation time instead of at mmap time. Reservation is done by creating a subpool, so guest_memfd doesn't use resv_map. >> The last reason to use HugeTLB is not because of any inherent advantage >> of using HugeTLB over other sources of huge pages, but for >> administrative/scheduling purposes: >> >> Given that existing non-guest_memfd workloads are already using >> HugeTLB, for optimal scheduling, machine memory is already carved up >> in HugeTLB pages for these workloads. Workloads that require using >> guest_memfd (like Confidential VMs) must also use HugeTLB to >> participate in optimial workload scheduling across machines. >> >> >> [...snip...] >> >> On the other hand, reintroducing the charging protocol has the benefit >> of avoiding allocations (not just dequeuing, if surplus HugeTLB pages >> are required) if the memcg limit is hit. Also, if the original reason >> for removing the protocol was to simplify the code, refactoring out >> hugetlb_alloc_folio() also simplifies the code, and I think it's >> actually nice that memcg charging is done the same way as the other two >> (h_cg and h_cg_rsvd charging). After hugetlb_alloc_folio() is refactored >> out, the gotos make all three charging systems consistent and symmetric, >> which I think is nice to have :) >> >> I hope the consistent/symmetric charging among all 3 systems is welcome, >> what do you think? > > For the hugetlbfs case, the path to allocate a hugeTLB page on demand > makes sense, so I definitely see the argument for avoiding allocations. > Does guest_memfd also have a path to allocate a hugeTLB page outside of > the boottime reservations? In that case I think it would be nice to > clarify that the allocation failure case optimization is also for > guest_memfd, not only for hugetlbfs. > For now, guest_memfd actually doesn't want to use surplus pages, so guest_memfd won't be allocating pages outside of boottime reservations. > Symmetric charging is definitely welcome : -) All of your reasons make > sense to me, I just wanted to ask and make sure. > This change is mostly for (an alternate form of) simplicity :) > Thanks for your thoughts! I hope you have a great day!! > Joshua
© 2016 - 2026 Red Hat, Inc.