mm/hugetlb.c | 21 ++++++++++++++++----- 1 file changed, 16 insertions(+), 5 deletions(-)
alloc_hugetlb_folio() calls hugepage_subpool_get_pages() when map_chg
is set. For subpools with max_hpages, that increments used_hpages.
If the later hugetlb cgroup charge fails, the unwind must undo that
charge even when gbl_chg > 0.
hugepage_subpool_put_pages() can also restore rsv_hpages if concurrent
frees move used_hpages below min_hpages between the get and put. When
that happens on the gbl_chg > 0 path, restore the matching global
reservation as well.
Skip the gbl_chg > 0 put when max_hpages is unset. For a min_size-only
subpool, get_pages() did not change subpool state and put_pages() would
create a false reservation.
Signed-off-by: Zhao Li <enderaoelyther@gmail.com>
---
Changes in v2:
- Handle rsv_hpages restoration when racing frees cross min_hpages.
- Skip gbl_chg > 0 put_pages() when max_hpages is unset.
mm/hugetlb.c | 21 ++++++++++++++++-----
1 file changed, 16 insertions(+), 5 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f24bf49be047e..4065d66fdcb5c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3026,12 +3026,23 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
h_cg);
out_subpool_put:
/*
- * put page to subpool iff the quota of subpool's rsv_hpages is used
- * during hugepage_subpool_get_pages.
+ * map_chg means hugepage_subpool_get_pages() succeeded above.
+ * If max_hpages accounting was touched, undo it. If racing frees
+ * moved the subpool below min_hpages, the put path may restore a
+ * subpool reservation. Restore the matching global reservation too.
*/
- if (map_chg && !gbl_chg) {
- gbl_reserve = hugepage_subpool_put_pages(spool, 1);
- hugetlb_acct_memory(h, -gbl_reserve);
+ if (map_chg) {
+ if (!gbl_chg) {
+ gbl_reserve = hugepage_subpool_put_pages(spool, 1);
+ hugetlb_acct_memory(h, -gbl_reserve);
+ } else if (spool && spool->max_hpages != -1) {
+ gbl_reserve = hugepage_subpool_put_pages(spool, 1);
+ if (!gbl_reserve) {
+ spin_lock_irq(&hugetlb_lock);
+ h->resv_huge_pages++;
+ spin_unlock_irq(&hugetlb_lock);
+ }
+ }
}
--
2.50.1 (Apple Git-155)
On Tue, Apr 28, 2026 at 11:07:13AM +0800, Zhao Li wrote: > alloc_hugetlb_folio() calls hugepage_subpool_get_pages() when map_chg > is set. For subpools with max_hpages, that increments used_hpages. > If the later hugetlb cgroup charge fails, the unwind must undo that > charge even when gbl_chg > 0. I found that last sentence misleading, because we do not really care about hugetlb cgroup charge/uncharge (besides that being of the reasons we end up on error path) but rather the fact that we fiddle with subpool->used_hpages and we need to undo that when we rollback. > hugepage_subpool_put_pages() can also restore rsv_hpages if concurrent > frees move used_hpages below min_hpages between the get and put. When > that happens on the gbl_chg > 0 path, restore the matching global > reservation as well. Well, that does not quite explain the problem I think, at least not clear enough? So the problem at hand (IIUC) is that 1) if we took a global reservation 2) and concurrent hugepage_subpool_put_pages operations made used_hpages be below min_hpages, and so subpool's reservation was incremented, but we need to increment the global one as well otherwise the next time we pull a page from the spool, global resv_huge_pages will be inbalanced > Skip the gbl_chg > 0 put when max_hpages is unset. For a min_size-only > subpool, get_pages() did not change subpool state and put_pages() would > create a false reservation. > > Signed-off-by: Zhao Li <enderaoelyther@gmail.com> > --- > Changes in v2: > - Handle rsv_hpages restoration when racing frees cross min_hpages. > - Skip gbl_chg > 0 put_pages() when max_hpages is unset. > > mm/hugetlb.c | 21 ++++++++++++++++----- > 1 file changed, 16 insertions(+), 5 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index f24bf49be047e..4065d66fdcb5c 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -3026,12 +3026,23 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > h_cg); > out_subpool_put: > /* > - * put page to subpool iff the quota of subpool's rsv_hpages is used > - * during hugepage_subpool_get_pages. > + * map_chg means hugepage_subpool_get_pages() succeeded above. > + * If max_hpages accounting was touched, undo it. If racing frees > + * moved the subpool below min_hpages, the put path may restore a > + * subpool reservation. Restore the matching global reservation too. I would split the comment in two parts and place them within the block they belong, otherwise it sounds confusing. And maybe elaborate a little bit more. Subpools, reservations and hugetlb make a very head-spinning situation, so let us make our life easier. -- Oscar Salvador SUSE Labs
On Tue, Apr 28, 2026 at 11:08:04AM +0200, Oscar Salvador wrote:
>On Tue, Apr 28, 2026 at 11:07:13AM +0800, Zhao Li wrote:
>> alloc_hugetlb_folio() calls hugepage_subpool_get_pages() when map_chg
>> is set. For subpools with max_hpages, that increments used_hpages.
>> If the later hugetlb cgroup charge fails, the unwind must undo that
>> charge even when gbl_chg > 0.
>
>I found that last sentence misleading, because we do not really care
>about hugetlb cgroup charge/uncharge (besides that being of the reasons
>we end up on error path) but rather the fact that we fiddle with
>subpool->used_hpages and we need to undo that when we rollback.
>
>> hugepage_subpool_put_pages() can also restore rsv_hpages if concurrent
>> frees move used_hpages below min_hpages between the get and put. When
>> that happens on the gbl_chg > 0 path, restore the matching global
>> reservation as well.
>
>Well, that does not quite explain the problem I think, at least not clear enough?
>So the problem at hand (IIUC) is that
>
>1) if we took a global reservation
>2) and concurrent hugepage_subpool_put_pages operations made
> used_hpages be below min_hpages, and so subpool's reservation
> was incremented, but we need to increment the global one as well
> otherwise the next time we pull a page from the spool,
> global resv_huge_pages will be inbalanced
>
>
>> Skip the gbl_chg > 0 put when max_hpages is unset. For a min_size-only
>> subpool, get_pages() did not change subpool state and put_pages() would
>> create a false reservation.
>>
>> Signed-off-by: Zhao Li <enderaoelyther@gmail.com>
>> ---
>> Changes in v2:
>> - Handle rsv_hpages restoration when racing frees cross min_hpages.
>> - Skip gbl_chg > 0 put_pages() when max_hpages is unset.
>>
>> mm/hugetlb.c | 21 ++++++++++++++++-----
>> 1 file changed, 16 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index f24bf49be047e..4065d66fdcb5c 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -3026,12 +3026,23 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>> h_cg);
>> out_subpool_put:
>> /*
>> - * put page to subpool iff the quota of subpool's rsv_hpages is used
>> - * during hugepage_subpool_get_pages.
>> + * map_chg means hugepage_subpool_get_pages() succeeded above.
>> + * If max_hpages accounting was touched, undo it. If racing frees
>> + * moved the subpool below min_hpages, the put path may restore a
>> + * subpool reservation. Restore the matching global reservation too.
>
>I would split the comment in two parts and place them within the block
>they belong, otherwise it sounds confusing.
>And maybe elaborate a little bit more.
>
>Subpools, reservations and hugetlb make a very head-spinning situation, so let
>us make our life easier.
Yep, the comment is confusing to me as well ...
+ if (map_chg) {
+ if (!gbl_chg) {
+ gbl_reserve = hugepage_subpool_put_pages(spool, 1);
+ hugetlb_acct_memory(h, -gbl_reserve);
+ } else if (spool && spool->max_hpages != -1) {
+ gbl_reserve = hugepage_subpool_put_pages(spool, 1);
+ if (!gbl_reserve) {
+ spin_lock_irq(&hugetlb_lock);
+ h->resv_huge_pages++;
+ spin_unlock_irq(&hugetlb_lock);
+ }
+ }
}
IIUC, there are three cases:
1) !gbl_chg
hugepage_subpool_get_pages() consumed one reservation from
spool->rsv_hpages. So the error path needs to call
hugepage_subpool_put_pages() and then drop the matching global
reservation with hugetlb_acct_memory().
2) gbl_chg > 0 && spool->max_hpages != -1
hugepage_subpool_get_pages() did not consume spool->rsv_hpages, but it
did increment spool->used_hpages after the max_hpages check passed. So
the error path still needs to call hugepage_subpool_put_pages() to undo
that.
If hugepage_subpool_put_pages() returns 0 here, it restored one
reservation in spool->rsv_hpages, so we also need to increment
h->resv_huge_pages.
3) gbl_chg > 0 && spool->max_hpages == -1
hugepage_subpool_get_pages() did not change spool->rsv_hpages or
spool->used_hpages, so the error path should not call
hugepage_subpool_put_pages(), otherwise it could create a false
reservation in spool->rsv_hpages.
Hopefully I didn't miss anything :)
Lance
On Tue, Apr 28, 2026 at 07:30:59PM +0800, Lance Yang wrote: > IIUC, there are three cases: > [...] > 2) gbl_chg > 0 && spool->max_hpages != -1 > [...] > If hugepage_subpool_put_pages() returns 0 here, it restored one > reservation in spool->rsv_hpages, so we also need to increment > h->resv_huge_pages. Thanks for working through the three cases - that's a clean breakdown and matches what v2 was trying to do. We ran into trouble on case 2 specifically: the h->resv_huge_pages++ on the put-returned-0 path looked right in isolation but turns out to be unsafe once you put it next to concurrent hugetlb_unreserve_pages() or free_huge_folio(). Two reachable orderings break it: * free_huge_folio() with HPageRestoreReserve set already does h->resv_huge_pages++ on its own. v2's bump on the gbl_chg > 0 cleanup double-counts against that. * hugetlb_unreserve_pages() does hugetlb_acct_memory(h, -X) which subtracts from h->resv_huge_pages and may also return surplus backing. v2's bump then leaves rsv_hpages backed by no h->resv_huge_pages - a phantom reservation that the next subpool_get_pages() consumes without real backing. v3 ended up going a different way: the gbl_chg > 0 cleanup is now restricted to (max_hpages != -1, min_hpages == -1). In that configuration hugepage_subpool_put_pages()'s min-restoration branch is dead, so a direct used_hpages-- under spool->lock is the exact inverse of the speculative bump - no put_pages(), no h->resv_huge_pages++, no concurrent-races to reason about. Your case 3 (max_hpages == -1) is unchanged: cleanup is a no-op, because get_pages() didn't touch any subpool field. Mounts with min_hpages != -1 are left at v1 behaviour for now. That quadrant has an inherited race that also exists at hugetlb_reserve_pages()'s out_put_pages cleanup; a coordinated fix belongs in a separate RFC rather than this stable backport. v3: https://lore.kernel.org/linux-mm/20260428113037.88766-2-enderaoelyther@gmail.com/ Thanks for the review. -- Zhao Li
On Tue, Apr 28, 2026 at 11:08:04AM +0200, Oscar Salvador wrote: > I found that last sentence misleading, because we do not really > care about hugetlb cgroup charge/uncharge (besides that being of > the reasons we end up on error path) but rather the fact that we > fiddle with subpool->used_hpages and we need to undo that when we > rollback. Agreed - reframed in v3. The commit body now states the bug as the unwind missing the used_hpages rollback, without pinning it to the cgroup-charge case, and the subject is narrowed to "fix max-only subpool accounting on alloc_hugetlb_folio failure". > Well, that does not quite explain the problem I think, at least > not clear enough? [...] Fair - that explanation got tangled because v2's design itself was trying to compensate for racing min crossings. v3 sidesteps it entirely: the gbl_chg > 0 cleanup is now restricted to (max_hpages != -1, min_hpages == -1). In that configuration hugepage_subpool_put_pages()'s min-restoration branch is dead, so a direct used_hpages-- under spool->lock is the exact inverse of the speculative bump - no h->resv_huge_pages++ needed, no rsv_hpages publication, no racing-put reasoning. Mounts with min_hpages != -1 are left at v1 behaviour for now. That quadrant has an inherited race that also exists at hugetlb_reserve_pages()'s out_put_pages cleanup, so a coordinated fix belongs in a separate RFC rather than this stable backport. > I would split the comment in two parts and place them within the > block they belong, otherwise it sounds confusing. > > Subpools, reservations and hugetlb make a very head-spinning > situation, so let us make our life easier. Done - one short comment per branch placed inside the relevant code block in v3. Hopefully easier to follow now. v3: https://lore.kernel.org/linux-mm/20260428113037.88766-2-enderaoelyther@gmail.com/ Thanks for the review. -- Zhao Li
alloc_hugetlb_folio() calls hugepage_subpool_get_pages() when map_chg
is set. For a subpool with max_hpages != -1, that bumps used_hpages
regardless of whether it returns gbl_chg = 0 (rsv slot consumed) or
gbl_chg > 0 (used_hpages slot only). If the allocation later fails
before a folio is returned, the unwind must undo the used_hpages
bump. The old cleanup only ran for !gbl_chg, leaking used_hpages on
the gbl_chg > 0 path.
For gbl_chg > 0 on max-only subpools (max_hpages != -1, min_hpages
== -1), hugepage_subpool_get_pages() took only a speculative
used_hpages slot. Drop that slot directly under spool->lock. In
that configuration hugepage_subpool_put_pages() cannot restore
rsv_hpages, so the direct decrement is the exact inverse and is
race-free against concurrent puts. This matches the used_hpages-only
part of hugetlb_reserve_pages()'s out_put_pages cleanup, but
restricts it to the max-only case where no rsv_hpages restoration is
possible.
Mounts with min_hpages != -1 are left unchanged for now. v2's
approach (hugepage_subpool_put_pages() + h->resv_huge_pages++ to
back a restored rsv_hpages slot) double-counts global backing under
concurrent free_huge_folio() and creates phantom reservations under
concurrent hugetlb_unreserve_pages(). Safe cleanup of that quadrant
needs a coordinated fix across multiple call sites.
Reproduced on size=20M hugetlbfs with the faulting task in a hugetlb
cgroup whose limit is exceeded. Vanilla leaks 6/8 hugepages of
subpool quota; this patch leaks 0/8. Verified under QEMU.
Fixes: a833a693a490 ("mm: hugetlb: fix incorrect fallback for subpool")
Cc: stable@vger.kernel.org # v6.15+
Signed-off-by: Zhao Li <enderaoelyther@gmail.com>
---
Changes in v3:
- Replace v2's hugepage_subpool_put_pages() + h->resv_huge_pages++ on
the gbl_chg > 0 branch with a direct used_hpages-- under spool->lock.
- Restrict the cleanup to (max_hpages != -1, min_hpages == -1) where
the direct decrement is the exact inverse of the speculative bump.
Changes in v2:
- Skip the gbl_chg > 0 cleanup when max_hpages is unset.
- Add hugepage_subpool_put_pages() + h->resv_huge_pages++ on the
gbl_chg > 0 branch.
mm/hugetlb.c | 25 ++++++++++++++++++-------
1 file changed, 18 insertions(+), 7 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f24bf49be047e..cfdeaf6394c5b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3025,13 +3025,24 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
hugetlb_cgroup_uncharge_cgroup_rsvd(idx, pages_per_huge_page(h),
h_cg);
out_subpool_put:
- /*
- * put page to subpool iff the quota of subpool's rsv_hpages is used
- * during hugepage_subpool_get_pages.
- */
- if (map_chg && !gbl_chg) {
- gbl_reserve = hugepage_subpool_put_pages(spool, 1);
- hugetlb_acct_memory(h, -gbl_reserve);
+ if (map_chg) {
+ if (!gbl_chg) {
+ /* Full inverse when subpool_get_pages() consumed rsv_hpages. */
+ gbl_reserve = hugepage_subpool_put_pages(spool, 1);
+ hugetlb_acct_memory(h, -gbl_reserve);
+ } else if (gbl_chg > 0 && spool && spool->min_hpages == -1 &&
+ spool->max_hpages != -1) {
+ unsigned long flags;
+
+ /*
+ * For max-only subpools, subpool_get_pages() took only a
+ * speculative used_hpages slot. Drop that slot directly.
+ */
+ spin_lock_irqsave(&spool->lock, flags);
+ if (spool->used_hpages > 0)
+ spool->used_hpages--;
+ unlock_or_release_subpool(spool, flags);
+ }
}
--
2.50.1 (Apple Git-155)
© 2016 - 2026 Red Hat, Inc.