[v2] mm/hugetlb: fix subpool accounting after cgroup charge failure

[PATCH v2] mm/hugetlb: fix subpool accounting after cgroup charge failure

Posted by Zhao Li 1 month, 2 weeks ago

alloc_hugetlb_folio() calls hugepage_subpool_get_pages() when map_chg
is set.  For subpools with max_hpages, that increments used_hpages.
If the later hugetlb cgroup charge fails, the unwind must undo that
charge even when gbl_chg > 0.

hugepage_subpool_put_pages() can also restore rsv_hpages if concurrent
frees move used_hpages below min_hpages between the get and put.  When
that happens on the gbl_chg > 0 path, restore the matching global
reservation as well.

Skip the gbl_chg > 0 put when max_hpages is unset.  For a min_size-only
subpool, get_pages() did not change subpool state and put_pages() would
create a false reservation.

Signed-off-by: Zhao Li <enderaoelyther@gmail.com>
---
Changes in v2:
- Handle rsv_hpages restoration when racing frees cross min_hpages.
- Skip gbl_chg > 0 put_pages() when max_hpages is unset.

 mm/hugetlb.c | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f24bf49be047e..4065d66fdcb5c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3026,12 +3026,23 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 						    h_cg);
 out_subpool_put:
 	/*
-	 * put page to subpool iff the quota of subpool's rsv_hpages is used
-	 * during hugepage_subpool_get_pages.
+	 * map_chg means hugepage_subpool_get_pages() succeeded above.
+	 * If max_hpages accounting was touched, undo it.  If racing frees
+	 * moved the subpool below min_hpages, the put path may restore a
+	 * subpool reservation.  Restore the matching global reservation too.
 	 */
-	if (map_chg && !gbl_chg) {
-		gbl_reserve = hugepage_subpool_put_pages(spool, 1);
-		hugetlb_acct_memory(h, -gbl_reserve);
+	if (map_chg) {
+		if (!gbl_chg) {
+			gbl_reserve = hugepage_subpool_put_pages(spool, 1);
+			hugetlb_acct_memory(h, -gbl_reserve);
+		} else if (spool && spool->max_hpages != -1) {
+			gbl_reserve = hugepage_subpool_put_pages(spool, 1);
+			if (!gbl_reserve) {
+				spin_lock_irq(&hugetlb_lock);
+				h->resv_huge_pages++;
+				spin_unlock_irq(&hugetlb_lock);
+			}
+		}
 	}


--
2.50.1 (Apple Git-155)

Re: [PATCH v2] mm/hugetlb: fix subpool accounting after cgroup charge failure

Posted by Oscar Salvador 1 month, 2 weeks ago

On Tue, Apr 28, 2026 at 11:07:13AM +0800, Zhao Li wrote:
> alloc_hugetlb_folio() calls hugepage_subpool_get_pages() when map_chg
> is set.  For subpools with max_hpages, that increments used_hpages.
> If the later hugetlb cgroup charge fails, the unwind must undo that
> charge even when gbl_chg > 0.

I found that last sentence misleading, because we do not really care
about hugetlb cgroup charge/uncharge (besides that being of the reasons
we end up on error path) but rather the fact that we fiddle with
subpool->used_hpages and we need to undo that when we rollback.

> hugepage_subpool_put_pages() can also restore rsv_hpages if concurrent
> frees move used_hpages below min_hpages between the get and put.  When
> that happens on the gbl_chg > 0 path, restore the matching global
> reservation as well.

Well, that does not quite explain the problem I think, at least not clear enough?
So the problem at hand (IIUC) is that

1) if we took a global reservation
2) and concurrent hugepage_subpool_put_pages operations made
   used_hpages be below min_hpages, and so subpool's reservation
   was incremented, but we need to increment the global one as well
   otherwise the next time we pull a page from the spool,
   global resv_huge_pages will be inbalanced


> Skip the gbl_chg > 0 put when max_hpages is unset.  For a min_size-only
> subpool, get_pages() did not change subpool state and put_pages() would
> create a false reservation.
> 
> Signed-off-by: Zhao Li <enderaoelyther@gmail.com>
> ---
> Changes in v2:
> - Handle rsv_hpages restoration when racing frees cross min_hpages.
> - Skip gbl_chg > 0 put_pages() when max_hpages is unset.
> 
>  mm/hugetlb.c | 21 ++++++++++++++++-----
>  1 file changed, 16 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index f24bf49be047e..4065d66fdcb5c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3026,12 +3026,23 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>  						    h_cg);
>  out_subpool_put:
>  	/*
> -	 * put page to subpool iff the quota of subpool's rsv_hpages is used
> -	 * during hugepage_subpool_get_pages.
> +	 * map_chg means hugepage_subpool_get_pages() succeeded above.
> +	 * If max_hpages accounting was touched, undo it.  If racing frees
> +	 * moved the subpool below min_hpages, the put path may restore a
> +	 * subpool reservation.  Restore the matching global reservation too.

I would split the comment in two parts and place them within the block
they belong, otherwise it sounds confusing.
And maybe elaborate a little bit more.

Subpools, reservations and hugetlb make a very head-spinning situation, so let
us make our life easier.


-- 
Oscar Salvador
SUSE Labs

Re: [PATCH v2] mm/hugetlb: fix subpool accounting after cgroup charge failure

Posted by Lance Yang 1 month, 2 weeks ago

On Tue, Apr 28, 2026 at 11:08:04AM +0200, Oscar Salvador wrote:
>On Tue, Apr 28, 2026 at 11:07:13AM +0800, Zhao Li wrote:
>> alloc_hugetlb_folio() calls hugepage_subpool_get_pages() when map_chg
>> is set.  For subpools with max_hpages, that increments used_hpages.
>> If the later hugetlb cgroup charge fails, the unwind must undo that
>> charge even when gbl_chg > 0.
>
>I found that last sentence misleading, because we do not really care
>about hugetlb cgroup charge/uncharge (besides that being of the reasons
>we end up on error path) but rather the fact that we fiddle with
>subpool->used_hpages and we need to undo that when we rollback.
>
>> hugepage_subpool_put_pages() can also restore rsv_hpages if concurrent
>> frees move used_hpages below min_hpages between the get and put.  When
>> that happens on the gbl_chg > 0 path, restore the matching global
>> reservation as well.
>
>Well, that does not quite explain the problem I think, at least not clear enough?
>So the problem at hand (IIUC) is that
>
>1) if we took a global reservation
>2) and concurrent hugepage_subpool_put_pages operations made
>   used_hpages be below min_hpages, and so subpool's reservation
>   was incremented, but we need to increment the global one as well
>   otherwise the next time we pull a page from the spool,
>   global resv_huge_pages will be inbalanced
>
>
>> Skip the gbl_chg > 0 put when max_hpages is unset.  For a min_size-only
>> subpool, get_pages() did not change subpool state and put_pages() would
>> create a false reservation.
>> 
>> Signed-off-by: Zhao Li <enderaoelyther@gmail.com>
>> ---
>> Changes in v2:
>> - Handle rsv_hpages restoration when racing frees cross min_hpages.
>> - Skip gbl_chg > 0 put_pages() when max_hpages is unset.
>> 
>>  mm/hugetlb.c | 21 ++++++++++++++++-----
>>  1 file changed, 16 insertions(+), 5 deletions(-)
>> 
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index f24bf49be047e..4065d66fdcb5c 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -3026,12 +3026,23 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>>  						    h_cg);
>>  out_subpool_put:
>>  	/*
>> -	 * put page to subpool iff the quota of subpool's rsv_hpages is used
>> -	 * during hugepage_subpool_get_pages.
>> +	 * map_chg means hugepage_subpool_get_pages() succeeded above.
>> +	 * If max_hpages accounting was touched, undo it.  If racing frees
>> +	 * moved the subpool below min_hpages, the put path may restore a
>> +	 * subpool reservation.  Restore the matching global reservation too.
>
>I would split the comment in two parts and place them within the block
>they belong, otherwise it sounds confusing.
>And maybe elaborate a little bit more.
>
>Subpools, reservations and hugetlb make a very head-spinning situation, so let
>us make our life easier.

Yep, the comment is confusing to me as well ...

+	if (map_chg) {
+		if (!gbl_chg) {
+			gbl_reserve = hugepage_subpool_put_pages(spool, 1);
+			hugetlb_acct_memory(h, -gbl_reserve);
+		} else if (spool && spool->max_hpages != -1) {
+			gbl_reserve = hugepage_subpool_put_pages(spool, 1);
+			if (!gbl_reserve) {
+				spin_lock_irq(&hugetlb_lock);
+				h->resv_huge_pages++;
+				spin_unlock_irq(&hugetlb_lock);
+			}
+		}
 	}

IIUC, there are three cases:

1) !gbl_chg

hugepage_subpool_get_pages() consumed one reservation from
spool->rsv_hpages.  So the error path needs to call
hugepage_subpool_put_pages() and then drop the matching global
reservation with hugetlb_acct_memory().

2) gbl_chg > 0 && spool->max_hpages != -1

hugepage_subpool_get_pages() did not consume spool->rsv_hpages, but it
did increment spool->used_hpages after the max_hpages check passed.  So
the error path still needs to call hugepage_subpool_put_pages() to undo
that.

If hugepage_subpool_put_pages() returns 0 here, it restored one
reservation in spool->rsv_hpages, so we also need to increment
h->resv_huge_pages.

3) gbl_chg > 0 && spool->max_hpages == -1

hugepage_subpool_get_pages() did not change spool->rsv_hpages or
spool->used_hpages, so the error path should not call
hugepage_subpool_put_pages(), otherwise it could create a false
reservation in spool->rsv_hpages.

Hopefully I didn't miss anything :)
Lance

Re: [PATCH v2] mm/hugetlb: fix subpool accounting after cgroup charge failure

Posted by Zhao Li 1 month, 2 weeks ago

On Tue, Apr 28, 2026 at 07:30:59PM +0800, Lance Yang wrote:
> IIUC, there are three cases:
> [...]
> 2) gbl_chg > 0 && spool->max_hpages != -1
> [...]
> If hugepage_subpool_put_pages() returns 0 here, it restored one
> reservation in spool->rsv_hpages, so we also need to increment
> h->resv_huge_pages.

Thanks for working through the three cases - that's a clean
breakdown and matches what v2 was trying to do.  We ran into trouble
on case 2 specifically: the h->resv_huge_pages++ on the
put-returned-0 path looked right in isolation but turns out to be
unsafe once you put it next to concurrent hugetlb_unreserve_pages()
or free_huge_folio().  Two reachable orderings break it:

 * free_huge_folio() with HPageRestoreReserve set already does
   h->resv_huge_pages++ on its own.  v2's bump on the gbl_chg > 0
   cleanup double-counts against that.

 * hugetlb_unreserve_pages() does hugetlb_acct_memory(h, -X) which
   subtracts from h->resv_huge_pages and may also return surplus
   backing.  v2's bump then leaves rsv_hpages backed by no
   h->resv_huge_pages - a phantom reservation that the next
   subpool_get_pages() consumes without real backing.

v3 ended up going a different way: the gbl_chg > 0 cleanup is now
restricted to (max_hpages != -1, min_hpages == -1).  In that
configuration hugepage_subpool_put_pages()'s min-restoration branch
is dead, so a direct used_hpages-- under spool->lock is the exact
inverse of the speculative bump - no put_pages(), no
h->resv_huge_pages++, no concurrent-races to reason about.

Your case 3 (max_hpages == -1) is unchanged: cleanup is a no-op,
because get_pages() didn't touch any subpool field.

Mounts with min_hpages != -1 are left at v1 behaviour for now.  That
quadrant has an inherited race that also exists at
hugetlb_reserve_pages()'s out_put_pages cleanup; a coordinated fix
belongs in a separate RFC rather than this stable backport.

v3:
  https://lore.kernel.org/linux-mm/20260428113037.88766-2-enderaoelyther@gmail.com/

Thanks for the review.

--
Zhao Li

Re: [PATCH v2] mm/hugetlb: fix subpool accounting after cgroup charge failure

Posted by Zhao Li 1 month, 2 weeks ago

On Tue, Apr 28, 2026 at 11:08:04AM +0200, Oscar Salvador wrote:
> I found that last sentence misleading, because we do not really
> care about hugetlb cgroup charge/uncharge (besides that being of
> the reasons we end up on error path) but rather the fact that we
> fiddle with subpool->used_hpages and we need to undo that when we
> rollback.

Agreed - reframed in v3.  The commit body now states the bug as
the unwind missing the used_hpages rollback, without pinning it to
the cgroup-charge case, and the subject is narrowed to "fix
max-only subpool accounting on alloc_hugetlb_folio failure".

> Well, that does not quite explain the problem I think, at least
> not clear enough?  [...]

Fair - that explanation got tangled because v2's design itself was
trying to compensate for racing min crossings.  v3 sidesteps it
entirely: the gbl_chg > 0 cleanup is now restricted to
(max_hpages != -1, min_hpages == -1).  In that configuration
hugepage_subpool_put_pages()'s min-restoration branch is dead, so a
direct used_hpages-- under spool->lock is the exact inverse of the
speculative bump - no h->resv_huge_pages++ needed, no rsv_hpages
publication, no racing-put reasoning.

Mounts with min_hpages != -1 are left at v1 behaviour for now.
That quadrant has an inherited race that also exists at
hugetlb_reserve_pages()'s out_put_pages cleanup, so a coordinated
fix belongs in a separate RFC rather than this stable backport.

> I would split the comment in two parts and place them within the
> block they belong, otherwise it sounds confusing.
>
> Subpools, reservations and hugetlb make a very head-spinning
> situation, so let us make our life easier.

Done - one short comment per branch placed inside the relevant
code block in v3.  Hopefully easier to follow now.

v3:
  https://lore.kernel.org/linux-mm/20260428113037.88766-2-enderaoelyther@gmail.com/

Thanks for the review.

--
Zhao Li

[PATCH v3] mm/hugetlb: fix max-only subpool accounting on alloc_hugetlb_folio failure

Posted by Zhao Li 1 month, 2 weeks ago

alloc_hugetlb_folio() calls hugepage_subpool_get_pages() when map_chg
is set.  For a subpool with max_hpages != -1, that bumps used_hpages
regardless of whether it returns gbl_chg = 0 (rsv slot consumed) or
gbl_chg > 0 (used_hpages slot only).  If the allocation later fails
before a folio is returned, the unwind must undo the used_hpages
bump.  The old cleanup only ran for !gbl_chg, leaking used_hpages on
the gbl_chg > 0 path.

For gbl_chg > 0 on max-only subpools (max_hpages != -1, min_hpages
== -1), hugepage_subpool_get_pages() took only a speculative
used_hpages slot.  Drop that slot directly under spool->lock.  In
that configuration hugepage_subpool_put_pages() cannot restore
rsv_hpages, so the direct decrement is the exact inverse and is
race-free against concurrent puts.  This matches the used_hpages-only
part of hugetlb_reserve_pages()'s out_put_pages cleanup, but
restricts it to the max-only case where no rsv_hpages restoration is
possible.

Mounts with min_hpages != -1 are left unchanged for now.  v2's
approach (hugepage_subpool_put_pages() + h->resv_huge_pages++ to
back a restored rsv_hpages slot) double-counts global backing under
concurrent free_huge_folio() and creates phantom reservations under
concurrent hugetlb_unreserve_pages().  Safe cleanup of that quadrant
needs a coordinated fix across multiple call sites.

Reproduced on size=20M hugetlbfs with the faulting task in a hugetlb
cgroup whose limit is exceeded.  Vanilla leaks 6/8 hugepages of
subpool quota; this patch leaks 0/8.  Verified under QEMU.

Fixes: a833a693a490 ("mm: hugetlb: fix incorrect fallback for subpool")
Cc: stable@vger.kernel.org # v6.15+
Signed-off-by: Zhao Li <enderaoelyther@gmail.com>
---
Changes in v3:
- Replace v2's hugepage_subpool_put_pages() + h->resv_huge_pages++ on
  the gbl_chg > 0 branch with a direct used_hpages-- under spool->lock.
- Restrict the cleanup to (max_hpages != -1, min_hpages == -1) where
  the direct decrement is the exact inverse of the speculative bump.

Changes in v2:
- Skip the gbl_chg > 0 cleanup when max_hpages is unset.
- Add hugepage_subpool_put_pages() + h->resv_huge_pages++ on the
  gbl_chg > 0 branch.

 mm/hugetlb.c | 25 ++++++++++++++++++-------
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f24bf49be047e..cfdeaf6394c5b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3025,13 +3025,24 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 		hugetlb_cgroup_uncharge_cgroup_rsvd(idx, pages_per_huge_page(h),
 						    h_cg);
 out_subpool_put:
-	/*
-	 * put page to subpool iff the quota of subpool's rsv_hpages is used
-	 * during hugepage_subpool_get_pages.
-	 */
-	if (map_chg && !gbl_chg) {
-		gbl_reserve = hugepage_subpool_put_pages(spool, 1);
-		hugetlb_acct_memory(h, -gbl_reserve);
+	if (map_chg) {
+		if (!gbl_chg) {
+			/* Full inverse when subpool_get_pages() consumed rsv_hpages. */
+			gbl_reserve = hugepage_subpool_put_pages(spool, 1);
+			hugetlb_acct_memory(h, -gbl_reserve);
+		} else if (gbl_chg > 0 && spool && spool->min_hpages == -1 &&
+			   spool->max_hpages != -1) {
+			unsigned long flags;
+
+			/*
+			 * For max-only subpools, subpool_get_pages() took only a
+			 * speculative used_hpages slot. Drop that slot directly.
+			 */
+			spin_lock_irqsave(&spool->lock, flags);
+			if (spool->used_hpages > 0)
+				spool->used_hpages--;
+			unlock_or_release_subpool(spool, flags);
+		}
 	}


--
2.50.1 (Apple Git-155)