Respect mempolicy when calculating surplus huge pages.

[PATCH] Respect mempolicy when calculating surplus huge pages.

Posted by Charles Haithcock 1 week, 4 days ago

Presently, when calculating how many huge pages are needed when
reserving surplus huge pages, the global count of free huge pages
are used. When reserving with a mempolicy, the global count of free huge
pages is used even if some/all of those free huge pages are on numa
nodes outside of the mempolicy. Fix it so free huge pages only on nodes
within the mempolicy are considered.

Signed-off-by: Charles Haithcock <chaithco@redhat.com>
---
 mm/hugetlb.c | 38 +++++++++++++++++++-------------------
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f24bf49be0..02752ce735 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2255,6 +2255,23 @@ static nodemask_t *policy_mbind_nodemask(gfp_t gfp)
 	return NULL;
 }
 
+static unsigned int allowed_mems_nr(struct hstate *h)
+{
+	int node;
+	unsigned int nr = 0;
+	nodemask_t *mbind_nodemask;
+	unsigned int *array = h->free_huge_pages_node;
+	gfp_t gfp_mask = htlb_alloc_mask(h);
+
+	mbind_nodemask = policy_mbind_nodemask(gfp_mask);
+	for_each_node_mask(node, cpuset_current_mems_allowed) {
+		if (!mbind_nodemask || node_isset(node, *mbind_nodemask))
+			nr += array[node];
+	}
+
+	return nr;
+}
+
 /*
  * Increase the hugetlb pool such that it can accommodate a reservation
  * of size 'delta'.
@@ -2277,7 +2294,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 		alloc_nodemask = cpuset_current_mems_allowed;
 
 	lockdep_assert_held(&hugetlb_lock);
-	needed = (h->resv_huge_pages + delta) - h->free_huge_pages;
+	needed = (h->resv_huge_pages + delta) - allowed_mems_nr(h);
 	if (needed <= 0) {
 		h->resv_huge_pages += delta;
 		return 0;
@@ -2312,7 +2329,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 	 */
 	spin_lock_irq(&hugetlb_lock);
 	needed = (h->resv_huge_pages + delta) -
-			(h->free_huge_pages + allocated);
+			(allowed_mems_nr(h) + allocated);
 	if (needed > 0) {
 		if (alloc_ok)
 			goto retry;
@@ -4513,23 +4530,6 @@ static int __init hugepage_alloc_threads_setup(char *s)
 }
 __setup("hugepage_alloc_threads=", hugepage_alloc_threads_setup);
 
-static unsigned int allowed_mems_nr(struct hstate *h)
-{
-	int node;
-	unsigned int nr = 0;
-	nodemask_t *mbind_nodemask;
-	unsigned int *array = h->free_huge_pages_node;
-	gfp_t gfp_mask = htlb_alloc_mask(h);
-
-	mbind_nodemask = policy_mbind_nodemask(gfp_mask);
-	for_each_node_mask(node, cpuset_current_mems_allowed) {
-		if (!mbind_nodemask || node_isset(node, *mbind_nodemask))
-			nr += array[node];
-	}
-
-	return nr;
-}
-
 void hugetlb_report_meminfo(struct seq_file *m)
 {
 	struct hstate *h;
-- 
2.54.0

Re: [PATCH] Respect mempolicy when calculating surplus huge pages.

Posted by Joshua Hahn 6 days ago

On Wed, 27 May 2026 16:48:46 -0600 Charles Haithcock <chaithco@redhat.com> wrote:

> Presently, when calculating how many huge pages are needed when
> reserving surplus huge pages, the global count of free huge pages
> are used. When reserving with a mempolicy, the global count of free huge
> pages is used even if some/all of those free huge pages are on numa
> nodes outside of the mempolicy. Fix it so free huge pages only on nodes
> within the mempolicy are considered.

Hello Charles, thank you for the patch!

I just wanted to add that it seems like this is a known issue. From the
comment in hugetlb_acct_memory (the only caller of gather_surplus_pages)
we have the following comment block:

        /*
         * When cpuset is configured, it breaks the strict hugetlb page
         * reservation as the accounting is done on a global variable. Such
         * reservation is completely rubbish in the presence of cpuset because
         * the reservation is not checked against page availability for the
         * current cpuset. Application can still potentially OOM'ed by kernel
         * with lack of free htlb page in cpuset that the task is in.
         * Attempt to enforce strict accounting with cpuset is almost
         * impossible (or too ugly) because cpuset is too fluid that
         * task or memory node can be dynamically moved between cpusets.
         *
         * The change of semantics for shared hugetlb mapping with cpuset is
         * undesirable. However, in order to preserve some of the semantics,
         * we fall back to check against current free page availability as
         * a best attempt and hopefully to minimize the impact of changing
         * semantics that cpuset has.
         *
         * Apart from cpuset, we also have memory policy mechanism that
         * also determines from which node the kernel will allocate memory
         * in a NUMA system. So similar to cpuset, we also should consider
         * the memory policy of the current task. Similar to the description
         * above.

So it would appear that getting an exact number of pages to allocate,
and ensure that there are no changes with the reservation or which nodes
those reservations actually go to is a lot more difficult. But I think
we can do a bit better.

FWIW, I think over-allocating is actually not fatal (although overallocating
by a lot is obviously not desirable) since we free all the unused hugetlb
pages at the end of gather_surplus_pages. I wonder if an approach like this
could work:

@@ -2260,7 +2277,8 @@ static int gather_surplus_pages(struct hstate *h, long delta)
                alloc_nodemask = cpuset_current_mems_allowed;

        lockdep_assert_held(&hugetlb_lock);
-       needed = (h->resv_huge_pages + delta) - h->free_huge_pages;
+       needed = max(delta - allowed_mems_nr(h),
+                    (h->resv_huge_pages + delta) - h->free_huge_pages);
        if (needed <= 0) {
                h->resv_huge_pages += delta;
                return 0;
@@ -2294,8 +2312,8 @@ static int gather_surplus_pages(struct hstate *h, long delta)
         * because either resv_huge_pages or free_huge_pages may have changed.
         */
        spin_lock_irq(&hugetlb_lock);
-       needed = (h->resv_huge_pages + delta) -
-                       (h->free_huge_pages + allocated);
+       needed = max((h->resv_huge_pages + delta) - h->free_huge_pages,
+                    delta - allowed_mems_nr(h)) - allocated;
        if (needed > 0) {
                if (alloc_ok)
                        goto retry;

So we compare the mempolicy-perspective "needed" and compare it to the
global "needed" and take whatever. Since we are taking a max it should
only ever make it more likely to actually succeed with the mempolicy-bound
hugetlb page usage, even though we still can't make guarantees since
a free page on our node may be taken by a different reservation later.

Let me know what you think. Thanks again!
Joshua

Re: [PATCH] Respect mempolicy when calculating surplus huge pages.

Posted by Andrew Morton 1 week, 2 days ago

On Wed, 27 May 2026 16:48:46 -0600 Charles Haithcock <chaithco@redhat.com> wrote:

> Presently, when calculating how many huge pages are needed when
> reserving surplus huge pages, the global count of free huge pages
> are used. When reserving with a mempolicy, the global count of free huge
> pages is used even if some/all of those free huge pages are on numa
> nodes outside of the mempolicy. Fix it so free huge pages only on nodes
> within the mempolicy are considered.
> 

Thanks.   AI review asked a thing:
	https://sashiko.dev/#/patchset/20260527224848.1753560-1-chaithco@redhat.com