mm/hugetlb.c | 38 +++++++++++++++++++------------------- 1 file changed, 19 insertions(+), 19 deletions(-)
Presently, when calculating how many huge pages are needed when
reserving surplus huge pages, the global count of free huge pages
are used. When reserving with a mempolicy, the global count of free huge
pages is used even if some/all of those free huge pages are on numa
nodes outside of the mempolicy. Fix it so free huge pages only on nodes
within the mempolicy are considered.
Signed-off-by: Charles Haithcock <chaithco@redhat.com>
---
mm/hugetlb.c | 38 +++++++++++++++++++-------------------
1 file changed, 19 insertions(+), 19 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f24bf49be0..02752ce735 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2255,6 +2255,23 @@ static nodemask_t *policy_mbind_nodemask(gfp_t gfp)
return NULL;
}
+static unsigned int allowed_mems_nr(struct hstate *h)
+{
+ int node;
+ unsigned int nr = 0;
+ nodemask_t *mbind_nodemask;
+ unsigned int *array = h->free_huge_pages_node;
+ gfp_t gfp_mask = htlb_alloc_mask(h);
+
+ mbind_nodemask = policy_mbind_nodemask(gfp_mask);
+ for_each_node_mask(node, cpuset_current_mems_allowed) {
+ if (!mbind_nodemask || node_isset(node, *mbind_nodemask))
+ nr += array[node];
+ }
+
+ return nr;
+}
+
/*
* Increase the hugetlb pool such that it can accommodate a reservation
* of size 'delta'.
@@ -2277,7 +2294,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
alloc_nodemask = cpuset_current_mems_allowed;
lockdep_assert_held(&hugetlb_lock);
- needed = (h->resv_huge_pages + delta) - h->free_huge_pages;
+ needed = (h->resv_huge_pages + delta) - allowed_mems_nr(h);
if (needed <= 0) {
h->resv_huge_pages += delta;
return 0;
@@ -2312,7 +2329,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
*/
spin_lock_irq(&hugetlb_lock);
needed = (h->resv_huge_pages + delta) -
- (h->free_huge_pages + allocated);
+ (allowed_mems_nr(h) + allocated);
if (needed > 0) {
if (alloc_ok)
goto retry;
@@ -4513,23 +4530,6 @@ static int __init hugepage_alloc_threads_setup(char *s)
}
__setup("hugepage_alloc_threads=", hugepage_alloc_threads_setup);
-static unsigned int allowed_mems_nr(struct hstate *h)
-{
- int node;
- unsigned int nr = 0;
- nodemask_t *mbind_nodemask;
- unsigned int *array = h->free_huge_pages_node;
- gfp_t gfp_mask = htlb_alloc_mask(h);
-
- mbind_nodemask = policy_mbind_nodemask(gfp_mask);
- for_each_node_mask(node, cpuset_current_mems_allowed) {
- if (!mbind_nodemask || node_isset(node, *mbind_nodemask))
- nr += array[node];
- }
-
- return nr;
-}
-
void hugetlb_report_meminfo(struct seq_file *m)
{
struct hstate *h;
--
2.54.0
On Wed, 27 May 2026 16:48:46 -0600 Charles Haithcock <chaithco@redhat.com> wrote:
> Presently, when calculating how many huge pages are needed when
> reserving surplus huge pages, the global count of free huge pages
> are used. When reserving with a mempolicy, the global count of free huge
> pages is used even if some/all of those free huge pages are on numa
> nodes outside of the mempolicy. Fix it so free huge pages only on nodes
> within the mempolicy are considered.
Hello Charles, thank you for the patch!
I just wanted to add that it seems like this is a known issue. From the
comment in hugetlb_acct_memory (the only caller of gather_surplus_pages)
we have the following comment block:
/*
* When cpuset is configured, it breaks the strict hugetlb page
* reservation as the accounting is done on a global variable. Such
* reservation is completely rubbish in the presence of cpuset because
* the reservation is not checked against page availability for the
* current cpuset. Application can still potentially OOM'ed by kernel
* with lack of free htlb page in cpuset that the task is in.
* Attempt to enforce strict accounting with cpuset is almost
* impossible (or too ugly) because cpuset is too fluid that
* task or memory node can be dynamically moved between cpusets.
*
* The change of semantics for shared hugetlb mapping with cpuset is
* undesirable. However, in order to preserve some of the semantics,
* we fall back to check against current free page availability as
* a best attempt and hopefully to minimize the impact of changing
* semantics that cpuset has.
*
* Apart from cpuset, we also have memory policy mechanism that
* also determines from which node the kernel will allocate memory
* in a NUMA system. So similar to cpuset, we also should consider
* the memory policy of the current task. Similar to the description
* above.
So it would appear that getting an exact number of pages to allocate,
and ensure that there are no changes with the reservation or which nodes
those reservations actually go to is a lot more difficult. But I think
we can do a bit better.
FWIW, I think over-allocating is actually not fatal (although overallocating
by a lot is obviously not desirable) since we free all the unused hugetlb
pages at the end of gather_surplus_pages. I wonder if an approach like this
could work:
@@ -2260,7 +2277,8 @@ static int gather_surplus_pages(struct hstate *h, long delta)
alloc_nodemask = cpuset_current_mems_allowed;
lockdep_assert_held(&hugetlb_lock);
- needed = (h->resv_huge_pages + delta) - h->free_huge_pages;
+ needed = max(delta - allowed_mems_nr(h),
+ (h->resv_huge_pages + delta) - h->free_huge_pages);
if (needed <= 0) {
h->resv_huge_pages += delta;
return 0;
@@ -2294,8 +2312,8 @@ static int gather_surplus_pages(struct hstate *h, long delta)
* because either resv_huge_pages or free_huge_pages may have changed.
*/
spin_lock_irq(&hugetlb_lock);
- needed = (h->resv_huge_pages + delta) -
- (h->free_huge_pages + allocated);
+ needed = max((h->resv_huge_pages + delta) - h->free_huge_pages,
+ delta - allowed_mems_nr(h)) - allocated;
if (needed > 0) {
if (alloc_ok)
goto retry;
So we compare the mempolicy-perspective "needed" and compare it to the
global "needed" and take whatever. Since we are taking a max it should
only ever make it more likely to actually succeed with the mempolicy-bound
hugetlb page usage, even though we still can't make guarantees since
a free page on our node may be taken by a different reservation later.
Let me know what you think. Thanks again!
Joshua
On Wed, 27 May 2026 16:48:46 -0600 Charles Haithcock <chaithco@redhat.com> wrote: > Presently, when calculating how many huge pages are needed when > reserving surplus huge pages, the global count of free huge pages > are used. When reserving with a mempolicy, the global count of free huge > pages is used even if some/all of those free huge pages are on numa > nodes outside of the mempolicy. Fix it so free huge pages only on nodes > within the mempolicy are considered. > Thanks. AI review asked a thing: https://sashiko.dev/#/patchset/20260527224848.1753560-1-chaithco@redhat.com
© 2016 - 2026 Red Hat, Inc.