[RFC PATCH 10/12] mm/hugetlb: do explicit CMA balancing

Frank van der Linden posted 12 patches 2 weeks, 2 days ago
[RFC PATCH 10/12] mm/hugetlb: do explicit CMA balancing
Posted by Frank van der Linden 2 weeks, 2 days ago
CMA areas are normally not very large, but HugeTLB CMA is an
exception. hugetlb_cma, used for 'gigantic' pages (usually
1G), can take up many gigabytes of memory.

As such, it is potentially the largest source of 'false OOM'
conditions, situations where the kernel runs out of space
for unmovable allocations, because it can't allocate from
CMA pageblocks, and non-CMA memory has been tied up by
other movable allocations.

The normal use case of hugetlb_cma is a system where 1G
hugetlb pages are sometimes, but not always, needed, so
they need to be created and freed dynamically. As such,
the best time to address CMA memory imbalances is when CMA
hugetlb pages are freed, making multiples of 1G available
as buddy managed CMA pageblocks. That is a good time to
check if movable allocations fron non-CMA pageblocks should
be moved to CMA pageblocks to give the kernel more breathing
space.

Do this by calling balance_node_cma on either the hugetlb
CMA area for the node that just had its number of hugetlb
pages reduced, or for all hugetlb CMA areas if the reduction
was not node-specific.

To have the CMA balancing code act on the hugetlb CMA areas,
set the CMA_BALANCE flag when creating them.

Signed-off-by: Frank van der Linden <fvdl@google.com>
---
 mm/hugetlb.c     | 14 ++++++++------
 mm/hugetlb_cma.c | 16 ++++++++++++++++
 mm/hugetlb_cma.h |  5 +++++
 3 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index eed59cfb5d21..611655876f60 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3971,12 +3971,14 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 
 		list_add(&folio->lru, &page_list);
 	}
-	/* free the pages after dropping lock */
-	spin_unlock_irq(&hugetlb_lock);
-	update_and_free_pages_bulk(h, &page_list);
-	flush_free_hpage_work(h);
-	spin_lock_irq(&hugetlb_lock);
-
+	if (!list_empty(&page_list)) {
+		/* free the pages after dropping lock */
+		spin_unlock_irq(&hugetlb_lock);
+		update_and_free_pages_bulk(h, &page_list);
+		flush_free_hpage_work(h);
+		hugetlb_cma_balance(nid);
+		spin_lock_irq(&hugetlb_lock);
+	}
 	while (count < persistent_huge_pages(h)) {
 		if (!adjust_pool_surplus(h, nodes_allowed, 1))
 			break;
diff --git a/mm/hugetlb_cma.c b/mm/hugetlb_cma.c
index 71d0e9a048d4..c0396d35b5bf 100644
--- a/mm/hugetlb_cma.c
+++ b/mm/hugetlb_cma.c
@@ -276,3 +276,19 @@ bool __init hugetlb_early_cma(struct hstate *h)
 
 	return hstate_is_gigantic(h) && hugetlb_cma_only;
 }
+
+void hugetlb_cma_balance(int nid)
+{
+	int node;
+
+	if (nid != NUMA_NO_NODE) {
+		if (hugetlb_cma[nid])
+			balance_node_cma(nid, hugetlb_cma[nid]);
+	} else {
+		for_each_online_node(node) {
+			if (hugetlb_cma[node])
+				balance_node_cma(node,
+						 hugetlb_cma[node]);
+		}
+	}
+}
diff --git a/mm/hugetlb_cma.h b/mm/hugetlb_cma.h
index f7d7fb9880a2..2f2a35b56d8a 100644
--- a/mm/hugetlb_cma.h
+++ b/mm/hugetlb_cma.h
@@ -13,6 +13,7 @@ bool hugetlb_cma_exclusive_alloc(void);
 unsigned long hugetlb_cma_total_size(void);
 void hugetlb_cma_validate_params(void);
 bool hugetlb_early_cma(struct hstate *h);
+void hugetlb_cma_balance(int nid);
 #else
 static inline void hugetlb_cma_free_folio(struct folio *folio)
 {
@@ -53,5 +54,9 @@ static inline bool hugetlb_early_cma(struct hstate *h)
 {
 	return false;
 }
+
+static inline void hugetlb_cma_balance(int nid)
+{
+}
 #endif
 #endif
-- 
2.51.0.384.g4c02a37b29-goog
Re: [RFC PATCH 10/12] mm/hugetlb: do explicit CMA balancing
Posted by Rik van Riel 2 weeks ago
On Mon, 2025-09-15 at 19:51 +0000, Frank van der Linden wrote:
> CMA areas are normally not very large, but HugeTLB CMA is an
> exception. hugetlb_cma, used for 'gigantic' pages (usually
> 1G), can take up many gigabytes of memory.
> 
> As such, it is potentially the largest source of 'false OOM'
> conditions,

The false OOM kills also seem to happen when a system
does not use hugetlbfs at all, but a cgroup simply has
most/all of its reclaimable memory in a CMA region,
and then tries to do a kernel allocation.

Would it make sense to call hugetlb_cma_balance() from
the pageout code instead, when the pageout code is
trying to free non-CMA memory, but ended up freeing
mostly/only CMA memory?

-- 
All Rights Reversed.