[v6] page_alloc: allow migration of smaller hugepages during contig_alloc

[PATCH v6] page_alloc: allow migration of smaller hugepages during contig_alloc

Posted by Gregory Price 1 month, 3 weeks ago

We presently skip regions with hugepages entirely when trying to do
contiguous page allocation.  This will cause otherwise-movable
2MB HugeTLB pages to be considered unmovable, and makes 1GB gigantic
page allocation less reliable on systems utilizing both.

Commit 4d73ba5fa710 ("mm: page_alloc: skip regions with hugetlbfs pages
when allocating 1G pages") skipped all HugePage containing regions
because it can cause significant delays in 1G allocation (as HugeTLB
migrations may fail for a number of reasons).

Instead, if hugepage migration is enabled, consider regions with
hugepages smaller than the target contiguous allocation request
as valid targets for allocation.

We optimize for the existing behavior by searching for non-hugetlb
regions in a first pass, then retrying the search to include hugetlb
only on failure.  This allows the existing fast-path to remain the
default case with a slow-path fallback to increase reliability.

We only fallback to the slow path if a hugetlb region was detected,
and we do a full re-scan because the zones/blocks may have changed
during the first pass (and it's not worth further complexity).

isolate_migrate_pages_block() has similar hugetlb filter logic, and
the hugetlb code does a migratable check in folio_isolate_hugetlb()
during isolation.  The code servicing the allocation and migration
already supports this exact use case.

To test, allocate a bunch of 2MB HugeTLB pages (in this case 48GB)
and then attempt to allocate some 1G HugeTLB pages (in this case 4GB)
(Scale to your machine's memory capacity).

echo 24576 > .../hugepages-2048kB/nr_hugepages
echo 4 > .../hugepages-1048576kB/nr_hugepages

Prior to this patch, the 1GB page reservation can fail if no contiguous
1GB pages remain.  After this patch, the kernel will try to move 2MB
pages and successfully allocate the 1GB pages (assuming overall
sufficient memory is available). Also tested this while a program had
the 2MB reservations mapped, and the 1GB reservation still succeeds.

folio_alloc_gigantic() is the primary user of alloc_contig_pages(),
other users are debug or init-time allocations and largely unaffected.
- ppc/memtrace is a debugfs interface
- x86/tdx memory allocation occurs once on module-init
- kfence/core happens once on module (late) init
- THP uses it in debug_vm_pgtable_alloc_huge_page at __init time

Suggested-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/linux-mm/6fe3562d-49b2-4975-aa86-e139c535ad00@redhat.com/
Signed-off-by: Gregory Price <gourry@gourry.net>
---
 mm/page_alloc.c | 52 +++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 48 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 822e05f1a964..adf579a0df3e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7083,7 +7083,8 @@ static int __alloc_contig_pages(unsigned long start_pfn,
 }
 
 static bool pfn_range_valid_contig(struct zone *z, unsigned long start_pfn,
-				   unsigned long nr_pages)
+				   unsigned long nr_pages, bool skip_hugetlb,
+				   bool *skipped_hugetlb)
 {
 	unsigned long i, end_pfn = start_pfn + nr_pages;
 	struct page *page;
@@ -7099,8 +7100,35 @@ static bool pfn_range_valid_contig(struct zone *z, unsigned long start_pfn,
 		if (PageReserved(page))
 			return false;
 
-		if (PageHuge(page))
-			return false;
+		/*
+		 * Only consider ranges containing hugepages if those pages are
+		 * smaller than the requested contiguous region.  e.g.:
+		 *     Move 2MB pages to free up a 1GB range.
+		 *     Don't move 1GB pages to free up a 2MB range.
+		 *
+		 * This makes contiguous allocation more reliable if multiple
+		 * hugepage sizes are used without causing needless movement.
+		 */
+		if (PageHuge(page)) {
+			unsigned int order;
+
+			if (!IS_ENABLED(CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION))
+				return false;
+
+			if (skip_hugetlb) {
+				*skipped_hugetlb = true;
+				return false;
+			}
+
+			page = compound_head(page);
+			order = compound_order(page);
+			if ((order >= MAX_FOLIO_ORDER) ||
+			    (nr_pages <= (1 << order)))
+				return false;
+
+			/* No need to check the pfns for this page */
+			i += (1 << order) - 1;
+		}
 	}
 	return true;
 }
@@ -7143,7 +7171,10 @@ struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask,
 	struct zonelist *zonelist;
 	struct zone *zone;
 	struct zoneref *z;
+	bool skip_hugetlb = true;
+	bool skipped_hugetlb = false;
 
+retry:
 	zonelist = node_zonelist(nid, gfp_mask);
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 					gfp_zone(gfp_mask), nodemask) {
@@ -7151,7 +7182,9 @@ struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask,
 
 		pfn = ALIGN(zone->zone_start_pfn, nr_pages);
 		while (zone_spans_last_pfn(zone, pfn, nr_pages)) {
-			if (pfn_range_valid_contig(zone, pfn, nr_pages)) {
+			if (pfn_range_valid_contig(zone, pfn, nr_pages,
+						   skip_hugetlb,
+						   &skipped_hugetlb)) {
 				/*
 				 * We release the zone lock here because
 				 * alloc_contig_range() will also lock the zone
@@ -7170,6 +7203,17 @@ struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask,
 		}
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}
+	/*
+	 * If we failed, retry the search, but treat regions with HugeTLB pages
+	 * as valid targets.  This retains fast-allocations on first pass
+	 * without trying to migrate HugeTLB pages (which may fail). On the
+	 * second pass, we will try moving HugeTLB pages when those pages are
+	 * smaller than the requested contiguous region size.
+	 */
+	if (skip_hugetlb && skipped_hugetlb) {
+		skip_hugetlb = false;
+		goto retry;
+	}
 	return NULL;
 }
 #endif /* CONFIG_CONTIG_ALLOC */
-- 
2.52.0

Re: [PATCH v6] page_alloc: allow migration of smaller hugepages during contig_alloc

Posted by Zi Yan 1 month, 2 weeks ago

On 18 Dec 2025, at 18:38, Gregory Price wrote:

> We presently skip regions with hugepages entirely when trying to do
> contiguous page allocation.  This will cause otherwise-movable
> 2MB HugeTLB pages to be considered unmovable, and makes 1GB gigantic
> page allocation less reliable on systems utilizing both.
>
> Commit 4d73ba5fa710 ("mm: page_alloc: skip regions with hugetlbfs pages
> when allocating 1G pages") skipped all HugePage containing regions
> because it can cause significant delays in 1G allocation (as HugeTLB
> migrations may fail for a number of reasons).
>
> Instead, if hugepage migration is enabled, consider regions with
> hugepages smaller than the target contiguous allocation request
> as valid targets for allocation.
>
> We optimize for the existing behavior by searching for non-hugetlb
> regions in a first pass, then retrying the search to include hugetlb
> only on failure.  This allows the existing fast-path to remain the
> default case with a slow-path fallback to increase reliability.
>
> We only fallback to the slow path if a hugetlb region was detected,
> and we do a full re-scan because the zones/blocks may have changed
> during the first pass (and it's not worth further complexity).
>
> isolate_migrate_pages_block() has similar hugetlb filter logic, and
> the hugetlb code does a migratable check in folio_isolate_hugetlb()
> during isolation.  The code servicing the allocation and migration
> already supports this exact use case.
>
> To test, allocate a bunch of 2MB HugeTLB pages (in this case 48GB)
> and then attempt to allocate some 1G HugeTLB pages (in this case 4GB)
> (Scale to your machine's memory capacity).
>
> echo 24576 > .../hugepages-2048kB/nr_hugepages
> echo 4 > .../hugepages-1048576kB/nr_hugepages
>
> Prior to this patch, the 1GB page reservation can fail if no contiguous
> 1GB pages remain.  After this patch, the kernel will try to move 2MB
> pages and successfully allocate the 1GB pages (assuming overall
> sufficient memory is available). Also tested this while a program had
> the 2MB reservations mapped, and the 1GB reservation still succeeds.
>
> folio_alloc_gigantic() is the primary user of alloc_contig_pages(),
> other users are debug or init-time allocations and largely unaffected.
> - ppc/memtrace is a debugfs interface
> - x86/tdx memory allocation occurs once on module-init
> - kfence/core happens once on module (late) init
> - THP uses it in debug_vm_pgtable_alloc_huge_page at __init time
>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Link: https://lore.kernel.org/linux-mm/6fe3562d-49b2-4975-aa86-e139c535ad00@redhat.com/
> Signed-off-by: Gregory Price <gourry@gourry.net>
> ---
>  mm/page_alloc.c | 52 +++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 48 insertions(+), 4 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 822e05f1a964..adf579a0df3e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7083,7 +7083,8 @@ static int __alloc_contig_pages(unsigned long start_pfn,
>  }
>
>  static bool pfn_range_valid_contig(struct zone *z, unsigned long start_pfn,
> -				   unsigned long nr_pages)
> +				   unsigned long nr_pages, bool skip_hugetlb,
> +				   bool *skipped_hugetlb)
>  {
>  	unsigned long i, end_pfn = start_pfn + nr_pages;
>  	struct page *page;
> @@ -7099,8 +7100,35 @@ static bool pfn_range_valid_contig(struct zone *z, unsigned long start_pfn,
>  		if (PageReserved(page))
>  			return false;
>
> -		if (PageHuge(page))
> -			return false;
> +		/*
> +		 * Only consider ranges containing hugepages if those pages are
> +		 * smaller than the requested contiguous region.  e.g.:
> +		 *     Move 2MB pages to free up a 1GB range.
> +		 *     Don't move 1GB pages to free up a 2MB range.
> +		 *
> +		 * This makes contiguous allocation more reliable if multiple
> +		 * hugepage sizes are used without causing needless movement.
> +		 */
> +		if (PageHuge(page)) {
> +			unsigned int order;
> +
> +			if (!IS_ENABLED(CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION))
> +				return false;
> +
> +			if (skip_hugetlb) {
> +				*skipped_hugetlb = true;

I do not know whether we should check if skipped_hugetlb is NULL or not,
since pfn_range_valid_contig() is only called by alloc_contig_pages_noprof().
I have no strong opinion on an additional skipped_hugetlb check.

> +				return false;
> +			}
> +
> +			page = compound_head(page);
> +			order = compound_order(page);
> +			if ((order >= MAX_FOLIO_ORDER) ||
> +			    (nr_pages <= (1 << order)))
> +				return false;
> +
> +			/* No need to check the pfns for this page */
> +			i += (1 << order) - 1;
> +		}
>  	}
>  	return true;
>  }
> @@ -7143,7 +7171,10 @@ struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask,
>  	struct zonelist *zonelist;
>  	struct zone *zone;
>  	struct zoneref *z;
> +	bool skip_hugetlb = true;
> +	bool skipped_hugetlb = false;
>
> +retry:
>  	zonelist = node_zonelist(nid, gfp_mask);
>  	for_each_zone_zonelist_nodemask(zone, z, zonelist,
>  					gfp_zone(gfp_mask), nodemask) {
> @@ -7151,7 +7182,9 @@ struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask,
>
>  		pfn = ALIGN(zone->zone_start_pfn, nr_pages);
>  		while (zone_spans_last_pfn(zone, pfn, nr_pages)) {
> -			if (pfn_range_valid_contig(zone, pfn, nr_pages)) {
> +			if (pfn_range_valid_contig(zone, pfn, nr_pages,
> +						   skip_hugetlb,
> +						   &skipped_hugetlb)) {
>  				/*
>  				 * We release the zone lock here because
>  				 * alloc_contig_range() will also lock the zone
> @@ -7170,6 +7203,17 @@ struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask,
>  		}
>  		spin_unlock_irqrestore(&zone->lock, flags);
>  	}
> +	/*
> +	 * If we failed, retry the search, but treat regions with HugeTLB pages
> +	 * as valid targets.  This retains fast-allocations on first pass
> +	 * without trying to migrate HugeTLB pages (which may fail). On the
> +	 * second pass, we will try moving HugeTLB pages when those pages are
> +	 * smaller than the requested contiguous region size.
> +	 */
> +	if (skip_hugetlb && skipped_hugetlb) {
> +		skip_hugetlb = false;
> +		goto retry;
> +	}
>  	return NULL;
>  }
>  #endif /* CONFIG_CONTIG_ALLOC */
> -- 
> 2.52.0
Otherwise, LGTM.

Reviewed-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi

Re: [PATCH v6] page_alloc: allow migration of smaller hugepages during contig_alloc

Posted by Gregory Price 1 month, 2 weeks ago

On Fri, Dec 19, 2025 at 03:48:45PM -0500, Zi Yan wrote:
> On 18 Dec 2025, at 18:38, Gregory Price wrote:
> 
> > +		if (PageHuge(page)) {
> > +			unsigned int order;
> > +
> > +			if (!IS_ENABLED(CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION))
> > +				return false;
> > +
> > +			if (skip_hugetlb) {
> > +				*skipped_hugetlb = true;
> 
> I do not know whether we should check if skipped_hugetlb is NULL or not,
> since pfn_range_valid_contig() is only called by alloc_contig_pages_noprof().
> I have no strong opinion on an additional skipped_hugetlb check.
>

I'm fine either way if folks have a preference.  Compiler might even
optimize it anyway after things get inlined.

> Otherwise, LGTM.
> 
> Reviewed-by: Zi Yan <ziy@nvidia.com>
> 

Thanks again!
~Gregory

Re: [PATCH v6] page_alloc: allow migration of smaller hugepages during contig_alloc

Posted by Wei Yang 1 month, 3 weeks ago

On Thu, Dec 18, 2025 at 06:38:04PM -0500, Gregory Price wrote:
>We presently skip regions with hugepages entirely when trying to do
>contiguous page allocation.  This will cause otherwise-movable
>2MB HugeTLB pages to be considered unmovable, and makes 1GB gigantic
>page allocation less reliable on systems utilizing both.
>
>Commit 4d73ba5fa710 ("mm: page_alloc: skip regions with hugetlbfs pages
>when allocating 1G pages") skipped all HugePage containing regions
>because it can cause significant delays in 1G allocation (as HugeTLB
>migrations may fail for a number of reasons).
>
>Instead, if hugepage migration is enabled, consider regions with
>hugepages smaller than the target contiguous allocation request
>as valid targets for allocation.
>
>We optimize for the existing behavior by searching for non-hugetlb
>regions in a first pass, then retrying the search to include hugetlb
>only on failure.  This allows the existing fast-path to remain the
>default case with a slow-path fallback to increase reliability.
>
>We only fallback to the slow path if a hugetlb region was detected,
>and we do a full re-scan because the zones/blocks may have changed
>during the first pass (and it's not worth further complexity).
>
>isolate_migrate_pages_block() has similar hugetlb filter logic, and
>the hugetlb code does a migratable check in folio_isolate_hugetlb()
>during isolation.  The code servicing the allocation and migration
>already supports this exact use case.
>
>To test, allocate a bunch of 2MB HugeTLB pages (in this case 48GB)
>and then attempt to allocate some 1G HugeTLB pages (in this case 4GB)
>(Scale to your machine's memory capacity).
>
>echo 24576 > .../hugepages-2048kB/nr_hugepages
>echo 4 > .../hugepages-1048576kB/nr_hugepages
>
>Prior to this patch, the 1GB page reservation can fail if no contiguous
>1GB pages remain.  After this patch, the kernel will try to move 2MB
>pages and successfully allocate the 1GB pages (assuming overall
>sufficient memory is available). Also tested this while a program had
>the 2MB reservations mapped, and the 1GB reservation still succeeds.
>
>folio_alloc_gigantic() is the primary user of alloc_contig_pages(),
>other users are debug or init-time allocations and largely unaffected.
>- ppc/memtrace is a debugfs interface
>- x86/tdx memory allocation occurs once on module-init
>- kfence/core happens once on module (late) init
>- THP uses it in debug_vm_pgtable_alloc_huge_page at __init time
>
>Suggested-by: David Hildenbrand <david@redhat.com>
>Link: https://lore.kernel.org/linux-mm/6fe3562d-49b2-4975-aa86-e139c535ad00@redhat.com/
>Signed-off-by: Gregory Price <gourry@gourry.net>
>---
> mm/page_alloc.c | 52 +++++++++++++++++++++++++++++++++++++++++++++----
> 1 file changed, 48 insertions(+), 4 deletions(-)
>
>diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>index 822e05f1a964..adf579a0df3e 100644
>--- a/mm/page_alloc.c
>+++ b/mm/page_alloc.c
>@@ -7083,7 +7083,8 @@ static int __alloc_contig_pages(unsigned long start_pfn,
> }
> 
> static bool pfn_range_valid_contig(struct zone *z, unsigned long start_pfn,
>-				   unsigned long nr_pages)
>+				   unsigned long nr_pages, bool skip_hugetlb,
>+				   bool *skipped_hugetlb)
> {
> 	unsigned long i, end_pfn = start_pfn + nr_pages;
> 	struct page *page;
>@@ -7099,8 +7100,35 @@ static bool pfn_range_valid_contig(struct zone *z, unsigned long start_pfn,
> 		if (PageReserved(page))
> 			return false;
> 
>-		if (PageHuge(page))
>-			return false;
>+		/*
>+		 * Only consider ranges containing hugepages if those pages are
>+		 * smaller than the requested contiguous region.  e.g.:
>+		 *     Move 2MB pages to free up a 1GB range.
>+		 *     Don't move 1GB pages to free up a 2MB range.
>+		 *
>+		 * This makes contiguous allocation more reliable if multiple
>+		 * hugepage sizes are used without causing needless movement.
>+		 */
>+		if (PageHuge(page)) {
>+			unsigned int order;
>+
>+			if (!IS_ENABLED(CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION))
>+				return false;
>+
>+			if (skip_hugetlb) {
>+				*skipped_hugetlb = true;
>+				return false;
>+			}
>+
>+			page = compound_head(page);
>+			order = compound_order(page);

The order is get from head page.

>+			if ((order >= MAX_FOLIO_ORDER) ||
>+			    (nr_pages <= (1 << order)))
>+				return false;
>+
>+			/* No need to check the pfns for this page */
>+			i += (1 << order) - 1;

So this advance should based on "head page" instead of original page, right?

>+		}
> 	}
> 	return true;
> }
>@@ -7143,7 +7171,10 @@ struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask,
> 	struct zonelist *zonelist;
> 	struct zone *zone;
> 	struct zoneref *z;
>+	bool skip_hugetlb = true;
>+	bool skipped_hugetlb = false;
> 
>+retry:
> 	zonelist = node_zonelist(nid, gfp_mask);
> 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> 					gfp_zone(gfp_mask), nodemask) {
>@@ -7151,7 +7182,9 @@ struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask,
> 
> 		pfn = ALIGN(zone->zone_start_pfn, nr_pages);
> 		while (zone_spans_last_pfn(zone, pfn, nr_pages)) {
>-			if (pfn_range_valid_contig(zone, pfn, nr_pages)) {
>+			if (pfn_range_valid_contig(zone, pfn, nr_pages,
>+						   skip_hugetlb,
>+						   &skipped_hugetlb)) {
> 				/*
> 				 * We release the zone lock here because
> 				 * alloc_contig_range() will also lock the zone
>@@ -7170,6 +7203,17 @@ struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask,
> 		}
> 		spin_unlock_irqrestore(&zone->lock, flags);
> 	}
>+	/*
>+	 * If we failed, retry the search, but treat regions with HugeTLB pages
>+	 * as valid targets.  This retains fast-allocations on first pass
>+	 * without trying to migrate HugeTLB pages (which may fail). On the
>+	 * second pass, we will try moving HugeTLB pages when those pages are
>+	 * smaller than the requested contiguous region size.
>+	 */
>+	if (skip_hugetlb && skipped_hugetlb) {
>+		skip_hugetlb = false;
>+		goto retry;
>+	}
> 	return NULL;
> }
> #endif /* CONFIG_CONTIG_ALLOC */
>-- 
>2.52.0

-- 
Wei Yang
Help you, Help me

Re: [PATCH v6] page_alloc: allow migration of smaller hugepages during contig_alloc

Posted by Gregory Price 1 month, 2 weeks ago

On Fri, Dec 19, 2025 at 12:08:00AM +0000, Wei Yang wrote:
> >+
> >+			page = compound_head(page);
> >+			order = compound_order(page);
> 
> The order is get from head page.
> 
> >+			if ((order >= MAX_FOLIO_ORDER) ||
> >+			    (nr_pages <= (1 << order)))
> >+				return false;
> >+
> >+			/* No need to check the pfns for this page */
> >+			i += (1 << order) - 1;
> 
> So this advance should based on "head page" instead of original page, right?
>

hm, I think the thought here was that since we're moving forward from
start of an aligned chunk, we'd never hit a non-head page - but this
may not be true.  

Will think about this for a bit.

~Gregory

Re: [PATCH v6] page_alloc: allow migration of smaller hugepages during contig_alloc

Posted by Zi Yan 1 month, 2 weeks ago

On 19 Dec 2025, at 9:26, Gregory Price wrote:

> On Fri, Dec 19, 2025 at 12:08:00AM +0000, Wei Yang wrote:
>>> +
>>> +			page = compound_head(page);
>>> +			order = compound_order(page);
>>
>> The order is get from head page.
>>
>>> +			if ((order >= MAX_FOLIO_ORDER) ||
>>> +			    (nr_pages <= (1 << order)))
>>> +				return false;
>>> +
>>> +			/* No need to check the pfns for this page */
>>> +			i += (1 << order) - 1;
>>
>> So this advance should based on "head page" instead of original page, right?
>>
>
> hm, I think the thought here was that since we're moving forward from
> start of an aligned chunk, we'd never hit a non-head page - but this
> may not be true.
>
> Will think about this for a bit.

The sole caller of pfn_range_valid_contig(), alloc_contig_pages_noprof(),
scans from the beginning of a zone to the end. pfn_range_valid_contig()
should see head pages all the time, except it scans in the middle of
a 1GB hugetlb when alloc_contig_pages_noprof() is asking for a smaller
nr_pages, like 2MB. But in that case, the if above i += (1 << order) - 1
would return false without reaching it. Basically, to get to
i += ..., pfn_range_valid_contig() needs to search for nr_pages larger
than PageHuge(page) and nr_pages is always power of two based on
alloc_contig_pages_noprof() requirement, but that means
pfn_range_valid_contig() always sees such PageHuge pages as a whole
within nr_pages range, thus cannot see a tail PageHuge page at the
point of i += ....

Best Regards,
Yan, Zi

Re: [PATCH v6] page_alloc: allow migration of smaller hugepages during contig_alloc

Posted by Wei Yang 1 month, 2 weeks ago

On Fri, Dec 19, 2025 at 03:46:25PM -0500, Zi Yan wrote:
>On 19 Dec 2025, at 9:26, Gregory Price wrote:
>
>> On Fri, Dec 19, 2025 at 12:08:00AM +0000, Wei Yang wrote:
>>>> +
>>>> +			page = compound_head(page);
>>>> +			order = compound_order(page);
>>>
>>> The order is get from head page.
>>>
>>>> +			if ((order >= MAX_FOLIO_ORDER) ||
>>>> +			    (nr_pages <= (1 << order)))
>>>> +				return false;
>>>> +
>>>> +			/* No need to check the pfns for this page */
>>>> +			i += (1 << order) - 1;
>>>
>>> So this advance should based on "head page" instead of original page, right?
>>>
>>
>> hm, I think the thought here was that since we're moving forward from
>> start of an aligned chunk, we'd never hit a non-head page - but this
>> may not be true.
>>
>> Will think about this for a bit.
>
>The sole caller of pfn_range_valid_contig(), alloc_contig_pages_noprof(),
>scans from the beginning of a zone to the end. pfn_range_valid_contig()
>should see head pages all the time, except it scans in the middle of
>a 1GB hugetlb when alloc_contig_pages_noprof() is asking for a smaller
>nr_pages, like 2MB. But in that case, the if above i += (1 << order) - 1
>would return false without reaching it. Basically, to get to
>i += ..., pfn_range_valid_contig() needs to search for nr_pages larger
>than PageHuge(page) and nr_pages is always power of two based on
>alloc_contig_pages_noprof() requirement, but that means
>pfn_range_valid_contig() always sees such PageHuge pages as a whole
>within nr_pages range, thus cannot see a tail PageHuge page at the
>point of i += ....
>

Thanks, I think you are right. For current use case, it is safe.

But I am not sure others could get it on first sight. For example, me :-)
Do you think it would be helpful to add some comment here?

Generally LGTM.

Reviewed-by: Wei Yang <richard.weiyang@gmail.com>

>Best Regards,
>Yan, Zi

-- 
Wei Yang
Help you, Help me

Re: [PATCH v6] page_alloc: allow migration of smaller hugepages during contig_alloc

Posted by Gregory Price 1 month, 2 weeks ago

On Sat, Dec 20, 2025 at 06:37:38AM +0000, Wei Yang wrote:
> On Fri, Dec 19, 2025 at 03:46:25PM -0500, Zi Yan wrote:
> >
> >The sole caller of pfn_range_valid_contig(), alloc_contig_pages_noprof(),
> >scans from the beginning of a zone to the end. pfn_range_valid_contig()
> >should see head pages all the time, except it scans in the middle of
> >a 1GB hugetlb when alloc_contig_pages_noprof() is asking for a smaller
> >nr_pages, like 2MB. But in that case, the if above i += (1 << order) - 1
> >would return false without reaching it. Basically, to get to
> >i += ..., pfn_range_valid_contig() needs to search for nr_pages larger
> >than PageHuge(page) and nr_pages is always power of two based on
> >alloc_contig_pages_noprof() requirement, but that means
> >pfn_range_valid_contig() always sees such PageHuge pages as a whole
> >within nr_pages range, thus cannot see a tail PageHuge page at the
> >point of i += ....
> >
> 
> Thanks, I think you are right. For current use case, it is safe.
> 
> But I am not sure others could get it on first sight. For example, me :-)
> Do you think it would be helpful to add some comment here?
>

Can't hurt, i'll give this a v7 and collect the tags, thanks!

> Generally LGTM.
> 
> Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
> 
> >Best Regards,
> >Yan, Zi
> 
> -- 
> Wei Yang
> Help you, Help me

Re: [PATCH v6] page_alloc: allow migration of smaller hugepages during contig_alloc

Posted by Gregory Price 1 month, 2 weeks ago

On Fri, Dec 19, 2025 at 03:46:25PM -0500, Zi Yan wrote:
> On 19 Dec 2025, at 9:26, Gregory Price wrote:
> 
> > On Fri, Dec 19, 2025 at 12:08:00AM +0000, Wei Yang wrote:
> >>> +
> >>> +			page = compound_head(page);
> >>> +			order = compound_order(page);
> >>
> >> The order is get from head page.
> >>
> >>> +			if ((order >= MAX_FOLIO_ORDER) ||
> >>> +			    (nr_pages <= (1 << order)))
> >>> +				return false;
> >>> +
> >>> +			/* No need to check the pfns for this page */
> >>> +			i += (1 << order) - 1;
> >>
> >> So this advance should based on "head page" instead of original page, right?
> >>
> >
> > hm, I think the thought here was that since we're moving forward from
> > start of an aligned chunk, we'd never hit a non-head page - but this
> > may not be true.
> >
> > Will think about this for a bit.
> 
> The sole caller of pfn_range_valid_contig(), alloc_contig_pages_noprof(),
> scans from the beginning of a zone to the end. pfn_range_valid_contig()
> should see head pages all the time, except it scans in the middle of
> a 1GB hugetlb when alloc_contig_pages_noprof() is asking for a smaller
> nr_pages, like 2MB. But in that case, the if above i += (1 << order) - 1
> would return false without reaching it. Basically, to get to
> i += ..., pfn_range_valid_contig() needs to search for nr_pages larger
> than PageHuge(page) and nr_pages is always power of two based on
> alloc_contig_pages_noprof() requirement, but that means
> pfn_range_valid_contig() always sees such PageHuge pages as a whole
> within nr_pages range, thus cannot see a tail PageHuge page at the
> point of i += ....
> 

Thinking about this a bit more, it might be worthwhile to detect this
condiition and just skip that hugepage in the external code.

while (zone_spans_last_pfn(zone, pfn, nr_pages)) {
	if (pfn_range_valid_contig(zone, pfn, nr_pages,
				   skip_hugetlb,
				   &skipped_hugetlb)) {
		... snip ...
	}
	pfn += nr_pages;
	/*
	 * TODO: If the last scanned page was a hugepage that caused
	 *       the zone to be invalid, skip the rest of that page
	 *       (e.g. if we hit a 1GB page trying to allocate a 2MB
	 *       page, skip the entire 1GB instead of scanning the
	 *       same page 1GB/2MB times).
	 */
	 ...
}

But this solves a different problem than this patch, so i will defer.

~Gregory

Re: [PATCH v6] page_alloc: allow migration of smaller hugepages during contig_alloc

Posted by Gregory Price 1 month, 2 weeks ago

On Fri, Dec 19, 2025 at 03:46:25PM -0500, Zi Yan wrote:
> On 19 Dec 2025, at 9:26, Gregory Price wrote:
> 
> > Will think about this for a bit.
> 
> The sole caller of pfn_range_valid_contig(), alloc_contig_pages_noprof(),
> scans from the beginning of a zone to the end. pfn_range_valid_contig()
> should see head pages all the time, except it scans in the middle of
> a 1GB hugetlb when alloc_contig_pages_noprof() is asking for a smaller
> nr_pages, like 2MB. But in that case, the if above i += (1 << order) - 1
> would return false without reaching it. Basically, to get to
> i += ..., pfn_range_valid_contig() needs to search for nr_pages larger
> than PageHuge(page) and nr_pages is always power of two based on
> alloc_contig_pages_noprof() requirement, but that means
> pfn_range_valid_contig() always sees such PageHuge pages as a whole
> within nr_pages range, thus cannot see a tail PageHuge page at the
> point of i += ....
> 

right, and we hold the zone lock here, so we shouldn't see a page
suddenly become a tail page mid-iteration.

I hadn't mentally worked through whether it was a good idea to encode
this behavior now with only one user - but i suppose there's no point in
optimizing for code that doesn't exist, so i agree.  This does seem
fine.

Thanks!
Gregory