[v2] mm: Free contiguous order-0 pages efficiently

[PATCH v2 1/3] mm/page_alloc: Optimize free_contig_range()

Posted by Muhammad Usama Anjum 3 weeks ago

From: Ryan Roberts <ryan.roberts@arm.com>

Decompose the range of order-0 pages to be freed into the set of largest
possible power-of-2 size and aligned chunks and free them to the pcp or
buddy. This improves on the previous approach which freed each order-0
page individually in a loop. Testing shows performance to be improved by
more than 10x in some cases.

Since each page is order-0, we must decrement each page's reference
count individually and only consider the page for freeing as part of a
high order chunk if the reference count goes to zero. Additionally
free_pages_prepare() must be called for each individual order-0 page
too, so that the struct page state and global accounting state can be
appropriately managed. But once this is done, the resulting high order
chunks can be freed as a unit to the pcp or buddy.

This significantly speeds up the free operation but also has the side
benefit that high order blocks are added to the pcp instead of each page
ending up on the pcp order-0 list; memory remains more readily available
in high orders.

vmalloc will shortly become a user of this new optimized
free_contig_range() since it aggressively allocates high order
non-compound pages, but then calls split_page() to end up with
contiguous order-0 pages. These can now be freed much more efficiently.

The execution time of the following function was measured in a server
class arm64 machine:

static int page_alloc_high_order_test(void)
{
	unsigned int order = HPAGE_PMD_ORDER;
	struct page *page;
	int i;

	for (i = 0; i < 100000; i++) {
		page = alloc_pages(GFP_KERNEL, order);
		if (!page)
			return -1;
		split_page(page, order);
		free_contig_range(page_to_pfn(page), 1UL << order);
	}

	return 0;
}

Execution time before: 4097358 usec
Execution time after:   729831 usec

Perf trace before:

    99.63%     0.00%  kthreadd         [kernel.kallsyms]      [.] kthread
            |
            ---kthread
               0xffffb33c12a26af8
               |
               |--98.13%--0xffffb33c12a26060
               |          |
               |          |--97.37%--free_contig_range
               |          |          |
               |          |          |--94.93%--___free_pages
               |          |          |          |
               |          |          |          |--55.42%--__free_frozen_pages
               |          |          |          |          |
               |          |          |          |           --43.20%--free_frozen_page_commit
               |          |          |          |                     |
               |          |          |          |                      --35.37%--_raw_spin_unlock_irqrestore
               |          |          |          |
               |          |          |          |--11.53%--_raw_spin_trylock
               |          |          |          |
               |          |          |          |--8.19%--__preempt_count_dec_and_test
               |          |          |          |
               |          |          |          |--5.64%--_raw_spin_unlock
               |          |          |          |
               |          |          |          |--2.37%--__get_pfnblock_flags_mask.isra.0
               |          |          |          |
               |          |          |           --1.07%--free_frozen_page_commit
               |          |          |
               |          |           --1.54%--__free_frozen_pages
               |          |
               |           --0.77%--___free_pages
               |
                --0.98%--0xffffb33c12a26078
                          alloc_pages_noprof

Perf trace after:

     8.42%     2.90%  kthreadd         [kernel.kallsyms]         [k] __free_contig_range
            |
            |--5.52%--__free_contig_range
            |          |
            |          |--5.00%--free_prepared_contig_range
            |          |          |
            |          |          |--1.43%--__free_frozen_pages
            |          |          |          |
            |          |          |           --0.51%--free_frozen_page_commit
            |          |          |
            |          |          |--1.08%--_raw_spin_trylock
            |          |          |
            |          |           --0.89%--_raw_spin_unlock
            |          |
            |           --0.52%--free_pages_prepare
            |
             --2.90%--ret_from_fork
                       kthread
                       0xffffae1c12abeaf8
                       0xffffae1c12abe7a0
                       |
                        --2.69%--vfree
                                  __free_contig_range

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Co-developed-by: Muhammad Usama Anjum <usama.anjum@arm.com>
Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
---
Changes since v1:
- Rebase on mm-new
- Move FPI_PREPARED check inside __free_pages_prepare() now that
  fpi_flags are already being passed.
- Add todo (Zi Yan)
- Rerun benchmarks
- Convert VM_BUG_ON_PAGE() to VM_WARN_ON_ONCE()
- Rework order calculation in free_prepared_contig_range() and use
  MAX_PAGE_ORDER as high limit instead of pageblock_order as it must
  be up to internal __free_frozen_pages() how it frees them
---
 include/linux/gfp.h |   2 +
 mm/page_alloc.c     | 110 ++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 108 insertions(+), 4 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index f82d74a77cad8..96ac7aae370c4 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -467,6 +467,8 @@ void free_contig_frozen_range(unsigned long pfn, unsigned long nr_pages);
 void free_contig_range(unsigned long pfn, unsigned long nr_pages);
 #endif
 
+unsigned long __free_contig_range(unsigned long pfn, unsigned long nr_pages);
+
 DEFINE_FREE(free_page, void *, free_page((unsigned long)_T))
 
 #endif /* __LINUX_GFP_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 75ee81445640b..6a9430f720579 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -91,6 +91,13 @@ typedef int __bitwise fpi_t;
 /* Free the page without taking locks. Rely on trylock only. */
 #define FPI_TRYLOCK		((__force fpi_t)BIT(2))
 
+/*
+ * free_pages_prepare() has already been called for page(s) being freed.
+ * TODO: Perform per-subpage free_pages_prepare() checks for order > 0 pages
+ * (HWPoison, PageNetpp, bad free page).
+ */
+#define FPI_PREPARED		((__force fpi_t)BIT(3))
+
 /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
 static DEFINE_MUTEX(pcp_batch_high_lock);
 #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
@@ -1310,6 +1317,9 @@ __always_inline bool __free_pages_prepare(struct page *page,
 	bool compound = PageCompound(page);
 	struct folio *folio = page_folio(page);
 
+	if (fpi_flags & FPI_PREPARED)
+		return true;
+
 	VM_BUG_ON_PAGE(PageTail(page), page);
 
 	trace_mm_page_free(page, order);
@@ -1579,8 +1589,10 @@ static void __free_pages_ok(struct page *page, unsigned int order,
 	unsigned long pfn = page_to_pfn(page);
 	struct zone *zone = page_zone(page);
 
-	if (__free_pages_prepare(page, order, fpi_flags))
-		free_one_page(zone, page, pfn, order, fpi_flags);
+	if (!__free_pages_prepare(page, order, fpi_flags))
+		return;
+
+	free_one_page(zone, page, pfn, order, fpi_flags);
 }
 
 void __meminit __free_pages_core(struct page *page, unsigned int order,
@@ -6784,6 +6796,93 @@ void __init page_alloc_sysctl_init(void)
 	register_sysctl_init("vm", page_alloc_sysctl_table);
 }
 
+static void free_prepared_contig_range(struct page *page,
+				       unsigned long nr_pages)
+{
+	while (nr_pages) {
+		unsigned int order;
+		unsigned long pfn;
+
+		pfn = page_to_pfn(page);
+		/* We are limited by the largest buddy order. */
+		order = pfn ? __ffs(pfn) : MAX_PAGE_ORDER;
+		/* Don't exceed the number of pages to free. */
+		order = min(order, ilog2(nr_pages));
+		order = min_t(unsigned int, order, MAX_PAGE_ORDER);
+
+		/*
+		 * Free the chunk as a single block. Our caller has already
+		 * called free_pages_prepare() for each order-0 page.
+		 */
+		__free_frozen_pages(page, order, FPI_PREPARED);
+
+		page += 1UL << order;
+		nr_pages -= 1UL << order;
+	}
+}
+
+/**
+ * __free_contig_range - Free contiguous range of order-0 pages.
+ * @pfn: Page frame number of the first page in the range.
+ * @nr_pages: Number of pages to free.
+ *
+ * For each order-0 struct page in the physically contiguous range, put a
+ * reference. Free any page who's reference count falls to zero. The
+ * implementation is functionally equivalent to, but significantly faster than
+ * calling __free_page() for each struct page in a loop.
+ *
+ * Memory allocated with alloc_pages(order>=1) then subsequently split to
+ * order-0 with split_page() is an example of appropriate contiguous pages that
+ * can be freed with this API.
+ *
+ * Returns the number of pages which were not freed, because their reference
+ * count did not fall to zero.
+ *
+ * Context: May be called in interrupt context or while holding a normal
+ * spinlock, but not in NMI context or while holding a raw spinlock.
+ */
+unsigned long __free_contig_range(unsigned long pfn, unsigned long nr_pages)
+{
+	struct page *page = pfn_to_page(pfn);
+	unsigned long not_freed = 0;
+	struct page *start = NULL;
+	unsigned long i;
+	bool can_free;
+
+	/*
+	 * Chunk the range into contiguous runs of pages for which the refcount
+	 * went to zero and for which free_pages_prepare() succeeded. If
+	 * free_pages_prepare() fails we consider the page to have been freed;
+	 * deliberately leak it.
+	 *
+	 * Code assumes contiguous PFNs have contiguous struct pages, but not
+	 * vice versa.
+	 */
+	for (i = 0; i < nr_pages; i++, page++) {
+		VM_WARN_ON_ONCE(PageHead(page));
+		VM_WARN_ON_ONCE(PageTail(page));
+
+		can_free = put_page_testzero(page);
+		if (!can_free)
+			not_freed++;
+		else if (!free_pages_prepare(page, 0))
+			can_free = false;
+
+		if (!can_free && start) {
+			free_prepared_contig_range(start, page - start);
+			start = NULL;
+		} else if (can_free && !start) {
+			start = page;
+		}
+	}
+
+	if (start)
+		free_prepared_contig_range(start, page - start);
+
+	return not_freed;
+}
+EXPORT_SYMBOL(__free_contig_range);
+
 #ifdef CONFIG_CONTIG_ALLOC
 /* Usage: See admin-guide/dynamic-debug-howto.rst */
 static void alloc_contig_dump_pages(struct list_head *page_list)
@@ -7327,11 +7426,14 @@ EXPORT_SYMBOL(free_contig_frozen_range);
  */
 void free_contig_range(unsigned long pfn, unsigned long nr_pages)
 {
+	unsigned long count;
+
 	if (WARN_ON_ONCE(PageHead(pfn_to_page(pfn))))
 		return;
 
-	for (; nr_pages--; pfn++)
-		__free_page(pfn_to_page(pfn));
+	count = __free_contig_range(pfn, nr_pages);
+	WARN(count != 0, "%lu pages are still in use!\n", count);
+
 }
 EXPORT_SYMBOL(free_contig_range);
 #endif /* CONFIG_CONTIG_ALLOC */
-- 
2.47.3

Re: [PATCH v2 1/3] mm/page_alloc: Optimize free_contig_range()

Posted by Vlastimil Babka 3 weeks ago

On 3/16/26 12:31, Muhammad Usama Anjum wrote:
> From: Ryan Roberts <ryan.roberts@arm.com>
> 
> Decompose the range of order-0 pages to be freed into the set of largest
> possible power-of-2 size and aligned chunks and free them to the pcp or
> buddy. This improves on the previous approach which freed each order-0
> page individually in a loop. Testing shows performance to be improved by
> more than 10x in some cases.
> 
> Since each page is order-0, we must decrement each page's reference
> count individually and only consider the page for freeing as part of a
> high order chunk if the reference count goes to zero. Additionally
> free_pages_prepare() must be called for each individual order-0 page
> too, so that the struct page state and global accounting state can be
> appropriately managed. But once this is done, the resulting high order
> chunks can be freed as a unit to the pcp or buddy.
> 
> This significantly speeds up the free operation but also has the side
> benefit that high order blocks are added to the pcp instead of each page
> ending up on the pcp order-0 list; memory remains more readily available
> in high orders.
> 
> vmalloc will shortly become a user of this new optimized
> free_contig_range() since it aggressively allocates high order
> non-compound pages, but then calls split_page() to end up with
> contiguous order-0 pages. These can now be freed much more efficiently.
> 
> The execution time of the following function was measured in a server
> class arm64 machine:
> 
> static int page_alloc_high_order_test(void)
> {
> 	unsigned int order = HPAGE_PMD_ORDER;
> 	struct page *page;
> 	int i;
> 
> 	for (i = 0; i < 100000; i++) {
> 		page = alloc_pages(GFP_KERNEL, order);
> 		if (!page)
> 			return -1;
> 		split_page(page, order);
> 		free_contig_range(page_to_pfn(page), 1UL << order);
> 	}
> 
> 	return 0;
> }
> 
> Execution time before: 4097358 usec
> Execution time after:   729831 usec
> 
> Perf trace before:
> 
>     99.63%     0.00%  kthreadd         [kernel.kallsyms]      [.] kthread
>             |
>             ---kthread
>                0xffffb33c12a26af8
>                |
>                |--98.13%--0xffffb33c12a26060
>                |          |
>                |          |--97.37%--free_contig_range
>                |          |          |
>                |          |          |--94.93%--___free_pages
>                |          |          |          |
>                |          |          |          |--55.42%--__free_frozen_pages
>                |          |          |          |          |
>                |          |          |          |           --43.20%--free_frozen_page_commit
>                |          |          |          |                     |
>                |          |          |          |                      --35.37%--_raw_spin_unlock_irqrestore
>                |          |          |          |
>                |          |          |          |--11.53%--_raw_spin_trylock
>                |          |          |          |
>                |          |          |          |--8.19%--__preempt_count_dec_and_test
>                |          |          |          |
>                |          |          |          |--5.64%--_raw_spin_unlock
>                |          |          |          |
>                |          |          |          |--2.37%--__get_pfnblock_flags_mask.isra.0
>                |          |          |          |
>                |          |          |           --1.07%--free_frozen_page_commit
>                |          |          |
>                |          |           --1.54%--__free_frozen_pages
>                |          |
>                |           --0.77%--___free_pages
>                |
>                 --0.98%--0xffffb33c12a26078
>                           alloc_pages_noprof
> 
> Perf trace after:
> 
>      8.42%     2.90%  kthreadd         [kernel.kallsyms]         [k] __free_contig_range
>             |
>             |--5.52%--__free_contig_range
>             |          |
>             |          |--5.00%--free_prepared_contig_range
>             |          |          |
>             |          |          |--1.43%--__free_frozen_pages
>             |          |          |          |
>             |          |          |           --0.51%--free_frozen_page_commit
>             |          |          |
>             |          |          |--1.08%--_raw_spin_trylock
>             |          |          |
>             |          |           --0.89%--_raw_spin_unlock
>             |          |
>             |           --0.52%--free_pages_prepare
>             |
>              --2.90%--ret_from_fork
>                        kthread
>                        0xffffae1c12abeaf8
>                        0xffffae1c12abe7a0
>                        |
>                         --2.69%--vfree
>                                   __free_contig_range
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> Co-developed-by: Muhammad Usama Anjum <usama.anjum@arm.com>
> Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
> ---
> Changes since v1:
> - Rebase on mm-new
> - Move FPI_PREPARED check inside __free_pages_prepare() now that
>   fpi_flags are already being passed.
> - Add todo (Zi Yan)
> - Rerun benchmarks
> - Convert VM_BUG_ON_PAGE() to VM_WARN_ON_ONCE()
> - Rework order calculation in free_prepared_contig_range() and use
>   MAX_PAGE_ORDER as high limit instead of pageblock_order as it must
>   be up to internal __free_frozen_pages() how it frees them
> ---
>  include/linux/gfp.h |   2 +
>  mm/page_alloc.c     | 110 ++++++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 108 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index f82d74a77cad8..96ac7aae370c4 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -467,6 +467,8 @@ void free_contig_frozen_range(unsigned long pfn, unsigned long nr_pages);
>  void free_contig_range(unsigned long pfn, unsigned long nr_pages);
>  #endif
>  
> +unsigned long __free_contig_range(unsigned long pfn, unsigned long nr_pages);
> +
>  DEFINE_FREE(free_page, void *, free_page((unsigned long)_T))
>  
>  #endif /* __LINUX_GFP_H */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 75ee81445640b..6a9430f720579 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -91,6 +91,13 @@ typedef int __bitwise fpi_t;
>  /* Free the page without taking locks. Rely on trylock only. */
>  #define FPI_TRYLOCK		((__force fpi_t)BIT(2))
>  
> +/*
> + * free_pages_prepare() has already been called for page(s) being freed.
> + * TODO: Perform per-subpage free_pages_prepare() checks for order > 0 pages
> + * (HWPoison, PageNetpp, bad free page).
> + */

I'm confused, and reading the v1 thread didn't help either. Where would the
subpages to check come from? AFAICS we start from order-0 pages always.
__free_contig_range calls free_pages_prepare on every page with order 0
unconditionally, so we check every page as an order-0 page. If we then free
the bunch of individually checked pages as a high-order page, there's no
reason to check those subpages again, no? Am I missing something?

> +#define FPI_PREPARED		((__force fpi_t)BIT(3))
> +
>  /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
>  static DEFINE_MUTEX(pcp_batch_high_lock);
>  #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
> @@ -1310,6 +1317,9 @@ __always_inline bool __free_pages_prepare(struct page *page,
>  	bool compound = PageCompound(page);
>  	struct folio *folio = page_folio(page);
>  
> +	if (fpi_flags & FPI_PREPARED)
> +		return true;
> +
>  	VM_BUG_ON_PAGE(PageTail(page), page);
>  
>  	trace_mm_page_free(page, order);
> @@ -1579,8 +1589,10 @@ static void __free_pages_ok(struct page *page, unsigned int order,
>  	unsigned long pfn = page_to_pfn(page);
>  	struct zone *zone = page_zone(page);
>  
> -	if (__free_pages_prepare(page, order, fpi_flags))
> -		free_one_page(zone, page, pfn, order, fpi_flags);
> +	if (!__free_pages_prepare(page, order, fpi_flags))
> +		return;
> +
> +	free_one_page(zone, page, pfn, order, fpi_flags);

This is not a functional change, can we drop it?

>  }
>  
>  void __meminit __free_pages_core(struct page *page, unsigned int order,
> @@ -6784,6 +6796,93 @@ void __init page_alloc_sysctl_init(void)
>  	register_sysctl_init("vm", page_alloc_sysctl_table);
>  }
>  
> +static void free_prepared_contig_range(struct page *page,
> +				       unsigned long nr_pages)
> +{
> +	while (nr_pages) {
> +		unsigned int order;
> +		unsigned long pfn;
> +
> +		pfn = page_to_pfn(page);
> +		/* We are limited by the largest buddy order. */
> +		order = pfn ? __ffs(pfn) : MAX_PAGE_ORDER;
> +		/* Don't exceed the number of pages to free. */
> +		order = min(order, ilog2(nr_pages));
> +		order = min_t(unsigned int, order, MAX_PAGE_ORDER);
> +
> +		/*
> +		 * Free the chunk as a single block. Our caller has already
> +		 * called free_pages_prepare() for each order-0 page.
> +		 */
> +		__free_frozen_pages(page, order, FPI_PREPARED);
> +
> +		page += 1UL << order;
> +		nr_pages -= 1UL << order;
> +	}
> +}
> +
> +/**
> + * __free_contig_range - Free contiguous range of order-0 pages.
> + * @pfn: Page frame number of the first page in the range.
> + * @nr_pages: Number of pages to free.
> + *
> + * For each order-0 struct page in the physically contiguous range, put a
> + * reference. Free any page who's reference count falls to zero. The
> + * implementation is functionally equivalent to, but significantly faster than
> + * calling __free_page() for each struct page in a loop.
> + *
> + * Memory allocated with alloc_pages(order>=1) then subsequently split to
> + * order-0 with split_page() is an example of appropriate contiguous pages that
> + * can be freed with this API.
> + *
> + * Returns the number of pages which were not freed, because their reference
> + * count did not fall to zero.

We probably don't need this part.

> + *
> + * Context: May be called in interrupt context or while holding a normal
> + * spinlock, but not in NMI context or while holding a raw spinlock.
> + */
> +unsigned long __free_contig_range(unsigned long pfn, unsigned long nr_pages)
> +{
> +	struct page *page = pfn_to_page(pfn);
> +	unsigned long not_freed = 0;
> +	struct page *start = NULL;
> +	unsigned long i;
> +	bool can_free;
> +
> +	/*
> +	 * Chunk the range into contiguous runs of pages for which the refcount
> +	 * went to zero and for which free_pages_prepare() succeeded. If
> +	 * free_pages_prepare() fails we consider the page to have been freed;
> +	 * deliberately leak it.
> +	 *
> +	 * Code assumes contiguous PFNs have contiguous struct pages, but not
> +	 * vice versa.
> +	 */
> +	for (i = 0; i < nr_pages; i++, page++) {
> +		VM_WARN_ON_ONCE(PageHead(page));
> +		VM_WARN_ON_ONCE(PageTail(page));
> +
> +		can_free = put_page_testzero(page);
> +		if (!can_free)
> +			not_freed++;
> +		else if (!free_pages_prepare(page, 0))
> +			can_free = false;
> +
> +		if (!can_free && start) {
> +			free_prepared_contig_range(start, page - start);
> +			start = NULL;
> +		} else if (can_free && !start) {
> +			start = page;
> +		}
> +	}
> +
> +	if (start)
> +		free_prepared_contig_range(start, page - start);
> +
> +	return not_freed;
> +}
> +EXPORT_SYMBOL(__free_contig_range);
> +
>  #ifdef CONFIG_CONTIG_ALLOC
>  /* Usage: See admin-guide/dynamic-debug-howto.rst */
>  static void alloc_contig_dump_pages(struct list_head *page_list)
> @@ -7327,11 +7426,14 @@ EXPORT_SYMBOL(free_contig_frozen_range);
>   */
>  void free_contig_range(unsigned long pfn, unsigned long nr_pages)
>  {
> +	unsigned long count;
> +
>  	if (WARN_ON_ONCE(PageHead(pfn_to_page(pfn))))
>  		return;
>  
> -	for (; nr_pages--; pfn++)
> -		__free_page(pfn_to_page(pfn));
> +	count = __free_contig_range(pfn, nr_pages);
> +	WARN(count != 0, "%lu pages are still in use!\n", count);

And we almost certainly don't want this warning. Spurious temporary page
refcount increases (get_page_unless_zero()) can happen e.g. due to memory
compaction pfn scanners. It just might mean that side will be then the last
one to drop the refcount and freeing the order-0 page. For us it means only
that we abort and restart the batching, so we get worse performance, but
functionally it's ok, and should be very rare anyway.

> +
>  }
>  EXPORT_SYMBOL(free_contig_range);
>  #endif /* CONFIG_CONTIG_ALLOC */

Re: [PATCH v2 1/3] mm/page_alloc: Optimize free_contig_range()

Posted by Muhammad Usama Anjum 3 weeks ago

On 16/03/2026 3:21 pm, Vlastimil Babka wrote:
> On 3/16/26 12:31, Muhammad Usama Anjum wrote:
>> From: Ryan Roberts <ryan.roberts@arm.com>
>>
>> Decompose the range of order-0 pages to be freed into the set of largest
>> possible power-of-2 size and aligned chunks and free them to the pcp or
>> buddy. This improves on the previous approach which freed each order-0
>> page individually in a loop. Testing shows performance to be improved by
>> more than 10x in some cases.
>>
>> Since each page is order-0, we must decrement each page's reference
>> count individually and only consider the page for freeing as part of a
>> high order chunk if the reference count goes to zero. Additionally
>> free_pages_prepare() must be called for each individual order-0 page
>> too, so that the struct page state and global accounting state can be
>> appropriately managed. But once this is done, the resulting high order
>> chunks can be freed as a unit to the pcp or buddy.
>>
>> This significantly speeds up the free operation but also has the side
>> benefit that high order blocks are added to the pcp instead of each page
>> ending up on the pcp order-0 list; memory remains more readily available
>> in high orders.
>>
>> vmalloc will shortly become a user of this new optimized
>> free_contig_range() since it aggressively allocates high order
>> non-compound pages, but then calls split_page() to end up with
>> contiguous order-0 pages. These can now be freed much more efficiently.
>>
>> The execution time of the following function was measured in a server
>> class arm64 machine:
>>
>> static int page_alloc_high_order_test(void)
>> {
>> 	unsigned int order = HPAGE_PMD_ORDER;
>> 	struct page *page;
>> 	int i;
>>
>> 	for (i = 0; i < 100000; i++) {
>> 		page = alloc_pages(GFP_KERNEL, order);
>> 		if (!page)
>> 			return -1;
>> 		split_page(page, order);
>> 		free_contig_range(page_to_pfn(page), 1UL << order);
>> 	}
>>
>> 	return 0;
>> }
>>
>> Execution time before: 4097358 usec
>> Execution time after:   729831 usec
>>
>> Perf trace before:
>>
>>     99.63%     0.00%  kthreadd         [kernel.kallsyms]      [.] kthread
>>             |
>>             ---kthread
>>                0xffffb33c12a26af8
>>                |
>>                |--98.13%--0xffffb33c12a26060
>>                |          |
>>                |          |--97.37%--free_contig_range
>>                |          |          |
>>                |          |          |--94.93%--___free_pages
>>                |          |          |          |
>>                |          |          |          |--55.42%--__free_frozen_pages
>>                |          |          |          |          |
>>                |          |          |          |           --43.20%--free_frozen_page_commit
>>                |          |          |          |                     |
>>                |          |          |          |                      --35.37%--_raw_spin_unlock_irqrestore
>>                |          |          |          |
>>                |          |          |          |--11.53%--_raw_spin_trylock
>>                |          |          |          |
>>                |          |          |          |--8.19%--__preempt_count_dec_and_test
>>                |          |          |          |
>>                |          |          |          |--5.64%--_raw_spin_unlock
>>                |          |          |          |
>>                |          |          |          |--2.37%--__get_pfnblock_flags_mask.isra.0
>>                |          |          |          |
>>                |          |          |           --1.07%--free_frozen_page_commit
>>                |          |          |
>>                |          |           --1.54%--__free_frozen_pages
>>                |          |
>>                |           --0.77%--___free_pages
>>                |
>>                 --0.98%--0xffffb33c12a26078
>>                           alloc_pages_noprof
>>
>> Perf trace after:
>>
>>      8.42%     2.90%  kthreadd         [kernel.kallsyms]         [k] __free_contig_range
>>             |
>>             |--5.52%--__free_contig_range
>>             |          |
>>             |          |--5.00%--free_prepared_contig_range
>>             |          |          |
>>             |          |          |--1.43%--__free_frozen_pages
>>             |          |          |          |
>>             |          |          |           --0.51%--free_frozen_page_commit
>>             |          |          |
>>             |          |          |--1.08%--_raw_spin_trylock
>>             |          |          |
>>             |          |           --0.89%--_raw_spin_unlock
>>             |          |
>>             |           --0.52%--free_pages_prepare
>>             |
>>              --2.90%--ret_from_fork
>>                        kthread
>>                        0xffffae1c12abeaf8
>>                        0xffffae1c12abe7a0
>>                        |
>>                         --2.69%--vfree
>>                                   __free_contig_range
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> Co-developed-by: Muhammad Usama Anjum <usama.anjum@arm.com>
>> Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
>> ---
>> Changes since v1:
>> - Rebase on mm-new
>> - Move FPI_PREPARED check inside __free_pages_prepare() now that
>>   fpi_flags are already being passed.
>> - Add todo (Zi Yan)
>> - Rerun benchmarks
>> - Convert VM_BUG_ON_PAGE() to VM_WARN_ON_ONCE()
>> - Rework order calculation in free_prepared_contig_range() and use
>>   MAX_PAGE_ORDER as high limit instead of pageblock_order as it must
>>   be up to internal __free_frozen_pages() how it frees them
>> ---
>>  include/linux/gfp.h |   2 +
>>  mm/page_alloc.c     | 110 ++++++++++++++++++++++++++++++++++++++++++--
>>  2 files changed, 108 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>> index f82d74a77cad8..96ac7aae370c4 100644
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -467,6 +467,8 @@ void free_contig_frozen_range(unsigned long pfn, unsigned long nr_pages);
>>  void free_contig_range(unsigned long pfn, unsigned long nr_pages);
>>  #endif
>>  
>> +unsigned long __free_contig_range(unsigned long pfn, unsigned long nr_pages);
>> +
>>  DEFINE_FREE(free_page, void *, free_page((unsigned long)_T))
>>  
>>  #endif /* __LINUX_GFP_H */
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 75ee81445640b..6a9430f720579 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -91,6 +91,13 @@ typedef int __bitwise fpi_t;
>>  /* Free the page without taking locks. Rely on trylock only. */
>>  #define FPI_TRYLOCK		((__force fpi_t)BIT(2))
>>  
>> +/*
>> + * free_pages_prepare() has already been called for page(s) being freed.
>> + * TODO: Perform per-subpage free_pages_prepare() checks for order > 0 pages
>> + * (HWPoison, PageNetpp, bad free page).
>> + */
> 
> I'm confused, and reading the v1 thread didn't help either. Where would the
> subpages to check come from? AFAICS we start from order-0 pages always.
> __free_contig_range calls free_pages_prepare on every page with order 0
> unconditionally, so we check every page as an order-0 page. If we then free
> the bunch of individually checked pages as a high-order page, there's no
> reason to check those subpages again, no? Am I missing something?
Zi Yan replied in separate thread. Let's continue this discussion there.

> 
>> +#define FPI_PREPARED		((__force fpi_t)BIT(3))
>> +
>>  /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
>>  static DEFINE_MUTEX(pcp_batch_high_lock);
>>  #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
>> @@ -1310,6 +1317,9 @@ __always_inline bool __free_pages_prepare(struct page *page,
>>  	bool compound = PageCompound(page);
>>  	struct folio *folio = page_folio(page);
>>  
>> +	if (fpi_flags & FPI_PREPARED)
>> +		return true;
>> +
>>  	VM_BUG_ON_PAGE(PageTail(page), page);
>>  
>>  	trace_mm_page_free(page, order);
>> @@ -1579,8 +1589,10 @@ static void __free_pages_ok(struct page *page, unsigned int order,
>>  	unsigned long pfn = page_to_pfn(page);
>>  	struct zone *zone = page_zone(page);
>>  
>> -	if (__free_pages_prepare(page, order, fpi_flags))
>> -		free_one_page(zone, page, pfn, order, fpi_flags);
>> +	if (!__free_pages_prepare(page, order, fpi_flags))
>> +		return;
>> +
>> +	free_one_page(zone, page, pfn, order, fpi_flags);
> 
> This is not a functional change, can we drop it?
Yes, I'll drop it in the next version.

> 
>>  }
>>  
>>  void __meminit __free_pages_core(struct page *page, unsigned int order,
>> @@ -6784,6 +6796,93 @@ void __init page_alloc_sysctl_init(void)
>>  	register_sysctl_init("vm", page_alloc_sysctl_table);
>>  }
>>  
>> +static void free_prepared_contig_range(struct page *page,
>> +				       unsigned long nr_pages)
>> +{
>> +	while (nr_pages) {
>> +		unsigned int order;
>> +		unsigned long pfn;
>> +
>> +		pfn = page_to_pfn(page);
>> +		/* We are limited by the largest buddy order. */
>> +		order = pfn ? __ffs(pfn) : MAX_PAGE_ORDER;
>> +		/* Don't exceed the number of pages to free. */
>> +		order = min(order, ilog2(nr_pages));
>> +		order = min_t(unsigned int, order, MAX_PAGE_ORDER);
>> +
>> +		/*
>> +		 * Free the chunk as a single block. Our caller has already
>> +		 * called free_pages_prepare() for each order-0 page.
>> +		 */
>> +		__free_frozen_pages(page, order, FPI_PREPARED);
>> +
>> +		page += 1UL << order;
>> +		nr_pages -= 1UL << order;
>> +	}
>> +}
>> +
>> +/**
>> + * __free_contig_range - Free contiguous range of order-0 pages.
>> + * @pfn: Page frame number of the first page in the range.
>> + * @nr_pages: Number of pages to free.
>> + *
>> + * For each order-0 struct page in the physically contiguous range, put a
>> + * reference. Free any page who's reference count falls to zero. The
>> + * implementation is functionally equivalent to, but significantly faster than
>> + * calling __free_page() for each struct page in a loop.
>> + *
>> + * Memory allocated with alloc_pages(order>=1) then subsequently split to
>> + * order-0 with split_page() is an example of appropriate contiguous pages that
>> + * can be freed with this API.
>> + *
>> + * Returns the number of pages which were not freed, because their reference
>> + * count did not fall to zero.
> 
> We probably don't need this part.
The only user of this return value is free_contig_range(). Your explanation
below makes sense. I'll drop the the return value and clean free_contif_range()
as well.

> 
>> + *
>> + * Context: May be called in interrupt context or while holding a normal
>> + * spinlock, but not in NMI context or while holding a raw spinlock.
>> + */
>> +unsigned long __free_contig_range(unsigned long pfn, unsigned long nr_pages)
>> +{
>> +	struct page *page = pfn_to_page(pfn);
>> +	unsigned long not_freed = 0;
>> +	struct page *start = NULL;
>> +	unsigned long i;
>> +	bool can_free;
>> +
>> +	/*
>> +	 * Chunk the range into contiguous runs of pages for which the refcount
>> +	 * went to zero and for which free_pages_prepare() succeeded. If
>> +	 * free_pages_prepare() fails we consider the page to have been freed;
>> +	 * deliberately leak it.
>> +	 *
>> +	 * Code assumes contiguous PFNs have contiguous struct pages, but not
>> +	 * vice versa.
>> +	 */
>> +	for (i = 0; i < nr_pages; i++, page++) {
>> +		VM_WARN_ON_ONCE(PageHead(page));
>> +		VM_WARN_ON_ONCE(PageTail(page));
>> +
>> +		can_free = put_page_testzero(page);
>> +		if (!can_free)
>> +			not_freed++;
>> +		else if (!free_pages_prepare(page, 0))
>> +			can_free = false;
>> +
>> +		if (!can_free && start) {
>> +			free_prepared_contig_range(start, page - start);
>> +			start = NULL;
>> +		} else if (can_free && !start) {
>> +			start = page;
>> +		}
>> +	}
>> +
>> +	if (start)
>> +		free_prepared_contig_range(start, page - start);
>> +
>> +	return not_freed;
>> +}
>> +EXPORT_SYMBOL(__free_contig_range);
>> +
>>  #ifdef CONFIG_CONTIG_ALLOC
>>  /* Usage: See admin-guide/dynamic-debug-howto.rst */
>>  static void alloc_contig_dump_pages(struct list_head *page_list)
>> @@ -7327,11 +7426,14 @@ EXPORT_SYMBOL(free_contig_frozen_range);
>>   */
>>  void free_contig_range(unsigned long pfn, unsigned long nr_pages)
>>  {
>> +	unsigned long count;
>> +
>>  	if (WARN_ON_ONCE(PageHead(pfn_to_page(pfn))))
>>  		return;
>>  
>> -	for (; nr_pages--; pfn++)
>> -		__free_page(pfn_to_page(pfn));
>> +	count = __free_contig_range(pfn, nr_pages);
>> +	WARN(count != 0, "%lu pages are still in use!\n", count);
> 
> And we almost certainly don't want this warning. Spurious temporary page
> refcount increases (get_page_unless_zero()) can happen e.g. due to memory
> compaction pfn scanners. It just might mean that side will be then the last
> one to drop the refcount and freeing the order-0 page. For us it means only
> that we abort and restart the batching, so we get worse performance, but
> functionally it's ok, and should be very rare anyway.
> 
>> +
>>  }
>>  EXPORT_SYMBOL(free_contig_range);
>>  #endif /* CONFIG_CONTIG_ALLOC */
>

Re: [PATCH v2 1/3] mm/page_alloc: Optimize free_contig_range()

Posted by Zi Yan 3 weeks ago

On 16 Mar 2026, at 11:21, Vlastimil Babka wrote:

> On 3/16/26 12:31, Muhammad Usama Anjum wrote:
>> From: Ryan Roberts <ryan.roberts@arm.com>
>>
>> Decompose the range of order-0 pages to be freed into the set of largest
>> possible power-of-2 size and aligned chunks and free them to the pcp or
>> buddy. This improves on the previous approach which freed each order-0
>> page individually in a loop. Testing shows performance to be improved by
>> more than 10x in some cases.
>>
>> Since each page is order-0, we must decrement each page's reference
>> count individually and only consider the page for freeing as part of a
>> high order chunk if the reference count goes to zero. Additionally
>> free_pages_prepare() must be called for each individual order-0 page
>> too, so that the struct page state and global accounting state can be
>> appropriately managed. But once this is done, the resulting high order
>> chunks can be freed as a unit to the pcp or buddy.
>>
>> This significantly speeds up the free operation but also has the side
>> benefit that high order blocks are added to the pcp instead of each page
>> ending up on the pcp order-0 list; memory remains more readily available
>> in high orders.
>>
>> vmalloc will shortly become a user of this new optimized
>> free_contig_range() since it aggressively allocates high order
>> non-compound pages, but then calls split_page() to end up with
>> contiguous order-0 pages. These can now be freed much more efficiently.
>>
>> The execution time of the following function was measured in a server
>> class arm64 machine:
>>
>> static int page_alloc_high_order_test(void)
>> {
>> 	unsigned int order = HPAGE_PMD_ORDER;
>> 	struct page *page;
>> 	int i;
>>
>> 	for (i = 0; i < 100000; i++) {
>> 		page = alloc_pages(GFP_KERNEL, order);
>> 		if (!page)
>> 			return -1;
>> 		split_page(page, order);
>> 		free_contig_range(page_to_pfn(page), 1UL << order);
>> 	}
>>
>> 	return 0;
>> }
>>
>> Execution time before: 4097358 usec
>> Execution time after:   729831 usec
>>
>> Perf trace before:
>>
>>     99.63%     0.00%  kthreadd         [kernel.kallsyms]      [.] kthread
>>             |
>>             ---kthread
>>                0xffffb33c12a26af8
>>                |
>>                |--98.13%--0xffffb33c12a26060
>>                |          |
>>                |          |--97.37%--free_contig_range
>>                |          |          |
>>                |          |          |--94.93%--___free_pages
>>                |          |          |          |
>>                |          |          |          |--55.42%--__free_frozen_pages
>>                |          |          |          |          |
>>                |          |          |          |           --43.20%--free_frozen_page_commit
>>                |          |          |          |                     |
>>                |          |          |          |                      --35.37%--_raw_spin_unlock_irqrestore
>>                |          |          |          |
>>                |          |          |          |--11.53%--_raw_spin_trylock
>>                |          |          |          |
>>                |          |          |          |--8.19%--__preempt_count_dec_and_test
>>                |          |          |          |
>>                |          |          |          |--5.64%--_raw_spin_unlock
>>                |          |          |          |
>>                |          |          |          |--2.37%--__get_pfnblock_flags_mask.isra.0
>>                |          |          |          |
>>                |          |          |           --1.07%--free_frozen_page_commit
>>                |          |          |
>>                |          |           --1.54%--__free_frozen_pages
>>                |          |
>>                |           --0.77%--___free_pages
>>                |
>>                 --0.98%--0xffffb33c12a26078
>>                           alloc_pages_noprof
>>
>> Perf trace after:
>>
>>      8.42%     2.90%  kthreadd         [kernel.kallsyms]         [k] __free_contig_range
>>             |
>>             |--5.52%--__free_contig_range
>>             |          |
>>             |          |--5.00%--free_prepared_contig_range
>>             |          |          |
>>             |          |          |--1.43%--__free_frozen_pages
>>             |          |          |          |
>>             |          |          |           --0.51%--free_frozen_page_commit
>>             |          |          |
>>             |          |          |--1.08%--_raw_spin_trylock
>>             |          |          |
>>             |          |           --0.89%--_raw_spin_unlock
>>             |          |
>>             |           --0.52%--free_pages_prepare
>>             |
>>              --2.90%--ret_from_fork
>>                        kthread
>>                        0xffffae1c12abeaf8
>>                        0xffffae1c12abe7a0
>>                        |
>>                         --2.69%--vfree
>>                                   __free_contig_range
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> Co-developed-by: Muhammad Usama Anjum <usama.anjum@arm.com>
>> Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
>> ---
>> Changes since v1:
>> - Rebase on mm-new
>> - Move FPI_PREPARED check inside __free_pages_prepare() now that
>>   fpi_flags are already being passed.
>> - Add todo (Zi Yan)
>> - Rerun benchmarks
>> - Convert VM_BUG_ON_PAGE() to VM_WARN_ON_ONCE()
>> - Rework order calculation in free_prepared_contig_range() and use
>>   MAX_PAGE_ORDER as high limit instead of pageblock_order as it must
>>   be up to internal __free_frozen_pages() how it frees them
>> ---
>>  include/linux/gfp.h |   2 +
>>  mm/page_alloc.c     | 110 ++++++++++++++++++++++++++++++++++++++++++--
>>  2 files changed, 108 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>> index f82d74a77cad8..96ac7aae370c4 100644
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -467,6 +467,8 @@ void free_contig_frozen_range(unsigned long pfn, unsigned long nr_pages);
>>  void free_contig_range(unsigned long pfn, unsigned long nr_pages);
>>  #endif
>>
>> +unsigned long __free_contig_range(unsigned long pfn, unsigned long nr_pages);
>> +
>>  DEFINE_FREE(free_page, void *, free_page((unsigned long)_T))
>>
>>  #endif /* __LINUX_GFP_H */
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 75ee81445640b..6a9430f720579 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -91,6 +91,13 @@ typedef int __bitwise fpi_t;
>>  /* Free the page without taking locks. Rely on trylock only. */
>>  #define FPI_TRYLOCK		((__force fpi_t)BIT(2))
>>
>> +/*
>> + * free_pages_prepare() has already been called for page(s) being freed.
>> + * TODO: Perform per-subpage free_pages_prepare() checks for order > 0 pages
>> + * (HWPoison, PageNetpp, bad free page).
>> + */
>
> I'm confused, and reading the v1 thread didn't help either. Where would the
> subpages to check come from? AFAICS we start from order-0 pages always.
> __free_contig_range calls free_pages_prepare on every page with order 0
> unconditionally, so we check every page as an order-0 page. If we then free
> the bunch of individually checked pages as a high-order page, there's no
> reason to check those subpages again, no? Am I missing something?

There are two kinds of order > 0 pages, compound and not compound.
free_pages_prepare() checks all tail pages of a compound order > 0 pages too.
For non compound ones, free_pages_prepare() only has free_page_is_bad()
check on tail ones.

So my guess is that the TODO is to check all subpages on a non compound
order > 0 one in the same manner. This is based on the assumption that
all non compound order > 0 page users use split_page() after the allocation,
treat each page individually, and free them back altogether. But I am not
sure if this is true for all users allocating non compound order > 0 pages.
And free_pages_prepare_bulk() might be a better name for such functions.

The above confusion is also a reason I asked Ryan to try adding a unsplit_page()
function to fuse back non compound order > 0 pages and free the fused one
as we are currently doing. But that looks like a pain to implment. Maybe an
alternative to this FPI_PREPARED is to add FPI_FREE_BULK and loop through all
subpages if FPI_FREE_BULK is set with
__free_pages_prepare(page + i, 0, fpi_flags & ~FPI_FREE_BULK) in
__free_pages_ok().


Best Regards,
Yan, Zi

Re: [PATCH v2 1/3] mm/page_alloc: Optimize free_contig_range()

Posted by Vlastimil Babka (SUSE) 3 weeks ago

On 3/16/26 17:02, Zi Yan wrote:
> On 16 Mar 2026, at 11:21, Vlastimil Babka wrote:
> 
>>> +/*
>>> + * free_pages_prepare() has already been called for page(s) being freed.
>>> + * TODO: Perform per-subpage free_pages_prepare() checks for order > 0 pages
>>> + * (HWPoison, PageNetpp, bad free page).
>>> + */
>>
>> I'm confused, and reading the v1 thread didn't help either. Where would the
>> subpages to check come from? AFAICS we start from order-0 pages always.
>> __free_contig_range calls free_pages_prepare on every page with order 0
>> unconditionally, so we check every page as an order-0 page. If we then free
>> the bunch of individually checked pages as a high-order page, there's no
>> reason to check those subpages again, no? Am I missing something?
> 
> There are two kinds of order > 0 pages, compound and not compound.
> free_pages_prepare() checks all tail pages of a compound order > 0 pages too.
> For non compound ones, free_pages_prepare() only has free_page_is_bad()
> check on tail ones.
> 
> So my guess is that the TODO is to check all subpages on a non compound
> order > 0 one in the same manner. This is based on the assumption that

OK but:

1) Why put that TODO specifically on FPI_PREPARED definition, which is for
the case we skip the prepare/check?
2) Why add it in this series which AFAICS doesn't handle non-compound
order>0 anywhere.
3) We'd better work on eliminating the non-compound order>0 usages
altogether, rather than work on support them better.

> all non compound order > 0 page users use split_page() after the allocation,
> treat each page individually, and free them back altogether. But I am not
> sure if this is true for all users allocating non compound order > 0 pages.

Maybe as part of the elimination (point 3 above) we should combine the
allocation+split so it's never the first without the second anymore.

> And free_pages_prepare_bulk() might be a better name for such functions.
> 
> The above confusion is also a reason I asked Ryan to try adding a unsplit_page()
> function to fuse back non compound order > 0 pages and free the fused one
> as we are currently doing. But that looks like a pain to implment. Maybe an

Yeah not sure it's worth it either.

> alternative to this FPI_PREPARED is to add FPI_FREE_BULK and loop through all
> subpages if FPI_FREE_BULK is set with
> __free_pages_prepare(page + i, 0, fpi_flags & ~FPI_FREE_BULK) in
> __free_pages_ok().

Hmm, maybe...

> 
> Best Regards,
> Yan, Zi

Re: [PATCH v2 1/3] mm/page_alloc: Optimize free_contig_range()

Posted by Zi Yan 2 weeks, 6 days ago

On 16 Mar 2026, at 12:19, Vlastimil Babka (SUSE) wrote:

> On 3/16/26 17:02, Zi Yan wrote:
>> On 16 Mar 2026, at 11:21, Vlastimil Babka wrote:
>>
>>>> +/*
>>>> + * free_pages_prepare() has already been called for page(s) being freed.
>>>> + * TODO: Perform per-subpage free_pages_prepare() checks for order > 0 pages
>>>> + * (HWPoison, PageNetpp, bad free page).
>>>> + */
>>>
>>> I'm confused, and reading the v1 thread didn't help either. Where would the
>>> subpages to check come from? AFAICS we start from order-0 pages always.
>>> __free_contig_range calls free_pages_prepare on every page with order 0
>>> unconditionally, so we check every page as an order-0 page. If we then free
>>> the bunch of individually checked pages as a high-order page, there's no
>>> reason to check those subpages again, no? Am I missing something?
>>
>> There are two kinds of order > 0 pages, compound and not compound.
>> free_pages_prepare() checks all tail pages of a compound order > 0 pages too.
>> For non compound ones, free_pages_prepare() only has free_page_is_bad()
>> check on tail ones.
>>
>> So my guess is that the TODO is to check all subpages on a non compound
>> order > 0 one in the same manner. This is based on the assumption that
>
> OK but:
>
> 1) Why put that TODO specifically on FPI_PREPARED definition, which is for
> the case we skip the prepare/check?
> 2) Why add it in this series which AFAICS doesn't handle non-compound
> order>0 anywhere.
> 3) We'd better work on eliminating the non-compound order>0 usages
> altogether, rather than work on support them better.

I agreed with you when I first saw this. After I think about it again,
the issue might not be directly related to the allocation but is the free path.
Like the patch title said, it is an optimization of free contiguous pages.
These physically contiguous pages happen to come from alloc non-compound order>0
and this leads to this optimization.

The problem they want to solve is to speed up page free path by freeing
a group of pages together. They are optimizing for a special situation
where a group of pages that are physically contiguous, so that these pages
can be freed via free_pages(page, order /* > 0 */). If we take away
the allocation of non-compound order>0, like you suggested in 3, we basically
remove the optimization opportunity from them. I am not sure that what
people want.

To think about the problem broadly, how can we optimize free_page_bulk(),
if that exists? Sorting input pages based on PFNs, so that we can them in
high orders instead of individual order-0s. This patch basically says,
hey, the group of pages we are freeing are all contiguous, since that is
how we allocate them, freeing them as a whole is much quicker than freeing
them individually.

>
>> all non compound order > 0 page users use split_page() after the allocation,
>> treat each page individually, and free them back altogether. But I am not
>> sure if this is true for all users allocating non compound order > 0 pages.
>
> Maybe as part of the elimination (point 3 above) we should combine the
> allocation+split so it's never the first without the second anymore.
>
>> And free_pages_prepare_bulk() might be a better name for such functions.
>>
>> The above confusion is also a reason I asked Ryan to try adding a unsplit_page()
>> function to fuse back non compound order > 0 pages and free the fused one
>> as we are currently doing. But that looks like a pain to implment. Maybe an
>
> Yeah not sure it's worth it either.
>
>> alternative to this FPI_PREPARED is to add FPI_FREE_BULK and loop through all
>> subpages if FPI_FREE_BULK is set with
>> __free_pages_prepare(page + i, 0, fpi_flags & ~FPI_FREE_BULK) in
>> __free_pages_ok().
>
> Hmm, maybe...

Let me know if my reasoning above moves your opinion on FPI_FREE_BULK towards
a positive direction. :)

Best Regards,
Yan, Zi

Re: [PATCH v2 1/3] mm/page_alloc: Optimize free_contig_range()

Posted by Vlastimil Babka (SUSE) 2 weeks, 6 days ago

On 3/17/26 16:17, Zi Yan wrote:
> On 16 Mar 2026, at 12:19, Vlastimil Babka (SUSE) wrote:
> 
>> On 3/16/26 17:02, Zi Yan wrote:
>>> On 16 Mar 2026, at 11:21, Vlastimil Babka wrote:
>>>
>>>>> +/*
>>>>> + * free_pages_prepare() has already been called for page(s) being freed.
>>>>> + * TODO: Perform per-subpage free_pages_prepare() checks for order > 0 pages
>>>>> + * (HWPoison, PageNetpp, bad free page).
>>>>> + */
>>>>
>>>> I'm confused, and reading the v1 thread didn't help either. Where would the
>>>> subpages to check come from? AFAICS we start from order-0 pages always.
>>>> __free_contig_range calls free_pages_prepare on every page with order 0
>>>> unconditionally, so we check every page as an order-0 page. If we then free
>>>> the bunch of individually checked pages as a high-order page, there's no
>>>> reason to check those subpages again, no? Am I missing something?
>>>
>>> There are two kinds of order > 0 pages, compound and not compound.
>>> free_pages_prepare() checks all tail pages of a compound order > 0 pages too.
>>> For non compound ones, free_pages_prepare() only has free_page_is_bad()
>>> check on tail ones.
>>>
>>> So my guess is that the TODO is to check all subpages on a non compound
>>> order > 0 one in the same manner. This is based on the assumption that
>>
>> OK but:
>>
>> 1) Why put that TODO specifically on FPI_PREPARED definition, which is for
>> the case we skip the prepare/check?
>> 2) Why add it in this series which AFAICS doesn't handle non-compound
>> order>0 anywhere.
>> 3) We'd better work on eliminating the non-compound order>0 usages
>> altogether, rather than work on support them better.
> 
> I agreed with you when I first saw this. After I think about it again,
> the issue might not be directly related to the allocation but is the free path.
> Like the patch title said, it is an optimization of free contiguous pages.
> These physically contiguous pages happen to come from alloc non-compound order>0
> and this leads to this optimization.

Sure and this use-case doesn't need the TODO to be solved, or am I mistaken?

That TODO seems to be about a hypothetical other use case with order>0
non-compound pages. Because AFAICS the use-cases in this series are not
about order>0 non-compound pages. Maybe they exist for a brief moment
between allocation and split_page() (in vmalloc() case?), but when we are
freeing them, we start with a contiguous series of order-0 pages (refcounted
or not).

So by my definition we are not freeing an order>0 non-compound page. By
"freeing order>0 non-compound page" I mean specifically what ___free_pages()
is handling in the "else if (!head) {" path, which I'd love to get rid of.
That TODO to me seems like about supporting that case.

> The problem they want to solve is to speed up page free path by freeing
> a group of pages together. They are optimizing for a special situation
> where a group of pages that are physically contiguous, so that these pages
> can be freed via free_pages(page, order /* > 0 */). If we take away

I don't think we want that as that leads to the case I described above. It
assumes head is refcounted and tail are not. I'd rather not overload it with
a case where we have contiguous order-0 pages where each is refcounted (or
none are). Yeah we can optimize the freeing like this series does, but I'd
not do it via something like "free_pages(page, order /* > 0 */)"

> the allocation of non-compound order>0, like you suggested in 3, we basically

I suggested we'd take it away in the sense of not producing order>0 where
head is refcounted, tails are not, and it's not a compound page. I'd rather
have an API that applies split_page() before and returns it as order-0
refcounted pages, but not the intermediate order>0 non-compound anymore.

> remove the optimization opportunity from them. I am not sure that what
> people want.
> 
> To think about the problem broadly, how can we optimize free_page_bulk(),
> if that exists? Sorting input pages based on PFNs, so that we can them in
> high orders instead of individual order-0s. This patch basically says,
> hey, the group of pages we are freeing are all contiguous, since that is
> how we allocate them, freeing them as a whole is much quicker than freeing
> them individually.

Yes we can have generalized, perhaps stacked support for the cases used by
the converted callers in this series, but not using a generic API that would
try e.g. sorting pfns even when we know they are already sorted. That means:

- given as contiguous range, frozen (patch 3)
- given as contiguous range, not frozen (patch 1)
- probably contiguous, needs checking, given as array of pages (patch 2)

>>
>>> all non compound order > 0 page users use split_page() after the allocation,
>>> treat each page individually, and free them back altogether. But I am not
>>> sure if this is true for all users allocating non compound order > 0 pages.
>>
>> Maybe as part of the elimination (point 3 above) we should combine the
>> allocation+split so it's never the first without the second anymore.

I elaborated on this above.

>>> And free_pages_prepare_bulk() might be a better name for such functions.
>>>
>>> The above confusion is also a reason I asked Ryan to try adding a unsplit_page()
>>> function to fuse back non compound order > 0 pages and free the fused one
>>> as we are currently doing. But that looks like a pain to implment. Maybe an
>>
>> Yeah not sure it's worth it either.
>>
>>> alternative to this FPI_PREPARED is to add FPI_FREE_BULK and loop through all
>>> subpages if FPI_FREE_BULK is set with
>>> __free_pages_prepare(page + i, 0, fpi_flags & ~FPI_FREE_BULK) in
>>> __free_pages_ok().
>>
>> Hmm, maybe...
> 
> Let me know if my reasoning above moves your opinion on FPI_FREE_BULK towards
> a positive direction. :)

If you can make it work to support the three cases above, without doing
unnecessary work, and with no "free_pages(page, order /* > 0 */)" like API?

> Best Regards,
> Yan, Zi

Re: [PATCH v2 1/3] mm/page_alloc: Optimize free_contig_range()

Posted by David Hildenbrand (Arm) 2 weeks, 4 days ago

>> the allocation of non-compound order>0, like you suggested in 3, we basically
> 
> I suggested we'd take it away in the sense of not producing order>0 where
> head is refcounted, tails are not, and it's not a compound page. I'd rather
> have an API that applies split_page() before and returns it as order-0
> refcounted pages, but not the intermediate order>0 non-compound anymore.

Are you talking about external API or internal API?

Regarding external interface: I think the crucial part is that an
external interface (free_contig_range) should always get a range of
individual order-0 pages: neither compound nor non-compound order > 0.

The individual order-0 pages can either be frozen or refcounted
(depending on the interface).


Regarding internal interface: To me that implies that FPI_PREPARED will
never ever have to do any kind of "subpage" (page) free_pages_prepare()
checks. It must already have been performed on all order-0 pages.

So the TODO should indeed be dropped.

I'm not sure I understood whether you think using the
__free_frozen_pages() with order > 0 is okay, or whether we need a
different (internal) interface.


-- 
Cheers,

David

Re: [PATCH v2 1/3] mm/page_alloc: Optimize free_contig_range()

Posted by Vlastimil Babka (SUSE) 2 weeks, 3 days ago

On 3/19/26 23:07, David Hildenbrand (Arm) wrote:
>>> the allocation of non-compound order>0, like you suggested in 3, we basically
>> 
>> I suggested we'd take it away in the sense of not producing order>0 where
>> head is refcounted, tails are not, and it's not a compound page. I'd rather
>> have an API that applies split_page() before and returns it as order-0
>> refcounted pages, but not the intermediate order>0 non-compound anymore.
> 
> Are you talking about external API or internal API?

In this case of alloc+split, external, and that would make sense to me.

In case of freeing, the current free_pages(order>0) is also external and I
would prefer not to augment it for this free_contig_range() usecase.

> Regarding external interface: I think the crucial part is that an
> external interface (free_contig_range) should always get a range of
> individual order-0 pages: neither compound nor non-compound order > 0.

Ack.

> The individual order-0 pages can either be frozen or refcounted
> (depending on the interface).

Ack.

> Regarding internal interface: To me that implies that FPI_PREPARED will
> never ever have to do any kind of "subpage" (page) free_pages_prepare()
> checks. It must already have been performed on all order-0 pages.
> 
> So the TODO should indeed be dropped.

Agreed. But maybe I misunderstood Zi, so that's why I tried to add so much
detail about what I mean by what.

> I'm not sure I understood whether you think using the
> __free_frozen_pages() with order > 0 is okay, or whether we need a
> different (internal) interface.

I think this is fine. But I agree with you above that this assumes
FPI_PREPARED and will not have to deal with subpages.

Re: [PATCH v2 1/3] mm/page_alloc: Optimize free_contig_range()

Posted by Zi Yan 2 weeks, 3 days ago

On 20 Mar 2026, at 4:20, Vlastimil Babka (SUSE) wrote:

> On 3/19/26 23:07, David Hildenbrand (Arm) wrote:
>>>> the allocation of non-compound order>0, like you suggested in 3, we basically
>>>
>>> I suggested we'd take it away in the sense of not producing order>0 where
>>> head is refcounted, tails are not, and it's not a compound page. I'd rather
>>> have an API that applies split_page() before and returns it as order-0
>>> refcounted pages, but not the intermediate order>0 non-compound anymore.
>>
>> Are you talking about external API or internal API?
>
> In this case of alloc+split, external, and that would make sense to me.
>
> In case of freeing, the current free_pages(order>0) is also external and I
> would prefer not to augment it for this free_contig_range() usecase.
>
>> Regarding external interface: I think the crucial part is that an
>> external interface (free_contig_range) should always get a range of
>> individual order-0 pages: neither compound nor non-compound order > 0.
>
> Ack.
>
>> The individual order-0 pages can either be frozen or refcounted
>> (depending on the interface).
>
> Ack.
>
>> Regarding internal interface: To me that implies that FPI_PREPARED will
>> never ever have to do any kind of "subpage" (page) free_pages_prepare()
>> checks. It must already have been performed on all order-0 pages.
>>
>> So the TODO should indeed be dropped.
>
> Agreed. But maybe I misunderstood Zi, so that's why I tried to add so much
> detail about what I mean by what.

Ack on dropping the TODO.

I was discussing about whether we can have a better interface for freeing these
contiguous pages instead of FPI_PREPARED. Since FPI_PREPARED adds another
form of free pages, where free_pages_prepare() are called on all incoming
pages already. It might be a separate topic. I will think about it more
and come back later. Sorry for the confusion.

>
>> I'm not sure I understood whether you think using the
>> __free_frozen_pages() with order > 0 is okay, or whether we need a
>> different (internal) interface.
>
> I think this is fine. But I agree with you above that this assumes
> FPI_PREPARED and will not have to deal with subpages.


Best Regards,
Yan, Zi

[PATCH v2 1/3] mm/page_alloc: Optimize free_contig_range()
[PATCH v2 2/3] vmalloc: Optimize vfree
[PATCH v2 3/3] mm/page_alloc: Optimize __free_contig_frozen_range()