Apply the same batch-freeing optimization from free_contig_range() to the
frozen page path. The previous __free_contig_frozen_range() freed each
order-0 page individually via free_frozen_pages(), which is slow for the
same reason the old free_contig_range() was: each page goes to the
order-0 pcp list rather than being coalesced into higher-order blocks.
Rewrite __free_contig_frozen_range() to call free_pages_prepare() for
each order-0 page, then batch the prepared pages into the largest
possible power-of-2 aligned chunks via free_prepared_contig_range().
If free_pages_prepare() fails (e.g. HWPoison, bad page) the page is
deliberately not freed; it should not be returned to the allocator.
I've tested CMA through debugfs. The test allocates 16384 pages per
allocation for several iterations. There is 3.5x improvement.
Before: 1406 usec per iteration
After: 402 usec per iteration
Before:
70.89% 0.69% cma [kernel.kallsyms] [.] free_contig_frozen_range
|
|--70.20%--free_contig_frozen_range
| |
| |--46.41%--__free_frozen_pages
| | |
| | --36.18%--free_frozen_page_commit
| | |
| | --29.63%--_raw_spin_unlock_irqrestore
| |
| |--8.76%--_raw_spin_trylock
| |
| |--7.03%--__preempt_count_dec_and_test
| |
| |--4.57%--_raw_spin_unlock
| |
| |--1.96%--__get_pfnblock_flags_mask.isra.0
| |
| --1.15%--free_frozen_page_commit
|
--0.69%--el0t_64_sync
After:
23.57% 0.00% cma [kernel.kallsyms] [.] free_contig_frozen_range
|
---free_contig_frozen_range
|
|--20.45%--__free_contig_frozen_range
| |
| |--17.77%--free_pages_prepare
| |
| --0.72%--free_prepared_contig_range
| |
| --0.55%--__free_frozen_pages
|
--3.12%--free_pages_prepare
Suggested-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
---
Changes since v3:
- Use newly introduced __free_contig_range_common() as the pattern was
very similar to __free_contig_range()
Changes since v2:
- Rework the loop to check for memory sections just like __free_contig_range()
- Didn't add reviewed-by tags because of rework
---
mm/page_alloc.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 64be8a9019dca..110e912fa785e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7059,8 +7059,7 @@ static int __alloc_contig_verify_gfp_mask(gfp_t gfp_mask, gfp_t *gfp_cc_mask)
static void __free_contig_frozen_range(unsigned long pfn, unsigned long nr_pages)
{
- for (; nr_pages--; pfn++)
- free_frozen_pages(pfn_to_page(pfn), 0);
+ __free_contig_range_common(pfn, nr_pages, true);
}
/**
--
2.47.3
On 3/27/26 13:57, Muhammad Usama Anjum wrote:
> Apply the same batch-freeing optimization from free_contig_range() to the
> frozen page path. The previous __free_contig_frozen_range() freed each
> order-0 page individually via free_frozen_pages(), which is slow for the
> same reason the old free_contig_range() was: each page goes to the
> order-0 pcp list rather than being coalesced into higher-order blocks.
>
> Rewrite __free_contig_frozen_range() to call free_pages_prepare() for
> each order-0 page, then batch the prepared pages into the largest
> possible power-of-2 aligned chunks via free_prepared_contig_range().
> If free_pages_prepare() fails (e.g. HWPoison, bad page) the page is
> deliberately not freed; it should not be returned to the allocator.
>
> I've tested CMA through debugfs. The test allocates 16384 pages per
> allocation for several iterations. There is 3.5x improvement.
>
> Before: 1406 usec per iteration
> After: 402 usec per iteration
>
> Before:
>
> 70.89% 0.69% cma [kernel.kallsyms] [.] free_contig_frozen_range
> |
> |--70.20%--free_contig_frozen_range
> | |
> | |--46.41%--__free_frozen_pages
> | | |
> | | --36.18%--free_frozen_page_commit
> | | |
> | | --29.63%--_raw_spin_unlock_irqrestore
> | |
> | |--8.76%--_raw_spin_trylock
> | |
> | |--7.03%--__preempt_count_dec_and_test
> | |
> | |--4.57%--_raw_spin_unlock
> | |
> | |--1.96%--__get_pfnblock_flags_mask.isra.0
> | |
> | --1.15%--free_frozen_page_commit
> |
> --0.69%--el0t_64_sync
>
> After:
>
> 23.57% 0.00% cma [kernel.kallsyms] [.] free_contig_frozen_range
> |
> ---free_contig_frozen_range
> |
> |--20.45%--__free_contig_frozen_range
> | |
> | |--17.77%--free_pages_prepare
> | |
> | --0.72%--free_prepared_contig_range
> | |
> | --0.55%--__free_frozen_pages
> |
> --3.12%--free_pages_prepare
>
> Suggested-by: Zi Yan <ziy@nvidia.com>
> Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
> ---
> Changes since v3:
> - Use newly introduced __free_contig_range_common() as the pattern was
> very similar to __free_contig_range()
>
> Changes since v2:
> - Rework the loop to check for memory sections just like __free_contig_range()
> - Didn't add reviewed-by tags because of rework
> ---
> mm/page_alloc.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 64be8a9019dca..110e912fa785e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7059,8 +7059,7 @@ static int __alloc_contig_verify_gfp_mask(gfp_t gfp_mask, gfp_t *gfp_cc_mask)
>
> static void __free_contig_frozen_range(unsigned long pfn, unsigned long nr_pages)
> {
> - for (; nr_pages--; pfn++)
> - free_frozen_pages(pfn_to_page(pfn), 0);
> + __free_contig_range_common(pfn, nr_pages, true);
Ah, might want to add a comment here as well
/* is_frozen= */ true
--
Cheers,
David
On 3/27/26 13:57, Muhammad Usama Anjum wrote:
> Apply the same batch-freeing optimization from free_contig_range() to the
> frozen page path. The previous __free_contig_frozen_range() freed each
> order-0 page individually via free_frozen_pages(), which is slow for the
> same reason the old free_contig_range() was: each page goes to the
> order-0 pcp list rather than being coalesced into higher-order blocks.
>
> Rewrite __free_contig_frozen_range() to call free_pages_prepare() for
> each order-0 page, then batch the prepared pages into the largest
> possible power-of-2 aligned chunks via free_prepared_contig_range().
> If free_pages_prepare() fails (e.g. HWPoison, bad page) the page is
> deliberately not freed; it should not be returned to the allocator.
>
> I've tested CMA through debugfs. The test allocates 16384 pages per
> allocation for several iterations. There is 3.5x improvement.
>
> Before: 1406 usec per iteration
> After: 402 usec per iteration
>
> Before:
>
> 70.89% 0.69% cma [kernel.kallsyms] [.] free_contig_frozen_range
> |
> |--70.20%--free_contig_frozen_range
> | |
> | |--46.41%--__free_frozen_pages
> | | |
> | | --36.18%--free_frozen_page_commit
> | | |
> | | --29.63%--_raw_spin_unlock_irqrestore
> | |
> | |--8.76%--_raw_spin_trylock
> | |
> | |--7.03%--__preempt_count_dec_and_test
> | |
> | |--4.57%--_raw_spin_unlock
> | |
> | |--1.96%--__get_pfnblock_flags_mask.isra.0
> | |
> | --1.15%--free_frozen_page_commit
> |
> --0.69%--el0t_64_sync
>
> After:
>
> 23.57% 0.00% cma [kernel.kallsyms] [.] free_contig_frozen_range
> |
> ---free_contig_frozen_range
> |
> |--20.45%--__free_contig_frozen_range
> | |
> | |--17.77%--free_pages_prepare
> | |
> | --0.72%--free_prepared_contig_range
> | |
> | --0.55%--__free_frozen_pages
> |
> --3.12%--free_pages_prepare
>
> Suggested-by: Zi Yan <ziy@nvidia.com>
> Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---
> Changes since v3:
> - Use newly introduced __free_contig_range_common() as the pattern was
> very similar to __free_contig_range()
>
> Changes since v2:
> - Rework the loop to check for memory sections just like __free_contig_range()
> - Didn't add reviewed-by tags because of rework
> ---
> mm/page_alloc.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 64be8a9019dca..110e912fa785e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7059,8 +7059,7 @@ static int __alloc_contig_verify_gfp_mask(gfp_t gfp_mask, gfp_t *gfp_cc_mask)
>
> static void __free_contig_frozen_range(unsigned long pfn, unsigned long nr_pages)
> {
> - for (; nr_pages--; pfn++)
> - free_frozen_pages(pfn_to_page(pfn), 0);
> + __free_contig_range_common(pfn, nr_pages, true);
> }
>
> /**
On 3/27/26 13:57, Muhammad Usama Anjum wrote: > Apply the same batch-freeing optimization from free_contig_range() to the > frozen page path. The previous __free_contig_frozen_range() freed each > order-0 page individually via free_frozen_pages(), which is slow for the > same reason the old free_contig_range() was: each page goes to the > order-0 pcp list rather than being coalesced into higher-order blocks. > > Rewrite __free_contig_frozen_range() to call free_pages_prepare() for > each order-0 page, then batch the prepared pages into the largest > possible power-of-2 aligned chunks via free_prepared_contig_range(). > If free_pages_prepare() fails (e.g. HWPoison, bad page) the page is > deliberately not freed; it should not be returned to the allocator. > > I've tested CMA through debugfs. The test allocates 16384 pages per > allocation for several iterations. There is 3.5x improvement. > > Before: 1406 usec per iteration > After: 402 usec per iteration > > Before: > > 70.89% 0.69% cma [kernel.kallsyms] [.] free_contig_frozen_range > | > |--70.20%--free_contig_frozen_range > | | > | |--46.41%--__free_frozen_pages > | | | > | | --36.18%--free_frozen_page_commit > | | | > | | --29.63%--_raw_spin_unlock_irqrestore > | | > | |--8.76%--_raw_spin_trylock > | | > | |--7.03%--__preempt_count_dec_and_test > | | > | |--4.57%--_raw_spin_unlock > | | > | |--1.96%--__get_pfnblock_flags_mask.isra.0 > | | > | --1.15%--free_frozen_page_commit > | > --0.69%--el0t_64_sync > > After: > > 23.57% 0.00% cma [kernel.kallsyms] [.] free_contig_frozen_range > | > ---free_contig_frozen_range > | > |--20.45%--__free_contig_frozen_range > | | > | |--17.77%--free_pages_prepare > | | > | --0.72%--free_prepared_contig_range > | | > | --0.55%--__free_frozen_pages > | > --3.12%--free_pages_prepare > > Suggested-by: Zi Yan <ziy@nvidia.com> > Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com> > --- Acked-by: David Hildenbrand (Arm) <david@kernel.org> -- Cheers, David
On 27 Mar 2026, at 8:57, Muhammad Usama Anjum wrote:
> Apply the same batch-freeing optimization from free_contig_range() to the
> frozen page path. The previous __free_contig_frozen_range() freed each
> order-0 page individually via free_frozen_pages(), which is slow for the
> same reason the old free_contig_range() was: each page goes to the
> order-0 pcp list rather than being coalesced into higher-order blocks.
>
> Rewrite __free_contig_frozen_range() to call free_pages_prepare() for
> each order-0 page, then batch the prepared pages into the largest
> possible power-of-2 aligned chunks via free_prepared_contig_range().
> If free_pages_prepare() fails (e.g. HWPoison, bad page) the page is
> deliberately not freed; it should not be returned to the allocator.
>
> I've tested CMA through debugfs. The test allocates 16384 pages per
> allocation for several iterations. There is 3.5x improvement.
>
> Before: 1406 usec per iteration
> After: 402 usec per iteration
>
> Before:
>
> 70.89% 0.69% cma [kernel.kallsyms] [.] free_contig_frozen_range
> |
> |--70.20%--free_contig_frozen_range
> | |
> | |--46.41%--__free_frozen_pages
> | | |
> | | --36.18%--free_frozen_page_commit
> | | |
> | | --29.63%--_raw_spin_unlock_irqrestore
> | |
> | |--8.76%--_raw_spin_trylock
> | |
> | |--7.03%--__preempt_count_dec_and_test
> | |
> | |--4.57%--_raw_spin_unlock
> | |
> | |--1.96%--__get_pfnblock_flags_mask.isra.0
> | |
> | --1.15%--free_frozen_page_commit
> |
> --0.69%--el0t_64_sync
>
> After:
>
> 23.57% 0.00% cma [kernel.kallsyms] [.] free_contig_frozen_range
> |
> ---free_contig_frozen_range
> |
> |--20.45%--__free_contig_frozen_range
> | |
> | |--17.77%--free_pages_prepare
> | |
> | --0.72%--free_prepared_contig_range
> | |
> | --0.55%--__free_frozen_pages
> |
> --3.12%--free_pages_prepare
>
> Suggested-by: Zi Yan <ziy@nvidia.com>
> Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
> ---
> Changes since v3:
> - Use newly introduced __free_contig_range_common() as the pattern was
> very similar to __free_contig_range()
>
> Changes since v2:
> - Rework the loop to check for memory sections just like __free_contig_range()
> - Didn't add reviewed-by tags because of rework
> ---
> mm/page_alloc.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 64be8a9019dca..110e912fa785e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7059,8 +7059,7 @@ static int __alloc_contig_verify_gfp_mask(gfp_t gfp_mask, gfp_t *gfp_cc_mask)
>
> static void __free_contig_frozen_range(unsigned long pfn, unsigned long nr_pages)
> {
> - for (; nr_pages--; pfn++)
> - free_frozen_pages(pfn_to_page(pfn), 0);
> + __free_contig_range_common(pfn, nr_pages, true);
__free_contig_range_common(pfn, nr_pages, /* is_frozen= */ true);
is better.
Otherwise,
Reviewed-by: Zi Yan <ziy@nvidia.com>
> }
>
> /**
> --
> 2.47.3
Best Regards,
Yan, Zi
© 2016 - 2026 Red Hat, Inc.