[v4] mm: Free contiguous order-0 pages efficiently

[PATCH v4 0/3] mm: Free contiguous order-0 pages efficiently

Posted by Muhammad Usama Anjum 6 days, 1 hour ago

Hi All,

A recent change to vmalloc caused some performance benchmark regressions (see
[1]). I'm attempting to fix that (and at the same time significantly improve
beyond the baseline) by freeing a contiguous set of order-0 pages as a batch.

At the same time I observed that free_contig_range() was essentially doing the
same thing as vfree() so I've fixed it there too. While at it, optimize the
__free_contig_frozen_range() as well.

Check that the contiguous range falls in the same section. If they aren't enabled,
the if conditions get optimized out by the compiler as memdesc_section() returns 0.
See num_pages_contiguous() for more details about it.

[1] https://lore.kernel.org/all/66919a28-bc81-49c9-b68f-dd7c73395a0d@arm.com

v6.18      - Before the patch causing regression was added
mm-new     - current latest code
this series - v2 series of these patches

(>0 is faster, <0 is slower, (R)/(I) = statistically significant
Regression/Improvement)

v6.18 vs mm-new
+-----------------+----------------------------------------------------------+-------------------+-------------+
| Benchmark       | Result Class                                             |   v6.18    (base) |    mm-new   |
+=================+==========================================================+===================+=============+
| micromm/vmalloc | fix_align_alloc_test: p:1, h:0, l:500000 (usec)          |         653643.33 | (R) -50.92% |
|                 | fix_size_alloc_test: p:1, h:0, l:500000 (usec)           |         366167.33 | (R) -11.96% |
|                 | fix_size_alloc_test: p:4, h:0, l:500000 (usec)           |         489484.00 | (R) -35.21% |
|                 | fix_size_alloc_test: p:16, h:0, l:500000 (usec)          |        1011250.33 | (R) -36.45% |
|                 | fix_size_alloc_test: p:16, h:1, l:500000 (usec)          |        1086812.33 | (R) -31.83% |
|                 | fix_size_alloc_test: p:64, h:0, l:100000 (usec)          |         657940.00 | (R) -38.62% |
|                 | fix_size_alloc_test: p:64, h:1, l:100000 (usec)          |         765422.00 | (R) -24.84% |
|                 | fix_size_alloc_test: p:256, h:0, l:100000 (usec)         |        2468585.00 | (R) -37.83% |
|                 | fix_size_alloc_test: p:256, h:1, l:100000 (usec)         |        2815758.33 | (R) -26.32% |
|                 | fix_size_alloc_test: p:512, h:0, l:100000 (usec)         |        4851969.00 | (R) -37.76% |
|                 | fix_size_alloc_test: p:512, h:1, l:100000 (usec)         |        4496257.33 | (R) -31.15% |
|                 | full_fit_alloc_test: p:1, h:0, l:500000 (usec)           |         570605.00 |      -8.97% |
|                 | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         500866.00 |      -5.88% |
|                 | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         499733.00 |      -6.95% |
|                 | long_busy_list_alloc_test: p:1, h:0, l:500000 (usec)     |        5266237.67 | (R) -40.19% |
|                 | pcpu_alloc_test: p:1, h:0, l:500000 (usec)               |         490284.00 |      -2.10% |
|                 | random_size_align_alloc_test: p:1, h:0, l:500000 (usec)  |         850986.33 | (R) -48.03% |
|                 | random_size_alloc_test: p:1, h:0, l:500000 (usec)        |        2712106.00 | (R) -40.48% |
|                 | vm_map_ram_test: p:1, h:0, l:500000 (usec)               |         111151.33 |       3.52% |
+-----------------+----------------------------------------------------------+-------------------+-------------+

v6.18 vs mm-new with patches
+-----------------+----------------------------------------------------------+-------------------+--------------+
| Benchmark       | Result Class                                             |   v6.18 (base)    |  this series |
+=================+==========================================================+===================+==============+
| micromm/vmalloc | fix_align_alloc_test: p:1, h:0, l:500000 (usec)          |         653643.33 |      -14.02% |
|                 | fix_size_alloc_test: p:1, h:0, l:500000 (usec)           |         366167.33 |       -7.23% |
|                 | fix_size_alloc_test: p:4, h:0, l:500000 (usec)           |         489484.00 |       -1.57% |
|                 | fix_size_alloc_test: p:16, h:0, l:500000 (usec)          |        1011250.33 |        1.57% |
|                 | fix_size_alloc_test: p:16, h:1, l:500000 (usec)          |        1086812.33 |   (I) 15.75% |
|                 | fix_size_alloc_test: p:64, h:0, l:100000 (usec)          |         657940.00 |    (I) 9.05% |
|                 | fix_size_alloc_test: p:64, h:1, l:100000 (usec)          |         765422.00 |   (I) 38.45% |
|                 | fix_size_alloc_test: p:256, h:0, l:100000 (usec)         |        2468585.00 |   (I) 12.56% |
|                 | fix_size_alloc_test: p:256, h:1, l:100000 (usec)         |        2815758.33 |   (I) 38.61% |
|                 | fix_size_alloc_test: p:512, h:0, l:100000 (usec)         |        4851969.00 |   (I) 13.43% |
|                 | fix_size_alloc_test: p:512, h:1, l:100000 (usec)         |        4496257.33 |   (I) 49.21% |
|                 | full_fit_alloc_test: p:1, h:0, l:500000 (usec)           |         570605.00 |       -8.47% |
|                 | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         500866.00 |       -8.17% |
|                 | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         499733.00 |       -5.54% |
|                 | long_busy_list_alloc_test: p:1, h:0, l:500000 (usec)     |        5266237.67 |    (I) 4.63% |
|                 | pcpu_alloc_test: p:1, h:0, l:500000 (usec)               |         490284.00 |        1.53% |
|                 | random_size_align_alloc_test: p:1, h:0, l:500000 (usec)  |         850986.33 |       -0.00% |
|                 | random_size_alloc_test: p:1, h:0, l:500000 (usec)        |        2712106.00 |        1.22% |
|                 | vm_map_ram_test: p:1, h:0, l:500000 (usec)               |         111151.33 |    (I) 4.98% |
+-----------------+----------------------------------------------------------+-------------------+--------------+

mm-new vs vmalloc_2 results are in 2/3 patch.

So this series is mitigating the regression on average as results show -14% to 49% improvement.

Thanks,
Muhammad Usama Anjum

---
Chagnes since v3: (summary)
- Introduce __free_contig_range_common() in first patch  and use it in
  3rd patch as well
- Cosmetic changes related to comments and kerneldoc

Changes since v2: (summary)
- Patch 1 and 3:  Rework the loop to check for memory sections
- Patch 2: Rework by removing the BUG on and add helper free_pages_bulk()

Changes since v1:
- Update description
- Rebase on mm-new and rerun benchmarks/tests
- Patch 1: move FPI_PREPARED check and add todo
- Patch 2: Rework catering newer changes in vfree()
- New Patch 3: optimizes __free_contig_frozen_range()

Muhammad Usama Anjum (1):
  mm/page_alloc: Optimize __free_contig_frozen_range()

Ryan Roberts (2):
  mm/page_alloc: Optimize free_contig_range()
  vmalloc: Optimize vfree

 include/linux/gfp.h |   4 ++
 mm/page_alloc.c     | 144 ++++++++++++++++++++++++++++++++++++++++++--
 mm/vmalloc.c        |  16 ++---
 3 files changed, 149 insertions(+), 15 deletions(-)

-- 
2.47.3

Re: [PATCH v4 0/3] mm: Free contiguous order-0 pages efficiently

Posted by Andrew Morton 5 days, 18 hours ago

On Fri, 27 Mar 2026 12:57:12 +0000 Muhammad Usama Anjum <usama.anjum@arm.com> wrote:

> A recent change to vmalloc caused some performance benchmark regressions (see
> [1]). I'm attempting to fix that (and at the same time significantly improve
> beyond the baseline) by freeing a contiguous set of order-0 pages as a batch.
> 
> At the same time I observed that free_contig_range() was essentially doing the
> same thing as vfree() so I've fixed it there too. While at it, optimize the
> __free_contig_frozen_range() as well.
> 
> Check that the contiguous range falls in the same section. If they aren't enabled,
> the if conditions get optimized out by the compiler as memdesc_section() returns 0.
> See num_pages_contiguous() for more details about it.

Thanks.  I'm seeing impressive speedups for microbenchmarks.  The
speedup in [3/3] may be a bit more real-worldy.

Do you have a feeling for how much difference these changes will make
for any real-world workload?

Also, AI review said things:
	https://sashiko.dev/#/patchset/20260327125720.2270651-1-usama.anjum@arm.com

The can_free one (at least) seems legit.  I suggest that can_free be
made local to that for() loop - this would clear things up a bit.

Re: [PATCH v4 0/3] mm: Free contiguous order-0 pages efficiently

Posted by David Hildenbrand (Arm) 2 days, 23 hours ago

On 3/27/26 20:42, Andrew Morton wrote:
> On Fri, 27 Mar 2026 12:57:12 +0000 Muhammad Usama Anjum <usama.anjum@arm.com> wrote:
> 
>> A recent change to vmalloc caused some performance benchmark regressions (see
>> [1]). I'm attempting to fix that (and at the same time significantly improve
>> beyond the baseline) by freeing a contiguous set of order-0 pages as a batch.
>>
>> At the same time I observed that free_contig_range() was essentially doing the
>> same thing as vfree() so I've fixed it there too. While at it, optimize the
>> __free_contig_frozen_range() as well.
>>
>> Check that the contiguous range falls in the same section. If they aren't enabled,
>> the if conditions get optimized out by the compiler as memdesc_section() returns 0.
>> See num_pages_contiguous() for more details about it.
> 
> Thanks.  I'm seeing impressive speedups for microbenchmarks.  The
> speedup in [3/3] may be a bit more real-worldy.

This should speedup virtio-mem memory hotplug in cases where we end up
calling free_contig_range() on MAX_PAGE_ORDER ranges to release
hotplugged memory to the buddy!

-- 
Cheers,

David

Re: [PATCH v4 0/3] mm: Free contiguous order-0 pages efficiently

Posted by Muhammad Usama Anjum 3 days, 2 hours ago

Hi Andrew,

Thank you for reviewing.

On 27/03/2026 7:42 pm, Andrew Morton wrote:
> On Fri, 27 Mar 2026 12:57:12 +0000 Muhammad Usama Anjum <usama.anjum@arm.com> wrote:
> 
>> A recent change to vmalloc caused some performance benchmark regressions (see
>> [1]). I'm attempting to fix that (and at the same time significantly improve
>> beyond the baseline) by freeing a contiguous set of order-0 pages as a batch.
>>
>> At the same time I observed that free_contig_range() was essentially doing the
>> same thing as vfree() so I've fixed it there too. While at it, optimize the
>> __free_contig_frozen_range() as well.
>>
>> Check that the contiguous range falls in the same section. If they aren't enabled,
>> the if conditions get optimized out by the compiler as memdesc_section() returns 0.
>> See num_pages_contiguous() for more details about it.
> 
> Thanks.  I'm seeing impressive speedups for microbenchmarks.  The
> speedup in [3/3] may be a bit more real-worldy.
Yeah, we are getting natural speedup as the vmalloc() allocated higher order
page blocks and vfree() starts freeing into higher order blocks instead
of order-0 blocks.

> 
> Do you have a feeling for how much difference these changes will make
> for any real-world workload?
I've not run any real world benchmarks. Considering these benchmarks
we'll surely get speedups.

> 
> Also, AI review said things:
> 	https://sashiko.dev/#/patchset/20260327125720.2270651-1-usama.anjum@arm.com
> 
> The can_free one (at least) seems legit.  I suggest that can_free be
> made local to that for() loop - this would clear things up a bit.
Other than unrelated things in the AI's review, this is the only real
problem highlighted. I'll fix it.

Thanks,
Usama