include/linux/gfp.h | 4 ++ mm/page_alloc.c | 144 ++++++++++++++++++++++++++++++++++++++++++-- mm/vmalloc.c | 16 ++--- 3 files changed, 149 insertions(+), 15 deletions(-)
Hi All, A recent change to vmalloc caused some performance benchmark regressions (see [1]). I'm attempting to fix that (and at the same time significantly improve beyond the baseline) by freeing a contiguous set of order-0 pages as a batch. At the same time I observed that free_contig_range() was essentially doing the same thing as vfree() so I've fixed it there too. While at it, optimize the __free_contig_frozen_range() as well. Check that the contiguous range falls in the same section. If they aren't enabled, the if conditions get optimized out by the compiler as memdesc_section() returns 0. See num_pages_contiguous() for more details about it. [1] https://lore.kernel.org/all/66919a28-bc81-49c9-b68f-dd7c73395a0d@arm.com v6.18 - Before the patch causing regression was added mm-new - current latest code this series - v2 series of these patches (>0 is faster, <0 is slower, (R)/(I) = statistically significant Regression/Improvement) v6.18 vs mm-new +-----------------+----------------------------------------------------------+-------------------+-------------+ | Benchmark | Result Class | v6.18 (base) | mm-new | +=================+==========================================================+===================+=============+ | micromm/vmalloc | fix_align_alloc_test: p:1, h:0, l:500000 (usec) | 653643.33 | (R) -50.92% | | | fix_size_alloc_test: p:1, h:0, l:500000 (usec) | 366167.33 | (R) -11.96% | | | fix_size_alloc_test: p:4, h:0, l:500000 (usec) | 489484.00 | (R) -35.21% | | | fix_size_alloc_test: p:16, h:0, l:500000 (usec) | 1011250.33 | (R) -36.45% | | | fix_size_alloc_test: p:16, h:1, l:500000 (usec) | 1086812.33 | (R) -31.83% | | | fix_size_alloc_test: p:64, h:0, l:100000 (usec) | 657940.00 | (R) -38.62% | | | fix_size_alloc_test: p:64, h:1, l:100000 (usec) | 765422.00 | (R) -24.84% | | | fix_size_alloc_test: p:256, h:0, l:100000 (usec) | 2468585.00 | (R) -37.83% | | | fix_size_alloc_test: p:256, h:1, l:100000 (usec) | 2815758.33 | (R) -26.32% | | | fix_size_alloc_test: p:512, h:0, l:100000 (usec) | 4851969.00 | (R) -37.76% | | | fix_size_alloc_test: p:512, h:1, l:100000 (usec) | 4496257.33 | (R) -31.15% | | | full_fit_alloc_test: p:1, h:0, l:500000 (usec) | 570605.00 | -8.97% | | | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) | 500866.00 | -5.88% | | | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) | 499733.00 | -6.95% | | | long_busy_list_alloc_test: p:1, h:0, l:500000 (usec) | 5266237.67 | (R) -40.19% | | | pcpu_alloc_test: p:1, h:0, l:500000 (usec) | 490284.00 | -2.10% | | | random_size_align_alloc_test: p:1, h:0, l:500000 (usec) | 850986.33 | (R) -48.03% | | | random_size_alloc_test: p:1, h:0, l:500000 (usec) | 2712106.00 | (R) -40.48% | | | vm_map_ram_test: p:1, h:0, l:500000 (usec) | 111151.33 | 3.52% | +-----------------+----------------------------------------------------------+-------------------+-------------+ v6.18 vs mm-new with patches +-----------------+----------------------------------------------------------+-------------------+--------------+ | Benchmark | Result Class | v6.18 (base) | this series | +=================+==========================================================+===================+==============+ | micromm/vmalloc | fix_align_alloc_test: p:1, h:0, l:500000 (usec) | 653643.33 | -14.02% | | | fix_size_alloc_test: p:1, h:0, l:500000 (usec) | 366167.33 | -7.23% | | | fix_size_alloc_test: p:4, h:0, l:500000 (usec) | 489484.00 | -1.57% | | | fix_size_alloc_test: p:16, h:0, l:500000 (usec) | 1011250.33 | 1.57% | | | fix_size_alloc_test: p:16, h:1, l:500000 (usec) | 1086812.33 | (I) 15.75% | | | fix_size_alloc_test: p:64, h:0, l:100000 (usec) | 657940.00 | (I) 9.05% | | | fix_size_alloc_test: p:64, h:1, l:100000 (usec) | 765422.00 | (I) 38.45% | | | fix_size_alloc_test: p:256, h:0, l:100000 (usec) | 2468585.00 | (I) 12.56% | | | fix_size_alloc_test: p:256, h:1, l:100000 (usec) | 2815758.33 | (I) 38.61% | | | fix_size_alloc_test: p:512, h:0, l:100000 (usec) | 4851969.00 | (I) 13.43% | | | fix_size_alloc_test: p:512, h:1, l:100000 (usec) | 4496257.33 | (I) 49.21% | | | full_fit_alloc_test: p:1, h:0, l:500000 (usec) | 570605.00 | -8.47% | | | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) | 500866.00 | -8.17% | | | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) | 499733.00 | -5.54% | | | long_busy_list_alloc_test: p:1, h:0, l:500000 (usec) | 5266237.67 | (I) 4.63% | | | pcpu_alloc_test: p:1, h:0, l:500000 (usec) | 490284.00 | 1.53% | | | random_size_align_alloc_test: p:1, h:0, l:500000 (usec) | 850986.33 | -0.00% | | | random_size_alloc_test: p:1, h:0, l:500000 (usec) | 2712106.00 | 1.22% | | | vm_map_ram_test: p:1, h:0, l:500000 (usec) | 111151.33 | (I) 4.98% | +-----------------+----------------------------------------------------------+-------------------+--------------+ mm-new vs vmalloc_2 results are in 2/3 patch. So this series is mitigating the regression on average as results show -14% to 49% improvement. Thanks, Muhammad Usama Anjum --- Chagnes since v3: (summary) - Introduce __free_contig_range_common() in first patch and use it in 3rd patch as well - Cosmetic changes related to comments and kerneldoc Changes since v2: (summary) - Patch 1 and 3: Rework the loop to check for memory sections - Patch 2: Rework by removing the BUG on and add helper free_pages_bulk() Changes since v1: - Update description - Rebase on mm-new and rerun benchmarks/tests - Patch 1: move FPI_PREPARED check and add todo - Patch 2: Rework catering newer changes in vfree() - New Patch 3: optimizes __free_contig_frozen_range() Muhammad Usama Anjum (1): mm/page_alloc: Optimize __free_contig_frozen_range() Ryan Roberts (2): mm/page_alloc: Optimize free_contig_range() vmalloc: Optimize vfree include/linux/gfp.h | 4 ++ mm/page_alloc.c | 144 ++++++++++++++++++++++++++++++++++++++++++-- mm/vmalloc.c | 16 ++--- 3 files changed, 149 insertions(+), 15 deletions(-) -- 2.47.3
On Fri, 27 Mar 2026 12:57:12 +0000 Muhammad Usama Anjum <usama.anjum@arm.com> wrote: > A recent change to vmalloc caused some performance benchmark regressions (see > [1]). I'm attempting to fix that (and at the same time significantly improve > beyond the baseline) by freeing a contiguous set of order-0 pages as a batch. > > At the same time I observed that free_contig_range() was essentially doing the > same thing as vfree() so I've fixed it there too. While at it, optimize the > __free_contig_frozen_range() as well. > > Check that the contiguous range falls in the same section. If they aren't enabled, > the if conditions get optimized out by the compiler as memdesc_section() returns 0. > See num_pages_contiguous() for more details about it. Thanks. I'm seeing impressive speedups for microbenchmarks. The speedup in [3/3] may be a bit more real-worldy. Do you have a feeling for how much difference these changes will make for any real-world workload? Also, AI review said things: https://sashiko.dev/#/patchset/20260327125720.2270651-1-usama.anjum@arm.com The can_free one (at least) seems legit. I suggest that can_free be made local to that for() loop - this would clear things up a bit.
On 3/27/26 20:42, Andrew Morton wrote: > On Fri, 27 Mar 2026 12:57:12 +0000 Muhammad Usama Anjum <usama.anjum@arm.com> wrote: > >> A recent change to vmalloc caused some performance benchmark regressions (see >> [1]). I'm attempting to fix that (and at the same time significantly improve >> beyond the baseline) by freeing a contiguous set of order-0 pages as a batch. >> >> At the same time I observed that free_contig_range() was essentially doing the >> same thing as vfree() so I've fixed it there too. While at it, optimize the >> __free_contig_frozen_range() as well. >> >> Check that the contiguous range falls in the same section. If they aren't enabled, >> the if conditions get optimized out by the compiler as memdesc_section() returns 0. >> See num_pages_contiguous() for more details about it. > > Thanks. I'm seeing impressive speedups for microbenchmarks. The > speedup in [3/3] may be a bit more real-worldy. This should speedup virtio-mem memory hotplug in cases where we end up calling free_contig_range() on MAX_PAGE_ORDER ranges to release hotplugged memory to the buddy! -- Cheers, David
Hi Andrew, Thank you for reviewing. On 27/03/2026 7:42 pm, Andrew Morton wrote: > On Fri, 27 Mar 2026 12:57:12 +0000 Muhammad Usama Anjum <usama.anjum@arm.com> wrote: > >> A recent change to vmalloc caused some performance benchmark regressions (see >> [1]). I'm attempting to fix that (and at the same time significantly improve >> beyond the baseline) by freeing a contiguous set of order-0 pages as a batch. >> >> At the same time I observed that free_contig_range() was essentially doing the >> same thing as vfree() so I've fixed it there too. While at it, optimize the >> __free_contig_frozen_range() as well. >> >> Check that the contiguous range falls in the same section. If they aren't enabled, >> the if conditions get optimized out by the compiler as memdesc_section() returns 0. >> See num_pages_contiguous() for more details about it. > > Thanks. I'm seeing impressive speedups for microbenchmarks. The > speedup in [3/3] may be a bit more real-worldy. Yeah, we are getting natural speedup as the vmalloc() allocated higher order page blocks and vfree() starts freeing into higher order blocks instead of order-0 blocks. > > Do you have a feeling for how much difference these changes will make > for any real-world workload? I've not run any real world benchmarks. Considering these benchmarks we'll surely get speedups. > > Also, AI review said things: > https://sashiko.dev/#/patchset/20260327125720.2270651-1-usama.anjum@arm.com > > The can_free one (at least) seems legit. I suggest that can_free be > made local to that for() loop - this would clear things up a bit. Other than unrelated things in the AI's review, this is the only real problem highlighted. I'll fix it. Thanks, Usama
© 2016 - 2026 Red Hat, Inc.