mm/vmalloc: request large order pages from buddy allocator

[PATCH] mm/vmalloc: request large order pages from buddy allocator

Posted by Vishal Moola (Oracle) 3 months, 2 weeks ago

Sometimes, vm_area_alloc_pages() will want many pages from the buddy
allocator. Rather than making requests to the buddy allocator for at
most 100 pages at a time, we can eagerly request large order pages a
smaller number of times.

We still split the large order pages down to order-0 as the rest of the
vmalloc code (and some callers) depend on it. We still defer to the bulk
allocator and fallback path in case of order-0 pages or failure.

Running 1000 iterations of allocations on a small 4GB system finds:

1000 2mb allocations:
	[Baseline]			[This patch]
	real    46.310s			real    0m34.582
	user    0.001s			user    0.006s
	sys     46.058s			sys     0m34.365s

10000 200kb allocations:
	[Baseline]			[This patch]
	real    56.104s			real    0m43.696
	user    0.001s			user    0.003s
	sys     55.375s			sys     0m42.995s

Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>

-----
RFC:
https://lore.kernel.org/linux-mm/20251014182754.4329-1-vishal.moola@gmail.com/

Changes since rfc:
  - Mask off NO_FAIL in large_gfp
  - Mask off GFP_COMP in large_gfp
There was discussion about warning on and rejecting unsupported GFP
flags in vmalloc, I'll have a separate patch for that.

  - Introduce nr_remaining variable to track total pages
  - Calculate large order as (min(max_order, ilog2())
  - Attempt lower orders on failure before falling back to original path
  - Drop unnecessary fallback comment change
---
 mm/vmalloc.c | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index adde450ddf5e..0832f944544c 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3619,8 +3619,44 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
 		unsigned int order, unsigned int nr_pages, struct page **pages)
 {
 	unsigned int nr_allocated = 0;
+	unsigned int nr_remaining = nr_pages;
+	unsigned int max_attempt_order = MAX_PAGE_ORDER;
 	struct page *page;
 	int i;
+	gfp_t large_gfp = (gfp &
+		~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL | __GFP_COMP))
+		| __GFP_NOWARN;
+	unsigned int large_order = ilog2(nr_remaining);
+
+	large_order = min(max_attempt_order, large_order);
+
+	/*
+	 * Initially, attempt to have the page allocator give us large order
+	 * pages. Do not attempt allocating smaller than order chunks since
+	 * __vmap_pages_range() expects physically contigous pages of exactly
+	 * order long chunks.
+	 */
+	while (large_order > order && nr_remaining) {
+		if (nid == NUMA_NO_NODE)
+			page = alloc_pages_noprof(large_gfp, large_order);
+		else
+			page = alloc_pages_node_noprof(nid, large_gfp, large_order);
+
+		if (unlikely(!page)) {
+			max_attempt_order = --large_order;
+			continue;
+		}
+
+		split_page(page, large_order);
+		for (i = 0; i < (1U << large_order); i++)
+			pages[nr_allocated + i] = page + i;
+
+		nr_allocated += 1U << large_order;
+		nr_remaining = nr_pages - nr_allocated;
+
+		large_order = ilog2(nr_remaining);
+		large_order = min(max_attempt_order, large_order);
+	}
 
 	/*
 	 * For order-0 pages we make use of bulk allocator, if
-- 
2.51.0

Re: [PATCH] mm/vmalloc: request large order pages from buddy allocator

Posted by Ryan Roberts 1 month, 4 weeks ago

Hi Vishal,


On 21/10/2025 20:44, Vishal Moola (Oracle) wrote:
> Sometimes, vm_area_alloc_pages() will want many pages from the buddy
> allocator. Rather than making requests to the buddy allocator for at
> most 100 pages at a time, we can eagerly request large order pages a
> smaller number of times.
> 
> We still split the large order pages down to order-0 as the rest of the
> vmalloc code (and some callers) depend on it. We still defer to the bulk
> allocator and fallback path in case of order-0 pages or failure.
> 
> Running 1000 iterations of allocations on a small 4GB system finds:
> 
> 1000 2mb allocations:
> 	[Baseline]			[This patch]
> 	real    46.310s			real    0m34.582
> 	user    0.001s			user    0.006s
> 	sys     46.058s			sys     0m34.365s
> 
> 10000 200kb allocations:
> 	[Baseline]			[This patch]
> 	real    56.104s			real    0m43.696
> 	user    0.001s			user    0.003s
> 	sys     55.375s			sys     0m42.995s

I'm seeing some big vmalloc micro benchmark regressions on arm64, for which 
bisect is pointing to this patch.

The tests are all originally from the vmalloc_test module. Note that (R) 
indicates a statistically significant regression and (I) indicates a 
statistically improvement.

p is number of pages in the allocation, h is huge. So it looks like the 
regressions are all coming for the non-huge case, where we want to split to 
order-0.

+---------------------------------+----------------------------------------------------------+------------+------------------------+
| Benchmark                       | Result Class                                             |     6-18-0 |   6-18-0-gc2f2b01b74be |
+=================================+==========================================================+============+========================+
| micromm/vmalloc                 | fix_align_alloc_test: p:1, h:0, l:500000 (usec)          |  514126.58 |            (R) -42.20% |
|                                 | fix_size_alloc_test: p:1, h:0, l:500000 (usec)           |  320458.33 |                 -0.02% |
|                                 | fix_size_alloc_test: p:4, h:0, l:500000 (usec)           |  399680.33 |            (R) -23.43% |
|                                 | fix_size_alloc_test: p:16, h:0, l:500000 (usec)          |  788723.25 |            (R) -23.66% |
|                                 | fix_size_alloc_test: p:16, h:1, l:500000 (usec)          |  979839.58 |                 -1.05% |
|                                 | fix_size_alloc_test: p:64, h:0, l:100000 (usec)          |  481454.58 |            (R) -23.99% |
|                                 | fix_size_alloc_test: p:64, h:1, l:100000 (usec)          |  615924.00 |              (I) 2.56% |
|                                 | fix_size_alloc_test: p:256, h:0, l:100000 (usec)         | 1799224.08 |            (R) -23.28% |
|                                 | fix_size_alloc_test: p:256, h:1, l:100000 (usec)         | 2313859.25 |              (I) 3.43% |
|                                 | fix_size_alloc_test: p:512, h:0, l:100000 (usec)         | 3541904.75 |            (R) -23.86% |
|                                 | fix_size_alloc_test: p:512, h:1, l:100000 (usec)         | 3597577.25 |             (R) -2.97% |
|                                 | full_fit_alloc_test: p:1, h:0, l:500000 (usec)           |  487021.83 |              (I) 4.95% |
|                                 | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |  344466.33 |                 -0.65% |
|                                 | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |  342484.25 |                 -1.58% |
|                                 | long_busy_list_alloc_test: p:1, h:0, l:500000 (usec)     | 4034901.17 |            (R) -25.35% |
|                                 | pcpu_alloc_test: p:1, h:0, l:500000 (usec)               |  195973.42 |                  0.57% |
|                                 | random_size_align_alloc_test: p:1, h:0, l:500000 (usec)  |  643489.33 |            (R) -47.63% |
|                                 | random_size_alloc_test: p:1, h:0, l:500000 (usec)        | 2029261.33 |            (R) -27.88% |
|                                 | vm_map_ram_test: p:1, h:0, l:500000 (usec)               |   83557.08 |                 -0.22% |
+---------------------------------+----------------------------------------------------------+------------+------------------------+

I have a couple of thoughts from looking at the patch:

 - Perhaps split_page() is the bulk of the cost? Previously for this case we 
   were allocating order-0 so there was no split to do. For h=1, split would 
   have already been called so that would explain why no regression for that 
   case?

 - I guess we are bypassing the pcpu cache? Could this be having an effect? Dev 
   (cc'ed) did some similar investigation a while back and saw increased vmalloc 
   latencies when bypassing pcpu cache.

 - Philosophically is allocating physically contiguous memory when it is not 
   strictly needed the right thing to do? Large physically contiguous blocks are 
   a scarce resource so we don't want to waste them. Although I guess it could 
   be argued that this actually preserves the contiguous blocks because the 
   lifetime of all the pages is tied together. Anyway, I doubt this is the 
   reason for the slow down, since those benchmarks are not under memory 
   pressure.

Anyway, it would be good to resolve the performance regressions if we can.

Thanks,
Ryan

> 
> Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
> 
> -----
> RFC:
> https://lore.kernel.org/linux-mm/20251014182754.4329-1-vishal.moola@gmail.com/
> 
> Changes since rfc:
>   - Mask off NO_FAIL in large_gfp
>   - Mask off GFP_COMP in large_gfp
> There was discussion about warning on and rejecting unsupported GFP
> flags in vmalloc, I'll have a separate patch for that.
> 
>   - Introduce nr_remaining variable to track total pages
>   - Calculate large order as (min(max_order, ilog2())
>   - Attempt lower orders on failure before falling back to original path
>   - Drop unnecessary fallback comment change
> ---
>  mm/vmalloc.c | 36 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 36 insertions(+)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index adde450ddf5e..0832f944544c 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3619,8 +3619,44 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>  		unsigned int order, unsigned int nr_pages, struct page **pages)
>  {
>  	unsigned int nr_allocated = 0;
> +	unsigned int nr_remaining = nr_pages;
> +	unsigned int max_attempt_order = MAX_PAGE_ORDER;
>  	struct page *page;
>  	int i;
> +	gfp_t large_gfp = (gfp &
> +		~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL | __GFP_COMP))
> +		| __GFP_NOWARN;
> +	unsigned int large_order = ilog2(nr_remaining);
> +
> +	large_order = min(max_attempt_order, large_order);
> +
> +	/*
> +	 * Initially, attempt to have the page allocator give us large order
> +	 * pages. Do not attempt allocating smaller than order chunks since
> +	 * __vmap_pages_range() expects physically contigous pages of exactly
> +	 * order long chunks.
> +	 */
> +	while (large_order > order && nr_remaining) {
> +		if (nid == NUMA_NO_NODE)
> +			page = alloc_pages_noprof(large_gfp, large_order);
> +		else
> +			page = alloc_pages_node_noprof(nid, large_gfp, large_order);
> +
> +		if (unlikely(!page)) {
> +			max_attempt_order = --large_order;
> +			continue;
> +		}
> +
> +		split_page(page, large_order);
> +		for (i = 0; i < (1U << large_order); i++)
> +			pages[nr_allocated + i] = page + i;
> +
> +		nr_allocated += 1U << large_order;
> +		nr_remaining = nr_pages - nr_allocated;
> +
> +		large_order = ilog2(nr_remaining);
> +		large_order = min(max_attempt_order, large_order);
> +	}
>  
>  	/*
>  	 * For order-0 pages we make use of bulk allocator, if

Re: [PATCH] mm/vmalloc: request large order pages from buddy allocator

Posted by Vishal Moola (Oracle) 1 month, 4 weeks ago

On Wed, Dec 10, 2025 at 01:21:22PM +0000, Ryan Roberts wrote:
> Hi Vishal,
> 
> 
> On 21/10/2025 20:44, Vishal Moola (Oracle) wrote:
> > Sometimes, vm_area_alloc_pages() will want many pages from the buddy
> > allocator. Rather than making requests to the buddy allocator for at
> > most 100 pages at a time, we can eagerly request large order pages a
> > smaller number of times.
> > 
> > We still split the large order pages down to order-0 as the rest of the
> > vmalloc code (and some callers) depend on it. We still defer to the bulk
> > allocator and fallback path in case of order-0 pages or failure.
> > 
> > Running 1000 iterations of allocations on a small 4GB system finds:
> > 
> > 1000 2mb allocations:
> > 	[Baseline]			[This patch]
> > 	real    46.310s			real    0m34.582
> > 	user    0.001s			user    0.006s
> > 	sys     46.058s			sys     0m34.365s
> > 
> > 10000 200kb allocations:
> > 	[Baseline]			[This patch]
> > 	real    56.104s			real    0m43.696
> > 	user    0.001s			user    0.003s
> > 	sys     55.375s			sys     0m42.995s
> 
> I'm seeing some big vmalloc micro benchmark regressions on arm64, for which 
> bisect is pointing to this patch.

Ulad had similar findings/concerns[1]. Tldr: The numbers you are seeing
are expected for how the test module is currently written.

> The tests are all originally from the vmalloc_test module. Note that (R) 
> indicates a statistically significant regression and (I) indicates a 
> statistically improvement.
> 
> p is number of pages in the allocation, h is huge. So it looks like the 
> regressions are all coming for the non-huge case, where we want to split to 
> order-0.
> 
> +---------------------------------+----------------------------------------------------------+------------+------------------------+
> | Benchmark                       | Result Class                                             |     6-18-0 |   6-18-0-gc2f2b01b74be |
> +=================================+==========================================================+============+========================+
> | micromm/vmalloc                 | fix_align_alloc_test: p:1, h:0, l:500000 (usec)          |  514126.58 |            (R) -42.20% |
> |                                 | fix_size_alloc_test: p:1, h:0, l:500000 (usec)           |  320458.33 |                 -0.02% |
> |                                 | fix_size_alloc_test: p:4, h:0, l:500000 (usec)           |  399680.33 |            (R) -23.43% |
> |                                 | fix_size_alloc_test: p:16, h:0, l:500000 (usec)          |  788723.25 |            (R) -23.66% |
> |                                 | fix_size_alloc_test: p:16, h:1, l:500000 (usec)          |  979839.58 |                 -1.05% |
> |                                 | fix_size_alloc_test: p:64, h:0, l:100000 (usec)          |  481454.58 |            (R) -23.99% |
> |                                 | fix_size_alloc_test: p:64, h:1, l:100000 (usec)          |  615924.00 |              (I) 2.56% |
> |                                 | fix_size_alloc_test: p:256, h:0, l:100000 (usec)         | 1799224.08 |            (R) -23.28% |
> |                                 | fix_size_alloc_test: p:256, h:1, l:100000 (usec)         | 2313859.25 |              (I) 3.43% |
> |                                 | fix_size_alloc_test: p:512, h:0, l:100000 (usec)         | 3541904.75 |            (R) -23.86% |
> |                                 | fix_size_alloc_test: p:512, h:1, l:100000 (usec)         | 3597577.25 |             (R) -2.97% |
> |                                 | full_fit_alloc_test: p:1, h:0, l:500000 (usec)           |  487021.83 |              (I) 4.95% |
> |                                 | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |  344466.33 |                 -0.65% |
> |                                 | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |  342484.25 |                 -1.58% |
> |                                 | long_busy_list_alloc_test: p:1, h:0, l:500000 (usec)     | 4034901.17 |            (R) -25.35% |
> |                                 | pcpu_alloc_test: p:1, h:0, l:500000 (usec)               |  195973.42 |                  0.57% |
> |                                 | random_size_align_alloc_test: p:1, h:0, l:500000 (usec)  |  643489.33 |            (R) -47.63% |
> |                                 | random_size_alloc_test: p:1, h:0, l:500000 (usec)        | 2029261.33 |            (R) -27.88% |
> |                                 | vm_map_ram_test: p:1, h:0, l:500000 (usec)               |   83557.08 |                 -0.22% |
> +---------------------------------+----------------------------------------------------------+------------+------------------------+
>
> I have a couple of thoughts from looking at the patch:
> 
>  - Perhaps split_page() is the bulk of the cost? Previously for this case we 
>    were allocating order-0 so there was no split to do. For h=1, split would 
>    have already been called so that would explain why no regression for that 
>    case?

For h=1, this patch shouldn't change (as long as nr_pages <
arch_vmap_{pte,pmd}_supported_shift). This is why you don't see regressions
in those cases.

>  - I guess we are bypassing the pcpu cache? Could this be having an effect? Dev 
>    (cc'ed) did some similar investigation a while back and saw increased vmalloc 
>    latencies when bypassing pcpu cache.

I'd say this is more a case of this test module targeting the pcpu
cache. The module allocates then frees one at a time, which promotes
reusing pcpu pages. [1] Has some numbers after modifying the test such
that all the allocations are made before freeing any.

>  - Philosophically is allocating physically contiguous memory when it is not 
>    strictly needed the right thing to do? Large physically contiguous blocks are 
>    a scarce resource so we don't want to waste them. Although I guess it could 
>    be argued that this actually preserves the contiguous blocks because the 
>    lifetime of all the pages is tied together. Anyway, I doubt this is the 

This was the primary incentive for this patch :)

>    reason for the slow down, since those benchmarks are not under memory 
>    pressure.
>
> Anyway, it would be good to resolve the performance regressions if we can.

Imo, the appropriate way to address these is to modify the test module
as seen in [1].

[1] https://lore.kernel.org/linux-mm/aPJ6lLf24TfW_1n7@milan/

Re: [PATCH] mm/vmalloc: request large order pages from buddy allocator

Posted by Ryan Roberts 1 month, 4 weeks ago

On 10/12/2025 22:28, Vishal Moola (Oracle) wrote:
> On Wed, Dec 10, 2025 at 01:21:22PM +0000, Ryan Roberts wrote:
>> Hi Vishal,
>>
>>
>> On 21/10/2025 20:44, Vishal Moola (Oracle) wrote:
>>> Sometimes, vm_area_alloc_pages() will want many pages from the buddy
>>> allocator. Rather than making requests to the buddy allocator for at
>>> most 100 pages at a time, we can eagerly request large order pages a
>>> smaller number of times.
>>>
>>> We still split the large order pages down to order-0 as the rest of the
>>> vmalloc code (and some callers) depend on it. We still defer to the bulk
>>> allocator and fallback path in case of order-0 pages or failure.
>>>
>>> Running 1000 iterations of allocations on a small 4GB system finds:
>>>
>>> 1000 2mb allocations:
>>> 	[Baseline]			[This patch]
>>> 	real    46.310s			real    0m34.582
>>> 	user    0.001s			user    0.006s
>>> 	sys     46.058s			sys     0m34.365s
>>>
>>> 10000 200kb allocations:
>>> 	[Baseline]			[This patch]
>>> 	real    56.104s			real    0m43.696
>>> 	user    0.001s			user    0.003s
>>> 	sys     55.375s			sys     0m42.995s
>>
>> I'm seeing some big vmalloc micro benchmark regressions on arm64, for which 
>> bisect is pointing to this patch.
> 
> Ulad had similar findings/concerns[1]. Tldr: The numbers you are seeing
> are expected for how the test module is currently written.

Hmm... simplistically, I'd say that either the tests are bad, in which case they
should be deleted, or they are good, in which case we shouldn't ignore the
regressions. Having tests that we learn to ignore is the worst of both worlds.

But I see your point about the allocation pattern not being very realistic.

> 
>> The tests are all originally from the vmalloc_test module. Note that (R) 
>> indicates a statistically significant regression and (I) indicates a 
>> statistically improvement.
>>
>> p is number of pages in the allocation, h is huge. So it looks like the 
>> regressions are all coming for the non-huge case, where we want to split to 
>> order-0.
>>
>> +---------------------------------+----------------------------------------------------------+------------+------------------------+
>> | Benchmark                       | Result Class                                             |     6-18-0 |   6-18-0-gc2f2b01b74be |
>> +=================================+==========================================================+============+========================+
>> | micromm/vmalloc                 | fix_align_alloc_test: p:1, h:0, l:500000 (usec)          |  514126.58 |            (R) -42.20% |
>> |                                 | fix_size_alloc_test: p:1, h:0, l:500000 (usec)           |  320458.33 |                 -0.02% |
>> |                                 | fix_size_alloc_test: p:4, h:0, l:500000 (usec)           |  399680.33 |            (R) -23.43% |
>> |                                 | fix_size_alloc_test: p:16, h:0, l:500000 (usec)          |  788723.25 |            (R) -23.66% |
>> |                                 | fix_size_alloc_test: p:16, h:1, l:500000 (usec)          |  979839.58 |                 -1.05% |
>> |                                 | fix_size_alloc_test: p:64, h:0, l:100000 (usec)          |  481454.58 |            (R) -23.99% |
>> |                                 | fix_size_alloc_test: p:64, h:1, l:100000 (usec)          |  615924.00 |              (I) 2.56% |
>> |                                 | fix_size_alloc_test: p:256, h:0, l:100000 (usec)         | 1799224.08 |            (R) -23.28% |
>> |                                 | fix_size_alloc_test: p:256, h:1, l:100000 (usec)         | 2313859.25 |              (I) 3.43% |
>> |                                 | fix_size_alloc_test: p:512, h:0, l:100000 (usec)         | 3541904.75 |            (R) -23.86% |
>> |                                 | fix_size_alloc_test: p:512, h:1, l:100000 (usec)         | 3597577.25 |             (R) -2.97% |
>> |                                 | full_fit_alloc_test: p:1, h:0, l:500000 (usec)           |  487021.83 |              (I) 4.95% |
>> |                                 | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |  344466.33 |                 -0.65% |
>> |                                 | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |  342484.25 |                 -1.58% |
>> |                                 | long_busy_list_alloc_test: p:1, h:0, l:500000 (usec)     | 4034901.17 |            (R) -25.35% |
>> |                                 | pcpu_alloc_test: p:1, h:0, l:500000 (usec)               |  195973.42 |                  0.57% |
>> |                                 | random_size_align_alloc_test: p:1, h:0, l:500000 (usec)  |  643489.33 |            (R) -47.63% |
>> |                                 | random_size_alloc_test: p:1, h:0, l:500000 (usec)        | 2029261.33 |            (R) -27.88% |
>> |                                 | vm_map_ram_test: p:1, h:0, l:500000 (usec)               |   83557.08 |                 -0.22% |
>> +---------------------------------+----------------------------------------------------------+------------+------------------------+
>>
>> I have a couple of thoughts from looking at the patch:
>>
>>  - Perhaps split_page() is the bulk of the cost? Previously for this case we 
>>    were allocating order-0 so there was no split to do. For h=1, split would 
>>    have already been called so that would explain why no regression for that 
>>    case?
> 
> For h=1, this patch shouldn't change (as long as nr_pages <
> arch_vmap_{pte,pmd}_supported_shift). This is why you don't see regressions
> in those cases.

arm64 supports 64K contigous-mappings with vmalloc so once nr_pages >= 16 we can
take the huge path.

> 
>>  - I guess we are bypassing the pcpu cache? Could this be having an effect? Dev 
>>    (cc'ed) did some similar investigation a while back and saw increased vmalloc 
>>    latencies when bypassing pcpu cache.
> 
> I'd say this is more a case of this test module targeting the pcpu
> cache. The module allocates then frees one at a time, which promotes
> reusing pcpu pages. [1] Has some numbers after modifying the test such
> that all the allocations are made before freeing any.

OK fair enough.

We are seeing a bunch of other regressions in higher level benchmarks too; but
haven't yet concluded what's causing those. I'll report back if this patch looks
connected.

Thanks,
Ryan


> 
>>  - Philosophically is allocating physically contiguous memory when it is not 
>>    strictly needed the right thing to do? Large physically contiguous blocks are 
>>    a scarce resource so we don't want to waste them. Although I guess it could 
>>    be argued that this actually preserves the contiguous blocks because the 
>>    lifetime of all the pages is tied together. Anyway, I doubt this is the 
> 
> This was the primary incentive for this patch :)
> 
>>    reason for the slow down, since those benchmarks are not under memory 
>>    pressure.
>>
>> Anyway, it would be good to resolve the performance regressions if we can.
> 
> Imo, the appropriate way to address these is to modify the test module
> as seen in [1].
> 
> [1] https://lore.kernel.org/linux-mm/aPJ6lLf24TfW_1n7@milan/

Re: [PATCH] mm/vmalloc: request large order pages from buddy allocator

Posted by Andrew Morton 1 month, 4 weeks ago

On Thu, 11 Dec 2025 15:28:56 +0000 Ryan Roberts <ryan.roberts@arm.com> wrote:

> We are seeing a bunch of other regressions in higher level benchmarks too; but
> haven't yet concluded what's causing those. I'll report back if this patch looks
> connected.

Pretty please, would it be possible to do this testing *before* everything
hits mainline?

: From: "Vishal Moola (Oracle)" <vishal.moola@gmail.com>
: Subject: mm/vmalloc: request large order pages from buddy allocator
: Date: Tue, 21 Oct 2025 12:44:56 -0700

Thanks.

Re: [PATCH] mm/vmalloc: request large order pages from buddy allocator

Posted by Uladzislau Rezki 1 month, 4 weeks ago

On Thu, Dec 11, 2025 at 03:28:56PM +0000, Ryan Roberts wrote:
> On 10/12/2025 22:28, Vishal Moola (Oracle) wrote:
> > On Wed, Dec 10, 2025 at 01:21:22PM +0000, Ryan Roberts wrote:
> >> Hi Vishal,
> >>
> >>
> >> On 21/10/2025 20:44, Vishal Moola (Oracle) wrote:
> >>> Sometimes, vm_area_alloc_pages() will want many pages from the buddy
> >>> allocator. Rather than making requests to the buddy allocator for at
> >>> most 100 pages at a time, we can eagerly request large order pages a
> >>> smaller number of times.
> >>>
> >>> We still split the large order pages down to order-0 as the rest of the
> >>> vmalloc code (and some callers) depend on it. We still defer to the bulk
> >>> allocator and fallback path in case of order-0 pages or failure.
> >>>
> >>> Running 1000 iterations of allocations on a small 4GB system finds:
> >>>
> >>> 1000 2mb allocations:
> >>> 	[Baseline]			[This patch]
> >>> 	real    46.310s			real    0m34.582
> >>> 	user    0.001s			user    0.006s
> >>> 	sys     46.058s			sys     0m34.365s
> >>>
> >>> 10000 200kb allocations:
> >>> 	[Baseline]			[This patch]
> >>> 	real    56.104s			real    0m43.696
> >>> 	user    0.001s			user    0.003s
> >>> 	sys     55.375s			sys     0m42.995s
> >>
> >> I'm seeing some big vmalloc micro benchmark regressions on arm64, for which 
> >> bisect is pointing to this patch.
> > 
> > Ulad had similar findings/concerns[1]. Tldr: The numbers you are seeing
> > are expected for how the test module is currently written.
> 
> Hmm... simplistically, I'd say that either the tests are bad, in which case they
> should be deleted, or they are good, in which case we shouldn't ignore the
> regressions. Having tests that we learn to ignore is the worst of both worlds.
> 
Uh.. Tests are for measure vmalloc performance and stressing. They can not be just
removed :) In some sense they are synthetic, from the other hand they allow to find
problems and bottle-necks + measure perf. You have identified regression with it :)

I think, the problem is in the 

+   14.05%     0.11%  [kernel]          [k] remove_vm_area
+   11.85%     1.82%  [kernel]          [k] __alloc_frozen_pages_noprof
+   10.91%     0.36%  [kernel]          [k] __get_vm_area_node
+   10.60%     7.58%  [kernel]          [k] insert_vmap_area
+   10.02%     4.67%  [kernel]          [k] get_page_from_freelist


get_page_from_freelist() call. With a patch it adds 10% of cycles on
top whereas without patch i do not see the symbol at all, i.e. pages
are obtained really fast from the pcp list, not from the body.

The question is, why high-order pages are not end-up in the pcp-cache?
I think it is due to the fact, that we split such pages and freeing them
as order-0 one.

Any thoughts?

--
Uladzislau Rezki

Re: [PATCH] mm/vmalloc: request large order pages from buddy allocator

Posted by Dev Jain 1 month, 4 weeks ago

On 11/12/25 9:09 pm, Uladzislau Rezki wrote:
> On Thu, Dec 11, 2025 at 03:28:56PM +0000, Ryan Roberts wrote:
>> On 10/12/2025 22:28, Vishal Moola (Oracle) wrote:
>>> On Wed, Dec 10, 2025 at 01:21:22PM +0000, Ryan Roberts wrote:
>>>> Hi Vishal,
>>>>
>>>>
>>>> On 21/10/2025 20:44, Vishal Moola (Oracle) wrote:
>>>>> Sometimes, vm_area_alloc_pages() will want many pages from the buddy
>>>>> allocator. Rather than making requests to the buddy allocator for at
>>>>> most 100 pages at a time, we can eagerly request large order pages a
>>>>> smaller number of times.
>>>>>
>>>>> We still split the large order pages down to order-0 as the rest of the
>>>>> vmalloc code (and some callers) depend on it. We still defer to the bulk
>>>>> allocator and fallback path in case of order-0 pages or failure.
>>>>>
>>>>> Running 1000 iterations of allocations on a small 4GB system finds:
>>>>>
>>>>> 1000 2mb allocations:
>>>>> 	[Baseline]			[This patch]
>>>>> 	real    46.310s			real    0m34.582
>>>>> 	user    0.001s			user    0.006s
>>>>> 	sys     46.058s			sys     0m34.365s
>>>>>
>>>>> 10000 200kb allocations:
>>>>> 	[Baseline]			[This patch]
>>>>> 	real    56.104s			real    0m43.696
>>>>> 	user    0.001s			user    0.003s
>>>>> 	sys     55.375s			sys     0m42.995s
>>>> I'm seeing some big vmalloc micro benchmark regressions on arm64, for which
>>>> bisect is pointing to this patch.
>>> Ulad had similar findings/concerns[1]. Tldr: The numbers you are seeing
>>> are expected for how the test module is currently written.
>> Hmm... simplistically, I'd say that either the tests are bad, in which case they
>> should be deleted, or they are good, in which case we shouldn't ignore the
>> regressions. Having tests that we learn to ignore is the worst of both worlds.
>>
> Uh.. Tests are for measure vmalloc performance and stressing. They can not be just
> removed :) In some sense they are synthetic, from the other hand they allow to find
> problems and bottle-necks + measure perf. You have identified regression with it :)
>
> I think, the problem is in the
>
> +   14.05%     0.11%  [kernel]          [k] remove_vm_area
> +   11.85%     1.82%  [kernel]          [k] __alloc_frozen_pages_noprof
> +   10.91%     0.36%  [kernel]          [k] __get_vm_area_node
> +   10.60%     7.58%  [kernel]          [k] insert_vmap_area
> +   10.02%     4.67%  [kernel]          [k] get_page_from_freelist
>
>
> get_page_from_freelist() call. With a patch it adds 10% of cycles on
> top whereas without patch i do not see the symbol at all, i.e. pages
> are obtained really fast from the pcp list, not from the body.
>
> The question is, why high-order pages are not end-up in the pcp-cache?
> I think it is due to the fact, that we split such pages and freeing them
> as order-0 one.

Please take a look at my RFC:

https://lore.kernel.org/all/20251112110807.69958-1-dev.jain@arm.com/

You are right, we allocate large folios but then split them up and free
them as basepages. In patch 2 I have proved (not rigorously) that pcp
draining is one of the issues.

>
> Any thoughts?
>
> --
> Uladzislau Rezki
>

Re: [PATCH] mm/vmalloc: request large order pages from buddy allocator

Posted by Uladzislau Rezki 1 month, 4 weeks ago

On Thu, Dec 11, 2025 at 09:13:28PM +0530, Dev Jain wrote:
> 
> On 11/12/25 9:09 pm, Uladzislau Rezki wrote:
> > On Thu, Dec 11, 2025 at 03:28:56PM +0000, Ryan Roberts wrote:
> > > On 10/12/2025 22:28, Vishal Moola (Oracle) wrote:
> > > > On Wed, Dec 10, 2025 at 01:21:22PM +0000, Ryan Roberts wrote:
> > > > > Hi Vishal,
> > > > > 
> > > > > 
> > > > > On 21/10/2025 20:44, Vishal Moola (Oracle) wrote:
> > > > > > Sometimes, vm_area_alloc_pages() will want many pages from the buddy
> > > > > > allocator. Rather than making requests to the buddy allocator for at
> > > > > > most 100 pages at a time, we can eagerly request large order pages a
> > > > > > smaller number of times.
> > > > > > 
> > > > > > We still split the large order pages down to order-0 as the rest of the
> > > > > > vmalloc code (and some callers) depend on it. We still defer to the bulk
> > > > > > allocator and fallback path in case of order-0 pages or failure.
> > > > > > 
> > > > > > Running 1000 iterations of allocations on a small 4GB system finds:
> > > > > > 
> > > > > > 1000 2mb allocations:
> > > > > > 	[Baseline]			[This patch]
> > > > > > 	real    46.310s			real    0m34.582
> > > > > > 	user    0.001s			user    0.006s
> > > > > > 	sys     46.058s			sys     0m34.365s
> > > > > > 
> > > > > > 10000 200kb allocations:
> > > > > > 	[Baseline]			[This patch]
> > > > > > 	real    56.104s			real    0m43.696
> > > > > > 	user    0.001s			user    0.003s
> > > > > > 	sys     55.375s			sys     0m42.995s
> > > > > I'm seeing some big vmalloc micro benchmark regressions on arm64, for which
> > > > > bisect is pointing to this patch.
> > > > Ulad had similar findings/concerns[1]. Tldr: The numbers you are seeing
> > > > are expected for how the test module is currently written.
> > > Hmm... simplistically, I'd say that either the tests are bad, in which case they
> > > should be deleted, or they are good, in which case we shouldn't ignore the
> > > regressions. Having tests that we learn to ignore is the worst of both worlds.
> > > 
> > Uh.. Tests are for measure vmalloc performance and stressing. They can not be just
> > removed :) In some sense they are synthetic, from the other hand they allow to find
> > problems and bottle-necks + measure perf. You have identified regression with it :)
> > 
> > I think, the problem is in the
> > 
> > +   14.05%     0.11%  [kernel]          [k] remove_vm_area
> > +   11.85%     1.82%  [kernel]          [k] __alloc_frozen_pages_noprof
> > +   10.91%     0.36%  [kernel]          [k] __get_vm_area_node
> > +   10.60%     7.58%  [kernel]          [k] insert_vmap_area
> > +   10.02%     4.67%  [kernel]          [k] get_page_from_freelist
> > 
> > 
> > get_page_from_freelist() call. With a patch it adds 10% of cycles on
> > top whereas without patch i do not see the symbol at all, i.e. pages
> > are obtained really fast from the pcp list, not from the body.
> > 
> > The question is, why high-order pages are not end-up in the pcp-cache?
> > I think it is due to the fact, that we split such pages and freeing them
> > as order-0 one.
> 
> Please take a look at my RFC:
> 
> https://lore.kernel.org/all/20251112110807.69958-1-dev.jain@arm.com/
> 
> You are right, we allocate large folios but then split them up and free
> them as basepages. In patch 2 I have proved (not rigorously) that pcp
> draining is one of the issues.
> 
You sent out RFC 12 of NOV :-/ I have missed those two patches from you,
even though you put me into "to".

Appreciate that you point me on your work. Let me have a look at this.

Could you please resend RFC based on latest code-base?

--
Uladzislau Rezki

Re: [PATCH] mm/vmalloc: request large order pages from buddy allocator

Posted by Dev Jain 1 month, 4 weeks ago

On 11/12/25 9:54 pm, Uladzislau Rezki wrote:
> On Thu, Dec 11, 2025 at 09:13:28PM +0530, Dev Jain wrote:
>> On 11/12/25 9:09 pm, Uladzislau Rezki wrote:
>>> On Thu, Dec 11, 2025 at 03:28:56PM +0000, Ryan Roberts wrote:
>>>> On 10/12/2025 22:28, Vishal Moola (Oracle) wrote:
>>>>> On Wed, Dec 10, 2025 at 01:21:22PM +0000, Ryan Roberts wrote:
>>>>>> Hi Vishal,
>>>>>>
>>>>>>
>>>>>> On 21/10/2025 20:44, Vishal Moola (Oracle) wrote:
>>>>>>> Sometimes, vm_area_alloc_pages() will want many pages from the buddy
>>>>>>> allocator. Rather than making requests to the buddy allocator for at
>>>>>>> most 100 pages at a time, we can eagerly request large order pages a
>>>>>>> smaller number of times.
>>>>>>>
>>>>>>> We still split the large order pages down to order-0 as the rest of the
>>>>>>> vmalloc code (and some callers) depend on it. We still defer to the bulk
>>>>>>> allocator and fallback path in case of order-0 pages or failure.
>>>>>>>
>>>>>>> Running 1000 iterations of allocations on a small 4GB system finds:
>>>>>>>
>>>>>>> 1000 2mb allocations:
>>>>>>> 	[Baseline]			[This patch]
>>>>>>> 	real    46.310s			real    0m34.582
>>>>>>> 	user    0.001s			user    0.006s
>>>>>>> 	sys     46.058s			sys     0m34.365s
>>>>>>>
>>>>>>> 10000 200kb allocations:
>>>>>>> 	[Baseline]			[This patch]
>>>>>>> 	real    56.104s			real    0m43.696
>>>>>>> 	user    0.001s			user    0.003s
>>>>>>> 	sys     55.375s			sys     0m42.995s
>>>>>> I'm seeing some big vmalloc micro benchmark regressions on arm64, for which
>>>>>> bisect is pointing to this patch.
>>>>> Ulad had similar findings/concerns[1]. Tldr: The numbers you are seeing
>>>>> are expected for how the test module is currently written.
>>>> Hmm... simplistically, I'd say that either the tests are bad, in which case they
>>>> should be deleted, or they are good, in which case we shouldn't ignore the
>>>> regressions. Having tests that we learn to ignore is the worst of both worlds.
>>>>
>>> Uh.. Tests are for measure vmalloc performance and stressing. They can not be just
>>> removed :) In some sense they are synthetic, from the other hand they allow to find
>>> problems and bottle-necks + measure perf. You have identified regression with it :)
>>>
>>> I think, the problem is in the
>>>
>>> +   14.05%     0.11%  [kernel]          [k] remove_vm_area
>>> +   11.85%     1.82%  [kernel]          [k] __alloc_frozen_pages_noprof
>>> +   10.91%     0.36%  [kernel]          [k] __get_vm_area_node
>>> +   10.60%     7.58%  [kernel]          [k] insert_vmap_area
>>> +   10.02%     4.67%  [kernel]          [k] get_page_from_freelist
>>>
>>>
>>> get_page_from_freelist() call. With a patch it adds 10% of cycles on
>>> top whereas without patch i do not see the symbol at all, i.e. pages
>>> are obtained really fast from the pcp list, not from the body.
>>>
>>> The question is, why high-order pages are not end-up in the pcp-cache?
>>> I think it is due to the fact, that we split such pages and freeing them
>>> as order-0 one.
>> Please take a look at my RFC:
>>
>> https://lore.kernel.org/all/20251112110807.69958-1-dev.jain@arm.com/
>>
>> You are right, we allocate large folios but then split them up and free
>> them as basepages. In patch 2 I have proved (not rigorously) that pcp
>> draining is one of the issues.
>>
> You sent out RFC 12 of NOV :-/ I have missed those two patches from you,
> even though you put me into "to".
>
> Appreciate that you point me on your work. Let me have a look at this.
>
> Could you please resend RFC based on latest code-base?

Yup I'll do that. I was trying to get some perf numbers from LTP - fsstress,
but the variance seems to be high on the system I am testing. I would
appreciate if you or someone can run some benchmarks (filesystem is what I
believe would benefit).

>
> --
> Uladzislau Rezki
>

Re: [PATCH] mm/vmalloc: request large order pages from buddy allocator

Posted by Dev Jain 1 month, 4 weeks ago

On 11/12/25 8:58 pm, Ryan Roberts wrote:
> On 10/12/2025 22:28, Vishal Moola (Oracle) wrote:
>> On Wed, Dec 10, 2025 at 01:21:22PM +0000, Ryan Roberts wrote:
>>> Hi Vishal,
>>>
>>>
>>> On 21/10/2025 20:44, Vishal Moola (Oracle) wrote:
>>>> Sometimes, vm_area_alloc_pages() will want many pages from the buddy
>>>> allocator. Rather than making requests to the buddy allocator for at
>>>> most 100 pages at a time, we can eagerly request large order pages a
>>>> smaller number of times.
>>>>
>>>> We still split the large order pages down to order-0 as the rest of the
>>>> vmalloc code (and some callers) depend on it. We still defer to the bulk
>>>> allocator and fallback path in case of order-0 pages or failure.
>>>>
>>>> Running 1000 iterations of allocations on a small 4GB system finds:
>>>>
>>>> 1000 2mb allocations:
>>>> 	[Baseline]			[This patch]
>>>> 	real    46.310s			real    0m34.582
>>>> 	user    0.001s			user    0.006s
>>>> 	sys     46.058s			sys     0m34.365s
>>>>
>>>> 10000 200kb allocations:
>>>> 	[Baseline]			[This patch]
>>>> 	real    56.104s			real    0m43.696
>>>> 	user    0.001s			user    0.003s
>>>> 	sys     55.375s			sys     0m42.995s
>>> I'm seeing some big vmalloc micro benchmark regressions on arm64, for which
>>> bisect is pointing to this patch.
>> Ulad had similar findings/concerns[1]. Tldr: The numbers you are seeing
>> are expected for how the test module is currently written.
> Hmm... simplistically, I'd say that either the tests are bad, in which case they
> should be deleted, or they are good, in which case we shouldn't ignore the
> regressions. Having tests that we learn to ignore is the worst of both worlds.

AFAICR the test does some million-odd iterations by default, which is the real problem.
On my RFC [1] I notice that reducing the iterations reduces the regression - till
some multiple of ten thousand iterations, the regression is zero. Doing this
alloc->free a million freaking times messes up the buddy badly.

[1] https://lore.kernel.org/all/20251112110807.69958-1-dev.jain@arm.com/

>
> But I see your point about the allocation pattern not being very realistic.
>
>>> The tests are all originally from the vmalloc_test module. Note that (R)
>>> indicates a statistically significant regression and (I) indicates a
>>> statistically improvement.
>>>
>>> p is number of pages in the allocation, h is huge. So it looks like the
>>> regressions are all coming for the non-huge case, where we want to split to
>>> order-0.
>>>
>>> +---------------------------------+----------------------------------------------------------+------------+------------------------+
>>> | Benchmark                       | Result Class                                             |     6-18-0 |   6-18-0-gc2f2b01b74be |
>>> +=================================+==========================================================+============+========================+
>>> | micromm/vmalloc                 | fix_align_alloc_test: p:1, h:0, l:500000 (usec)          |  514126.58 |            (R) -42.20% |
>>> |                                 | fix_size_alloc_test: p:1, h:0, l:500000 (usec)           |  320458.33 |                 -0.02% |
>>> |                                 | fix_size_alloc_test: p:4, h:0, l:500000 (usec)           |  399680.33 |            (R) -23.43% |
>>> |                                 | fix_size_alloc_test: p:16, h:0, l:500000 (usec)          |  788723.25 |            (R) -23.66% |
>>> |                                 | fix_size_alloc_test: p:16, h:1, l:500000 (usec)          |  979839.58 |                 -1.05% |
>>> |                                 | fix_size_alloc_test: p:64, h:0, l:100000 (usec)          |  481454.58 |            (R) -23.99% |
>>> |                                 | fix_size_alloc_test: p:64, h:1, l:100000 (usec)          |  615924.00 |              (I) 2.56% |
>>> |                                 | fix_size_alloc_test: p:256, h:0, l:100000 (usec)         | 1799224.08 |            (R) -23.28% |
>>> |                                 | fix_size_alloc_test: p:256, h:1, l:100000 (usec)         | 2313859.25 |              (I) 3.43% |
>>> |                                 | fix_size_alloc_test: p:512, h:0, l:100000 (usec)         | 3541904.75 |            (R) -23.86% |
>>> |                                 | fix_size_alloc_test: p:512, h:1, l:100000 (usec)         | 3597577.25 |             (R) -2.97% |
>>> |                                 | full_fit_alloc_test: p:1, h:0, l:500000 (usec)           |  487021.83 |              (I) 4.95% |
>>> |                                 | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |  344466.33 |                 -0.65% |
>>> |                                 | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |  342484.25 |                 -1.58% |
>>> |                                 | long_busy_list_alloc_test: p:1, h:0, l:500000 (usec)     | 4034901.17 |            (R) -25.35% |
>>> |                                 | pcpu_alloc_test: p:1, h:0, l:500000 (usec)               |  195973.42 |                  0.57% |
>>> |                                 | random_size_align_alloc_test: p:1, h:0, l:500000 (usec)  |  643489.33 |            (R) -47.63% |
>>> |                                 | random_size_alloc_test: p:1, h:0, l:500000 (usec)        | 2029261.33 |            (R) -27.88% |
>>> |                                 | vm_map_ram_test: p:1, h:0, l:500000 (usec)               |   83557.08 |                 -0.22% |
>>> +---------------------------------+----------------------------------------------------------+------------+------------------------+
>>>
>>> I have a couple of thoughts from looking at the patch:
>>>
>>>   - Perhaps split_page() is the bulk of the cost? Previously for this case we
>>>     were allocating order-0 so there was no split to do. For h=1, split would
>>>     have already been called so that would explain why no regression for that
>>>     case?
>> For h=1, this patch shouldn't change (as long as nr_pages <
>> arch_vmap_{pte,pmd}_supported_shift). This is why you don't see regressions
>> in those cases.
> arm64 supports 64K contigous-mappings with vmalloc so once nr_pages >= 16 we can
> take the huge path.
>
>>>   - I guess we are bypassing the pcpu cache? Could this be having an effect? Dev
>>>     (cc'ed) did some similar investigation a while back and saw increased vmalloc
>>>     latencies when bypassing pcpu cache.
>> I'd say this is more a case of this test module targeting the pcpu
>> cache. The module allocates then frees one at a time, which promotes
>> reusing pcpu pages. [1] Has some numbers after modifying the test such
>> that all the allocations are made before freeing any.
> OK fair enough.
>
> We are seeing a bunch of other regressions in higher level benchmarks too; but
> haven't yet concluded what's causing those. I'll report back if this patch looks
> connected.
>
> Thanks,
> Ryan
>
>
>>>   - Philosophically is allocating physically contiguous memory when it is not
>>>     strictly needed the right thing to do? Large physically contiguous blocks are
>>>     a scarce resource so we don't want to waste them. Although I guess it could
>>>     be argued that this actually preserves the contiguous blocks because the
>>>     lifetime of all the pages is tied together. Anyway, I doubt this is the
>> This was the primary incentive for this patch :)
>>
>>>     reason for the slow down, since those benchmarks are not under memory
>>>     pressure.
>>>
>>> Anyway, it would be good to resolve the performance regressions if we can.
>> Imo, the appropriate way to address these is to modify the test module
>> as seen in [1].
>>
>> [1] https://lore.kernel.org/linux-mm/aPJ6lLf24TfW_1n7@milan/
>

Re: [PATCH] mm/vmalloc: request large order pages from buddy allocator

Posted by Dev Jain 1 month, 4 weeks ago

On 11/12/25 9:05 pm, Dev Jain wrote:
>
> On 11/12/25 8:58 pm, Ryan Roberts wrote:
>> On 10/12/2025 22:28, Vishal Moola (Oracle) wrote:
>>> On Wed, Dec 10, 2025 at 01:21:22PM +0000, Ryan Roberts wrote:
>>>> Hi Vishal,
>>>>
>>>>
>>>> On 21/10/2025 20:44, Vishal Moola (Oracle) wrote:
>>>>> Sometimes, vm_area_alloc_pages() will want many pages from the buddy
>>>>> allocator. Rather than making requests to the buddy allocator for at
>>>>> most 100 pages at a time, we can eagerly request large order pages a
>>>>> smaller number of times.
>>>>>
>>>>> We still split the large order pages down to order-0 as the rest 
>>>>> of the
>>>>> vmalloc code (and some callers) depend on it. We still defer to 
>>>>> the bulk
>>>>> allocator and fallback path in case of order-0 pages or failure.
>>>>>
>>>>> Running 1000 iterations of allocations on a small 4GB system finds:
>>>>>
>>>>> 1000 2mb allocations:
>>>>>     [Baseline]            [This patch]
>>>>>     real    46.310s            real    0m34.582
>>>>>     user    0.001s            user    0.006s
>>>>>     sys     46.058s            sys     0m34.365s
>>>>>
>>>>> 10000 200kb allocations:
>>>>>     [Baseline]            [This patch]
>>>>>     real    56.104s            real    0m43.696
>>>>>     user    0.001s            user    0.003s
>>>>>     sys     55.375s            sys     0m42.995s
>>>> I'm seeing some big vmalloc micro benchmark regressions on arm64, 
>>>> for which
>>>> bisect is pointing to this patch.
>>> Ulad had similar findings/concerns[1]. Tldr: The numbers you are seeing
>>> are expected for how the test module is currently written.
>> Hmm... simplistically, I'd say that either the tests are bad, in 
>> which case they
>> should be deleted, or they are good, in which case we shouldn't 
>> ignore the
>> regressions. Having tests that we learn to ignore is the worst of 
>> both worlds.
>
> AFAICR the test does some million-odd iterations by default, which is 
> the real problem.
> On my RFC [1] I notice that reducing the iterations reduces the 
> regression - till
> some multiple of ten thousand iterations, the regression is zero. 
> Doing this
> alloc->free a million freaking times messes up the buddy badly.
>
> [1] https://lore.kernel.org/all/20251112110807.69958-1-dev.jain@arm.com/


So this line:

__param(int, test_loop_count, 1000000,
         "Set test loop counter");

We should just change it to 20k or something and that should resolve it.


>
>>
>> But I see your point about the allocation pattern not being very 
>> realistic.
>>
>>>> The tests are all originally from the vmalloc_test module. Note 
>>>> that (R)
>>>> indicates a statistically significant regression and (I) indicates a
>>>> statistically improvement.
>>>>
>>>> p is number of pages in the allocation, h is huge. So it looks like 
>>>> the
>>>> regressions are all coming for the non-huge case, where we want to 
>>>> split to
>>>> order-0.
>>>>
>>>> +---------------------------------+----------------------------------------------------------+------------+------------------------+ 
>>>>
>>>> | Benchmark                       | Result 
>>>> Class                                             | 6-18-0 |   
>>>> 6-18-0-gc2f2b01b74be |
>>>> +=================================+==========================================================+============+========================+ 
>>>>
>>>> | micromm/vmalloc                 | fix_align_alloc_test: p:1, h:0, 
>>>> l:500000 (usec)          |  514126.58 | (R) -42.20% |
>>>> |                                 | fix_size_alloc_test: p:1, h:0, 
>>>> l:500000 (usec)           |  320458.33 |                 -0.02% |
>>>> |                                 | fix_size_alloc_test: p:4, h:0, 
>>>> l:500000 (usec)           |  399680.33 |            (R) -23.43% |
>>>> |                                 | fix_size_alloc_test: p:16, h:0, 
>>>> l:500000 (usec)          |  788723.25 |            (R) -23.66% |
>>>> |                                 | fix_size_alloc_test: p:16, h:1, 
>>>> l:500000 (usec)          |  979839.58 |                 -1.05% |
>>>> |                                 | fix_size_alloc_test: p:64, h:0, 
>>>> l:100000 (usec)          |  481454.58 |            (R) -23.99% |
>>>> |                                 | fix_size_alloc_test: p:64, h:1, 
>>>> l:100000 (usec)          |  615924.00 |              (I) 2.56% |
>>>> |                                 | fix_size_alloc_test: p:256, 
>>>> h:0, l:100000 (usec)         | 1799224.08 |            (R) -23.28% |
>>>> |                                 | fix_size_alloc_test: p:256, 
>>>> h:1, l:100000 (usec)         | 2313859.25 |              (I) 3.43% |
>>>> |                                 | fix_size_alloc_test: p:512, 
>>>> h:0, l:100000 (usec)         | 3541904.75 |            (R) -23.86% |
>>>> |                                 | fix_size_alloc_test: p:512, 
>>>> h:1, l:100000 (usec)         | 3597577.25 |             (R) -2.97% |
>>>> |                                 | full_fit_alloc_test: p:1, h:0, 
>>>> l:500000 (usec)           |  487021.83 |              (I) 4.95% |
>>>> |                                 | kvfree_rcu_1_arg_vmalloc_test: 
>>>> p:1, h:0, l:500000 (usec) | 344466.33 |                 -0.65% |
>>>> |                                 | kvfree_rcu_2_arg_vmalloc_test: 
>>>> p:1, h:0, l:500000 (usec) | 342484.25 |                 -1.58% |
>>>> |                                 | long_busy_list_alloc_test: p:1, 
>>>> h:0, l:500000 (usec)     | 4034901.17 |            (R) -25.35% |
>>>> |                                 | pcpu_alloc_test: p:1, h:0, 
>>>> l:500000 (usec)               |  195973.42 |                  0.57% |
>>>> |                                 | random_size_align_alloc_test: 
>>>> p:1, h:0, l:500000 (usec)  | 643489.33 |            (R) -47.63% |
>>>> |                                 | random_size_alloc_test: p:1, 
>>>> h:0, l:500000 (usec)        | 2029261.33 | (R) -27.88% |
>>>> |                                 | vm_map_ram_test: p:1, h:0, 
>>>> l:500000 (usec)               |   83557.08 |                 -0.22% |
>>>> +---------------------------------+----------------------------------------------------------+------------+------------------------+ 
>>>>
>>>>
>>>> I have a couple of thoughts from looking at the patch:
>>>>
>>>>   - Perhaps split_page() is the bulk of the cost? Previously for 
>>>> this case we
>>>>     were allocating order-0 so there was no split to do. For h=1, 
>>>> split would
>>>>     have already been called so that would explain why no 
>>>> regression for that
>>>>     case?
>>> For h=1, this patch shouldn't change (as long as nr_pages <
>>> arch_vmap_{pte,pmd}_supported_shift). This is why you don't see 
>>> regressions
>>> in those cases.
>> arm64 supports 64K contigous-mappings with vmalloc so once nr_pages 
>> >= 16 we can
>> take the huge path.
>>
>>>>   - I guess we are bypassing the pcpu cache? Could this be having 
>>>> an effect? Dev
>>>>     (cc'ed) did some similar investigation a while back and saw 
>>>> increased vmalloc
>>>>     latencies when bypassing pcpu cache.
>>> I'd say this is more a case of this test module targeting the pcpu
>>> cache. The module allocates then frees one at a time, which promotes
>>> reusing pcpu pages. [1] Has some numbers after modifying the test such
>>> that all the allocations are made before freeing any.
>> OK fair enough.
>>
>> We are seeing a bunch of other regressions in higher level benchmarks 
>> too; but
>> haven't yet concluded what's causing those. I'll report back if this 
>> patch looks
>> connected.
>>
>> Thanks,
>> Ryan
>>
>>
>>>>   - Philosophically is allocating physically contiguous memory when 
>>>> it is not
>>>>     strictly needed the right thing to do? Large physically 
>>>> contiguous blocks are
>>>>     a scarce resource so we don't want to waste them. Although I 
>>>> guess it could
>>>>     be argued that this actually preserves the contiguous blocks 
>>>> because the
>>>>     lifetime of all the pages is tied together. Anyway, I doubt 
>>>> this is the
>>> This was the primary incentive for this patch :)
>>>
>>>>     reason for the slow down, since those benchmarks are not under 
>>>> memory
>>>>     pressure.
>>>>
>>>> Anyway, it would be good to resolve the performance regressions if 
>>>> we can.
>>> Imo, the appropriate way to address these is to modify the test module
>>> as seen in [1].
>>>
>>> [1] https://lore.kernel.org/linux-mm/aPJ6lLf24TfW_1n7@milan/
>>
>

Re: [PATCH] mm/vmalloc: request large order pages from buddy allocator

Posted by Uladzislau Rezki 3 months, 2 weeks ago

On Tue, Oct 21, 2025 at 12:44:56PM -0700, Vishal Moola (Oracle) wrote:
> Sometimes, vm_area_alloc_pages() will want many pages from the buddy
> allocator. Rather than making requests to the buddy allocator for at
> most 100 pages at a time, we can eagerly request large order pages a
> smaller number of times.
> 
> We still split the large order pages down to order-0 as the rest of the
> vmalloc code (and some callers) depend on it. We still defer to the bulk
> allocator and fallback path in case of order-0 pages or failure.
> 
> Running 1000 iterations of allocations on a small 4GB system finds:
> 
> 1000 2mb allocations:
> 	[Baseline]			[This patch]
> 	real    46.310s			real    0m34.582
> 	user    0.001s			user    0.006s
> 	sys     46.058s			sys     0m34.365s
> 
> 10000 200kb allocations:
> 	[Baseline]			[This patch]
> 	real    56.104s			real    0m43.696
> 	user    0.001s			user    0.003s
> 	sys     55.375s			sys     0m42.995s
> 
> Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
> 
> -----
> RFC:
> https://lore.kernel.org/linux-mm/20251014182754.4329-1-vishal.moola@gmail.com/
> 
> Changes since rfc:
>   - Mask off NO_FAIL in large_gfp
>   - Mask off GFP_COMP in large_gfp
> There was discussion about warning on and rejecting unsupported GFP
> flags in vmalloc, I'll have a separate patch for that.
> 
>   - Introduce nr_remaining variable to track total pages
>   - Calculate large order as (min(max_order, ilog2())
>   - Attempt lower orders on failure before falling back to original path
>   - Drop unnecessary fallback comment change
> ---
>  mm/vmalloc.c | 36 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 36 insertions(+)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index adde450ddf5e..0832f944544c 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3619,8 +3619,44 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>  		unsigned int order, unsigned int nr_pages, struct page **pages)
>  {
>  	unsigned int nr_allocated = 0;
> +	unsigned int nr_remaining = nr_pages;
> +	unsigned int max_attempt_order = MAX_PAGE_ORDER;
>  	struct page *page;
>  	int i;
> +	gfp_t large_gfp = (gfp &
> +		~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL | __GFP_COMP))
> +		| __GFP_NOWARN;
> +	unsigned int large_order = ilog2(nr_remaining);
> +
> +	large_order = min(max_attempt_order, large_order);
> +
> +	/*
> +	 * Initially, attempt to have the page allocator give us large order
> +	 * pages. Do not attempt allocating smaller than order chunks since
> +	 * __vmap_pages_range() expects physically contigous pages of exactly
> +	 * order long chunks.
> +	 */
> +	while (large_order > order && nr_remaining) {
> +		if (nid == NUMA_NO_NODE)
> +			page = alloc_pages_noprof(large_gfp, large_order);
> +		else
> +			page = alloc_pages_node_noprof(nid, large_gfp, large_order);
> +
> +		if (unlikely(!page)) {
> +			max_attempt_order = --large_order;
> +			continue;
> +		}
> +
> +		split_page(page, large_order);
> +		for (i = 0; i < (1U << large_order); i++)
> +			pages[nr_allocated + i] = page + i;
> +
> +		nr_allocated += 1U << large_order;
> +		nr_remaining = nr_pages - nr_allocated;
> +
> +		large_order = ilog2(nr_remaining);
> +		large_order = min(max_attempt_order, large_order);
> +	}
>  
>  	/*
>  	 * For order-0 pages we make use of bulk allocator, if
> -- 
> 2.51.0
> 
I like the idea of page allocation using larger-order :)

Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

--
Uladzislau Rezki

Re: [PATCH] mm/vmalloc: request large order pages from buddy allocator

Posted by Andrew Morton 3 months, 2 weeks ago

On Tue, 21 Oct 2025 12:44:56 -0700 "Vishal Moola (Oracle)" <vishal.moola@gmail.com> wrote:

> Sometimes, vm_area_alloc_pages() will want many pages from the buddy
> allocator. Rather than making requests to the buddy allocator for at
> most 100 pages at a time, we can eagerly request large order pages a
> smaller number of times.

Does this have potential to inadvertently reduce the availability of
hugepages?

> We still split the large order pages down to order-0 as the rest of the
> vmalloc code (and some callers) depend on it. We still defer to the bulk
> allocator and fallback path in case of order-0 pages or failure.
> 
> Running 1000 iterations of allocations on a small 4GB system finds:
> 
> 1000 2mb allocations:
> 	[Baseline]			[This patch]
> 	real    46.310s			real    0m34.582
> 	user    0.001s			user    0.006s
> 	sys     46.058s			sys     0m34.365s
> 
> 10000 200kb allocations:
> 	[Baseline]			[This patch]
> 	real    56.104s			real    0m43.696
> 	user    0.001s			user    0.003s
> 	sys     55.375s			sys     0m42.995s

Nice, but how significant is this change likely to be for a real workload?

> ...
>
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3619,8 +3619,44 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>  		unsigned int order, unsigned int nr_pages, struct page **pages)
>  {
>  	unsigned int nr_allocated = 0;
> +	unsigned int nr_remaining = nr_pages;
> +	unsigned int max_attempt_order = MAX_PAGE_ORDER;
>  	struct page *page;
>  	int i;
> +	gfp_t large_gfp = (gfp &
> +		~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL | __GFP_COMP))
> +		| __GFP_NOWARN;

Gee, why is this so complicated?

> +	unsigned int large_order = ilog2(nr_remaining);

Should nr_remaining be rounded up to next-power-of-two?

> +	large_order = min(max_attempt_order, large_order);
> +
> +	/*
> +	 * Initially, attempt to have the page allocator give us large order
> +	 * pages. Do not attempt allocating smaller than order chunks since
> +	 * __vmap_pages_range() expects physically contigous pages of exactly
> +	 * order long chunks.
> +	 */
> +	while (large_order > order && nr_remaining) {
> +		if (nid == NUMA_NO_NODE)
> +			page = alloc_pages_noprof(large_gfp, large_order);
> +		else
> +			page = alloc_pages_node_noprof(nid, large_gfp, large_order);
> +
> +		if (unlikely(!page)) {
> +			max_attempt_order = --large_order;
> +			continue;
> +		}
> +
> +		split_page(page, large_order);
> +		for (i = 0; i < (1U << large_order); i++)
> +			pages[nr_allocated + i] = page + i;
> +
> +		nr_allocated += 1U << large_order;
> +		nr_remaining = nr_pages - nr_allocated;
> +
> +		large_order = ilog2(nr_remaining);
> +		large_order = min(max_attempt_order, large_order);
> +	}
>  
>  	/*
>  	 * For order-0 pages we make use of bulk allocator, if
> -- 
> 2.51.0

Re: [PATCH] mm/vmalloc: request large order pages from buddy allocator

Posted by Matthew Wilcox 3 months, 2 weeks ago

On Tue, Oct 21, 2025 at 02:24:36PM -0700, Andrew Morton wrote:
> On Tue, 21 Oct 2025 12:44:56 -0700 "Vishal Moola (Oracle)" <vishal.moola@gmail.com> wrote:
> 
> > Sometimes, vm_area_alloc_pages() will want many pages from the buddy
> > allocator. Rather than making requests to the buddy allocator for at
> > most 100 pages at a time, we can eagerly request large order pages a
> > smaller number of times.
> 
> Does this have potential to inadvertently reduce the availability of
> hugepages?

Quite the opposite.  Let's say we're doing a 40KiB allocation.  If we
just take the first 10 pages off the PCP list, those could be from
ten different 2MB chunks, preventing ten different hugepages from
forming until the allocation succeeds.  If instead we do an order-3
allocation and an order-1 allocation, those can be from at most two
different 2MB chunks and prevent at most two hugepages from forming.

> > 1000 2mb allocations:
> > 	[Baseline]			[This patch]
> > 	real    46.310s			real    0m34.582
> > 	user    0.001s			user    0.006s
> > 	sys     46.058s			sys     0m34.365s
> > 
> > 10000 200kb allocations:
> > 	[Baseline]			[This patch]
> > 	real    56.104s			real    0m43.696
> > 	user    0.001s			user    0.003s
> > 	sys     55.375s			sys     0m42.995s
> 
> Nice, but how significant is this change likely to be for a real workload?

Ulad has numbers for the last iteration of this patch showing an
improvement for a 16KiB allocation, which is an improvement for fork()
now we all have VMAP_STACK.

> > +	gfp_t large_gfp = (gfp &
> > +		~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL | __GFP_COMP))
> > +		| __GFP_NOWARN;
> 
> Gee, why is this so complicated?

Because GFP flags suck as an interface?  Look at kmalloc_gfp_adjust().

> > +	unsigned int large_order = ilog2(nr_remaining);
> 
> Should nr_remaining be rounded up to next-power-of-two?

No, we don't want to overallocate, we want to precisely allocate.
To use our 40KiB example from earlier, we want to satisfy the allocation
by allocating a 32KiB chunk and an 8KiB chunk, not by allocating 64KiB
and only using part of it.

(I suppose there's an argument for using alloc_pages_exact() here, but
I think it's a fairly weak one)