mm/vmalloc: map contiguous pages in batches for vmap() whenever possible

[PATCH] mm/vmalloc: map contiguous pages in batches for vmap() whenever possible

Posted by Barry Song 1 month, 3 weeks ago

From: Barry Song <v-songbaohua@oppo.com>

In many cases, the pages passed to vmap() may include high-order
pages allocated with __GFP_COMP flags. For example, the systemheap
often allocates pages in descending order: order 8, then 4, then 0.
Currently, vmap() iterates over every page individually—even pages
inside a high-order block are handled one by one.

This patch detects high-order pages and maps them as a single
contiguous block whenever possible.

An alternative would be to implement a new API, vmap_sg(), but that
change seems to be large in scope.

When vmapping a 128MB dma-buf using the systemheap, this patch
makes system_heap_do_vmap() roughly 17× faster.

W/ patch:
[   10.404769] system_heap_do_vmap took 2494000 ns
[   12.525921] system_heap_do_vmap took 2467008 ns
[   14.517348] system_heap_do_vmap took 2471008 ns
[   16.593406] system_heap_do_vmap took 2444000 ns
[   19.501341] system_heap_do_vmap took 2489008 ns

W/o patch:
[    7.413756] system_heap_do_vmap took 42626000 ns
[    9.425610] system_heap_do_vmap took 42500992 ns
[   11.810898] system_heap_do_vmap took 42215008 ns
[   14.336790] system_heap_do_vmap took 42134992 ns
[   16.373890] system_heap_do_vmap took 42750000 ns

Cc: David Hildenbrand <david@kernel.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: John Stultz <jstultz@google.com>
Cc: Maxime Ripard <mripard@kernel.org>
Tested-by: Tangquan Zheng <zhengtangquan@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 * diff with rfc:
 Many code refinements based on David's suggestions, thanks!
 Refine comment and changelog according to Uladzislau, thanks!
 rfc link:
 https://lore.kernel.org/linux-mm/20251122090343.81243-1-21cnbao@gmail.com/

 mm/vmalloc.c | 45 +++++++++++++++++++++++++++++++++++++++------
 1 file changed, 39 insertions(+), 6 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 41dd01e8430c..8d577767a9e5 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -642,6 +642,29 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
 	return err;
 }
 
+static inline int get_vmap_batch_order(struct page **pages,
+		unsigned int stride, unsigned int max_steps, unsigned int idx)
+{
+	int nr_pages = 1;
+
+	/*
+	 * Currently, batching is only supported in vmap_pages_range
+	 * when page_shift == PAGE_SHIFT.
+	 */
+	if (stride != 1)
+		return 0;
+
+	nr_pages = compound_nr(pages[idx]);
+	if (nr_pages == 1)
+		return 0;
+	if (max_steps < nr_pages)
+		return 0;
+
+	if (num_pages_contiguous(&pages[idx], nr_pages) == nr_pages)
+		return compound_order(pages[idx]);
+	return 0;
+}
+
 /*
  * vmap_pages_range_noflush is similar to vmap_pages_range, but does not
  * flush caches.
@@ -655,23 +678,33 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
 		pgprot_t prot, struct page **pages, unsigned int page_shift)
 {
 	unsigned int i, nr = (end - addr) >> PAGE_SHIFT;
+	unsigned int stride;
 
 	WARN_ON(page_shift < PAGE_SHIFT);
 
+	/*
+	 * For vmap(), users may allocate pages from high orders down to
+	 * order 0, while always using PAGE_SHIFT as the page_shift.
+	 * We first check whether the initial page is a compound page. If so,
+	 * there may be an opportunity to batch multiple pages together.
+	 */
 	if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
-			page_shift == PAGE_SHIFT)
+			(page_shift == PAGE_SHIFT && !PageCompound(pages[0])))
 		return vmap_small_pages_range_noflush(addr, end, prot, pages);
 
-	for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
-		int err;
+	stride = 1U << (page_shift - PAGE_SHIFT);
+	for (i = 0; i < nr; ) {
+		int err, order;
 
-		err = vmap_range_noflush(addr, addr + (1UL << page_shift),
+		order = get_vmap_batch_order(pages, stride, nr - i, i);
+		err = vmap_range_noflush(addr, addr + (1UL << (page_shift + order)),
 					page_to_phys(pages[i]), prot,
-					page_shift);
+					page_shift + order);
 		if (err)
 			return err;
 
-		addr += 1UL << page_shift;
+		addr += 1UL  << (page_shift + order);
+		i += 1U << (order + page_shift - PAGE_SHIFT);
 	}
 
 	return 0;
-- 
2.39.3 (Apple Git-146)

Re: [PATCH] mm/vmalloc: map contiguous pages in batches for vmap() whenever possible

Posted by Uladzislau Rezki 1 month, 3 weeks ago

On Mon, Dec 15, 2025 at 01:30:50PM +0800, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> In many cases, the pages passed to vmap() may include high-order
> pages allocated with __GFP_COMP flags. For example, the systemheap
> often allocates pages in descending order: order 8, then 4, then 0.
> Currently, vmap() iterates over every page individually—even pages
> inside a high-order block are handled one by one.
> 
> This patch detects high-order pages and maps them as a single
> contiguous block whenever possible.
> 
> An alternative would be to implement a new API, vmap_sg(), but that
> change seems to be large in scope.
> 
> When vmapping a 128MB dma-buf using the systemheap, this patch
> makes system_heap_do_vmap() roughly 17× faster.
> 
> W/ patch:
> [   10.404769] system_heap_do_vmap took 2494000 ns
> [   12.525921] system_heap_do_vmap took 2467008 ns
> [   14.517348] system_heap_do_vmap took 2471008 ns
> [   16.593406] system_heap_do_vmap took 2444000 ns
> [   19.501341] system_heap_do_vmap took 2489008 ns
> 
> W/o patch:
> [    7.413756] system_heap_do_vmap took 42626000 ns
> [    9.425610] system_heap_do_vmap took 42500992 ns
> [   11.810898] system_heap_do_vmap took 42215008 ns
> [   14.336790] system_heap_do_vmap took 42134992 ns
> [   16.373890] system_heap_do_vmap took 42750000 ns
> 
> Cc: David Hildenbrand <david@kernel.org>
> Cc: Uladzislau Rezki <urezki@gmail.com>
> Cc: Sumit Semwal <sumit.semwal@linaro.org>
> Cc: John Stultz <jstultz@google.com>
> Cc: Maxime Ripard <mripard@kernel.org>
> Tested-by: Tangquan Zheng <zhengtangquan@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  * diff with rfc:
>  Many code refinements based on David's suggestions, thanks!
>  Refine comment and changelog according to Uladzislau, thanks!
>  rfc link:
>  https://lore.kernel.org/linux-mm/20251122090343.81243-1-21cnbao@gmail.com/
> 
>  mm/vmalloc.c | 45 +++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 39 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 41dd01e8430c..8d577767a9e5 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -642,6 +642,29 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
>  	return err;
>  }
>  
> +static inline int get_vmap_batch_order(struct page **pages,
> +		unsigned int stride, unsigned int max_steps, unsigned int idx)
> +{
> +	int nr_pages = 1;
> +
> +	/*
> +	 * Currently, batching is only supported in vmap_pages_range
> +	 * when page_shift == PAGE_SHIFT.
> +	 */
> +	if (stride != 1)
> +		return 0;
> +
> +	nr_pages = compound_nr(pages[idx]);
> +	if (nr_pages == 1)
> +		return 0;
> +	if (max_steps < nr_pages)
> +		return 0;
> +
> +	if (num_pages_contiguous(&pages[idx], nr_pages) == nr_pages)
> +		return compound_order(pages[idx]);
> +	return 0;
> +}
> +
Can we instead look at this as: it can be that we have continues
set of pages let's find out. I mean if we do not stick just to
compound pages.

--
Uladzislau Rezki

Re: [PATCH] mm/vmalloc: map contiguous pages in batches for vmap() whenever possible

Posted by Barry Song 1 month, 3 weeks ago

[...]
> >
> > +static inline int get_vmap_batch_order(struct page **pages,
> > +             unsigned int stride, unsigned int max_steps, unsigned int idx)
> > +{
> > +     int nr_pages = 1;
> > +
> > +     /*
> > +      * Currently, batching is only supported in vmap_pages_range
> > +      * when page_shift == PAGE_SHIFT.
> > +      */
> > +     if (stride != 1)
> > +             return 0;
> > +
> > +     nr_pages = compound_nr(pages[idx]);
> > +     if (nr_pages == 1)
> > +             return 0;
> > +     if (max_steps < nr_pages)
> > +             return 0;
> > +
> > +     if (num_pages_contiguous(&pages[idx], nr_pages) == nr_pages)
> > +             return compound_order(pages[idx]);
> > +     return 0;
> > +}
> > +
> Can we instead look at this as: it can be that we have continues
> set of pages let's find out. I mean if we do not stick just to
> compound pages.

We use PageCompound(pages[0]) and compound_nr() as quick
filters to skip checking the contiguous count, and this is
now the intended use case. Always checking contiguity might
cause a slight regression, I guess.

BTW, do we have a strong use case where GFP_COMP or folio is
not used, yet the pages are physically contiguous?

Thanks
Barry

Re: [PATCH] mm/vmalloc: map contiguous pages in batches for vmap() whenever possible

Posted by David Hildenbrand (Red Hat) 3 weeks, 3 days ago

On 12/18/25 21:05, Barry Song wrote:
> [...]
>>>
>>> +static inline int get_vmap_batch_order(struct page **pages,
>>> +             unsigned int stride, unsigned int max_steps, unsigned int idx)
>>> +{
>>> +     int nr_pages = 1;
>>> +
>>> +     /*
>>> +      * Currently, batching is only supported in vmap_pages_range
>>> +      * when page_shift == PAGE_SHIFT.
>>> +      */
>>> +     if (stride != 1)
>>> +             return 0;
>>> +
>>> +     nr_pages = compound_nr(pages[idx]);
>>> +     if (nr_pages == 1)
>>> +             return 0;
>>> +     if (max_steps < nr_pages)
>>> +             return 0;
>>> +
>>> +     if (num_pages_contiguous(&pages[idx], nr_pages) == nr_pages)
>>> +             return compound_order(pages[idx]);
>>> +     return 0;
>>> +}
>>> +
>> Can we instead look at this as: it can be that we have continues
>> set of pages let's find out. I mean if we do not stick just to
>> compound pages.
> 
> We use PageCompound(pages[0]) and compound_nr() as quick
> filters to skip checking the contiguous count, and this is
> now the intended use case. Always checking contiguity might
> cause a slight regression, I guess.
> 
> BTW, do we have a strong use case where GFP_COMP or folio is
> not used, yet the pages are physically contiguous?

It usually happens by accident :)

E.g., allocate 2 pages and because we had to split an order-1 page into 
two order-0 pages, we get both of them.

Using num_pages_contiguous() only might indeed be nicer, but then we 
have to add some handling for getting aligned ranges (start and size 
aligned to order) ... so not sure if that is worth it.

-- 
Cheers

David

Re: [PATCH] mm/vmalloc: map contiguous pages in batches for vmap() whenever possible

Posted by David Hildenbrand (Red Hat) 1 month, 3 weeks ago

On 12/15/25 06:30, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> In many cases, the pages passed to vmap() may include high-order
> pages allocated with __GFP_COMP flags. For example, the systemheap
> often allocates pages in descending order: order 8, then 4, then 0.
> Currently, vmap() iterates over every page individually—even pages
> inside a high-order block are handled one by one.
> 
> This patch detects high-order pages and maps them as a single
> contiguous block whenever possible.
> 
> An alternative would be to implement a new API, vmap_sg(), but that
> change seems to be large in scope.
> 
> When vmapping a 128MB dma-buf using the systemheap, this patch
> makes system_heap_do_vmap() roughly 17× faster.
> 
> W/ patch:
> [   10.404769] system_heap_do_vmap took 2494000 ns
> [   12.525921] system_heap_do_vmap took 2467008 ns
> [   14.517348] system_heap_do_vmap took 2471008 ns
> [   16.593406] system_heap_do_vmap took 2444000 ns
> [   19.501341] system_heap_do_vmap took 2489008 ns
> 
> W/o patch:
> [    7.413756] system_heap_do_vmap took 42626000 ns
> [    9.425610] system_heap_do_vmap took 42500992 ns
> [   11.810898] system_heap_do_vmap took 42215008 ns
> [   14.336790] system_heap_do_vmap took 42134992 ns
> [   16.373890] system_heap_do_vmap took 42750000 ns
> 

That's quite a speedup.

> Cc: David Hildenbrand <david@kernel.org>
> Cc: Uladzislau Rezki <urezki@gmail.com>
> Cc: Sumit Semwal <sumit.semwal@linaro.org>
> Cc: John Stultz <jstultz@google.com>
> Cc: Maxime Ripard <mripard@kernel.org>
> Tested-by: Tangquan Zheng <zhengtangquan@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>   * diff with rfc:
>   Many code refinements based on David's suggestions, thanks!
>   Refine comment and changelog according to Uladzislau, thanks!
>   rfc link:
>   https://lore.kernel.org/linux-mm/20251122090343.81243-1-21cnbao@gmail.com/
> 
>   mm/vmalloc.c | 45 +++++++++++++++++++++++++++++++++++++++------
>   1 file changed, 39 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 41dd01e8430c..8d577767a9e5 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -642,6 +642,29 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
>   	return err;
>   }
>   
> +static inline int get_vmap_batch_order(struct page **pages,
> +		unsigned int stride, unsigned int max_steps, unsigned int idx)
> +{
> +	int nr_pages = 1;

unsigned int, maybe

Why are you initializing nr_pages when you overwrite it below?

> +
> +	/*
> +	 * Currently, batching is only supported in vmap_pages_range
> +	 * when page_shift == PAGE_SHIFT.

I don't know the code so realizing how we go from page_shift to stride 
too me a second. Maybe only talk about stride here?

OTOH, is "stride" really the right terminology?

we calculate it as

	stride = 1U << (page_shift - PAGE_SHIFT);

page_shift - PAGE_SHIFT should give us an "order". So is this a 
"granularity" in nr_pages?

Again, I don't know this code, so sorry for the question.

> +	 */
> +	if (stride != 1)
> +		return 0;
> +
> +	nr_pages = compound_nr(pages[idx]);
> +	if (nr_pages == 1)
> +		return 0;
> +	if (max_steps < nr_pages)
> +		return 0;

Might combine these simple checks

if (nr_pages == 1 || max_steps < nr_pages)
	return 0;


-- 
Cheers

David

Re: [PATCH] mm/vmalloc: map contiguous pages in batches for vmap() whenever possible

Posted by Uladzislau Rezki 1 month, 3 weeks ago

On Thu, Dec 18, 2025 at 02:01:56PM +0100, David Hildenbrand (Red Hat) wrote:
> On 12/15/25 06:30, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> > 
> > In many cases, the pages passed to vmap() may include high-order
> > pages allocated with __GFP_COMP flags. For example, the systemheap
> > often allocates pages in descending order: order 8, then 4, then 0.
> > Currently, vmap() iterates over every page individually—even pages
> > inside a high-order block are handled one by one.
> > 
> > This patch detects high-order pages and maps them as a single
> > contiguous block whenever possible.
> > 
> > An alternative would be to implement a new API, vmap_sg(), but that
> > change seems to be large in scope.
> > 
> > When vmapping a 128MB dma-buf using the systemheap, this patch
> > makes system_heap_do_vmap() roughly 17× faster.
> > 
> > W/ patch:
> > [   10.404769] system_heap_do_vmap took 2494000 ns
> > [   12.525921] system_heap_do_vmap took 2467008 ns
> > [   14.517348] system_heap_do_vmap took 2471008 ns
> > [   16.593406] system_heap_do_vmap took 2444000 ns
> > [   19.501341] system_heap_do_vmap took 2489008 ns
> > 
> > W/o patch:
> > [    7.413756] system_heap_do_vmap took 42626000 ns
> > [    9.425610] system_heap_do_vmap took 42500992 ns
> > [   11.810898] system_heap_do_vmap took 42215008 ns
> > [   14.336790] system_heap_do_vmap took 42134992 ns
> > [   16.373890] system_heap_do_vmap took 42750000 ns
> > 
> 
> That's quite a speedup.
> 
> > Cc: David Hildenbrand <david@kernel.org>
> > Cc: Uladzislau Rezki <urezki@gmail.com>
> > Cc: Sumit Semwal <sumit.semwal@linaro.org>
> > Cc: John Stultz <jstultz@google.com>
> > Cc: Maxime Ripard <mripard@kernel.org>
> > Tested-by: Tangquan Zheng <zhengtangquan@oppo.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >   * diff with rfc:
> >   Many code refinements based on David's suggestions, thanks!
> >   Refine comment and changelog according to Uladzislau, thanks!
> >   rfc link:
> >   https://lore.kernel.org/linux-mm/20251122090343.81243-1-21cnbao@gmail.com/
> > 
> >   mm/vmalloc.c | 45 +++++++++++++++++++++++++++++++++++++++------
> >   1 file changed, 39 insertions(+), 6 deletions(-)
> > 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 41dd01e8430c..8d577767a9e5 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -642,6 +642,29 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
> >   	return err;
> >   }
> > +static inline int get_vmap_batch_order(struct page **pages,
> > +		unsigned int stride, unsigned int max_steps, unsigned int idx)
> > +{
> > +	int nr_pages = 1;
> 
> unsigned int, maybe
> 
> Why are you initializing nr_pages when you overwrite it below?
> 
> > +
> > +	/*
> > +	 * Currently, batching is only supported in vmap_pages_range
> > +	 * when page_shift == PAGE_SHIFT.
> 
> I don't know the code so realizing how we go from page_shift to stride too
> me a second. Maybe only talk about stride here?
> 
> OTOH, is "stride" really the right terminology?
> 
> we calculate it as
> 
> 	stride = 1U << (page_shift - PAGE_SHIFT);
> 
> page_shift - PAGE_SHIFT should give us an "order". So is this a
> "granularity" in nr_pages?
> 
> Again, I don't know this code, so sorry for the question.
> 
To me "stride" also sounds unclear.

--
Uladzislau Rezki

Re: [PATCH] mm/vmalloc: map contiguous pages in batches for vmap() whenever possible

Posted by Barry Song 1 month, 3 weeks ago

On Thu, Dec 18, 2025 at 9:55 PM Uladzislau Rezki <urezki@gmail.com> wrote:
>
> On Thu, Dec 18, 2025 at 02:01:56PM +0100, David Hildenbrand (Red Hat) wrote:
> > On 12/15/25 06:30, Barry Song wrote:
> > > From: Barry Song <v-songbaohua@oppo.com>
> > >
> > > In many cases, the pages passed to vmap() may include high-order
> > > pages allocated with __GFP_COMP flags. For example, the systemheap
> > > often allocates pages in descending order: order 8, then 4, then 0.
> > > Currently, vmap() iterates over every page individually—even pages
> > > inside a high-order block are handled one by one.
> > >
> > > This patch detects high-order pages and maps them as a single
> > > contiguous block whenever possible.
> > >
> > > An alternative would be to implement a new API, vmap_sg(), but that
> > > change seems to be large in scope.
> > >
> > > When vmapping a 128MB dma-buf using the systemheap, this patch
> > > makes system_heap_do_vmap() roughly 17× faster.
> > >
> > > W/ patch:
> > > [   10.404769] system_heap_do_vmap took 2494000 ns
> > > [   12.525921] system_heap_do_vmap took 2467008 ns
> > > [   14.517348] system_heap_do_vmap took 2471008 ns
> > > [   16.593406] system_heap_do_vmap took 2444000 ns
> > > [   19.501341] system_heap_do_vmap took 2489008 ns
> > >
> > > W/o patch:
> > > [    7.413756] system_heap_do_vmap took 42626000 ns
> > > [    9.425610] system_heap_do_vmap took 42500992 ns
> > > [   11.810898] system_heap_do_vmap took 42215008 ns
> > > [   14.336790] system_heap_do_vmap took 42134992 ns
> > > [   16.373890] system_heap_do_vmap took 42750000 ns
> > >
> >
> > That's quite a speedup.
> >
> > > Cc: David Hildenbrand <david@kernel.org>
> > > Cc: Uladzislau Rezki <urezki@gmail.com>
> > > Cc: Sumit Semwal <sumit.semwal@linaro.org>
> > > Cc: John Stultz <jstultz@google.com>
> > > Cc: Maxime Ripard <mripard@kernel.org>
> > > Tested-by: Tangquan Zheng <zhengtangquan@oppo.com>
> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > ---
> > >   * diff with rfc:
> > >   Many code refinements based on David's suggestions, thanks!
> > >   Refine comment and changelog according to Uladzislau, thanks!
> > >   rfc link:
> > >   https://lore.kernel.org/linux-mm/20251122090343.81243-1-21cnbao@gmail.com/
> > >
> > >   mm/vmalloc.c | 45 +++++++++++++++++++++++++++++++++++++++------
> > >   1 file changed, 39 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index 41dd01e8430c..8d577767a9e5 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -642,6 +642,29 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
> > >     return err;
> > >   }
> > > +static inline int get_vmap_batch_order(struct page **pages,
> > > +           unsigned int stride, unsigned int max_steps, unsigned int idx)
> > > +{
> > > +   int nr_pages = 1;
> >
> > unsigned int, maybe

Right

> >
> > Why are you initializing nr_pages when you overwrite it below?

Right, initializing nr_pages can be dropped.

> >
> > > +
> > > +   /*
> > > +    * Currently, batching is only supported in vmap_pages_range
> > > +    * when page_shift == PAGE_SHIFT.
> >
> > I don't know the code so realizing how we go from page_shift to stride too
> > me a second. Maybe only talk about stride here?
> >
> > OTOH, is "stride" really the right terminology?
> >
> > we calculate it as
> >
> >       stride = 1U << (page_shift - PAGE_SHIFT);
> >
> > page_shift - PAGE_SHIFT should give us an "order". So is this a
> > "granularity" in nr_pages?

This is the case where vmalloc() may realize that it has
high-order pages and therefore calls
vmap_pages_range_noflush() with a page_shift larger than
PAGE_SHIFT. For vmap(), we take a pages array, so
page_shift is always PAGE_SHIFT.

> >
> > Again, I don't know this code, so sorry for the question.
> >
> To me "stride" also sounds unclear.

Thanks, David and Uladzislau. On second thought, this stride may be
redundant, and it should be possible to drop it entirely. This results
in the code below:

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 41dd01e8430c..3962bdcb43e5 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -642,6 +642,20 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
 	return err;
 }
 
+static inline int get_vmap_batch_order(struct page **pages,
+		unsigned int max_steps, unsigned int idx)
+{
+	unsigned int nr_pages	 = compound_nr(pages[idx]);
+
+	if (nr_pages == 1 || max_steps < nr_pages)
+		return 0;
+
+	if (num_pages_contiguous(&pages[idx], nr_pages) == nr_pages)
+		return compound_order(pages[idx]);
+	return 0;
+}
+
 /*
  * vmap_pages_range_noflush is similar to vmap_pages_range, but does not
  * flush caches.
@@ -658,20 +672,35 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
 
 	WARN_ON(page_shift < PAGE_SHIFT);
 
+	/*
+	 * For vmap(), users may allocate pages from high orders down to
+	 * order 0, while always using PAGE_SHIFT as the page_shift.
+	 * We first check whether the initial page is a compound page. If so,
+	 * there may be an opportunity to batch multiple pages together.
+	 */
 	if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
-			page_shift == PAGE_SHIFT)
+			(page_shift == PAGE_SHIFT && !PageCompound(pages[0])))
 		return vmap_small_pages_range_noflush(addr, end, prot, pages);
 
-	for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
+	for (i = 0; i < nr; ) {
+		unsigned int shift = page_shift;
 		int err;
 
-		err = vmap_range_noflush(addr, addr + (1UL << page_shift),
+		/*
+		 * For vmap() cases, page_shift is always PAGE_SHIFT, even
+		 * if the pages are physically contiguous, they may still
+		 * be mapped in a batch.
+		 */
+		if (page_shift == PAGE_SHIFT)
+			shift += get_vmap_batch_order(pages, nr - i, i);
+		err = vmap_range_noflush(addr, addr + (1UL << shift),
 					page_to_phys(pages[i]), prot,
-					page_shift);
+					shift);
 		if (err)
 			return err;
 
-		addr += 1UL << page_shift;
+		addr += 1UL  << shift;
+		i += 1U << shift;
 	}
 
 	return 0;

Does this look clearer?

Thanks
Barry

Re: [PATCH] mm/vmalloc: map contiguous pages in batches for vmap() whenever possible

Posted by Uladzislau Rezki 1 month, 2 weeks ago

On Fri, Dec 19, 2025 at 05:24:36AM +0800, Barry Song wrote:
> On Thu, Dec 18, 2025 at 9:55 PM Uladzislau Rezki <urezki@gmail.com> wrote:
> >
> > On Thu, Dec 18, 2025 at 02:01:56PM +0100, David Hildenbrand (Red Hat) wrote:
> > > On 12/15/25 06:30, Barry Song wrote:
> > > > From: Barry Song <v-songbaohua@oppo.com>
> > > >
> > > > In many cases, the pages passed to vmap() may include high-order
> > > > pages allocated with __GFP_COMP flags. For example, the systemheap
> > > > often allocates pages in descending order: order 8, then 4, then 0.
> > > > Currently, vmap() iterates over every page individually—even pages
> > > > inside a high-order block are handled one by one.
> > > >
> > > > This patch detects high-order pages and maps them as a single
> > > > contiguous block whenever possible.
> > > >
> > > > An alternative would be to implement a new API, vmap_sg(), but that
> > > > change seems to be large in scope.
> > > >
> > > > When vmapping a 128MB dma-buf using the systemheap, this patch
> > > > makes system_heap_do_vmap() roughly 17× faster.
> > > >
> > > > W/ patch:
> > > > [   10.404769] system_heap_do_vmap took 2494000 ns
> > > > [   12.525921] system_heap_do_vmap took 2467008 ns
> > > > [   14.517348] system_heap_do_vmap took 2471008 ns
> > > > [   16.593406] system_heap_do_vmap took 2444000 ns
> > > > [   19.501341] system_heap_do_vmap took 2489008 ns
> > > >
> > > > W/o patch:
> > > > [    7.413756] system_heap_do_vmap took 42626000 ns
> > > > [    9.425610] system_heap_do_vmap took 42500992 ns
> > > > [   11.810898] system_heap_do_vmap took 42215008 ns
> > > > [   14.336790] system_heap_do_vmap took 42134992 ns
> > > > [   16.373890] system_heap_do_vmap took 42750000 ns
> > > >
> > >
> > > That's quite a speedup.
> > >
> > > > Cc: David Hildenbrand <david@kernel.org>
> > > > Cc: Uladzislau Rezki <urezki@gmail.com>
> > > > Cc: Sumit Semwal <sumit.semwal@linaro.org>
> > > > Cc: John Stultz <jstultz@google.com>
> > > > Cc: Maxime Ripard <mripard@kernel.org>
> > > > Tested-by: Tangquan Zheng <zhengtangquan@oppo.com>
> > > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > > ---
> > > >   * diff with rfc:
> > > >   Many code refinements based on David's suggestions, thanks!
> > > >   Refine comment and changelog according to Uladzislau, thanks!
> > > >   rfc link:
> > > >   https://lore.kernel.org/linux-mm/20251122090343.81243-1-21cnbao@gmail.com/
> > > >
> > > >   mm/vmalloc.c | 45 +++++++++++++++++++++++++++++++++++++++------
> > > >   1 file changed, 39 insertions(+), 6 deletions(-)
> > > >
> > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > index 41dd01e8430c..8d577767a9e5 100644
> > > > --- a/mm/vmalloc.c
> > > > +++ b/mm/vmalloc.c
> > > > @@ -642,6 +642,29 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
> > > >     return err;
> > > >   }
> > > > +static inline int get_vmap_batch_order(struct page **pages,
> > > > +           unsigned int stride, unsigned int max_steps, unsigned int idx)
> > > > +{
> > > > +   int nr_pages = 1;
> > >
> > > unsigned int, maybe
> 
> Right
> 
> > >
> > > Why are you initializing nr_pages when you overwrite it below?
> 
> Right, initializing nr_pages can be dropped.
> 
> > >
> > > > +
> > > > +   /*
> > > > +    * Currently, batching is only supported in vmap_pages_range
> > > > +    * when page_shift == PAGE_SHIFT.
> > >
> > > I don't know the code so realizing how we go from page_shift to stride too
> > > me a second. Maybe only talk about stride here?
> > >
> > > OTOH, is "stride" really the right terminology?
> > >
> > > we calculate it as
> > >
> > >       stride = 1U << (page_shift - PAGE_SHIFT);
> > >
> > > page_shift - PAGE_SHIFT should give us an "order". So is this a
> > > "granularity" in nr_pages?
> 
> This is the case where vmalloc() may realize that it has
> high-order pages and therefore calls
> vmap_pages_range_noflush() with a page_shift larger than
> PAGE_SHIFT. For vmap(), we take a pages array, so
> page_shift is always PAGE_SHIFT.
> 
> > >
> > > Again, I don't know this code, so sorry for the question.
> > >
> > To me "stride" also sounds unclear.
> 
> Thanks, David and Uladzislau. On second thought, this stride may be
> redundant, and it should be possible to drop it entirely. This results
> in the code below:
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 41dd01e8430c..3962bdcb43e5 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -642,6 +642,20 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
>  	return err;
>  }
>  
> +static inline int get_vmap_batch_order(struct page **pages,
> +		unsigned int max_steps, unsigned int idx)
> +{
> +	unsigned int nr_pages	 = compound_nr(pages[idx]);
> +
> +	if (nr_pages == 1 || max_steps < nr_pages)
> +		return 0;
> +
> +	if (num_pages_contiguous(&pages[idx], nr_pages) == nr_pages)
> +		return compound_order(pages[idx]);
> +	return 0;
> +}
> +
>



>  /*
>   * vmap_pages_range_noflush is similar to vmap_pages_range, but does not
>   * flush caches.
> @@ -658,20 +672,35 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>  
>  	WARN_ON(page_shift < PAGE_SHIFT);
>  
> +	/*
> +	 * For vmap(), users may allocate pages from high orders down to
> +	 * order 0, while always using PAGE_SHIFT as the page_shift.
> +	 * We first check whether the initial page is a compound page. If so,
> +	 * there may be an opportunity to batch multiple pages together.
> +	 */
>  	if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
> -			page_shift == PAGE_SHIFT)
> +			(page_shift == PAGE_SHIFT && !PageCompound(pages[0])))
>  		return vmap_small_pages_range_noflush(addr, end, prot, pages);
Hm.. If first few pages are order-0 and the rest are compound
then we do nothing.

>  
> -	for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
> +	for (i = 0; i < nr; ) {
> +		unsigned int shift = page_shift;
>  		int err;
>  
> -		err = vmap_range_noflush(addr, addr + (1UL << page_shift),
> +		/*
> +		 * For vmap() cases, page_shift is always PAGE_SHIFT, even
> +		 * if the pages are physically contiguous, they may still
> +		 * be mapped in a batch.
> +		 */
> +		if (page_shift == PAGE_SHIFT)
> +			shift += get_vmap_batch_order(pages, nr - i, i);
> +		err = vmap_range_noflush(addr, addr + (1UL << shift),
>  					page_to_phys(pages[i]), prot,
> -					page_shift);
> +					shift);
>  		if (err)
>  			return err;
>  
> -		addr += 1UL << page_shift;
> +		addr += 1UL  << shift;
> +		i += 1U << shift;
>  	}
>  
>  	return 0;
> 
> Does this look clearer?
> 
The concern is we mix it with a huge page mapping path. If we want to batch
v-mapping for page_shift == PAGE_SHIFT case, where "pages" array may contain 
compound pages(folio)(corner case to me), i think we should split it.

--
Uladzislau Rezki

Re: [PATCH] mm/vmalloc: map contiguous pages in batches for vmap() whenever possible

Posted by Barry Song 1 month, 2 weeks ago

> >  /*
> >   * vmap_pages_range_noflush is similar to vmap_pages_range, but does not
> >   * flush caches.
> > @@ -658,20 +672,35 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
> >
> >       WARN_ON(page_shift < PAGE_SHIFT);
> >
> > +     /*
> > +      * For vmap(), users may allocate pages from high orders down to
> > +      * order 0, while always using PAGE_SHIFT as the page_shift.
> > +      * We first check whether the initial page is a compound page. If so,
> > +      * there may be an opportunity to batch multiple pages together.
> > +      */
> >       if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
> > -                     page_shift == PAGE_SHIFT)
> > +                     (page_shift == PAGE_SHIFT && !PageCompound(pages[0])))
> >               return vmap_small_pages_range_noflush(addr, end, prot, pages);
> Hm.. If first few pages are order-0 and the rest are compound
> then we do nothing.

Now the dma-buf is allocated in descending order. If page0
is not huge, page1 will not be either. However, I agree that
we may extend support for this case.

>
> >
> > -     for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
> > +     for (i = 0; i < nr; ) {
> > +             unsigned int shift = page_shift;
> >               int err;
> >
> > -             err = vmap_range_noflush(addr, addr + (1UL << page_shift),
> > +             /*
> > +              * For vmap() cases, page_shift is always PAGE_SHIFT, even
> > +              * if the pages are physically contiguous, they may still
> > +              * be mapped in a batch.
> > +              */
> > +             if (page_shift == PAGE_SHIFT)
> > +                     shift += get_vmap_batch_order(pages, nr - i, i);
> > +             err = vmap_range_noflush(addr, addr + (1UL << shift),
> >                                       page_to_phys(pages[i]), prot,
> > -                                     page_shift);
> > +                                     shift);
> >               if (err)
> >                       return err;
> >
> > -             addr += 1UL << page_shift;
> > +             addr += 1UL  << shift;
> > +             i += 1U << shift;
> >       }
> >
> >       return 0;
> >
> > Does this look clearer?
> >
> The concern is we mix it with a huge page mapping path. If we want to batch
> v-mapping for page_shift == PAGE_SHIFT case, where "pages" array may contain
> compound pages(folio)(corner case to me), i think we should split it.

I agree this might not be common when the vmap buffer is only
used by the CPU. However, for GPUs, NPUs, and similar devices,
benefiting from larger mappings may be quite common.

Does the code below, which moves batched mapping to vmap(),
address both of your concerns?

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ecbac900c35f..782f2eac8a63 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3501,6 +3501,20 @@ void vunmap(const void *addr)
 }
 EXPORT_SYMBOL(vunmap);
 
+static inline int get_vmap_batch_order(struct page **pages,
+		unsigned int max_steps, unsigned int idx)
+{
+	unsigned int nr_pages;
+
+	nr_pages = compound_nr(pages[idx]);
+	if (nr_pages == 1 || max_steps < nr_pages)
+		return 0;
+
+	if (num_pages_contiguous(&pages[idx], nr_pages) == nr_pages)
+		return compound_order(pages[idx]);
+	return 0;
+}
+
 /**
  * vmap - map an array of pages into virtually contiguous space
  * @pages: array of page pointers
@@ -3544,10 +3558,21 @@ void *vmap(struct page **pages, unsigned int count,
 		return NULL;
 
 	addr = (unsigned long)area->addr;
-	if (vmap_pages_range(addr, addr + size, pgprot_nx(prot),
-				pages, PAGE_SHIFT) < 0) {
-		vunmap(area->addr);
-		return NULL;
+	for (unsigned int i = 0; i < count; ) {
+		unsigned int shift = PAGE_SHIFT;
+		int err;
+
+		shift += get_vmap_batch_order(pages, count - i, i);
+		err = vmap_range_noflush(addr, addr + (1UL << shift),
+				page_to_phys(pages[i]), pgprot_nx(prot),
+				shift);
+		if (err) {
+			vunmap(area->addr);
+			return NULL;
+		}
+
+		addr += 1UL  << shift;
+		i += 1U << shift;
 	}
 
 	if (flags & VM_MAP_PUT_PAGES) {
-- 
2.48.1

Thanks
Barry

Re: [PATCH] mm/vmalloc: map contiguous pages in batches for vmap() whenever possible

Posted by Uladzislau Rezki 1 month ago

On Wed, Dec 24, 2025 at 10:23:34AM +1300, Barry Song wrote:
> > >  /*
> > >   * vmap_pages_range_noflush is similar to vmap_pages_range, but does not
> > >   * flush caches.
> > > @@ -658,20 +672,35 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
> > >
> > >       WARN_ON(page_shift < PAGE_SHIFT);
> > >
> > > +     /*
> > > +      * For vmap(), users may allocate pages from high orders down to
> > > +      * order 0, while always using PAGE_SHIFT as the page_shift.
> > > +      * We first check whether the initial page is a compound page. If so,
> > > +      * there may be an opportunity to batch multiple pages together.
> > > +      */
> > >       if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
> > > -                     page_shift == PAGE_SHIFT)
> > > +                     (page_shift == PAGE_SHIFT && !PageCompound(pages[0])))
> > >               return vmap_small_pages_range_noflush(addr, end, prot, pages);
> > Hm.. If first few pages are order-0 and the rest are compound
> > then we do nothing.
> 
> Now the dma-buf is allocated in descending order. If page0
> is not huge, page1 will not be either. However, I agree that
> we may extend support for this case.
> 
> >
> > >
> > > -     for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
> > > +     for (i = 0; i < nr; ) {
> > > +             unsigned int shift = page_shift;
> > >               int err;
> > >
> > > -             err = vmap_range_noflush(addr, addr + (1UL << page_shift),
> > > +             /*
> > > +              * For vmap() cases, page_shift is always PAGE_SHIFT, even
> > > +              * if the pages are physically contiguous, they may still
> > > +              * be mapped in a batch.
> > > +              */
> > > +             if (page_shift == PAGE_SHIFT)
> > > +                     shift += get_vmap_batch_order(pages, nr - i, i);
> > > +             err = vmap_range_noflush(addr, addr + (1UL << shift),
> > >                                       page_to_phys(pages[i]), prot,
> > > -                                     page_shift);
> > > +                                     shift);
> > >               if (err)
> > >                       return err;
> > >
> > > -             addr += 1UL << page_shift;
> > > +             addr += 1UL  << shift;
> > > +             i += 1U << shift;
> > >       }
> > >
> > >       return 0;
> > >
> > > Does this look clearer?
> > >
I think so, at least the place:

<snip>
[    2.959030] Oops: Oops: 0000 [#66] SMP NOPTI
[    2.960004] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.18.0+ #220 PREEMPT(none)
[    2.961781] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[    2.963870] BUG: unable to handle page fault for address: ffffffff3fd68118
[    2.965383] #PF: supervisor read access in kernel mode
[    2.966532] #PF: error_code(0x0000) - not-present page
[    2.967682] BAD
<snip>

but it is broken for sure:

i += 1U << shift - "i" is an index in the page array.
For example if order-0 you jump 4096 indices ahead.

Should be: i += 1U << (shift - PAGE_SHIFT)

vmap_page_range() does flushing and it has instrumented KMSAN inside.
We should follow same semantic. Also it uses ioremap_max_page_shift as
maximum page shift policy.

--
Uladzislau Rezki