[v2] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory

[PATCH v2 0/7] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory

Posted by Wen Jiang 4 weeks, 1 day ago

This patchset accelerates ioremap, vmalloc, and vmap when the memory
is physically fully or partially contiguous. Two techniques are used:

1. Avoid page table rewalk when setting PTEs/PMDs for multiple memory
   segments
2. Use batched mappings wherever possible in both vmalloc and ARM64
   layers

Besides accelerating the mapping path, this also enables large
mappings (PMD and cont-PTE) for vmap, which are currently not
supported.

Patches 1-2 extend ARM64 vmalloc CONT-PTE mapping to support multiple
CONT-PTE regions instead of just one.

Patch 3 extracts a common helper vmap_set_ptes() that consolidates PTE
mapping logic between the ioremap and vmalloc/vmap paths, handling both
CONT_PTE and regular PTE mappings. This prepares for the next patch.

Patch 4 extends the page table walk path to support page shifts other
than PAGE_SHIFT and eliminates the page table rewalk for huge vmalloc
mappings. The function is renamed from vmap_small_pages_range_noflush()
to vmap_pages_range_noflush_walk().

Patches 5-7 add huge vmap support for contiguous pages, including
support for non-compound pages with pfn alignment verification.

On the RK3588 8-core ARM64 SoC, with tasks pinned to a little core and
the performance CPUfreq policy enabled, benchmark results:

* ioremap(1 MB): 1.35× faster (3407 ns -> 2526 ns)
* vmalloc(1 MB) mapping time (excluding allocation) with
  VM_ALLOW_HUGE_VMAP: 1.42× faster (5.00 us -> 3.53us)
* vmap(100MB) with order-8 pages: 8.3× faster (1235 us -> 149 us)

Many thanks to Xueyuan Chen for his testing efforts on RK3588 boards.

Changes since v1:
- Fix condition order and use PMD_SIZE instead of CONT_PMD_SIZE in
  patch 1 (Dev Jain)
- Squash patch 3+4 and patch 5+7 (Dev Jain)
- Replace "zigzag" with "page table rewalk" in commit messages
  (Dev Jain)
- Rename vmap_small_pages_range_noflush() to
  vmap_pages_range_noflush_walk() (Dev Jain)
- Extract vmap_set_ptes() as a new patch to consolidate PTE mapping
  logic between vmap_pte_range() and vmap_pages_pte_range(), handling
  both CONT_PTE and regular mappings (Mike Rapoport)
- Support non-compound pages in get_vmap_batch_order() by falling
  back to physical contiguity scanning with pfn alignment check
  (Dev Jain, Uladzislau Rezki)
- In get_vmap_batch_order(), filter out orders that the architecture
  cannot batch by checking arch_vmap_pte_supported_shift() directly.
  This avoids overhead for orders 1-3 on ARM64 CONT_PTE with 4K
  pages. (patch 5)

Barry Song (Xiaomi) (6):
  arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE
    setup
  arm64/vmalloc: Allow arch_vmap_pte_range_map_size to batch multiple
    CONT_PTE
  mm/vmalloc: Extend page table walk to support larger page_shift sizes
    and eliminate page table rewalk
  mm/vmalloc: map contiguous pages in batches for vmap() if possible
  mm/vmalloc: align vm_area so vmap() can batch mappings
  mm/vmalloc: Stop scanning for compound pages after encountering small
    pages in vmap

Wen Jiang (1):
  mm/vmalloc: Extract vmap_set_ptes() to consolidate PTE mapping logic

 arch/arm64/include/asm/vmalloc.h |   6 +-
 arch/arm64/mm/hugetlbpage.c      |  10 ++
 mm/vmalloc.c                     | 221 ++++++++++++++++++++++++-------
 3 files changed, 189 insertions(+), 48 deletions(-)

--
2.34.1

Re: [PATCH v2 0/7] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory

Posted by Andrew Morton 3 weeks, 2 days ago

On Thu, 14 May 2026 17:41:01 +0800 Wen Jiang <jiangwenxiaomi@gmail.com> wrote:

> This patchset accelerates ioremap, vmalloc, and vmap when the memory
> is physically fully or partially contiguous.
> 
> ...
> 
> On the RK3588 8-core ARM64 SoC, with tasks pinned to a little core and
> the performance CPUfreq policy enabled, benchmark results:
> 
> * ioremap(1 MB): 1.35× faster (3407 ns -> 2526 ns)
> * vmalloc(1 MB) mapping time (excluding allocation) with
>   VM_ALLOW_HUGE_VMAP: 1.42× faster (5.00 us -> 3.53us)
> * vmap(100MB) with order-8 pages: 8.3× faster (1235 us -> 149 us)

Nice.

AI review found a bunch of things to ask about:
	https://sashiko.dev/#/patchset/20260514094108.2016201-1-jiangwen6@xiaomi.com

It doesn't appear that you'll be getting any more review on this
series, so please check the above questions and resend?

Re: [PATCH v2 0/7] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory

Posted by Wen Jiang 3 weeks, 2 days ago

Hi Andrew,

I've reviewed all the Sashiko findings:

- Patch 2 (fls() truncation risk): Will fix. Replace fls() with
  __fls() to accept unsigned long directly.

- Patch 4 (nr overflow risk): Pre-existing type choice.

- Patch 4 (missing NULL check before page_to_phys): Will fix.
  Add defensive checks consistent with vmap_pages_pte_range().

- Patch 5 (flush_cache_vmap with empty range): Valid point. Will
  save the original start address and use it for the final flush.

- Patch 5 (virtual address alignment not checked): Addressed by
  Patch 6 in this series.

- Patch 6 (caller tracking loss and while(1) loop): Valid point.
  Will pass caller as a parameter and restructure per Uladzislau's
  suggestion to replace while(1) with explicit sequential attempts.

- Patch 7 (partial cache flush on early break): Same root cause as
  the Patch 5 flush issue.

Will resend V3 shortly.

Thanks,
Wen

On Wed, 20 May 2026 at 04:17, Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Thu, 14 May 2026 17:41:01 +0800 Wen Jiang <jiangwenxiaomi@gmail.com> wrote:
>
> > This patchset accelerates ioremap, vmalloc, and vmap when the memory
> > is physically fully or partially contiguous.
> >
> > ...
> >
> > On the RK3588 8-core ARM64 SoC, with tasks pinned to a little core and
> > the performance CPUfreq policy enabled, benchmark results:
> >
> > * ioremap(1 MB): 1.35× faster (3407 ns -> 2526 ns)
> > * vmalloc(1 MB) mapping time (excluding allocation) with
> >   VM_ALLOW_HUGE_VMAP: 1.42× faster (5.00 us -> 3.53us)
> > * vmap(100MB) with order-8 pages: 8.3× faster (1235 us -> 149 us)
>
> Nice.
>
> AI review found a bunch of things to ask about:
>         https://sashiko.dev/#/patchset/20260514094108.2016201-1-jiangwen6@xiaomi.com
>
> It doesn't appear that you'll be getting any more review on this
> series, so please check the above questions and resend?
>

Re: [PATCH v2 0/7] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory

Posted by Uladzislau Rezki 3 weeks, 2 days ago

On Tue, May 19, 2026 at 01:17:38PM -0700, Andrew Morton wrote:
> On Thu, 14 May 2026 17:41:01 +0800 Wen Jiang <jiangwenxiaomi@gmail.com> wrote:
> 
> > This patchset accelerates ioremap, vmalloc, and vmap when the memory
> > is physically fully or partially contiguous.
> > 
> > ...
> > 
> > On the RK3588 8-core ARM64 SoC, with tasks pinned to a little core and
> > the performance CPUfreq policy enabled, benchmark results:
> > 
> > * ioremap(1 MB): 1.35× faster (3407 ns -> 2526 ns)
> > * vmalloc(1 MB) mapping time (excluding allocation) with
> >   VM_ALLOW_HUGE_VMAP: 1.42× faster (5.00 us -> 3.53us)
> > * vmap(100MB) with order-8 pages: 8.3× faster (1235 us -> 149 us)
> 
> Nice.
> 
> AI review found a bunch of things to ask about:
> 	https://sashiko.dev/#/patchset/20260514094108.2016201-1-jiangwen6@xiaomi.com
> 
> It doesn't appear that you'll be getting any more review on this
> series, so please check the above questions and resend?
> 
Actually i keep an eye on it and i have done some stability testing.
So, just need some time. Fixing AI sounds good.

--
Uladzislau Rezki

Re: [PATCH v2 0/7] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory

Posted by Dev Jain 3 weeks, 2 days ago


On 20/05/26 1:47 am, Andrew Morton wrote:
> On Thu, 14 May 2026 17:41:01 +0800 Wen Jiang <jiangwenxiaomi@gmail.com> wrote:
> 
>> This patchset accelerates ioremap, vmalloc, and vmap when the memory
>> is physically fully or partially contiguous.
>>
>> ...
>>
>> On the RK3588 8-core ARM64 SoC, with tasks pinned to a little core and
>> the performance CPUfreq policy enabled, benchmark results:
>>
>> * ioremap(1 MB): 1.35× faster (3407 ns -> 2526 ns)
>> * vmalloc(1 MB) mapping time (excluding allocation) with
>>   VM_ALLOW_HUGE_VMAP: 1.42× faster (5.00 us -> 3.53us)
>> * vmap(100MB) with order-8 pages: 8.3× faster (1235 us -> 149 us)
> 
> Nice.
> 
> AI review found a bunch of things to ask about:
> 	https://sashiko.dev/#/patchset/20260514094108.2016201-1-jiangwen6@xiaomi.com
> 
> It doesn't appear that you'll be getting any more review on this
> series, so please check the above questions and resend?

I have to review this but struggling to find time right now. So please
don't wait for me : )
>