[v1] mm: speed up ZONE_DEVICE memmap initialization

[PATCH 0/4] mm: speed up ZONE_DEVICE memmap initialization

Posted by Li Zhe 4 weeks ago

memmap_init_zone_device() can spend a substantial amount of time
initializing large ZONE_DEVICE ranges because it repeats nearly
identical struct page setup for every PFN.

This series reduces that overhead in four steps.

The first patch factors the reusable pieces out of
__init_zone_device_page() so later patches can share the same logic
without changing the existing slow path.

The second patch adds a template-based fast path for zone-device head
pages on 64-bit builds. Instead of rebuilding the same struct page state
for every PFN, it prepares a reusable page template once and copies it
to each destination page.

The third patch extends the same template-based approach to compound
tails, so pfns_per_compound > 1 can also benefit from the fast path.

The last patch introduces arch_optimize_store_u64() and
arch_optimize_store_drain(), with a generic fallback and an x86-64
MOVNTI/SFENCE implementation, and uses them in the template-copy hot
path. It also refreshes the PFN-dependent fields in the reusable
template before each copy, so the hot path remains a fixed-offset store
sequence without post-copy normal stores to the destination page.

The optimized path is intentionally limited to 64-bit builds for now.
The current fast path copies struct page as a sequence of fixed-offset
u64 stores and relies on the set of struct page layouts used by current
64-bit configurations. Extending the same optimization to 32-bit builds
would need a separate store layout scheme and its own validation, so
this series keeps 32-bit on the existing slow path.

The helper interface is also meant to leave room for other
architecture-specific backends. This series only adds an x86-64
implementation because that is the only platform I was able to measure.
Other architectures, including arm64, may be able to add their own
optimized backend in follow-up work. But I do not currently have arm64
hardware available to validate that.

To preserve observability, the optimized path is disabled when the
page_ref_set tracepoint is enabled, because the template-copy path
bypasses set_page_count() and would otherwise hide that trace event.

Patch summary:
  1. mm: factor zone-device page init helpers out of __init_zone_device_page
  2. mm: add a template-based fast path for zone-device page init
  3. mm: extend the template fast path to zone-device compound tails
  4. mm: use arch store helpers in zone-device template copies

Testing
=======

Tests were run in a VM on an Intel Ice Lake server.

Two PMEM configurations were used:
  - a 100 GB fsdax namespace configured with map=dev, which exercises
    the nd_pmem rebind path (pfns_per_compound == 1)
  - a 100 GB devdax namespace configured with align=2097152, which
    exercises the dax_pmem rebind path (pfns_per_compound > 1)

For each configuration, the corresponding driver was unbound and
rebound 30 times. Memmap initialization latency was collected from the
pr_debug() output of memmap_init_zone_device().

The first bind is reported separately, and the average of subsequent
rebinds is used as the steady-state result.

Performance
===========
nd_pmem rebind, 100 GB fsdax namespace, map=dev
  Base(v7.1-rc3):
    First binding: 1486 ms
    Average of subsequent rebinds: 273.52 ms
  Full series:
    First binding: 1272 ms
    Average of subsequent rebinds: 104.59 ms

dax_pmem rebind, 100 GB devdax namespace, align=2097152
  Base(v7.1-rc3):
    First binding: 1515 ms
    Average of subsequent rebinds: 313.45 ms
  Full series:
    First binding: 1286 ms
    Average of subsequent rebinds: 116.93 ms

Li Zhe (4):
  mm: factor zone-device page init helpers out of
    __init_zone_device_page
  mm: add a template-based fast path for zone-device page init
  mm: extend the template fast path to zone-device compound tails
  mm: use arch store helpers in zone-device template copies

 arch/x86/include/asm/struct_page_init.h |  28 +++
 include/asm-generic/Kbuild              |   1 +
 include/asm-generic/struct_page_init.h  |  17 ++
 mm/mm_init.c                            | 260 +++++++++++++++++++++---
 4 files changed, 280 insertions(+), 26 deletions(-)
 create mode 100644 arch/x86/include/asm/struct_page_init.h
 create mode 100644 include/asm-generic/struct_page_init.h

-- 
2.20.1

Re: [PATCH 0/4] mm: speed up ZONE_DEVICE memmap initialization

Posted by Mike Rapoport 3 weeks, 4 days ago

Hi,

On Fri, May 15, 2026 at 04:20:41PM +0800, Li Zhe wrote:
> memmap_init_zone_device() can spend a substantial amount of time
> initializing large ZONE_DEVICE ranges because it repeats nearly
> identical struct page setup for every PFN.
> 
> This series reduces that overhead in four steps.
>
> Testing
> =======
> 
> Tests were run in a VM on an Intel Ice Lake server.
> 
> Two PMEM configurations were used:
>   - a 100 GB fsdax namespace configured with map=dev, which exercises
>     the nd_pmem rebind path (pfns_per_compound == 1)
>   - a 100 GB devdax namespace configured with align=2097152, which
>     exercises the dax_pmem rebind path (pfns_per_compound > 1)
> 
> For each configuration, the corresponding driver was unbound and
> rebound 30 times. Memmap initialization latency was collected from the
> pr_debug() output of memmap_init_zone_device().
> 
> The first bind is reported separately, and the average of subsequent
> rebinds is used as the steady-state result.
> 
> Performance
> ===========
> nd_pmem rebind, 100 GB fsdax namespace, map=dev
>   Base(v7.1-rc3):
>     First binding: 1486 ms
>     Average of subsequent rebinds: 273.52 ms
>   Full series:
>     First binding: 1272 ms
>     Average of subsequent rebinds: 104.59 ms
> 
> dax_pmem rebind, 100 GB devdax namespace, align=2097152
>   Base(v7.1-rc3):
>     First binding: 1515 ms
>     Average of subsequent rebinds: 313.45 ms
>   Full series:
>     First binding: 1286 ms
>     Average of subsequent rebinds: 116.93 ms

This is really good improvement!

It would be also interesting to see how the template approach would improve
"normal" memory map initialization.
 
> Li Zhe (4):
>   mm: factor zone-device page init helpers out of
>     __init_zone_device_page
>   mm: add a template-based fast path for zone-device page init
>   mm: extend the template fast path to zone-device compound tails
>   mm: use arch store helpers in zone-device template copies
> 
>  arch/x86/include/asm/struct_page_init.h |  28 +++
>  include/asm-generic/Kbuild              |   1 +
>  include/asm-generic/struct_page_init.h  |  17 ++
>  mm/mm_init.c                            | 260 +++++++++++++++++++++---
>  4 files changed, 280 insertions(+), 26 deletions(-)
>  create mode 100644 arch/x86/include/asm/struct_page_init.h
>  create mode 100644 include/asm-generic/struct_page_init.h
> 
> -- 
> 2.20.1
> 

-- 
Sincerely yours,
Mike.

Re: [PATCH 0/4] mm: speed up ZONE_DEVICE memmap initialization

Posted by Li Zhe 3 weeks, 4 days ago

On Mon, 18 May 2026 09:23:33 +0300, rppt@kernel.org wrote:

> Hi,
> 
> On Fri, May 15, 2026 at 04:20:41PM +0800, Li Zhe wrote:
> > memmap_init_zone_device() can spend a substantial amount of time
> > initializing large ZONE_DEVICE ranges because it repeats nearly
> > identical struct page setup for every PFN.
> > 
> > This series reduces that overhead in four steps.
> >
> > Testing
> > =======
> > 
> > Tests were run in a VM on an Intel Ice Lake server.
> > 
> > Two PMEM configurations were used:
> >   - a 100 GB fsdax namespace configured with map=dev, which exercises
> >     the nd_pmem rebind path (pfns_per_compound == 1)
> >   - a 100 GB devdax namespace configured with align=2097152, which
> >     exercises the dax_pmem rebind path (pfns_per_compound > 1)
> > 
> > For each configuration, the corresponding driver was unbound and
> > rebound 30 times. Memmap initialization latency was collected from the
> > pr_debug() output of memmap_init_zone_device().
> > 
> > The first bind is reported separately, and the average of subsequent
> > rebinds is used as the steady-state result.
> > 
> > Performance
> > ===========
> > nd_pmem rebind, 100 GB fsdax namespace, map=dev
> >   Base(v7.1-rc3):
> >     First binding: 1486 ms
> >     Average of subsequent rebinds: 273.52 ms
> >   Full series:
> >     First binding: 1272 ms
> >     Average of subsequent rebinds: 104.59 ms
> > 
> > dax_pmem rebind, 100 GB devdax namespace, align=2097152
> >   Base(v7.1-rc3):
> >     First binding: 1515 ms
> >     Average of subsequent rebinds: 313.45 ms
> >   Full series:
> >     First binding: 1286 ms
> >     Average of subsequent rebinds: 116.93 ms
> 
> This is really good improvement!
> 
> It would be also interesting to see how the template approach would improve
> "normal" memory map initialization.

I also experimented with this approach earlier. Unfortunately, in the
normal memory map initialization path, functions such as
deferred_free_pages() are invoked shortly after struct page
initialization, and this function performs both read and write accesses
to members of the struct page.

Non-temporal stores via MOVNTI are primarily beneficial for streaming
write operations, where the cache lines written are not expected to be
reused by the CPU in the near future. In this case, however, data
written using MOVNTI is immediately accessed again through regular load
and store instructions. This results in an access pattern that resembles
a write-then-reuse workload rather than a pure streaming store.

Consequently, non-temporal stores do not deliver the expected reduction
in cache pollution, and using MOVNTI provides no measurable performance
benefit for this particular workload.

That said, a template-based approach can still accelerate initialization.
Based on measurements from this patchset, it should improve performance
on the generic path by roughly 10%. I would appreciate feedback on
whether such an optimization is still considered useful.

Thanks,
Zhe

Re: [PATCH 0/4] mm: speed up ZONE_DEVICE memmap initialization

Posted by Mike Rapoport 3 weeks, 2 days ago

On Mon, May 18, 2026 at 04:57:00PM +0800, Li Zhe wrote:
> On Mon, 18 May 2026 09:23:33 +0300, rppt@kernel.org wrote:
> > On Fri, May 15, 2026 at 04:20:41PM +0800, Li Zhe wrote:
> > > 
> > > Performance
> > > ===========
> > > nd_pmem rebind, 100 GB fsdax namespace, map=dev
> > >   Base(v7.1-rc3):
> > >     First binding: 1486 ms
> > >     Average of subsequent rebinds: 273.52 ms
> > >   Full series:
> > >     First binding: 1272 ms
> > >     Average of subsequent rebinds: 104.59 ms
> > > 
> > > dax_pmem rebind, 100 GB devdax namespace, align=2097152
> > >   Base(v7.1-rc3):
> > >     First binding: 1515 ms
> > >     Average of subsequent rebinds: 313.45 ms
> > >   Full series:
> > >     First binding: 1286 ms
> > >     Average of subsequent rebinds: 116.93 ms
> > 
> > This is really good improvement!
> > 
> > It would be also interesting to see how the template approach would improve
> > "normal" memory map initialization.
> 
> I also experimented with this approach earlier. Unfortunately, in the
> normal memory map initialization path, functions such as
> deferred_free_pages() are invoked shortly after struct page
> initialization, and this function performs both read and write accesses
> to members of the struct page.
> 
> Non-temporal stores via MOVNTI are primarily beneficial for streaming
> write operations, where the cache lines written are not expected to be
> reused by the CPU in the near future. In this case, however, data
> written using MOVNTI is immediately accessed again through regular load
> and store instructions. This results in an access pattern that resembles
> a write-then-reuse workload rather than a pure streaming store.
> 
> Consequently, non-temporal stores do not deliver the expected reduction
> in cache pollution, and using MOVNTI provides no measurable performance
> benefit for this particular workload.

We can split initialization and freeing into separate loops if there is
overall benefit, but this needs to be verified on other major architectures
as well.
 
> That said, a template-based approach can still accelerate initialization.
> Based on measurements from this patchset, it should improve performance
> on the generic path by roughly 10%. I would appreciate feedback on
> whether such an optimization is still considered useful.

Improving the memory map initialization by 10% is valuable.
 
> Thanks,
> Zhe

-- 
Sincerely yours,
Mike.

Re: [PATCH 0/4] mm: speed up ZONE_DEVICE memmap initialization

Posted by Li Zhe 3 weeks, 1 day ago

On Wed, 20 May 2026 09:20:18 +0300, rppt@kernel.org wrote:

> On Mon, May 18, 2026 at 04:57:00PM +0800, Li Zhe wrote:
> > On Mon, 18 May 2026 09:23:33 +0300, rppt@kernel.org wrote:
> > > On Fri, May 15, 2026 at 04:20:41PM +0800, Li Zhe wrote:
> > > >
> > > > Performance
> > > > ===========
> > > > nd_pmem rebind, 100 GB fsdax namespace, map=dev
> > > >   Base(v7.1-rc3):
> > > >     First binding: 1486 ms
> > > >     Average of subsequent rebinds: 273.52 ms
> > > >   Full series:
> > > >     First binding: 1272 ms
> > > >     Average of subsequent rebinds: 104.59 ms
> > > >
> > > > dax_pmem rebind, 100 GB devdax namespace, align=2097152
> > > >   Base(v7.1-rc3):
> > > >     First binding: 1515 ms
> > > >     Average of subsequent rebinds: 313.45 ms
> > > >   Full series:
> > > >     First binding: 1286 ms
> > > >     Average of subsequent rebinds: 116.93 ms
> > >
> > > This is really good improvement!
> > >
> > > It would be also interesting to see how the template approach would improve
> > > "normal" memory map initialization.
> >
> > I also experimented with this approach earlier. Unfortunately, in the
> > normal memory map initialization path, functions such as
> > deferred_free_pages() are invoked shortly after struct page
> > initialization, and this function performs both read and write accesses
> > to members of the struct page.
> >
> > Non-temporal stores via MOVNTI are primarily beneficial for streaming
> > write operations, where the cache lines written are not expected to be
> > reused by the CPU in the near future. In this case, however, data
> > written using MOVNTI is immediately accessed again through regular load
> > and store instructions. This results in an access pattern that resembles
> > a write-then-reuse workload rather than a pure streaming store.
> >
> > Consequently, non-temporal stores do not deliver the expected reduction
> > in cache pollution, and using MOVNTI provides no measurable performance
> > benefit for this particular workload.
> 
> We can split initialization and freeing into separate loops if there is
> overall benefit, but this needs to be verified on other major architectures
> as well.

I agree with your point.

> > That said, a template-based approach can still accelerate initialization.
> > Based on measurements from this patchset, it should improve performance
> > on the generic path by roughly 10%. I would appreciate feedback on
> > whether such an optimization is still considered useful.
> 
> Improving the memory map initialization by 10% is valuable.

Thank you for your feedback. I will try the optimization after finishing
the current patchset.

Thanks,
Zhe

Re: [PATCH 0/4] mm: speed up ZONE_DEVICE memmap initialization

Posted by Alistair Popple 3 weeks, 1 day ago

On 2026-05-20 at 21:57 +1000, Li Zhe <lizhe.67@bytedance.com> wrote...
> On Wed, 20 May 2026 09:20:18 +0300, rppt@kernel.org wrote:
> 
> > On Mon, May 18, 2026 at 04:57:00PM +0800, Li Zhe wrote:
> > > On Mon, 18 May 2026 09:23:33 +0300, rppt@kernel.org wrote:
> > > > On Fri, May 15, 2026 at 04:20:41PM +0800, Li Zhe wrote:
> > > > >
> > > > > Performance
> > > > > ===========
> > > > > nd_pmem rebind, 100 GB fsdax namespace, map=dev
> > > > >   Base(v7.1-rc3):
> > > > >     First binding: 1486 ms
> > > > >     Average of subsequent rebinds: 273.52 ms
> > > > >   Full series:
> > > > >     First binding: 1272 ms
> > > > >     Average of subsequent rebinds: 104.59 ms
> > > > >
> > > > > dax_pmem rebind, 100 GB devdax namespace, align=2097152
> > > > >   Base(v7.1-rc3):
> > > > >     First binding: 1515 ms
> > > > >     Average of subsequent rebinds: 313.45 ms
> > > > >   Full series:
> > > > >     First binding: 1286 ms
> > > > >     Average of subsequent rebinds: 116.93 ms
> > > >
> > > > This is really good improvement!
> > > >
> > > > It would be also interesting to see how the template approach would improve
> > > > "normal" memory map initialization.
> > >
> > > I also experimented with this approach earlier. Unfortunately, in the
> > > normal memory map initialization path, functions such as
> > > deferred_free_pages() are invoked shortly after struct page
> > > initialization, and this function performs both read and write accesses
> > > to members of the struct page.
> > >
> > > Non-temporal stores via MOVNTI are primarily beneficial for streaming
> > > write operations, where the cache lines written are not expected to be
> > > reused by the CPU in the near future. In this case, however, data
> > > written using MOVNTI is immediately accessed again through regular load
> > > and store instructions. This results in an access pattern that resembles
> > > a write-then-reuse workload rather than a pure streaming store.
> > >
> > > Consequently, non-temporal stores do not deliver the expected reduction
> > > in cache pollution, and using MOVNTI provides no measurable performance
> > > benefit for this particular workload.
> > 
> > We can split initialization and freeing into separate loops if there is
> > overall benefit, but this needs to be verified on other major architectures
> > as well.
> 
> I agree with your point.
> 
> > > That said, a template-based approach can still accelerate initialization.
> > > Based on measurements from this patchset, it should improve performance
> > > on the generic path by roughly 10%. I would appreciate feedback on
> > > whether such an optimization is still considered useful.
> > 
> > Improving the memory map initialization by 10% is valuable.

Agree with that - GPUs have to hotplug 100's GB of ZONE_DEVICE memory so any
improvement here is valuable. Thanks for looking at it.

 - Alistair

> Thank you for your feedback. I will try the optimization after finishing
> the current patchset.
> 
> Thanks,
> Zhe
>

Re: [PATCH 0/4] mm: speed up ZONE_DEVICE memmap initialization

Posted by Li Zhe 3 weeks, 1 day ago

On Thu, 21 May 2026 08:36:08 +1000, apopple@nvidia.com wrote:

> On 2026-05-20 at 21:57 +1000, Li Zhe <lizhe.67@bytedance.com> wrote...
> > On Wed, 20 May 2026 09:20:18 +0300, rppt@kernel.org wrote:
> >
> > > On Mon, May 18, 2026 at 04:57:00PM +0800, Li Zhe wrote:
> > > > On Mon, 18 May 2026 09:23:33 +0300, rppt@kernel.org wrote:
> > > > > On Fri, May 15, 2026 at 04:20:41PM +0800, Li Zhe wrote:
> > > > > >
> > > > > > Performance
> > > > > > ===========
> > > > > > nd_pmem rebind, 100 GB fsdax namespace, map=dev
> > > > > >   Base(v7.1-rc3):
> > > > > >     First binding: 1486 ms
> > > > > >     Average of subsequent rebinds: 273.52 ms
> > > > > >   Full series:
> > > > > >     First binding: 1272 ms
> > > > > >     Average of subsequent rebinds: 104.59 ms
> > > > > >
> > > > > > dax_pmem rebind, 100 GB devdax namespace, align=2097152
> > > > > >   Base(v7.1-rc3):
> > > > > >     First binding: 1515 ms
> > > > > >     Average of subsequent rebinds: 313.45 ms
> > > > > >   Full series:
> > > > > >     First binding: 1286 ms
> > > > > >     Average of subsequent rebinds: 116.93 ms
> > > > >
> > > > > This is really good improvement!
> > > > >
> > > > > It would be also interesting to see how the template approach would improve
> > > > > "normal" memory map initialization.
> > > >
> > > > I also experimented with this approach earlier. Unfortunately, in the
> > > > normal memory map initialization path, functions such as
> > > > deferred_free_pages() are invoked shortly after struct page
> > > > initialization, and this function performs both read and write accesses
> > > > to members of the struct page.
> > > >
> > > > Non-temporal stores via MOVNTI are primarily beneficial for streaming
> > > > write operations, where the cache lines written are not expected to be
> > > > reused by the CPU in the near future. In this case, however, data
> > > > written using MOVNTI is immediately accessed again through regular load
> > > > and store instructions. This results in an access pattern that resembles
> > > > a write-then-reuse workload rather than a pure streaming store.
> > > >
> > > > Consequently, non-temporal stores do not deliver the expected reduction
> > > > in cache pollution, and using MOVNTI provides no measurable performance
> > > > benefit for this particular workload.
> > >
> > > We can split initialization and freeing into separate loops if there is
> > > overall benefit, but this needs to be verified on other major architectures
> > > as well.
> >
> > I agree with your point.
> >
> > > > That said, a template-based approach can still accelerate initialization.
> > > > Based on measurements from this patchset, it should improve performance
> > > > on the generic path by roughly 10%. I would appreciate feedback on
> > > > whether such an optimization is still considered useful.
> > >
> > > Improving the memory map initialization by 10% is valuable.
> 
> Agree with that - GPUs have to hotplug 100's GB of ZONE_DEVICE memory so any
> improvement here is valuable. Thanks for looking at it.

Thank you for your feedback. I will try the optimization after finishing
the current patchset.

Thanks,
Zhe