[v2] mm: speed up ZONE_DEVICE memmap initialization

[PATCH v2 0/7] mm: speed up ZONE_DEVICE memmap initialization

Posted by Li Zhe 3 days, 16 hours ago

memmap_init_zone_device() can spend a substantial amount of time
initializing large ZONE_DEVICE ranges because it repeats nearly
identical struct page setup for every PFN.

This series reduces that overhead in seven steps.

The first patch factors the reusable pieces out of
__init_zone_device_page() so later patches can share the same logic
without changing the existing slow path.

The second patch adds set_page_section_from_pfn(), so generic callers
can update section bits from a PFN without open-coding
SECTION_IN_PAGE_FLAGS checks.

The third patch adds a template-based fast path for ZONE_DEVICE head
pages. Instead of rebuilding the same struct page state for every PFN,
it prepares a reusable page template once and copies it to each
destination page.

The fourth patch extends the same template-based approach to compound
tails, so pfns_per_compound > 1 can also benefit from the fast path.

The fifth patch introduces memcpy_streaming() and
memcpy_streaming_drain() as a generic interface for write-once
streaming copies, with a memcpy() fallback for architectures that do
not provide a specialized backend.

The sixth patch extends x86 memcpy_flushcache() small fixed-size
fastpaths so struct-page-sized streaming copies can stay on the inline
path.

The last patch switches the zone-device template-copy path over to
memcpy_streaming(). It refreshes PFN-dependent fields in the reusable
template before each copy, keeps pageblock-aligned PFNs on regular
memcpy(), and drains streaming stores before later normal stores update
overlapping or dependent metadata.

The optimized path is disabled when the page_ref_set tracepoint is
enabled, sanitized builds remain on the slow path so their
instrumented stores are preserved, and the fast path falls back to the
existing slow path if sizeof(struct page) is not an integral number of
u64 words.

Testing
=======

Tests were run in a VM on an Intel Ice Lake server.

Two PMEM configurations were used:
  - a 100 GB fsdax namespace configured with map=dev, which exercises
    the nd_pmem rebind path (pfns_per_compound == 1)
  - a 100 GB devdax namespace configured with align=2097152, which
    exercises the dax_pmem rebind path (pfns_per_compound > 1)

For each configuration, the corresponding driver was unbound and
rebound 30 times. Memmap initialization latency was collected from the
pr_debug() output of memmap_init_zone_device().

The first bind is reported separately, and the average of subsequent
rebinds is used as the steady-state result.

Performance
===========

nd_pmem rebind, 100 GB fsdax namespace, map=dev
  Base(v7.1-rc3):
    First binding: 1486 ms
    Average of subsequent rebinds: 273.52 ms
  With patches 1-3 applied:
    First binding: 1422 ms
    Average of subsequent rebinds: 245.73 ms
  Full series:
    First binding: 1389 ms
    Average of subsequent rebinds: 111.08 ms

dax_pmem rebind, 100 GB devdax namespace, align=2097152
  Base(v7.1-rc3):
    First binding: 1515 ms
    Average of subsequent rebinds: 313.45 ms
  With patches 1-4 applied:
    First binding: 1422 ms
    Average of subsequent rebinds: 256.56 ms
  Full series:
    First binding: 1294 ms
    Average of subsequent rebinds: 110.24 ms

Li Zhe (7):
  mm: factor zone-device page init helpers out of
    __init_zone_device_page
  mm: add a set_page_section_from_pfn() helper
  mm: add a template-based fast path for zone-device page init
  mm: extend the template fast path to zone-device compound tails
  string: introduce memcpy_streaming() helpers
  x86/string: extend memcpy_flushcache() fixed-size fastpaths
  mm: use memcpy_streaming() in zone-device template copies

 arch/x86/include/asm/string_64.h | 100 +++++++++++++---
 include/linux/mm.h               |  19 ++-
 include/linux/string.h           |  18 +++
 mm/mm_init.c                     | 198 +++++++++++++++++++++++++++----
 4 files changed, 294 insertions(+), 41 deletions(-)

---
v1: https://lore.kernel.org/all/20260515082045.63029-1-lizhe.67@bytedance.com/

Changelogs:

v1->v2:
- Move the pageblock-helper split into patch 1, and add a dedicated
  set_page_section_from_pfn() helper so generic callers no longer
  open-code SECTION_IN_PAGE_FLAGS handling. Suggested by Mike Rapoport.
- Drop the v1 32-bit gating and document instead that CONFIG_ZONE_DEVICE
  is 64-bit only because it depends on MEMORY_HOTPLUG. Suggested by
  Mike Rapoport.
- Replace the v1 BUILD_BUG_ON() struct page layout checks with a runtime
  fast-path eligibility check that falls back to the slow path when the
  layout is unsuitable. Suggested by Mike Rapoport.
- Rename the template fast-path helpers to zone_device_* names for
  clarity. Suggested by Mike Rapoport.
- Replace the v1 open-coded arch_optimize_store_u64()/drain() approach
  with a generic memcpy_streaming()/memcpy_streaming_drain() interface,
  and move the x86 optimization under memcpy_flushcache(). Suggested by
  Alistair Popple.
- Split the old v1 streaming-copy patch into three patches: introduce
  the generic helper, extend the x86 backend, and then switch mm over
  to the new interface. This is part of the memcpy-based rework
  suggested by Alistair Popple.
- Refresh the performance section and report whole-series results as
  series-level numbers.
- Fix a missing memcpy_streaming_drain() in the compound-page
  initialization path, following review feedback from Andrew Morton.

-- 
2.20.1

Re: [PATCH v2 0/7] mm: speed up ZONE_DEVICE memmap initialization

Posted by Andrew Morton 2 days, 22 hours ago

On Thu, 21 May 2026 12:01:17 +0800 "Li Zhe" <lizhe.67@bytedance.com> wrote:

> memmap_init_zone_device() can spend a substantial amount of time
> initializing large ZONE_DEVICE ranges because it repeats nearly
> identical struct page setup for every PFN.
> 
> This series reduces that overhead in seven steps.

Cool, thanks, we all love speedups.

> The first patch factors the reusable pieces out of
> __init_zone_device_page() so later patches can share the same logic
> without changing the existing slow path.
> 
> The second patch adds set_page_section_from_pfn(), so generic callers
> can update section bits from a PFN without open-coding
> SECTION_IN_PAGE_FLAGS checks.
> 
> The third patch adds a template-based fast path for ZONE_DEVICE head
> pages. Instead of rebuilding the same struct page state for every PFN,
> it prepares a reusable page template once and copies it to each
> destination page.
> 
> The fourth patch extends the same template-based approach to compound
> tails, so pfns_per_compound > 1 can also benefit from the fast path.
> 
> The fifth patch introduces memcpy_streaming() and
> memcpy_streaming_drain() as a generic interface for write-once
> streaming copies, with a memcpy() fallback for architectures that do
> not provide a specialized backend.
> 
> The sixth patch extends x86 memcpy_flushcache() small fixed-size
> fastpaths so struct-page-sized streaming copies can stay on the inline
> path.
> 
> The last patch switches the zone-device template-copy path over to
> memcpy_streaming(). It refreshes PFN-dependent fields in the reusable
> template before each copy, keeps pageblock-aligned PFNs on regular
> memcpy(), and drains streaming stores before later normal stores update
> overlapping or dependent metadata.
> 
> The optimized path is disabled when the page_ref_set tracepoint is
> enabled, sanitized builds remain on the slow path so their
> instrumented stores are preserved, and the fast path falls back to the
> existing slow path if sizeof(struct page) is not an integral number of
> u64 words.
> 
> Testing
> =======
> 
> Tests were run in a VM on an Intel Ice Lake server.
> 
> Two PMEM configurations were used:
>   - a 100 GB fsdax namespace configured with map=dev, which exercises
>     the nd_pmem rebind path (pfns_per_compound == 1)
>   - a 100 GB devdax namespace configured with align=2097152, which
>     exercises the dax_pmem rebind path (pfns_per_compound > 1)
> 
> For each configuration, the corresponding driver was unbound and
> rebound 30 times. Memmap initialization latency was collected from the
> pr_debug() output of memmap_init_zone_device().
> 
> The first bind is reported separately, and the average of subsequent
> rebinds is used as the steady-state result.

How closely does this workload resemble any real-world user workload?

> Performance
> ===========
> 
> nd_pmem rebind, 100 GB fsdax namespace, map=dev
>   Base(v7.1-rc3):
>     First binding: 1486 ms
>     Average of subsequent rebinds: 273.52 ms
>   With patches 1-3 applied:
>     First binding: 1422 ms
>     Average of subsequent rebinds: 245.73 ms
>   Full series:
>     First binding: 1389 ms
>     Average of subsequent rebinds: 111.08 ms
> 
> dax_pmem rebind, 100 GB devdax namespace, align=2097152
>   Base(v7.1-rc3):
>     First binding: 1515 ms
>     Average of subsequent rebinds: 313.45 ms
>   With patches 1-4 applied:
>     First binding: 1422 ms
>     Average of subsequent rebinds: 256.56 ms
>   Full series:
>     First binding: 1294 ms
>     Average of subsequent rebinds: 110.24 ms

The improvements appear to range between "modest" and "large", but what
I'd like to understand is how frequently real-world users are using
these operations in real-world workloads.

IOW, (and this is always the bottom line), how valuable is this
patchset to our users?

>   mm: factor zone-device page init helpers out of
>     __init_zone_device_page
>   mm: add a set_page_section_from_pfn() helper
>   mm: add a template-based fast path for zone-device page init
>   mm: extend the template fast path to zone-device compound tails
>   string: introduce memcpy_streaming() helpers
>   x86/string: extend memcpy_flushcache() fixed-size fastpaths
>   mm: use memcpy_streaming() in zone-device template copies
> 
>  arch/x86/include/asm/string_64.h | 100 +++++++++++++---
>  include/linux/mm.h               |  19 ++-
>  include/linux/string.h           |  18 +++
>  mm/mm_init.c                     | 198 +++++++++++++++++++++++++++----
>  4 files changed, 294 insertions(+), 41 deletions(-)

I won't take any action at this stage - let's await reviewer input.  If
none is forthcoming then please remind me and I'll figure out what to
do.

The ever-present reviewer called "Sashiko" has thoughts to offer:

	https://sashiko.dev/#/patchset/20260521040124.10608-1-lizhe.67@bytedance.com

Please take a look, decide if there's useful material in there.

Re: [PATCH v2 0/7] mm: speed up ZONE_DEVICE memmap initialization

Posted by Li Zhe 2 days, 12 hours ago

On Thu, 21 May 2026 15:20:29 -0700, akpm@linux-foundation.org wrote:

> On Thu, 21 May 2026 12:01:17 +0800 "Li Zhe" <lizhe.67@bytedance.com> wrote:
> 
> > memmap_init_zone_device() can spend a substantial amount of time
> > initializing large ZONE_DEVICE ranges because it repeats nearly
> > identical struct page setup for every PFN.
> >
> > This series reduces that overhead in seven steps.
> 
> Cool, thanks, we all love speedups.
> 
> > The first patch factors the reusable pieces out of
> > __init_zone_device_page() so later patches can share the same logic
> > without changing the existing slow path.
> >
> > The second patch adds set_page_section_from_pfn(), so generic callers
> > can update section bits from a PFN without open-coding
> > SECTION_IN_PAGE_FLAGS checks.
> >
> > The third patch adds a template-based fast path for ZONE_DEVICE head
> > pages. Instead of rebuilding the same struct page state for every PFN,
> > it prepares a reusable page template once and copies it to each
> > destination page.
> >
> > The fourth patch extends the same template-based approach to compound
> > tails, so pfns_per_compound > 1 can also benefit from the fast path.
> >
> > The fifth patch introduces memcpy_streaming() and
> > memcpy_streaming_drain() as a generic interface for write-once
> > streaming copies, with a memcpy() fallback for architectures that do
> > not provide a specialized backend.
> >
> > The sixth patch extends x86 memcpy_flushcache() small fixed-size
> > fastpaths so struct-page-sized streaming copies can stay on the inline
> > path.
> >
> > The last patch switches the zone-device template-copy path over to
> > memcpy_streaming(). It refreshes PFN-dependent fields in the reusable
> > template before each copy, keeps pageblock-aligned PFNs on regular
> > memcpy(), and drains streaming stores before later normal stores update
> > overlapping or dependent metadata.
> >
> > The optimized path is disabled when the page_ref_set tracepoint is
> > enabled, sanitized builds remain on the slow path so their
> > instrumented stores are preserved, and the fast path falls back to the
> > existing slow path if sizeof(struct page) is not an integral number of
> > u64 words.
> >
> > Testing
> > =======
> >
> > Tests were run in a VM on an Intel Ice Lake server.
> >
> > Two PMEM configurations were used:
> >   - a 100 GB fsdax namespace configured with map=dev, which exercises
> >     the nd_pmem rebind path (pfns_per_compound == 1)
> >   - a 100 GB devdax namespace configured with align=2097152, which
> >     exercises the dax_pmem rebind path (pfns_per_compound > 1)
> >
> > For each configuration, the corresponding driver was unbound and
> > rebound 30 times. Memmap initialization latency was collected from the
> > pr_debug() output of memmap_init_zone_device().
> >
> > The first bind is reported separately, and the average of subsequent
> > rebinds is used as the steady-state result.
> 
> How closely does this workload resemble any real-world user workload?

Not directly. The unbind/rebind loop is mainly a controlled and
repeatable way to measure the memmap_init_zone_device() path with minimal
unrelated noise.

> > Performance
> > ===========
> >
> > nd_pmem rebind, 100 GB fsdax namespace, map=dev
> >   Base(v7.1-rc3):
> >     First binding: 1486 ms
> >     Average of subsequent rebinds: 273.52 ms
> >   With patches 1-3 applied:
> >     First binding: 1422 ms
> >     Average of subsequent rebinds: 245.73 ms
> >   Full series:
> >     First binding: 1389 ms
> >     Average of subsequent rebinds: 111.08 ms
> >
> > dax_pmem rebind, 100 GB devdax namespace, align=2097152
> >   Base(v7.1-rc3):
> >     First binding: 1515 ms
> >     Average of subsequent rebinds: 313.45 ms
> >   With patches 1-4 applied:
> >     First binding: 1422 ms
> >     Average of subsequent rebinds: 256.56 ms
> >   Full series:
> >     First binding: 1294 ms
> >     Average of subsequent rebinds: 110.24 ms
> 
> The improvements appear to range between "modest" and "large", but what
> I'd like to understand is how frequently real-world users are using
> these operations in real-world workloads.
> 
> IOW, (and this is always the bottom line), how valuable is this
> patchset to our users?

This is not a steady-state data-path optimization. Its value is in pmem
bring-up paths, and in our deployment we do have scenarios where
multiple pmem devices are hotplugged , so reducing this latency is useful
in practice for us.

> >   mm: factor zone-device page init helpers out of
> >     __init_zone_device_page
> >   mm: add a set_page_section_from_pfn() helper
> >   mm: add a template-based fast path for zone-device page init
> >   mm: extend the template fast path to zone-device compound tails
> >   string: introduce memcpy_streaming() helpers
> >   x86/string: extend memcpy_flushcache() fixed-size fastpaths
> >   mm: use memcpy_streaming() in zone-device template copies
> >
> >  arch/x86/include/asm/string_64.h | 100 +++++++++++++---
> >  include/linux/mm.h               |  19 ++-
> >  include/linux/string.h           |  18 +++
> >  mm/mm_init.c                     | 198 +++++++++++++++++++++++++++----
> >  4 files changed, 294 insertions(+), 41 deletions(-)
> 
> I won't take any action at this stage - let's await reviewer input.  If
> none is forthcoming then please remind me and I'll figure out what to
> do.
> 
> The ever-present reviewer called "Sashiko" has thoughts to offer:
> 
> 	https://sashiko.dev/#/patchset/20260521040124.10608-1-lizhe.67@bytedance.com
> 
> Please take a look, decide if there's useful material in there.

There is useful material there, mainly around patches 5 and 6.

The memcpy_streaming() x86 backend should be narrower, and the expanded
memcpy_flushcache() small-copy fastpath should keep naturally aligned
cases only and preserve forward movnti store order.

I'll address those points in the next revision and rerun the benchmarks.

Thanks,
Zhe