arch/x86/include/asm/string_64.h | 140 ++++++++++++++++++-- include/linux/mm.h | 19 ++- include/linux/string.h | 20 +++ mm/mm_init.c | 221 +++++++++++++++++++++++++++---- 4 files changed, 360 insertions(+), 40 deletions(-)
memmap_init_zone_device() can spend a substantial amount of time
initializing large ZONE_DEVICE ranges because it repeats nearly
identical struct page setup for every PFN.
This series reduces that overhead in eight steps.
The first patch fixes a stale comment in __init_zone_device_page() so
the documented refcount policy matches the current ZONE_DEVICE code.
The second patch factors the reusable pieces out of
__init_zone_device_page() so later patches can share the same logic
without changing the existing slow path.
The third patch adds set_page_section_from_pfn(), so callers that want
to refresh section bits from a PFN no longer need to open-code
SECTION_IN_PAGE_FLAGS handling.
The fourth patch adds a template-based fast path for ZONE_DEVICE head
pages. Instead of rebuilding the same struct page state for every PFN,
it prepares one reusable template through the existing slow path,
refreshes the PFN-dependent fields in that template, and copies it to
each destination page.
The fifth patch extends the same template-based approach to compound
tails, so pfns_per_compound > 1 can also benefit from the fast path.
The sixth patch introduces memcpy_streaming() and
memcpy_streaming_drain() as a generic interface for write-once copies.
Architectures that do not provide a specialized backend, or cases that
cannot safely use one, fall back to memcpy().
The seventh patch extends x86 memcpy_flushcache() small fixed-size
fastpaths so struct-page-sized streaming copies can stay on the inline
path when alignment permits.
The last patch switches the ZONE_DEVICE template-copy path over to
memcpy_streaming(). It keeps pageblock-aligned PFNs on regular memcpy(),
uses memcpy_streaming() for the remaining write-once copies, and drains
streaming stores before later metadata updates that may depend on them.
This is not intended as a steady-state data-path optimization. Its
benefit is in pmem bring-up paths where memmap_init_zone_device()
dominates device online / rebind latency, such as:
- fsdax or devdax namespace creation and reconfiguration
- nd_pmem / dax_pmem driver bind or rebind
In those paths, the kernel initializes a large vmemmap range once and
does not immediately benefit from keeping the copied struct page state
hot in cache. Reducing write-allocate traffic in that one-time setup
path can therefore reduce end-to-end device bring-up latency.
The optimized path is disabled when the page_ref_set tracepoint is
enabled, and sanitized builds remain on the slow path so their
instrumented stores are preserved.
Testing
=======
Tests were run in a VM on an Intel Ice Lake server.
Two PMEM configurations were used:
- a 100 GB fsdax namespace configured with map=dev, which exercises
the nd_pmem rebind path (pfns_per_compound == 1)
- a 100 GB devdax namespace configured with align=2097152, which
exercises the dax_pmem rebind path (pfns_per_compound > 1)
For each configuration, the corresponding driver was unbound and
rebound 30 times. Memmap initialization latency was collected from the
pr_debug() output of memmap_init_zone_device().
The first bind is reported separately, and the average of subsequent
rebinds is used as the steady-state result.
Performance
===========
nd_pmem rebind, 100 GB fsdax namespace, map=dev
Base(v7.1-rc6):
First binding: 1466 ms
Average of subsequent rebinds: 262.12 ms
Full series:
First binding: 1359 ms
Average of subsequent rebinds: 108.36 ms
dax_pmem rebind, 100 GB devdax namespace, align=2097152
Base(v7.1-rc6):
First binding: 1430 ms
Average of subsequent rebinds: 229.12 ms
Full series:
First binding: 1273 ms
Average of subsequent rebinds: 100.17 ms
Li Zhe (8):
mm: fix stale ZONE_DEVICE refcount comment
mm: factor zone-device page init helpers out of
__init_zone_device_page
mm: add a set_page_section_from_pfn() helper
mm: add a template-based fast path for zone-device page init
mm: extend the template fast path to zone-device compound tails
string: introduce memcpy_streaming() helpers
x86/string: extend memcpy_flushcache() fixed-size fastpaths
mm: use memcpy_streaming() in zone-device template copies
arch/x86/include/asm/string_64.h | 140 ++++++++++++++++++--
include/linux/mm.h | 19 ++-
include/linux/string.h | 20 +++
mm/mm_init.c | 221 +++++++++++++++++++++++++++----
4 files changed, 360 insertions(+), 40 deletions(-)
---
v3: https://lore.kernel.org/all/20260527033636.28231-1-lizhe.67@bytedance.com/
v2: https://lore.kernel.org/all/20260521040124.10608-1-lizhe.67@bytedance.com/
v1: https://lore.kernel.org/all/20260515082045.63029-1-lizhe.67@bytedance.com/
Changelogs:
v3->v4:
- Rebase the series from v7.1-rc3 to v7.1-rc6.
- Rework patch 4 so the reusable head-page template is seeded from the
first real struct page, rather than being initialized directly on a
stack-resident template object. Also add an explicit !nr_pages early
return. Suggested by Andrew Morton.
- Rework patch 5 similarly for compound tails: seed the reusable
tail-page template from the first real tail page, thread
use_template through compound-page initialization, and reuse that
prepared tail-page image for the remaining tails. Suggested by Andrew
Morton.
- Tighten patch 6 so memcpy_streaming() maps to memcpy_flushcache() only
when the destination alignment and size allow the transfer to stay
entirely on the non-temporal path; other cases fall back to memcpy().
Suggested by Andrew Morton.
- Rework patch 7 so the existing 4/8/16-byte cases remain handled
directly in memcpy_flushcache(), while the new aligned fixed-size
fastpaths cover only the larger 32/48/64/80/96-byte cases. Suggested
by Andrew Morton.
For changelogs of earlier revisions, please refer to the v3 cover letter:
https://lore.kernel.org/all/20260527033636.28231-1-lizhe.67@bytedance.com/
--
2.20.1
On 2026-06-03 at 18:01 +1000, Li Zhe <lizhe.67@bytedance.com> wrote...
> memmap_init_zone_device() can spend a substantial amount of time
> initializing large ZONE_DEVICE ranges because it repeats nearly
> identical struct page setup for every PFN.
>
> This series reduces that overhead in eight steps.
>
> The first patch fixes a stale comment in __init_zone_device_page() so
> the documented refcount policy matches the current ZONE_DEVICE code.
>
> The second patch factors the reusable pieces out of
> __init_zone_device_page() so later patches can share the same logic
> without changing the existing slow path.
>
> The third patch adds set_page_section_from_pfn(), so callers that want
> to refresh section bits from a PFN no longer need to open-code
> SECTION_IN_PAGE_FLAGS handling.
>
> The fourth patch adds a template-based fast path for ZONE_DEVICE head
> pages. Instead of rebuilding the same struct page state for every PFN,
> it prepares one reusable template through the existing slow path,
> refreshes the PFN-dependent fields in that template, and copies it to
> each destination page.
>
> The fifth patch extends the same template-based approach to compound
> tails, so pfns_per_compound > 1 can also benefit from the fast path.
>
> The sixth patch introduces memcpy_streaming() and
> memcpy_streaming_drain() as a generic interface for write-once copies.
> Architectures that do not provide a specialized backend, or cases that
> cannot safely use one, fall back to memcpy().
>
> The seventh patch extends x86 memcpy_flushcache() small fixed-size
> fastpaths so struct-page-sized streaming copies can stay on the inline
> path when alignment permits.
>
> The last patch switches the ZONE_DEVICE template-copy path over to
> memcpy_streaming(). It keeps pageblock-aligned PFNs on regular memcpy(),
> uses memcpy_streaming() for the remaining write-once copies, and drains
> streaming stores before later metadata updates that may depend on them.
>
> This is not intended as a steady-state data-path optimization. Its
> benefit is in pmem bring-up paths where memmap_init_zone_device()
> dominates device online / rebind latency, such as:
> - fsdax or devdax namespace creation and reconfiguration
> - nd_pmem / dax_pmem driver bind or rebind
>
> In those paths, the kernel initializes a large vmemmap range once and
> does not immediately benefit from keeping the copied struct page state
> hot in cache. Reducing write-allocate traffic in that one-time setup
> path can therefore reduce end-to-end device bring-up latency.
>
> The optimized path is disabled when the page_ref_set tracepoint is
> enabled, and sanitized builds remain on the slow path so their
> instrumented stores are preserved.
>
> Testing
> =======
>
> Tests were run in a VM on an Intel Ice Lake server.
>
> Two PMEM configurations were used:
> - a 100 GB fsdax namespace configured with map=dev, which exercises
> the nd_pmem rebind path (pfns_per_compound == 1)
> - a 100 GB devdax namespace configured with align=2097152, which
> exercises the dax_pmem rebind path (pfns_per_compound > 1)
>
> For each configuration, the corresponding driver was unbound and
> rebound 30 times. Memmap initialization latency was collected from the
> pr_debug() output of memmap_init_zone_device().
>
> The first bind is reported separately, and the average of subsequent
> rebinds is used as the steady-state result.
>
> Performance
> ===========
>
> nd_pmem rebind, 100 GB fsdax namespace, map=dev
> Base(v7.1-rc6):
> First binding: 1466 ms
> Average of subsequent rebinds: 262.12 ms
> Full series:
> First binding: 1359 ms
> Average of subsequent rebinds: 108.36 ms
>
> dax_pmem rebind, 100 GB devdax namespace, align=2097152
> Base(v7.1-rc6):
> First binding: 1430 ms
> Average of subsequent rebinds: 229.12 ms
> Full series:
> First binding: 1273 ms
> Average of subsequent rebinds: 100.17 ms
The results here are impressive, but I've been having trouble replicating them
with hmm_test on my local development machines. Both an older AMD machine and
a newer Arrow Lake based machine shows ~3% worse performance with this series
applied doing ZONE_DEVICE_PRIVATE.
This is based on measuring the memremap_pages() call when inserting test_hmm.ko
in a VM using the following hack to measure 10 64GB memremaps. Is there an easy
way for me to replicate your results in a VM? Or is there something in my
testing that I'm missing here?
---
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 213504915737..a1d5463dbc86 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -34,7 +34,7 @@
#define DMIRROR_NDEVICES 4
#define DMIRROR_RANGE_FAULT_TIMEOUT 1000
-#define DEVMEM_CHUNK_SIZE (256 * 1024 * 1024U)
+#define DEVMEM_CHUNK_SIZE (64 * 1024 * 1024 * 1024UL)
#define DEVMEM_CHUNKS_RESERVE 16
/*
@@ -565,6 +565,8 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
unsigned long pfn_last;
void *ptr;
int ret = -ENOMEM;
+ int i;
+ u64 t0, total = 0;
devmem = kzalloc_obj(*devmem);
if (!devmem)
@@ -613,6 +615,22 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
mdevice->devmem_capacity = new_capacity;
mdevice->devmem_chunks = new_chunks;
}
+
+ for (i = 0; i < 10; i++) {
+ t0 = ktime_get_ns();
+ ptr = memremap_pages(&devmem->pagemap, numa_node_id());
+ total += ktime_get_ns() - t0;
+ if (IS_ERR_OR_NULL(ptr)) {
+ if (ptr)
+ ret = PTR_ERR(ptr);
+ else
+ ret = -EFAULT;
+ goto err_release;
+ }
+ memunmap_pages(&devmem->pagemap);
+ }
+ pr_info("avg memremap %llu ns\n", total / i);
+
ptr = memremap_pages(&devmem->pagemap, numa_node_id());
if (IS_ERR_OR_NULL(ptr)) {
if (ptr)
@@ -629,7 +647,7 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
mutex_unlock(&mdevice->devmem_lock);
- pr_info("added new %u MB chunk (total %u chunks, %u MB) PFNs [0x%lx 0x%lx)\n",
+ pr_info("added new %lu MB chunk (total %u chunks, %lu MB) PFNs [0x%lx 0x%lx)\n",
DEVMEM_CHUNK_SIZE / (1024 * 1024),
mdevice->devmem_count,
mdevice->devmem_count * (DEVMEM_CHUNK_SIZE / (1024 * 1024)),
> Li Zhe (8):
> mm: fix stale ZONE_DEVICE refcount comment
> mm: factor zone-device page init helpers out of
> __init_zone_device_page
> mm: add a set_page_section_from_pfn() helper
> mm: add a template-based fast path for zone-device page init
> mm: extend the template fast path to zone-device compound tails
> string: introduce memcpy_streaming() helpers
> x86/string: extend memcpy_flushcache() fixed-size fastpaths
> mm: use memcpy_streaming() in zone-device template copies
>
> arch/x86/include/asm/string_64.h | 140 ++++++++++++++++++--
> include/linux/mm.h | 19 ++-
> include/linux/string.h | 20 +++
> mm/mm_init.c | 221 +++++++++++++++++++++++++++----
> 4 files changed, 360 insertions(+), 40 deletions(-)
>
> ---
> v3: https://lore.kernel.org/all/20260527033636.28231-1-lizhe.67@bytedance.com/
> v2: https://lore.kernel.org/all/20260521040124.10608-1-lizhe.67@bytedance.com/
> v1: https://lore.kernel.org/all/20260515082045.63029-1-lizhe.67@bytedance.com/
>
> Changelogs:
>
> v3->v4:
> - Rebase the series from v7.1-rc3 to v7.1-rc6.
> - Rework patch 4 so the reusable head-page template is seeded from the
> first real struct page, rather than being initialized directly on a
> stack-resident template object. Also add an explicit !nr_pages early
> return. Suggested by Andrew Morton.
> - Rework patch 5 similarly for compound tails: seed the reusable
> tail-page template from the first real tail page, thread
> use_template through compound-page initialization, and reuse that
> prepared tail-page image for the remaining tails. Suggested by Andrew
> Morton.
> - Tighten patch 6 so memcpy_streaming() maps to memcpy_flushcache() only
> when the destination alignment and size allow the transfer to stay
> entirely on the non-temporal path; other cases fall back to memcpy().
> Suggested by Andrew Morton.
> - Rework patch 7 so the existing 4/8/16-byte cases remain handled
> directly in memcpy_flushcache(), while the new aligned fixed-size
> fastpaths cover only the larger 32/48/64/80/96-byte cases. Suggested
> by Andrew Morton.
>
> For changelogs of earlier revisions, please refer to the v3 cover letter:
> https://lore.kernel.org/all/20260527033636.28231-1-lizhe.67@bytedance.com/
>
> --
> 2.20.1
On Thu, 4 Jun 2026 18:14:05 +1000, apopple@nvidia.com wrote:
> On 2026-06-03 at 18:01 +1000, Li Zhe <lizhe.67@bytedance.com> wrote...
> > memmap_init_zone_device() can spend a substantial amount of time
> > initializing large ZONE_DEVICE ranges because it repeats nearly
> > identical struct page setup for every PFN.
> >
> > This series reduces that overhead in eight steps.
> >
> > The first patch fixes a stale comment in __init_zone_device_page() so
> > the documented refcount policy matches the current ZONE_DEVICE code.
> >
> > The second patch factors the reusable pieces out of
> > __init_zone_device_page() so later patches can share the same logic
> > without changing the existing slow path.
> >
> > The third patch adds set_page_section_from_pfn(), so callers that want
> > to refresh section bits from a PFN no longer need to open-code
> > SECTION_IN_PAGE_FLAGS handling.
> >
> > The fourth patch adds a template-based fast path for ZONE_DEVICE head
> > pages. Instead of rebuilding the same struct page state for every PFN,
> > it prepares one reusable template through the existing slow path,
> > refreshes the PFN-dependent fields in that template, and copies it to
> > each destination page.
> >
> > The fifth patch extends the same template-based approach to compound
> > tails, so pfns_per_compound > 1 can also benefit from the fast path.
> >
> > The sixth patch introduces memcpy_streaming() and
> > memcpy_streaming_drain() as a generic interface for write-once copies.
> > Architectures that do not provide a specialized backend, or cases that
> > cannot safely use one, fall back to memcpy().
> >
> > The seventh patch extends x86 memcpy_flushcache() small fixed-size
> > fastpaths so struct-page-sized streaming copies can stay on the inline
> > path when alignment permits.
> >
> > The last patch switches the ZONE_DEVICE template-copy path over to
> > memcpy_streaming(). It keeps pageblock-aligned PFNs on regular memcpy(),
> > uses memcpy_streaming() for the remaining write-once copies, and drains
> > streaming stores before later metadata updates that may depend on them.
> >
> > This is not intended as a steady-state data-path optimization. Its
> > benefit is in pmem bring-up paths where memmap_init_zone_device()
> > dominates device online / rebind latency, such as:
> > - fsdax or devdax namespace creation and reconfiguration
> > - nd_pmem / dax_pmem driver bind or rebind
> >
> > In those paths, the kernel initializes a large vmemmap range once and
> > does not immediately benefit from keeping the copied struct page state
> > hot in cache. Reducing write-allocate traffic in that one-time setup
> > path can therefore reduce end-to-end device bring-up latency.
> >
> > The optimized path is disabled when the page_ref_set tracepoint is
> > enabled, and sanitized builds remain on the slow path so their
> > instrumented stores are preserved.
> >
> > Testing
> > =======
> >
> > Tests were run in a VM on an Intel Ice Lake server.
> >
> > Two PMEM configurations were used:
> > - a 100 GB fsdax namespace configured with map=dev, which exercises
> > the nd_pmem rebind path (pfns_per_compound == 1)
> > - a 100 GB devdax namespace configured with align=2097152, which
> > exercises the dax_pmem rebind path (pfns_per_compound > 1)
> >
> > For each configuration, the corresponding driver was unbound and
> > rebound 30 times. Memmap initialization latency was collected from the
> > pr_debug() output of memmap_init_zone_device().
> >
> > The first bind is reported separately, and the average of subsequent
> > rebinds is used as the steady-state result.
> >
> > Performance
> > ===========
> >
> > nd_pmem rebind, 100 GB fsdax namespace, map=dev
> > Base(v7.1-rc6):
> > First binding: 1466 ms
> > Average of subsequent rebinds: 262.12 ms
> > Full series:
> > First binding: 1359 ms
> > Average of subsequent rebinds: 108.36 ms
> >
> > dax_pmem rebind, 100 GB devdax namespace, align=2097152
> > Base(v7.1-rc6):
> > First binding: 1430 ms
> > Average of subsequent rebinds: 229.12 ms
> > Full series:
> > First binding: 1273 ms
> > Average of subsequent rebinds: 100.17 ms
>
> The results here are impressive, but I've been having trouble replicating them
> with hmm_test on my local development machines. Both an older AMD machine and
> a newer Arrow Lake based machine shows ~3% worse performance with this series
> applied doing ZONE_DEVICE_PRIVATE.
>
> This is based on measuring the memremap_pages() call when inserting test_hmm.ko
> in a VM using the following hack to measure 10 64GB memremaps. Is there an easy
> way for me to replicate your results in a VM? Or is there something in my
> testing that I'm missing here?
>
> ---
>
> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> index 213504915737..a1d5463dbc86 100644
> --- a/lib/test_hmm.c
> +++ b/lib/test_hmm.c
> @@ -34,7 +34,7 @@
>
> #define DMIRROR_NDEVICES 4
> #define DMIRROR_RANGE_FAULT_TIMEOUT 1000
> -#define DEVMEM_CHUNK_SIZE (256 * 1024 * 1024U)
> +#define DEVMEM_CHUNK_SIZE (64 * 1024 * 1024 * 1024UL)
> #define DEVMEM_CHUNKS_RESERVE 16
>
> /*
> @@ -565,6 +565,8 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
> unsigned long pfn_last;
> void *ptr;
> int ret = -ENOMEM;
> + int i;
> + u64 t0, total = 0;
>
> devmem = kzalloc_obj(*devmem);
> if (!devmem)
> @@ -613,6 +615,22 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
> mdevice->devmem_capacity = new_capacity;
> mdevice->devmem_chunks = new_chunks;
> }
> +
> + for (i = 0; i < 10; i++) {
> + t0 = ktime_get_ns();
> + ptr = memremap_pages(&devmem->pagemap, numa_node_id());
> + total += ktime_get_ns() - t0;
> + if (IS_ERR_OR_NULL(ptr)) {
> + if (ptr)
> + ret = PTR_ERR(ptr);
> + else
> + ret = -EFAULT;
> + goto err_release;
> + }
> + memunmap_pages(&devmem->pagemap);
> + }
> + pr_info("avg memremap %llu ns\n", total / i);
> +
> ptr = memremap_pages(&devmem->pagemap, numa_node_id());
> if (IS_ERR_OR_NULL(ptr)) {
> if (ptr)
> @@ -629,7 +647,7 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
>
> mutex_unlock(&mdevice->devmem_lock);
>
> - pr_info("added new %u MB chunk (total %u chunks, %u MB) PFNs [0x%lx 0x%lx)\n",
> + pr_info("added new %lu MB chunk (total %u chunks, %lu MB) PFNs [0x%lx 0x%lx)\n",
> DEVMEM_CHUNK_SIZE / (1024 * 1024),
> mdevice->devmem_count,
> mdevice->devmem_count * (DEVMEM_CHUNK_SIZE / (1024 * 1024)),
Thanks for the feedback and for sharing your test results.
I reran the measurements on my side using two setups. I do not
currently have access to physical PMEM hardware on my side, and the
target use case for this work is a virtualized environment. So my
measurements were taken in a guest using a 100G emulated pmem device
backed by a regular file on the host filesystem.
First, I followed your modified test_hmm.c approach, i.e. looping
over memremap_pages() / memunmap_pages() and measuring the average
memremap time in the MEMORY_DEVICE_PRIVATE case, where the vmemmap
backing comes from normal system RAM. On this setup, I got:
- base kernel: avg memremap 222.0 ms
- patches 1-5 only: avg memremap 206.9 ms
- full 8-patch series: avg memremap 264.1 ms
I also enabled the pr_debug() timing inside memmap_init_zone_device()
for the same setup, and the numbers tracked that closely:
- base kernel: 221.0 ms
- patches 1-5 only: 206.0 ms
- full 8-patch series: 260.1 ms
So on this path, patches 1-5 seem to help, but the full 8-patch series
does not.
Second, I also tested a benchmark-only setup corresponding to the
FS_DAX map=dev case, where the memmap itself is allocated from the dax
altmap range rather than normal DRAM. On that setup, I got:
- base kernel: avg memremap 200.8 ms, pr_debug 196.4 ms
- full 8-patch series: avg memremap 117.2 ms, pr_debug 113.5 ms
So on my side, the full series shows a clear gain in the
FS_DAX + altmap case, but not in the MEMORY_DEVICE_PRIVATE / DRAM-backed
vmemmap case.
If convenient, could you also try the same kind of measurement from my
cover letter, or at least enable the pr_debug() in
memmap_init_zone_device(), to check whether the delta is visible there
on your setup as well?
Also, if you have time, could you please try your modified test_hmm.c
setup with patches 1-5 only? On my side that configuration still shows
a measurable improvement.
Given these results, I would also appreciate your advice on how best
to evolve the series. My current understanding is that patches 1-5 are
a more generic optimization, while patches 6-8 are only beneficial in
some cases. Do you think patches 1-5 alone would already be a
reasonable candidate for upstreaming?
For patches 6-8, I am not yet sure what the right direction is. Would
it make more sense to expose some explicit opt-in mechanism so that
the movnti-based path is selected only when desired, or does it make
more sense to use that path unconditionally for FS_DAX map=dev case?
I would also be interested in your view on why the FS_DAX + altmap
case shows a large gain while the DRAM-backed vmemmap case shows a
regression with the full series. I do not think I fully understand
that difference yet.
Thanks,
Zhe
© 2016 - 2026 Red Hat, Inc.