drivers/Kconfig | 2 + drivers/Makefile | 2 + drivers/migrate_offload/Kconfig | 8 + drivers/migrate_offload/Makefile | 1 + drivers/migrate_offload/dcbm/Makefile | 1 + drivers/migrate_offload/dcbm/dcbm.c | 457 ++++++++++++++++++++++++++ include/linux/migrate_copy_offload.h | 34 ++ include/linux/mm.h | 2 + mm/Kconfig | 9 + mm/Makefile | 1 + mm/migrate.c | 133 ++++++-- mm/migrate_copy_offload.c | 99 ++++++ mm/util.c | 31 ++ 13 files changed, 748 insertions(+), 32 deletions(-) create mode 100644 drivers/migrate_offload/Kconfig create mode 100644 drivers/migrate_offload/Makefile create mode 100644 drivers/migrate_offload/dcbm/Makefile create mode 100644 drivers/migrate_offload/dcbm/dcbm.c create mode 100644 include/linux/migrate_copy_offload.h create mode 100644 mm/migrate_copy_offload.c
This is the fourth RFC of the patchset to enhance page migration by
batching folio-copy operations and enabling acceleration via DMA offload.
Single-threaded, folio-by-folio copying bottlenecks page migration in
modern systems with deep memory hierarchies, especially for large folios
where copy overhead dominates, leaving significant hardware potential
untapped.
By batching the copy phase, we create an opportunity for hardware
acceleration. This series builds the framework and provides a DMA
offload driver (dcbm) as a reference implementation, targeting bulk
migration workloads where offloading the copy improves throughput
and latency while freeing the CPU cycles.
See the RFC V3 cover letter [1] for motivation.
Changelog since V3:
-------------------
1. Redesigned batch migration flow: pre-copy the batch before the move
phase instead of interleaving copy with metadata updates.
Simpler design, avoids redundancy with existing migrate_folios_move()
path.
2. Rewrote offload registration infrastructure: Simplified the migrate
copy offload infrastructure design, fixed the srcu_read_lock()
placement and other minor bugs.
3. Added should_batch() callback to struct migrator so offload drivers can
filter which migration reasons are eligible for offload.
4. Renamed for clarity:
- CONFIG_OFFC_MIGRATION -> CONFIG_MIGRATION_COPY_OFFLOAD
- migrate_offc.[ch] -> migrate_copy_offload.[ch]
- drivers/migoffcopy/ -> drivers/migrate_offload/
- start_offloading/stop_offloading -> migrate_offload_start/stop
5. Dropped mtcopy driver to keep focus on core infrastructure and DMA
offload (for testing and reference). Multi-threaded CPU copy can
follow separately.
6. Rebased on v7.0-rc2.
DESIGN:
-------
New Migration Flow:
[ migrate_pages_batch() ]
|
|--> do_batch = should_batch(reason) // driver filters by migration reason (e.g. allow
| // NUMA balancing, skip other), Once per batch
|
|--> for each folio:
| migrate_folio_unmap() // unmap the folio
| |
| +--> (success):
| if migrate_offload_enabled && do_batch && folio_supports_batch_copy():
| -> src_batch / dst_batch // batch list for copy offloading
| else:
| -> src_std / dst_std // standard lists for per-folio CPU copy
|
|--> try_to_unmap_flush() // single batched TLB flush
|
|--> Batch copy (if src_batch not empty):
| - Migrator is configurable at runtime via sysfs.
|
| static_call(migrate_offload_copy) // Pluggable Migrators
| / | \
| v v v
| [ Default ] [ DMA Offload ] [ ... ]
|
| On failure, folios fall back to per-folio CPU copy.
|
+--> migrate_folios_move() // metadata, update PTEs, finalize
(batch list with already_copied=true, std list with false)
Offload Registration:
Driver fills struct migrator { .name, .offload_copy, .should_batch, .owner }
and calls migrate_offload_start(). This:
- Pins the module via try_module_get()
- Patches static_call targets for offload_copy and should_batch
- Enables the migrate_offload_enabled static branch
migrate_offload_stop() disables the static branch and reverts both
static_calls, then synchronize_srcu() waits for in-flight
migrations before module_put().
PERFORMANCE RESULTS:
--------------------
System Info: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT Enabled),
1 NUMA node per socket, v7.0-rc2, DVFS set to Performance, PTDMA hardware.
Benchmark: move_pages() syscall to move pages between two NUMA nodes.
1. Moving different sized folios such that total transfer size is constant
(1GB), with different number of DMA channels. Throughput in GB/s.
a. Baseline (vanilla kernel: v7.0-rc2, single-threaded, serial folio_copy):
============================================================================================
| 4K | 16K | 64K | 256K | 1M | 2M |
============================================================================================
|3.55±0.19 | 5.66±0.30 | 6.16±0.09 | 7.12±0.83 | 6.93±0.09 | 10.88±0.19 |
b. DMA offload (Patched Kernel, dcbm driver, N DMA channels):
============================================================================================
Channel Cnt| 4K | 16K | 64K | 256K | 1M | 2M |
============================================================================================
1 | 2.63±0.26 | 2.92±0.09 | 3.16±0.13 | 4.75±0.70 | 7.38±0.18 | 12.64±0.07 |
2 | 3.20±0.12 | 4.68±0.17 | 5.16±0.36 | 7.42±1.00 | 8.05±0.05 | 14.40±0.10 |
4 | 3.78±0.16 | 6.45±0.06 | 7.36±0.18 | 9.70±0.11 | 11.68±2.37 | 27.16±0.20 |
8 | 4.32±0.24 | 8.20±0.45 | 9.45±0.26 | 12.99±2.87 | 13.18±0.08 | 46.17±0.67 |
12 | 4.35±0.16 | 8.80±0.09 | 11.65±2.71 | 15.46±4.95 | 14.69±4.10 | 60.89±0.68 |
16 | 4.40±0.19 | 9.25±0.13 | 11.02±0.26 | 13.56±0.15 | 18.04±7.11 | 66.86±0.81 |
- DMA offload with 16 channels achieves ~6x speedup for 2MB folios.
- Larger folios benefit more; small folios are DMA-setup bound.
2. Varying total move size (folio count) for fixed 2MB folio size,
single DMA channel. Throughput (GB/s):
2MB Folios | Baseline | DMA
=================================
1 | 7.34 | 6.17
8 | 8.27 | 8.85
16 | 7.56 | 9.12
32 | 8.39 | 11.73
64 | 9.37 | 12.18
256 | 10.58 | 12.50
512 | 10.78 | 12.68
1024 | 10.77 | 12.76
2048 | 10.87 | 12.81
8192 | 10.84 | 12.82
- Throughput increases with batch size but plateaus after ~64 pages.
- Even a single DMA channel outperforms baseline for batch-size >= 8 pages.
EARLIER POSTINGS:
-----------------
[1] RFC V3: https://lore.kernel.org/all/20250923174752.35701-1-shivankg@amd.com
[2] RFC V2: https://lore.kernel.org/all/20250319192211.10092-1-shivankg@amd.com
[3] RFC V1: https://lore.kernel.org/all/20240614221525.19170-1-shivankg@amd.com
[4] RFC from Zi Yan: https://lore.kernel.org/all/20250103172419.4148674-1-ziy@nvidia.com
RELATED DISCUSSIONS:
-------------------
[5] MM-alignment Session [Nov 12, 2025]:
https://lore.kernel.org/linux-mm/bd6a3c75-b9f0-cbcf-f7c4-1ef5dff06d24@google.com/
[6] Linux Memory Hotness and Promotion call [Nov 6, 2025]:
https://lore.kernel.org/linux-mm/8ff2fd10-c9ac-4912-cf56-7ecd4afd2770@google.com/
[7] LSFMM 2025:
https://lore.kernel.org/all/cf6fc05d-c0b0-4de3-985e-5403977aa3aa@amd.com
[8] OSS India:
https://ossindia2025.sched.com/event/23Jk1
Git Tree: https://github.com/shivankgarg98/linux/commits/shivank/V4_migrate_pages_optimization_precopy
Thanks to everyone who reviewed, tested or participated in discussions
around this series. Your feedback helped me throughout the development
process.
Best Regards,
Shivank
Shivank Garg (5):
mm: introduce folios_mc_copy() for batch folio copying
mm/migrate: skip data copy for already-copied folios
mm/migrate: add batch-copy path in migrate_pages_batch
mm/migrate: add copy offload registration infrastructure
drivers/migrate_offload: add DMA batch copy driver (dcbm)
Zi Yan (1):
mm/migrate: adjust NR_MAX_BATCHED_MIGRATION for testing
drivers/Kconfig | 2 +
drivers/Makefile | 2 +
drivers/migrate_offload/Kconfig | 8 +
drivers/migrate_offload/Makefile | 1 +
drivers/migrate_offload/dcbm/Makefile | 1 +
drivers/migrate_offload/dcbm/dcbm.c | 457 ++++++++++++++++++++++++++
include/linux/migrate_copy_offload.h | 34 ++
include/linux/mm.h | 2 +
mm/Kconfig | 9 +
mm/Makefile | 1 +
mm/migrate.c | 133 ++++++--
mm/migrate_copy_offload.c | 99 ++++++
mm/util.c | 31 ++
13 files changed, 748 insertions(+), 32 deletions(-)
create mode 100644 drivers/migrate_offload/Kconfig
create mode 100644 drivers/migrate_offload/Makefile
create mode 100644 drivers/migrate_offload/dcbm/Makefile
create mode 100644 drivers/migrate_offload/dcbm/dcbm.c
create mode 100644 include/linux/migrate_copy_offload.h
create mode 100644 mm/migrate_copy_offload.c
--
2.43.0
On 3/9/2026 5:37 PM, Shivank Garg wrote:
> This is the fourth RFC of the patchset to enhance page migration by
> batching folio-copy operations and enabling acceleration via DMA offload.
>
> Single-threaded, folio-by-folio copying bottlenecks page migration in
> modern systems with deep memory hierarchies, especially for large folios
> where copy overhead dominates, leaving significant hardware potential
> untapped.
>
> By batching the copy phase, we create an opportunity for hardware
> acceleration. This series builds the framework and provides a DMA
> offload driver (dcbm) as a reference implementation, targeting bulk
> migration workloads where offloading the copy improves throughput
> and latency while freeing the CPU cycles.
>
[snip]
> System Info: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT Enabled),
> 1 NUMA node per socket, v7.0-rc2, DVFS set to Performance, PTDMA hardware.
>
> a. Baseline (vanilla kernel: v7.0-rc2, single-threaded, serial folio_copy):
>
> ============================================================================================
> | 4K | 16K | 64K | 256K | 1M | 2M |
> ============================================================================================
> |3.55±0.19 | 5.66±0.30 | 6.16±0.09 | 7.12±0.83 | 6.93±0.09 | 10.88±0.19 |
>
> b. DMA offload (Patched Kernel, dcbm driver, N DMA channels):
>
> ============================================================================================
> Channel Cnt| 4K | 16K | 64K | 256K | 1M | 2M |
> ============================================================================================
> 1 | 2.63±0.26 | 2.92±0.09 | 3.16±0.13 | 4.75±0.70 | 7.38±0.18 | 12.64±0.07 |
> 2 | 3.20±0.12 | 4.68±0.17 | 5.16±0.36 | 7.42±1.00 | 8.05±0.05 | 14.40±0.10 |
> 4 | 3.78±0.16 | 6.45±0.06 | 7.36±0.18 | 9.70±0.11 | 11.68±2.37 | 27.16±0.20 |
> 8 | 4.32±0.24 | 8.20±0.45 | 9.45±0.26 | 12.99±2.87 | 13.18±0.08 | 46.17±0.67 |
> 12 | 4.35±0.16 | 8.80±0.09 | 11.65±2.71 | 15.46±4.95 | 14.69±4.10 | 60.89±0.68 |
> 16 | 4.40±0.19 | 9.25±0.13 | 11.02±0.26 | 13.56±0.15 | 18.04±7.11 | 66.86±0.81 |
I ran experiments to evaluate DMA offload for Memory Compaction page migration (on above system)
Each NUMA ~250GB per node. I bind everything to Node 1 (CPU 32) and keep background MM daemons disabled.
The experiment has two phases: Fragmentation and Compaction(/migration)
1. Memory Fragmentation
I allocate ~248GB of anonymous memory on Node 1 and touch every page to
ensure physical backing. Then, for each 2MB-aligned region (512
contiguous 4KB pages), I free 50% of pages at evenly-spaced offsets using
MADV_DONTNEED. The freed pages return to the buddy allocator, but the
remaining 256 occupied pages in each region prevent merging into higher
order blocks.
After this, Node 1 is 100% fragmented with 50% free memory means every
hugepage allocation requires compaction.
[ ] [X] [ ] [X] [ ] [X] [ ] [X] [ ] [X] [ ] [X] ...
The fragmenter process stays alive throughout the measurement, with
oom_score_adj=-1000 to prevent the OOM killer from targeting it.
2. Compaction Trigger
To benchmark compaction in a reproducible way, I use a kernel module that
calls alloc_pages_node() in a tight loop for the target node. Each
allocation enters the slow path:
__alloc_pages_slowpath() -> try_to_compact_pages() -> compact_zone() -> migrate_pages(),
performing page migration under MR_COMPACTION. The allocation is pinned
to CPU 32 on Node 1.
Target: Allocate **16384** order-9 pages (32GB), producing ~4.5 million
4KB page migrations per run.
3. CPU Contention (Busy System)
To emulate a real-world scenario for busy-system, I run a cpu hogging
process on the same CPU as compaction:
while (run) { counter++; __asm__ volatile("" : "+r"(counter)); }
Both compaction and the hog are pinned to CPU 32, so they compete for the
same core, emulating a real-world scenario where compaction shares CPU
time with application workloads.
I measure the following metrics:
1. Wall time: elapsed time for all hugepage allocations
2. Pages migrated: delta of /proc/vmstat counters (pgmigrate_success)
3. DMA copies: DCBM sysfs counter (folios_migrated)
4. /proc/stat for the pinned CPU — user%, sys%, idle% during the run
5. Hog iterations (busy modes): total loop count of the CPU-hog process
Experiment Results:
I run four configurations on fresh reboot to avoid buddy allocator
state degradation between runs:
Baseline (vanilla kernel) and DMA (migration offload enabled),
Each on an idle and a busy system.
Mode Wall time(ms) Migrated DMA_Copy Hog_Iters User% Sys% Idle%
--------------------------------------------------------------------------------------------
1 baseline 16708 4563506 - - 0.00% 99.40% 0.29%
2 dma 18887 4622952 4623181 - 0.00% 76.65% 22.55%
3 busy-baseline 33256 4599846 - 62300165085 49.90% 49.75% 0.06%
4 busy-dma 32475 4602750 4604672 66022189744 56.32% 42.97% 0.06%
Inference:
1. On an idle system, wall time increases with DMA (~13%) because the
current compaction batch size (COMPACT_CLUSTER_MAX = 32 pages) is
too small for DMA to amortize its setup cost. However, kernel sys%
drops from 99.4% to 76.7%, freeing 22.5% of CPU time.
2. On a busy system, wall time decreases slightly (~2.3%) and the hog
process accumulates 6% more iterations with DMA offload. The CPU
time freed during DMA transfers goes directly to the competing
userspace workload.
This shows that DMA offload for compaction benefits busy system with
high fragmentation.
Note:
Tuning the compaction algorithm for larger DMA batches and using DMA
hardware optimized for small-size transfers should improve the results
further.
Thanks,
Shivank
© 2016 - 2026 Red Hat, Inc.