Accelerate page migration with batch copying and hardware offload

[RFC V3 0/9] Accelerate page migration with batch copying and hardware offload

Posted by Shivank Garg 1 week, 1 day ago

This is the third RFC of the patchset to enhance page migration by batching
folio-copy operations and enabling acceleration via multi-threaded CPU or
DMA offload.

Single-threaded, folio-by-folio copying bottlenecks page migration
in modern systems with deep memory hierarchies, especially for large
folios where copy overhead dominates, leaving significant hardware
potential untapped. 

By batching the copy phase, we create an opportunity for significant
hardware acceleration. This series builds a framework for this acceleration
and provides two initial offload driver implementations: one using multiple
CPU threads (mtcopy) and another leveraging the DMAEngine subsystem (dcbm).

This version incorporates significant feedback to improve correctness,
robustness, and the efficiency of the DMA offload path.

Changelog since V2:

1. DMA Engine Rewrite:
   - Switched from per-folio dma_map_page() to batch dma_map_sgtable()
   - Single completion interrupt per batch (reduced overhead)
   - Order of magnitude improvement in setup time for large batches
2. Code cleanups and refactoring
3. Rebased on latest mainline (6.17-rc6+)

MOTIVATION:
-----------

Current Migration Flow:
[ move_pages(), Compaction, Tiering, etc. ]
              |
              v
     [ migrate_pages() ] // Common entry point
              |
              v
    [ migrate_pages_batch() ] // NR_MAX_BATCHED_MIGRATION (512) folios at a time
      |
      |--> [ migrate_folio_unmap() ]
      |
      |--> [ try_to_unmap_flush() ] // Perform a single, batched TLB flush
      |
      |--> [ migrate_folios_move() ] // Bottleneck: Interleaved copy
           - For each folio:
             - Metadata prep: Copy flags, mappings, etc.
             - folio_copy()  <-- Single-threaded, serial data copy.
             - Update PTEs & finalize for that single folio.
             
Understanding overheads in page migration (move_pages() syscall):

Total move_pages() overheads = folio_copy() + Other overheads
1. folio_copy() is the core copy operation that interests us.
2. The remaining operations are user/kernel transitions, page table walks,
locking, folio unmap, dst folio alloc, TLB flush, copying flags, updating
mappings and PTEs etc. that contribute to the remaining overheads.

Percentage of folio_copy() overheads in move_pages(N pages) syscall time:
Number of pages being migrated and folio size:
            4KB     2MB
1 page     <1%     ~66%
512 page   ~35%    ~97%

Based on Amdahl's Law, optimizing folio_copy() for large pages offers a
substantial performance opportunity.

move_pages() syscall speedup = 1 / ((1 - F) + (F / S))
Where F is the fraction of time spent in folio_copy() and S is the speedup of
folio_copy().

For 4KB folios, folio copy overheads are significantly small in single-page
migrations to impact overall speedup, even for 512 pages, maximum theoretical
speedup is limited to ~1.54x with infinite folio_copy() speedup.

For 2MB THPs, folio copy overheads are significant even in single page
migrations, with a theoretical speedup of ~3x with infinite folio_copy()
speedup and up to ~33x for 512 pages.

A realistic value of S (speedup of folio_copy()) is 7.5x for DMA offload
based on my measurements for copying 512 2MB pages.
This gives move_pages(), a practical speedup of 6.3x for 512 2MB page (also
observed in the experiments below).

DESIGN: A Pluggable Migrator Framework
---------------------------------------

Introduce migrate_folios_batch_move():

[ migrate_pages_batch() ]
    |
    |--> migrate_folio_unmap()
    |      
    |--> try_to_unmap_flush()
    |
    +--> [ migrate_folios_batch_move() ] // new batched design
            |
            |--> Metadata migration
            |    - Metadata prep: Copy flags, mappings, etc.
            |    - Use MIGRATE_NO_COPY to skip the actual data copy.
            |
            |--> Batch copy folio data
            |    - Migrator is configurable at runtime via sysfs.
            |
            |          static_call(_folios_copy) // Pluggable migrators
            |          /          |            \
            |         v           v             v
            | [ Default ]  [ MT CPU copy ]  [ DMA Offload ]
            |
            +--> Update PTEs to point to dst folios and complete migration.


User Control of Migrator:

# echo 1 > /sys/kernel/dcbm/offloading
   |
   +--> Driver's sysfs handler
        |
        +--> calls start_offloading(&cpu_migrator)
              |
              +--> calls offc_update_migrator()
                    |
                    +--> static_call_update(_folios_copy, mig->migrate_offc)

Later, During Migration ...
migrate_folios_batch_move()
    |
    +--> static_call(_folios_copy) // Now dispatches to the selected migrator
          |
          +-> [ mtcopy | dcbm | kernel_default ]


PERFORMANCE RESULTS:
--------------------

System Info: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT Enabled),
1 NUMA node per socket, Linux Kernel 6.16.0-rc6, DVFS set to Performance,
PTDMA hardware.

Benchmark: Use move_pages() syscall to move pages between two NUMA nodes.

1. Moving different sized folios (4KB, 16KB,..., 2MB) such that total transfer size is constant
(1GB), with different number of parallel threads/channels.
Metric: Throughput is reported in GB/s.

a. Baseline (Vanilla kernel, single-threaded, folio-by-folio migration):

Folio size|4K       | 16K       | 64K        | 128K       | 256K       | 512K       | 1M         | 2M         |
===============================================================================================================
Tput(GB/s)|3.73±0.33| 5.53±0.36 | 5.90±0.56  | 6.34±0.08  | 6.50±0.05  | 6.86±0.61  | 6.92±0.71  | 10.67±0.36 |

b. Multi-threaded CPU copy offload (mtcopy driver, use N Parallel Threads=1,2,4,8,12,16):

Thread | 4K         | 16K       | 64K        | 128K       | 256K       | 512K       | 1M         | 2M         |
===============================================================================================================
1      | 3.84±0.10  | 5.23±0.31 | 6.01±0.55  | 6.34±0.60  | 7.16±1.00  | 7.12±0.78  | 7.10±0.86  | 10.94±0.13 |
2      | 4.04±0.19  | 6.72±0.38 | 7.68±0.12  | 8.15±0.06  | 8.45±0.09  | 9.29±0.17  | 9.87±1.01  | 17.80±0.12 |
4      | 4.72±0.21  | 8.41±0.70 | 10.08±1.67 | 11.44±2.42 | 10.45±0.17 | 12.60±1.97 | 12.38±1.73 | 31.41±1.14 |
8      | 4.91±0.28  | 8.62±0.13 | 10.40±0.20 | 13.94±3.75 | 11.03±0.61 | 14.96±3.29 | 12.84±0.63 | 33.50±3.29 |
12     | 4.84±0.24  | 8.75±0.08 | 10.16±0.26 | 10.92±0.22 | 11.72±0.14 | 14.02±2.51 | 14.09±2.65 | 34.70±2.38 |
16     | 4.77±0.22  | 8.95±0.69 | 10.36±0.26 | 11.03±0.22 | 11.58±0.30 | 13.88±2.71 | 13.00±0.75 | 35.89±2.07 |

c. DMA offload (dcbm driver, use N DMA Channels=1,2,4,8,12,16):

Chan Cnt| 4K        | 16K       | 64K        | 128K       | 256K       | 512K       | 1M         | 2M         |
===============================================================================================================
1      | 2.75±0.19  | 2.86±0.13 | 3.28±0.20  | 4.57±0.72  | 5.03±0.62  | 4.69±0.25  | 4.78±0.34  | 12.50±0.24 |
2      | 3.35±0.19  | 4.57±0.19 | 5.35±0.55  | 6.71±0.71  | 7.40±1.07  | 7.38±0.61  | 7.21±0.73  | 14.23±0.34 |
4      | 4.01±0.17  | 6.36±0.26 | 7.71±0.89  | 9.40±1.35  | 10.27±1.96 | 10.60±1.42 | 12.35±2.64 | 26.84±0.91 |
8      | 4.46±0.16  | 7.74±0.13 | 9.72±1.29  | 10.88±0.16 | 12.12±2.54 | 15.62±3.96 | 13.29±2.65 | 45.27±2.60 |
12     | 4.60±0.22  | 8.90±0.84 | 11.26±2.19 | 16.00±4.41 | 14.90±4.38 | 14.57±2.84 | 13.79±3.18 | 59.94±4.19 |
16     | 4.61±0.25  | 9.08±0.79 | 11.14±1.75 | 13.95±3.85 | 13.69±3.39 | 15.47±3.44 | 15.44±4.65 | 63.69±5.01 |

- Throughput increases with folio size. Larger folios benefit more from DMA.
- Scaling shows diminishing returns beyond 8-12 threads/channels.
- Multi-threading and DMA offloading both provide significant gains - up to 3.4x and 6x respectively.

2. Varying total move size: (folio count = 1,8,..8192) for a fixed folio size of 2MB
   using only single thread/channel

folio_cnt | Baseline    | MTCPU      | DMA 
====================================================
1         | 7.96±2.22   | 6.43±0.66  | 6.52±0.45   |
8         | 8.20±0.75   | 8.82±1.10  | 8.88±0.54   |
16        | 7.54±0.61   | 9.06±0.95  | 9.03±0.62   |
32        | 8.68±0.77   | 10.11±0.42 | 10.17±0.50  |
64        | 9.08±1.03   | 10.12±0.44 | 11.21±0.24  |
256       | 10.53±0.39  | 10.77±0.28 | 12.43±0.12  |
512       | 10.59±0.29  | 10.81±0.19 | 12.61±0.07  |
2048      | 10.86±0.26  | 11.05±0.05 | 12.75±0.03  |
8192      | 10.84±0.18  | 11.12±0.05 | 12.81±0.02  |

- Throughput increases with folios count but plateaus after a threshold.
  (The migrate_pages function uses a folio batch size of 512)

Performance Analysis (V2 vs V3):

The new SG-based DMA driver dramatically reduces software overhead. By
switching from per-folio dma_map_page() to batch dma_map_sgtable(), setup
time improves by an order of magnitude for large batches.
This is most visible with 4KB folios, making DMA viable even for smaller
page sizes. For 2MB THP migrations, where hardware transfer time is more
dominant, the gains are more modest.

OPEN QUESTIONS:
---------------

User-Interface:

1. Control Interface Design:
The current interface creates separate sysfs files
for each driver, which can be confusing. Should we implement a unified interface
(/sys/kernel/mm/migration/offload_migrator), which accepts the name of the desired migrator
("kernel", "mtcopy", "dcbm"). This would ensure only one migrator is active at a time.
Is this the right approach?

2. Dynamic Migrator Selection:
Currently, active migrator is a global state, and only one can be active a time.
A more flexible approach might be for the caller of migrate_pages() to specify/hint which
offload mechanism to use, if any. This would allow a CXL driver to explicitly request DMA while a GPU driver might prefer
multi-threaded CPU copy.

3. Tuning Parameters: Expose parameters like number of threads/channels, batch size,
and thresholds for using migrators. Who should own these parameters?

4. Resources Accounting[3]:
a. CPU cgroups accounting and fairness
b. Migration cost attribution

FUTURE WORK:
------------

1. Enhance DMA drivers for bulk copying (e.g., SDXi Engine).
2. Enhance multi-threaded CPU copying for platform-specific scheduling of worker threads to optimize bandwidth utilization. Explore sched-ext for this. [2]
3. Enable kpromoted [4] to use the migration offload infrastructure.

EARLIER POSTINGS:
-----------------

- RFC V2: https://lore.kernel.org/all/20250319192211.10092-1-shivankg@amd.com
- RFC V1: https://lore.kernel.org/all/20240614221525.19170-1-shivankg@amd.com

REFERENCES:
-----------

[1] RFC from Zi Yan: https://lore.kernel.org/all/20250103172419.4148674-1-ziy@nvidia.com
[2] LSFMM: https://lore.kernel.org/all/cf6fc05d-c0b0-4de3-985e-5403977aa3aa@amd.com
[3] https://lore.kernel.org/all/CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@mail.gmail.com
[4] https://lore.kernel.org/all/20250910144653.212066-1-bharata@amd.com

Mike Day (1):
  mm: add support for copy offload for folio Migration

Shivank Garg (4):
  mm: Introduce folios_mc_copy() for batch copying folios
  mm/migrate: add migrate_folios_batch_move to  batch the folio move
    operations
  dcbm: add dma core batch migrator for batch page offloading
  mtcopy: spread threads across die for testing

Zi Yan (4):
  mm/migrate: factor out code in move_to_new_folio() and
    migrate_folio_move()
  mm/migrate: revive MIGRATE_NO_COPY in migrate_mode
  mtcopy: introduce multi-threaded page copy routine
  adjust NR_MAX_BATCHED_MIGRATION for testing

 drivers/Kconfig                        |   2 +
 drivers/Makefile                       |   3 +
 drivers/migoffcopy/Kconfig             |  17 +
 drivers/migoffcopy/Makefile            |   2 +
 drivers/migoffcopy/dcbm/Makefile       |   1 +
 drivers/migoffcopy/dcbm/dcbm.c         | 415 +++++++++++++++++++++++++
 drivers/migoffcopy/mtcopy/Makefile     |   1 +
 drivers/migoffcopy/mtcopy/copy_pages.c | 397 +++++++++++++++++++++++
 include/linux/migrate_mode.h           |   2 +
 include/linux/migrate_offc.h           |  34 ++
 include/linux/mm.h                     |   2 +
 mm/Kconfig                             |   8 +
 mm/Makefile                            |   1 +
 mm/migrate.c                           | 358 ++++++++++++++++++---
 mm/migrate_offc.c                      |  58 ++++
 mm/util.c                              |  29 ++
 16 files changed, 1284 insertions(+), 46 deletions(-)
 create mode 100644 drivers/migoffcopy/Kconfig
 create mode 100644 drivers/migoffcopy/Makefile
 create mode 100644 drivers/migoffcopy/dcbm/Makefile
 create mode 100644 drivers/migoffcopy/dcbm/dcbm.c
 create mode 100644 drivers/migoffcopy/mtcopy/Makefile
 create mode 100644 drivers/migoffcopy/mtcopy/copy_pages.c
 create mode 100644 include/linux/migrate_offc.h
 create mode 100644 mm/migrate_offc.c

-- 
2.43.0

Re: [RFC V3 0/9] Accelerate page migration with batch copying and hardware offload

Posted by Huang, Ying 1 week ago

Hi, Shivank,

Thanks for working on this!

Shivank Garg <shivankg@amd.com> writes:

> This is the third RFC of the patchset to enhance page migration by batching
> folio-copy operations and enabling acceleration via multi-threaded CPU or
> DMA offload.
>
> Single-threaded, folio-by-folio copying bottlenecks page migration
> in modern systems with deep memory hierarchies, especially for large
> folios where copy overhead dominates, leaving significant hardware
> potential untapped. 
>
> By batching the copy phase, we create an opportunity for significant
> hardware acceleration. This series builds a framework for this acceleration
> and provides two initial offload driver implementations: one using multiple
> CPU threads (mtcopy) and another leveraging the DMAEngine subsystem (dcbm).
>
> This version incorporates significant feedback to improve correctness,
> robustness, and the efficiency of the DMA offload path.
>
> Changelog since V2:
>
> 1. DMA Engine Rewrite:
>    - Switched from per-folio dma_map_page() to batch dma_map_sgtable()
>    - Single completion interrupt per batch (reduced overhead)
>    - Order of magnitude improvement in setup time for large batches
> 2. Code cleanups and refactoring
> 3. Rebased on latest mainline (6.17-rc6+)
>
> MOTIVATION:
> -----------
>
> Current Migration Flow:
> [ move_pages(), Compaction, Tiering, etc. ]
>               |
>               v
>      [ migrate_pages() ] // Common entry point
>               |
>               v
>     [ migrate_pages_batch() ] // NR_MAX_BATCHED_MIGRATION (512) folios at a time
>       |
>       |--> [ migrate_folio_unmap() ]
>       |
>       |--> [ try_to_unmap_flush() ] // Perform a single, batched TLB flush
>       |
>       |--> [ migrate_folios_move() ] // Bottleneck: Interleaved copy
>            - For each folio:
>              - Metadata prep: Copy flags, mappings, etc.
>              - folio_copy()  <-- Single-threaded, serial data copy.
>              - Update PTEs & finalize for that single folio.
>              
> Understanding overheads in page migration (move_pages() syscall):
>
> Total move_pages() overheads = folio_copy() + Other overheads
> 1. folio_copy() is the core copy operation that interests us.
> 2. The remaining operations are user/kernel transitions, page table walks,
> locking, folio unmap, dst folio alloc, TLB flush, copying flags, updating
> mappings and PTEs etc. that contribute to the remaining overheads.
>
> Percentage of folio_copy() overheads in move_pages(N pages) syscall time:
> Number of pages being migrated and folio size:
>             4KB     2MB
> 1 page     <1%     ~66%
> 512 page   ~35%    ~97%
>
> Based on Amdahl's Law, optimizing folio_copy() for large pages offers a
> substantial performance opportunity.
>
> move_pages() syscall speedup = 1 / ((1 - F) + (F / S))
> Where F is the fraction of time spent in folio_copy() and S is the speedup of
> folio_copy().
>
> For 4KB folios, folio copy overheads are significantly small in single-page
> migrations to impact overall speedup, even for 512 pages, maximum theoretical
> speedup is limited to ~1.54x with infinite folio_copy() speedup.
>
> For 2MB THPs, folio copy overheads are significant even in single page
> migrations, with a theoretical speedup of ~3x with infinite folio_copy()
> speedup and up to ~33x for 512 pages.
>
> A realistic value of S (speedup of folio_copy()) is 7.5x for DMA offload
> based on my measurements for copying 512 2MB pages.
> This gives move_pages(), a practical speedup of 6.3x for 512 2MB page (also
> observed in the experiments below).
>
> DESIGN: A Pluggable Migrator Framework
> ---------------------------------------
>
> Introduce migrate_folios_batch_move():
>
> [ migrate_pages_batch() ]
>     |
>     |--> migrate_folio_unmap()
>     |      
>     |--> try_to_unmap_flush()
>     |
>     +--> [ migrate_folios_batch_move() ] // new batched design
>             |
>             |--> Metadata migration
>             |    - Metadata prep: Copy flags, mappings, etc.
>             |    - Use MIGRATE_NO_COPY to skip the actual data copy.
>             |
>             |--> Batch copy folio data
>             |    - Migrator is configurable at runtime via sysfs.
>             |
>             |          static_call(_folios_copy) // Pluggable migrators
>             |          /          |            \
>             |         v           v             v
>             | [ Default ]  [ MT CPU copy ]  [ DMA Offload ]
>             |
>             +--> Update PTEs to point to dst folios and complete migration.
>

I just jump in the discussion, so this may be discussed before already.
Sorry if so.  Why not

migrate_folios_unmap()
try_to_unmap_flush()
copy folios in parallel if possible
migrate_folios_move(): with MIGRATE_NO_COPY?

> User Control of Migrator:
>
> # echo 1 > /sys/kernel/dcbm/offloading
>    |
>    +--> Driver's sysfs handler
>         |
>         +--> calls start_offloading(&cpu_migrator)
>               |
>               +--> calls offc_update_migrator()
>                     |
>                     +--> static_call_update(_folios_copy, mig->migrate_offc)
>
> Later, During Migration ...
> migrate_folios_batch_move()
>     |
>     +--> static_call(_folios_copy) // Now dispatches to the selected migrator
>           |
>           +-> [ mtcopy | dcbm | kernel_default ]
>

[snip]

---
Best Regards,
Huang, Ying

Re: [RFC V3 0/9] Accelerate page migration with batch copying and hardware offload

Posted by Zi Yan 1 week ago

On 23 Sep 2025, at 21:49, Huang, Ying wrote:

> Hi, Shivank,
>
> Thanks for working on this!
>
> Shivank Garg <shivankg@amd.com> writes:
>
>> This is the third RFC of the patchset to enhance page migration by batching
>> folio-copy operations and enabling acceleration via multi-threaded CPU or
>> DMA offload.
>>
>> Single-threaded, folio-by-folio copying bottlenecks page migration
>> in modern systems with deep memory hierarchies, especially for large
>> folios where copy overhead dominates, leaving significant hardware
>> potential untapped.
>>
>> By batching the copy phase, we create an opportunity for significant
>> hardware acceleration. This series builds a framework for this acceleration
>> and provides two initial offload driver implementations: one using multiple
>> CPU threads (mtcopy) and another leveraging the DMAEngine subsystem (dcbm).
>>
>> This version incorporates significant feedback to improve correctness,
>> robustness, and the efficiency of the DMA offload path.
>>
>> Changelog since V2:
>>
>> 1. DMA Engine Rewrite:
>>    - Switched from per-folio dma_map_page() to batch dma_map_sgtable()
>>    - Single completion interrupt per batch (reduced overhead)
>>    - Order of magnitude improvement in setup time for large batches
>> 2. Code cleanups and refactoring
>> 3. Rebased on latest mainline (6.17-rc6+)
>>
>> MOTIVATION:
>> -----------
>>
>> Current Migration Flow:
>> [ move_pages(), Compaction, Tiering, etc. ]
>>               |
>>               v
>>      [ migrate_pages() ] // Common entry point
>>               |
>>               v
>>     [ migrate_pages_batch() ] // NR_MAX_BATCHED_MIGRATION (512) folios at a time
>>       |
>>       |--> [ migrate_folio_unmap() ]
>>       |
>>       |--> [ try_to_unmap_flush() ] // Perform a single, batched TLB flush
>>       |
>>       |--> [ migrate_folios_move() ] // Bottleneck: Interleaved copy
>>            - For each folio:
>>              - Metadata prep: Copy flags, mappings, etc.
>>              - folio_copy()  <-- Single-threaded, serial data copy.
>>              - Update PTEs & finalize for that single folio.
>>
>> Understanding overheads in page migration (move_pages() syscall):
>>
>> Total move_pages() overheads = folio_copy() + Other overheads
>> 1. folio_copy() is the core copy operation that interests us.
>> 2. The remaining operations are user/kernel transitions, page table walks,
>> locking, folio unmap, dst folio alloc, TLB flush, copying flags, updating
>> mappings and PTEs etc. that contribute to the remaining overheads.
>>
>> Percentage of folio_copy() overheads in move_pages(N pages) syscall time:
>> Number of pages being migrated and folio size:
>>             4KB     2MB
>> 1 page     <1%     ~66%
>> 512 page   ~35%    ~97%
>>
>> Based on Amdahl's Law, optimizing folio_copy() for large pages offers a
>> substantial performance opportunity.
>>
>> move_pages() syscall speedup = 1 / ((1 - F) + (F / S))
>> Where F is the fraction of time spent in folio_copy() and S is the speedup of
>> folio_copy().
>>
>> For 4KB folios, folio copy overheads are significantly small in single-page
>> migrations to impact overall speedup, even for 512 pages, maximum theoretical
>> speedup is limited to ~1.54x with infinite folio_copy() speedup.
>>
>> For 2MB THPs, folio copy overheads are significant even in single page
>> migrations, with a theoretical speedup of ~3x with infinite folio_copy()
>> speedup and up to ~33x for 512 pages.
>>
>> A realistic value of S (speedup of folio_copy()) is 7.5x for DMA offload
>> based on my measurements for copying 512 2MB pages.
>> This gives move_pages(), a practical speedup of 6.3x for 512 2MB page (also
>> observed in the experiments below).
>>
>> DESIGN: A Pluggable Migrator Framework
>> ---------------------------------------
>>
>> Introduce migrate_folios_batch_move():
>>
>> [ migrate_pages_batch() ]
>>     |
>>     |--> migrate_folio_unmap()
>>     |
>>     |--> try_to_unmap_flush()
>>     |
>>     +--> [ migrate_folios_batch_move() ] // new batched design
>>             |
>>             |--> Metadata migration
>>             |    - Metadata prep: Copy flags, mappings, etc.
>>             |    - Use MIGRATE_NO_COPY to skip the actual data copy.
>>             |
>>             |--> Batch copy folio data
>>             |    - Migrator is configurable at runtime via sysfs.
>>             |
>>             |          static_call(_folios_copy) // Pluggable migrators
>>             |          /          |            \
>>             |         v           v             v
>>             | [ Default ]  [ MT CPU copy ]  [ DMA Offload ]
>>             |
>>             +--> Update PTEs to point to dst folios and complete migration.
>>
>
> I just jump in the discussion, so this may be discussed before already.
> Sorry if so.  Why not
>
> migrate_folios_unmap()
> try_to_unmap_flush()
> copy folios in parallel if possible
> migrate_folios_move(): with MIGRATE_NO_COPY?

Since in move_to_new_folio(), there are various migration preparation
works, which can fail. Copying folios regardless might lead to some
unnecessary work. What is your take on this?

>
>> User Control of Migrator:
>>
>> # echo 1 > /sys/kernel/dcbm/offloading
>>    |
>>    +--> Driver's sysfs handler
>>         |
>>         +--> calls start_offloading(&cpu_migrator)
>>               |
>>               +--> calls offc_update_migrator()
>>                     |
>>                     +--> static_call_update(_folios_copy, mig->migrate_offc)
>>
>> Later, During Migration ...
>> migrate_folios_batch_move()
>>     |
>>     +--> static_call(_folios_copy) // Now dispatches to the selected migrator
>>           |
>>           +-> [ mtcopy | dcbm | kernel_default ]
>>
>
> [snip]
>
> ---
> Best Regards,
> Huang, Ying


Best Regards,
Yan, Zi

Re: [RFC V3 0/9] Accelerate page migration with batch copying and hardware offload

Posted by Huang, Ying 1 week ago

Zi Yan <ziy@nvidia.com> writes:

> On 23 Sep 2025, at 21:49, Huang, Ying wrote:
>
>> Hi, Shivank,
>>
>> Thanks for working on this!
>>
>> Shivank Garg <shivankg@amd.com> writes:
>>
>>> This is the third RFC of the patchset to enhance page migration by batching
>>> folio-copy operations and enabling acceleration via multi-threaded CPU or
>>> DMA offload.
>>>
>>> Single-threaded, folio-by-folio copying bottlenecks page migration
>>> in modern systems with deep memory hierarchies, especially for large
>>> folios where copy overhead dominates, leaving significant hardware
>>> potential untapped.
>>>
>>> By batching the copy phase, we create an opportunity for significant
>>> hardware acceleration. This series builds a framework for this acceleration
>>> and provides two initial offload driver implementations: one using multiple
>>> CPU threads (mtcopy) and another leveraging the DMAEngine subsystem (dcbm).
>>>
>>> This version incorporates significant feedback to improve correctness,
>>> robustness, and the efficiency of the DMA offload path.
>>>
>>> Changelog since V2:
>>>
>>> 1. DMA Engine Rewrite:
>>>    - Switched from per-folio dma_map_page() to batch dma_map_sgtable()
>>>    - Single completion interrupt per batch (reduced overhead)
>>>    - Order of magnitude improvement in setup time for large batches
>>> 2. Code cleanups and refactoring
>>> 3. Rebased on latest mainline (6.17-rc6+)
>>>
>>> MOTIVATION:
>>> -----------
>>>
>>> Current Migration Flow:
>>> [ move_pages(), Compaction, Tiering, etc. ]
>>>               |
>>>               v
>>>      [ migrate_pages() ] // Common entry point
>>>               |
>>>               v
>>>     [ migrate_pages_batch() ] // NR_MAX_BATCHED_MIGRATION (512) folios at a time
>>>       |
>>>       |--> [ migrate_folio_unmap() ]
>>>       |
>>>       |--> [ try_to_unmap_flush() ] // Perform a single, batched TLB flush
>>>       |
>>>       |--> [ migrate_folios_move() ] // Bottleneck: Interleaved copy
>>>            - For each folio:
>>>              - Metadata prep: Copy flags, mappings, etc.
>>>              - folio_copy()  <-- Single-threaded, serial data copy.
>>>              - Update PTEs & finalize for that single folio.
>>>
>>> Understanding overheads in page migration (move_pages() syscall):
>>>
>>> Total move_pages() overheads = folio_copy() + Other overheads
>>> 1. folio_copy() is the core copy operation that interests us.
>>> 2. The remaining operations are user/kernel transitions, page table walks,
>>> locking, folio unmap, dst folio alloc, TLB flush, copying flags, updating
>>> mappings and PTEs etc. that contribute to the remaining overheads.
>>>
>>> Percentage of folio_copy() overheads in move_pages(N pages) syscall time:
>>> Number of pages being migrated and folio size:
>>>             4KB     2MB
>>> 1 page     <1%     ~66%
>>> 512 page   ~35%    ~97%
>>>
>>> Based on Amdahl's Law, optimizing folio_copy() for large pages offers a
>>> substantial performance opportunity.
>>>
>>> move_pages() syscall speedup = 1 / ((1 - F) + (F / S))
>>> Where F is the fraction of time spent in folio_copy() and S is the speedup of
>>> folio_copy().
>>>
>>> For 4KB folios, folio copy overheads are significantly small in single-page
>>> migrations to impact overall speedup, even for 512 pages, maximum theoretical
>>> speedup is limited to ~1.54x with infinite folio_copy() speedup.
>>>
>>> For 2MB THPs, folio copy overheads are significant even in single page
>>> migrations, with a theoretical speedup of ~3x with infinite folio_copy()
>>> speedup and up to ~33x for 512 pages.
>>>
>>> A realistic value of S (speedup of folio_copy()) is 7.5x for DMA offload
>>> based on my measurements for copying 512 2MB pages.
>>> This gives move_pages(), a practical speedup of 6.3x for 512 2MB page (also
>>> observed in the experiments below).
>>>
>>> DESIGN: A Pluggable Migrator Framework
>>> ---------------------------------------
>>>
>>> Introduce migrate_folios_batch_move():
>>>
>>> [ migrate_pages_batch() ]
>>>     |
>>>     |--> migrate_folio_unmap()
>>>     |
>>>     |--> try_to_unmap_flush()
>>>     |
>>>     +--> [ migrate_folios_batch_move() ] // new batched design
>>>             |
>>>             |--> Metadata migration
>>>             |    - Metadata prep: Copy flags, mappings, etc.
>>>             |    - Use MIGRATE_NO_COPY to skip the actual data copy.
>>>             |
>>>             |--> Batch copy folio data
>>>             |    - Migrator is configurable at runtime via sysfs.
>>>             |
>>>             |          static_call(_folios_copy) // Pluggable migrators
>>>             |          /          |            \
>>>             |         v           v             v
>>>             | [ Default ]  [ MT CPU copy ]  [ DMA Offload ]
>>>             |
>>>             +--> Update PTEs to point to dst folios and complete migration.
>>>
>>
>> I just jump in the discussion, so this may be discussed before already.
>> Sorry if so.  Why not
>>
>> migrate_folios_unmap()
>> try_to_unmap_flush()
>> copy folios in parallel if possible
>> migrate_folios_move(): with MIGRATE_NO_COPY?
>
> Since in move_to_new_folio(), there are various migration preparation
> works, which can fail. Copying folios regardless might lead to some
> unnecessary work. What is your take on this?

Good point, we should skip copying folios that fails the checks.

>>
>>> User Control of Migrator:
>>>
>>> # echo 1 > /sys/kernel/dcbm/offloading
>>>    |
>>>    +--> Driver's sysfs handler
>>>         |
>>>         +--> calls start_offloading(&cpu_migrator)
>>>               |
>>>               +--> calls offc_update_migrator()
>>>                     |
>>>                     +--> static_call_update(_folios_copy, mig->migrate_offc)
>>>
>>> Later, During Migration ...
>>> migrate_folios_batch_move()
>>>     |
>>>     +--> static_call(_folios_copy) // Now dispatches to the selected migrator
>>>           |
>>>           +-> [ mtcopy | dcbm | kernel_default ]
>>>
>>
>> [snip]

---
Best Regards,
Huang, Ying

Re: [RFC V3 0/9] Accelerate page migration with batch copying and hardware offload

Posted by Zi Yan 1 week ago

On 23 Sep 2025, at 13:47, Shivank Garg wrote:

> This is the third RFC of the patchset to enhance page migration by batching
> folio-copy operations and enabling acceleration via multi-threaded CPU or
> DMA offload.
>
> Single-threaded, folio-by-folio copying bottlenecks page migration
> in modern systems with deep memory hierarchies, especially for large
> folios where copy overhead dominates, leaving significant hardware
> potential untapped.
>
> By batching the copy phase, we create an opportunity for significant
> hardware acceleration. This series builds a framework for this acceleration
> and provides two initial offload driver implementations: one using multiple
> CPU threads (mtcopy) and another leveraging the DMAEngine subsystem (dcbm).
>
> This version incorporates significant feedback to improve correctness,
> robustness, and the efficiency of the DMA offload path.
>
> Changelog since V2:
>
> 1. DMA Engine Rewrite:
>    - Switched from per-folio dma_map_page() to batch dma_map_sgtable()
>    - Single completion interrupt per batch (reduced overhead)
>    - Order of magnitude improvement in setup time for large batches
> 2. Code cleanups and refactoring
> 3. Rebased on latest mainline (6.17-rc6+)

Thanks for working on this.

It is better to rebase on top of Andrew’s mm-new tree.

I have a version at: https://github.com/x-y-z/linux-dev/tree/batched_page_migration_copy_amd_v3-mm-everything-2025-09-23-00-13.

The difference is that I changed Patch 6 to use padata_do_multithreaded()
instead of my own implementation, since padata is a nice framework
for doing multithreaded jobs. The downside is that your patch 9
no longer applies and you will need to hack kernel/padata.c to
achieve the same thing.

I also tried to attribute back page copy kthread time to the initiating
thread so that page copy time does not disappear when it is parallelized
using CPU threads. It is currently a hack in the last patch from
the above repo. With the patch, I can see system time of a page migration
process with multithreaded page copy looks almost the same as without it,
while wall clock time is smaller. But I have not found time to ask
scheduler people about a proper implementation yet.


>
> MOTIVATION:
> -----------
>
> Current Migration Flow:
> [ move_pages(), Compaction, Tiering, etc. ]
>               |
>               v
>      [ migrate_pages() ] // Common entry point
>               |
>               v
>     [ migrate_pages_batch() ] // NR_MAX_BATCHED_MIGRATION (512) folios at a time
>       |
>       |--> [ migrate_folio_unmap() ]
>       |
>       |--> [ try_to_unmap_flush() ] // Perform a single, batched TLB flush
>       |
>       |--> [ migrate_folios_move() ] // Bottleneck: Interleaved copy
>            - For each folio:
>              - Metadata prep: Copy flags, mappings, etc.
>              - folio_copy()  <-- Single-threaded, serial data copy.
>              - Update PTEs & finalize for that single folio.
>
> Understanding overheads in page migration (move_pages() syscall):
>
> Total move_pages() overheads = folio_copy() + Other overheads
> 1. folio_copy() is the core copy operation that interests us.
> 2. The remaining operations are user/kernel transitions, page table walks,
> locking, folio unmap, dst folio alloc, TLB flush, copying flags, updating
> mappings and PTEs etc. that contribute to the remaining overheads.
>
> Percentage of folio_copy() overheads in move_pages(N pages) syscall time:
> Number of pages being migrated and folio size:
>             4KB     2MB
> 1 page     <1%     ~66%
> 512 page   ~35%    ~97%
>
> Based on Amdahl's Law, optimizing folio_copy() for large pages offers a
> substantial performance opportunity.
>
> move_pages() syscall speedup = 1 / ((1 - F) + (F / S))
> Where F is the fraction of time spent in folio_copy() and S is the speedup of
> folio_copy().
>
> For 4KB folios, folio copy overheads are significantly small in single-page
> migrations to impact overall speedup, even for 512 pages, maximum theoretical
> speedup is limited to ~1.54x with infinite folio_copy() speedup.
>
> For 2MB THPs, folio copy overheads are significant even in single page
> migrations, with a theoretical speedup of ~3x with infinite folio_copy()
> speedup and up to ~33x for 512 pages.
>
> A realistic value of S (speedup of folio_copy()) is 7.5x for DMA offload
> based on my measurements for copying 512 2MB pages.
> This gives move_pages(), a practical speedup of 6.3x for 512 2MB page (also
> observed in the experiments below).
>
> DESIGN: A Pluggable Migrator Framework
> ---------------------------------------
>
> Introduce migrate_folios_batch_move():
>
> [ migrate_pages_batch() ]
>     |
>     |--> migrate_folio_unmap()
>     |
>     |--> try_to_unmap_flush()
>     |
>     +--> [ migrate_folios_batch_move() ] // new batched design
>             |
>             |--> Metadata migration
>             |    - Metadata prep: Copy flags, mappings, etc.
>             |    - Use MIGRATE_NO_COPY to skip the actual data copy.
>             |
>             |--> Batch copy folio data
>             |    - Migrator is configurable at runtime via sysfs.
>             |
>             |          static_call(_folios_copy) // Pluggable migrators
>             |          /          |            \
>             |         v           v             v
>             | [ Default ]  [ MT CPU copy ]  [ DMA Offload ]
>             |
>             +--> Update PTEs to point to dst folios and complete migration.
>
>
> User Control of Migrator:
>
> # echo 1 > /sys/kernel/dcbm/offloading
>    |
>    +--> Driver's sysfs handler
>         |
>         +--> calls start_offloading(&cpu_migrator)
>               |
>               +--> calls offc_update_migrator()
>                     |
>                     +--> static_call_update(_folios_copy, mig->migrate_offc)
>
> Later, During Migration ...
> migrate_folios_batch_move()
>     |
>     +--> static_call(_folios_copy) // Now dispatches to the selected migrator
>           |
>           +-> [ mtcopy | dcbm | kernel_default ]
>
>
> PERFORMANCE RESULTS:
> --------------------
>
> System Info: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT Enabled),
> 1 NUMA node per socket, Linux Kernel 6.16.0-rc6, DVFS set to Performance,
> PTDMA hardware.
>
> Benchmark: Use move_pages() syscall to move pages between two NUMA nodes.
>
> 1. Moving different sized folios (4KB, 16KB,..., 2MB) such that total transfer size is constant
> (1GB), with different number of parallel threads/channels.
> Metric: Throughput is reported in GB/s.
>
> a. Baseline (Vanilla kernel, single-threaded, folio-by-folio migration):
>
> Folio size|4K       | 16K       | 64K        | 128K       | 256K       | 512K       | 1M         | 2M         |
> ===============================================================================================================
> Tput(GB/s)|3.73±0.33| 5.53±0.36 | 5.90±0.56  | 6.34±0.08  | 6.50±0.05  | 6.86±0.61  | 6.92±0.71  | 10.67±0.36 |
>
> b. Multi-threaded CPU copy offload (mtcopy driver, use N Parallel Threads=1,2,4,8,12,16):
>
> Thread | 4K         | 16K       | 64K        | 128K       | 256K       | 512K       | 1M         | 2M         |
> ===============================================================================================================
> 1      | 3.84±0.10  | 5.23±0.31 | 6.01±0.55  | 6.34±0.60  | 7.16±1.00  | 7.12±0.78  | 7.10±0.86  | 10.94±0.13 |
> 2      | 4.04±0.19  | 6.72±0.38 | 7.68±0.12  | 8.15±0.06  | 8.45±0.09  | 9.29±0.17  | 9.87±1.01  | 17.80±0.12 |
> 4      | 4.72±0.21  | 8.41±0.70 | 10.08±1.67 | 11.44±2.42 | 10.45±0.17 | 12.60±1.97 | 12.38±1.73 | 31.41±1.14 |
> 8      | 4.91±0.28  | 8.62±0.13 | 10.40±0.20 | 13.94±3.75 | 11.03±0.61 | 14.96±3.29 | 12.84±0.63 | 33.50±3.29 |
> 12     | 4.84±0.24  | 8.75±0.08 | 10.16±0.26 | 10.92±0.22 | 11.72±0.14 | 14.02±2.51 | 14.09±2.65 | 34.70±2.38 |
> 16     | 4.77±0.22  | 8.95±0.69 | 10.36±0.26 | 11.03±0.22 | 11.58±0.30 | 13.88±2.71 | 13.00±0.75 | 35.89±2.07 |
>
> c. DMA offload (dcbm driver, use N DMA Channels=1,2,4,8,12,16):
>
> Chan Cnt| 4K        | 16K       | 64K        | 128K       | 256K       | 512K       | 1M         | 2M         |
> ===============================================================================================================
> 1      | 2.75±0.19  | 2.86±0.13 | 3.28±0.20  | 4.57±0.72  | 5.03±0.62  | 4.69±0.25  | 4.78±0.34  | 12.50±0.24 |
> 2      | 3.35±0.19  | 4.57±0.19 | 5.35±0.55  | 6.71±0.71  | 7.40±1.07  | 7.38±0.61  | 7.21±0.73  | 14.23±0.34 |
> 4      | 4.01±0.17  | 6.36±0.26 | 7.71±0.89  | 9.40±1.35  | 10.27±1.96 | 10.60±1.42 | 12.35±2.64 | 26.84±0.91 |
> 8      | 4.46±0.16  | 7.74±0.13 | 9.72±1.29  | 10.88±0.16 | 12.12±2.54 | 15.62±3.96 | 13.29±2.65 | 45.27±2.60 |
> 12     | 4.60±0.22  | 8.90±0.84 | 11.26±2.19 | 16.00±4.41 | 14.90±4.38 | 14.57±2.84 | 13.79±3.18 | 59.94±4.19 |
> 16     | 4.61±0.25  | 9.08±0.79 | 11.14±1.75 | 13.95±3.85 | 13.69±3.39 | 15.47±3.44 | 15.44±4.65 | 63.69±5.01 |
>
> - Throughput increases with folio size. Larger folios benefit more from DMA.
> - Scaling shows diminishing returns beyond 8-12 threads/channels.
> - Multi-threading and DMA offloading both provide significant gains - up to 3.4x and 6x respectively.
>
> 2. Varying total move size: (folio count = 1,8,..8192) for a fixed folio size of 2MB
>    using only single thread/channel
>
> folio_cnt | Baseline    | MTCPU      | DMA
> ====================================================
> 1         | 7.96±2.22   | 6.43±0.66  | 6.52±0.45   |
> 8         | 8.20±0.75   | 8.82±1.10  | 8.88±0.54   |
> 16        | 7.54±0.61   | 9.06±0.95  | 9.03±0.62   |
> 32        | 8.68±0.77   | 10.11±0.42 | 10.17±0.50  |
> 64        | 9.08±1.03   | 10.12±0.44 | 11.21±0.24  |
> 256       | 10.53±0.39  | 10.77±0.28 | 12.43±0.12  |
> 512       | 10.59±0.29  | 10.81±0.19 | 12.61±0.07  |
> 2048      | 10.86±0.26  | 11.05±0.05 | 12.75±0.03  |
> 8192      | 10.84±0.18  | 11.12±0.05 | 12.81±0.02  |
>
> - Throughput increases with folios count but plateaus after a threshold.
>   (The migrate_pages function uses a folio batch size of 512)
>
> Performance Analysis (V2 vs V3):
>
> The new SG-based DMA driver dramatically reduces software overhead. By
> switching from per-folio dma_map_page() to batch dma_map_sgtable(), setup
> time improves by an order of magnitude for large batches.
> This is most visible with 4KB folios, making DMA viable even for smaller
> page sizes. For 2MB THP migrations, where hardware transfer time is more
> dominant, the gains are more modest.
>
> OPEN QUESTIONS:
> ---------------
>
> User-Interface:
>
> 1. Control Interface Design:
> The current interface creates separate sysfs files
> for each driver, which can be confusing. Should we implement a unified interface
> (/sys/kernel/mm/migration/offload_migrator), which accepts the name of the desired migrator
> ("kernel", "mtcopy", "dcbm"). This would ensure only one migrator is active at a time.
> Is this the right approach?
>
> 2. Dynamic Migrator Selection:
> Currently, active migrator is a global state, and only one can be active a time.
> A more flexible approach might be for the caller of migrate_pages() to specify/hint which
> offload mechanism to use, if any. This would allow a CXL driver to explicitly request DMA while a GPU driver might prefer
> multi-threaded CPU copy.
>
> 3. Tuning Parameters: Expose parameters like number of threads/channels, batch size,
> and thresholds for using migrators. Who should own these parameters?
>
> 4. Resources Accounting[3]:
> a. CPU cgroups accounting and fairness
> b. Migration cost attribution
>
> FUTURE WORK:
> ------------
>
> 1. Enhance DMA drivers for bulk copying (e.g., SDXi Engine).
> 2. Enhance multi-threaded CPU copying for platform-specific scheduling of worker threads to optimize bandwidth utilization. Explore sched-ext for this. [2]
> 3. Enable kpromoted [4] to use the migration offload infrastructure.
>
> EARLIER POSTINGS:
> -----------------
>
> - RFC V2: https://lore.kernel.org/all/20250319192211.10092-1-shivankg@amd.com
> - RFC V1: https://lore.kernel.org/all/20240614221525.19170-1-shivankg@amd.com
>
> REFERENCES:
> -----------
>
> [1] RFC from Zi Yan: https://lore.kernel.org/all/20250103172419.4148674-1-ziy@nvidia.com
> [2] LSFMM: https://lore.kernel.org/all/cf6fc05d-c0b0-4de3-985e-5403977aa3aa@amd.com
> [3] https://lore.kernel.org/all/CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@mail.gmail.com
> [4] https://lore.kernel.org/all/20250910144653.212066-1-bharata@amd.com
>
> Mike Day (1):
>   mm: add support for copy offload for folio Migration
>
> Shivank Garg (4):
>   mm: Introduce folios_mc_copy() for batch copying folios
>   mm/migrate: add migrate_folios_batch_move to  batch the folio move
>     operations
>   dcbm: add dma core batch migrator for batch page offloading
>   mtcopy: spread threads across die for testing
>
> Zi Yan (4):
>   mm/migrate: factor out code in move_to_new_folio() and
>     migrate_folio_move()
>   mm/migrate: revive MIGRATE_NO_COPY in migrate_mode
>   mtcopy: introduce multi-threaded page copy routine
>   adjust NR_MAX_BATCHED_MIGRATION for testing
>
>  drivers/Kconfig                        |   2 +
>  drivers/Makefile                       |   3 +
>  drivers/migoffcopy/Kconfig             |  17 +
>  drivers/migoffcopy/Makefile            |   2 +
>  drivers/migoffcopy/dcbm/Makefile       |   1 +
>  drivers/migoffcopy/dcbm/dcbm.c         | 415 +++++++++++++++++++++++++
>  drivers/migoffcopy/mtcopy/Makefile     |   1 +
>  drivers/migoffcopy/mtcopy/copy_pages.c | 397 +++++++++++++++++++++++
>  include/linux/migrate_mode.h           |   2 +
>  include/linux/migrate_offc.h           |  34 ++
>  include/linux/mm.h                     |   2 +
>  mm/Kconfig                             |   8 +
>  mm/Makefile                            |   1 +
>  mm/migrate.c                           | 358 ++++++++++++++++++---
>  mm/migrate_offc.c                      |  58 ++++
>  mm/util.c                              |  29 ++
>  16 files changed, 1284 insertions(+), 46 deletions(-)
>  create mode 100644 drivers/migoffcopy/Kconfig
>  create mode 100644 drivers/migoffcopy/Makefile
>  create mode 100644 drivers/migoffcopy/dcbm/Makefile
>  create mode 100644 drivers/migoffcopy/dcbm/dcbm.c
>  create mode 100644 drivers/migoffcopy/mtcopy/Makefile
>  create mode 100644 drivers/migoffcopy/mtcopy/copy_pages.c
>  create mode 100644 include/linux/migrate_offc.h
>  create mode 100644 mm/migrate_offc.c
>
> -- 
> 2.43.0


Best Regards,
Yan, Zi