[v1] Accelerate page migration with batch copying and hardware offload

[PATCH 0/7] Accelerate page migration with batch copying and hardware offload

Posted by Shivank Garg 1 month, 2 weeks ago

This is the fifth RFC of the patchset to enhance page migration by
batching folio-copy operations and enabling acceleration via DMA offload.

Single-threaded, folio-by-folio copying bottlenecks page migration in
modern systems with deep memory hierarchies, especially for large folios
where copy overhead dominates, leaving significant hardware potential
untapped.

By batching the copy phase, we create an opportunity for hardware
acceleration. This series builds the framework and provides a DMA
offload driver (dcbm) as a reference implementation, targeting bulk
migration workloads where offloading the copy improves throughput
and latency while freeing the CPU cycles.

See the RFC V3 cover letter [2] for motivation.

Changelog since V4:
-------------------

1. Renamed PAGE_* migration state flags to FOLIO_*. (David)
2. Use the new folio->migrate_info field instead of folio->private
   for migration state. (David)
3. Fold folios_mc_copy patch in batch-copy implementation patch. (David)
3. Renamed migrate_offload_start()/stop() to register()/unregister().
   (Huang, Ying)
4. Dropped should_batch() callback from struct migrator. Reason-based
   policy now lives in migrate_pages_batch(). Migrators can still skip
   a batch they don't want (size based policy). (Huang, Ying)
5. CONFIG_MIGRATION_COPY_OFFLOAD is now hidden and selected by the
   migrator driver. CONFIG_DCBM_DMA is tristate. (Huang Ying, Gregory Price). 
6. Wrapped the SRCU + static_call dispatch in a small helper. (Huang, Ying)
7. Requir m->owner in migrate_offload_register(), SRCU sync at
   unregister relies on it. Counters are atomic_long_t to avoid lock-order
   issue.
9. Moved DCBM sysfs from /sys/kernel/dcbm to /sys/module/dcbm (Huang, Ying) 
10. Rebased on v7.1-rc1.


DESIGN:
-------

New Migration Flow:

[ migrate_pages_batch() ]
    |
    |--> do_batch = migrate_offload_do_batch(reason)  // core filters by migration reason
    |
    |--> for each folio:
    |      migrate_folio_unmap()        // unmap the folio
    |      |
    |      +--> (success):
    |           if do_batch && folio_supports_batch_copy():
    |               -> unmap_batch / dst_batch  // batch list for copy offloading
    |           else:
    |               -> unmap_single / dst_single // single lists for per-folio CPU copy
    |
    |--> try_to_unmap_flush()                   // single batched TLB flush
    |
    |--> Batch copy (if unmap_batch not empty):
    |    - Migrator is configurable at runtime via sysfs.
    |
    |      static_call(migrate_offload_copy)    // Pluggable Migrators
    |              /          |            \
    |             v           v             v
    |     [ Default ]  [ DMA Offload ]  [ ... ]
    |
    |      On -EOPNOTSUPP or other error, batch falls back to per-folio CPU copy.
    |
    +--> migrate_folios_move()      // metadata, update PTEs, finalize
         (batch list with already_copied=true, single list with false)

Offload Registration:

    Driver fills struct migrator { .name, .offload_copy, .owner } and calls
    migrate_offload_register().  This:
      - Pins the module via try_module_get()
      - Patches the migrate_offload_copy() static_call target
      - Enables the migrate_offload_enabled static branch

    migrate_offload_unregister() disables the static branch and reverts
    the static_call, then synchronize_srcu() waits for in-flight migrations
    before module_put().

PERFORMANCE RESULTS:
--------------------

Re-ran the V4 workload on v7.1-rc1 with this series; relative
speedups match V4 (~6x for 2MB folios at 16 DMA channels). No design
change in V5 alters this picture; please refer to the V4 cover letter
for the throughput tables [1].


PLAN:
-----

Patches 1-4 (the batching infrastructure) don't depend on the migrator
interface, so if it helps I can split them off and post them ahead of
the migrator and DCBM bits, which still have a few open questions to
work through.

I would appreciate guidance on splitting the infrastructure portion
ahead of the migrator interface if that matches maintainers' preference.

OPEN QUESTIONS:
---------------

1. Should the batch path run without a registered migrator? Patches 1-4
   are self-contained and use folios_mc_copy() (CPU). I have several
   options like making batch path always-on for eligible folios, or
   giving admin an option to flip the static branch, or keep the gate.
   I'm leaning toward always-on.

2. Carrying already_copied via folio->migrate_info vs changing the
   migrate_folio() callback signature (Huang, Ying). I went with the
   field for now to avoid touching every fs callback before the design
   settles. Happy to revisit.

3. Per-caller offload selection: Today eligibility is by migrate_reason
   only. Some are latency-tolerant, others may be not. Is reason the
   right granularity, or do we want a per-caller hint?

4. Cgroup integration: How should per-cgroup be accounted for different
   migrators (e.g.: any accounting for DMA-busy time)?

5. Tuning migrate_pages callers for offloading. For instance, in
   compaction COMPACT_CLUSTER_MAX = 32 caps DMA's payoff for compaction
   (V4 experiment).

6. Where do batch-size thresholds live, and how are they tuned? Per
   Huang Ying's split, that policy lives in the migrator. DCBM has no
   threshold today. Open whether it should later be a per-migrator
   sysfs knob or hard-coded; probably clearer once a second migrator
   (SDXI, mtcopy) shows the trade-off.


FOLLOW-UPS:
--------------

1. dmaengine_prep_dma_memcpy_sg() in DCBM (Vinod Koul). The SG-prep
   variant cuts per-batch prep/submit cost (=CPU savings), but ptdma does
   not implement the SG hook yet [10]. The end-to-end migration throughput
   delta is small because per-descriptor execute time dominates.
   I'll post the ptdma SG hook + DCBM switch as a follow-up.
  
2. SDXI as a second migrator. The SDXI series [11] is in  review. SDXI is
   a generic memcpy engine without DMA_PRIVATE, so channel acquisition
   goes through dma_find_channel() or async_tx rather than
   dma_request_chan_by_mask(). I have a local DCBM variant working on top
   of the SDXI driver. I'm planning to send it as a follow-up once the
   SDXI series settles.
 
3. IOMMU SG merging in DCBM (Gregory). dma_map_sgtable() may merge
   contiguous PFNs unevenly, so src.nents != dst.nents. DCBM falls back
   to CPU for safety. Though I haven't seen it on Zen3 + PTDMA. I'll
   understand this and address it a follow-up.
 
4. Revisit Multi-threaded CPU copy migrator once the infra is settled.

EARLIER POSTINGS:
-----------------
[1] RFC V4: https://lore.kernel.org/all/20260309120725.308854-3-shivankg@amd.com
[2] RFC V3: https://lore.kernel.org/all/20250923174752.35701-1-shivankg@amd.com
[3] RFC V2: https://lore.kernel.org/all/20250319192211.10092-1-shivankg@amd.com
[4] RFC V1: https://lore.kernel.org/all/20240614221525.19170-1-shivankg@amd.com
[5] RFC from Zi Yan: https://lore.kernel.org/all/20250103172419.4148674-1-ziy@nvidia.com

RELATED DISCUSSIONS:
--------------------
[6] MM-alignment Session [Nov 12, 2025]:
    https://lore.kernel.org/linux-mm/bd6a3c75-b9f0-cbcf-f7c4-1ef5dff06d24@google.com
[7] Linux Memory Hotness and Promotion call [Nov 6, 2025]:
    https://lore.kernel.org/linux-mm/8ff2fd10-c9ac-4912-cf56-7ecd4afd2770@google.com
[8] LSFMM 2025:
    https://lore.kernel.org/all/cf6fc05d-c0b0-4de3-985e-5403977aa3aa@amd.com
[9] OSS India:
    https://ossindia2025.sched.com/event/23Jk1
[10] DMA_MEMCPY_SG comparison:
     https://lore.kernel.org/linux-mm/3e73addb-ac01-4a05-bc75-c6c1c56072df@amd.com
[11] SDXI V1:
     https://lore.kernel.org/all/20260410-sdxi-base-v1-0-1d184cb5c60a@amd.com

Thanks to everyone who reviewed, tested or participated in discussions
around this series. Your feedback helped me throughout the development
process.

Best Regards,
Shivank


Shivank Garg (6):
  mm/migrate: rename PAGE_ migration flags to FOLIO_
  mm/migrate: use migrate_info field instead of private
  mm/migrate: skip data copy for already-copied folios
  mm/migrate: add batch-copy path in migrate_pages_batch
  mm/migrate: add copy offload registration infrastructure
  drivers/migrate_offload: add DMA batch copy driver (dcbm)

Zi Yan (1):
  mm/migrate: adjust NR_MAX_BATCHED_MIGRATION for testing

 drivers/Kconfig                       |   2 +
 drivers/Makefile                      |   2 +
 drivers/migrate_offload/Kconfig       |   9 +
 drivers/migrate_offload/Makefile      |   1 +
 drivers/migrate_offload/dcbm/Makefile |   1 +
 drivers/migrate_offload/dcbm/dcbm.c   | 440 ++++++++++++++++++++++++++
 include/linux/migrate_copy_offload.h  |  44 +++
 include/linux/mm.h                    |   2 +
 include/linux/mm_types.h              |   1 +
 mm/Kconfig                            |   6 +
 mm/Makefile                           |   1 +
 mm/migrate.c                          | 211 ++++++++----
 mm/migrate_copy_offload.c             |  94 ++++++
 mm/util.c                             |  30 ++
 14 files changed, 784 insertions(+), 60 deletions(-)
 create mode 100644 drivers/migrate_offload/Kconfig
 create mode 100644 drivers/migrate_offload/Makefile
 create mode 100644 drivers/migrate_offload/dcbm/Makefile
 create mode 100644 drivers/migrate_offload/dcbm/dcbm.c
 create mode 100644 include/linux/migrate_copy_offload.h
 create mode 100644 mm/migrate_copy_offload.c


base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731
-- 
2.43.0

Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload

Posted by Garg, Shivank 1 month, 2 weeks ago

Hi all,

Apologies. The subject prefix should have been [RFC PATCH v5 0/7].

This is the fifth RFC, as mentioned in the cover letter, but I
missed the prefix while formatting the patches. Please treat this
round as RFC v5.

Thanks,
Shivank

Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload

Posted by David Hildenbrand (Arm) 1 month, 2 weeks ago

On 4/28/26 19:11, Garg, Shivank wrote:
> Hi all,
> 
> Apologies. The subject prefix should have been [RFC PATCH v5 0/7].
> 
> This is the fifth RFC, as mentioned in the cover letter, but I
> missed the prefix while formatting the patches. Please treat this
> round as RFC v5.

Ever since I switched to b4 for patch management, the quality of my life improved :)

$ b4 prep -n SERIES -f mm/mm-unstable
$ b4 prep --set-prefixes RFC
... add patches
$ b4 prep --auto-to-cc
$ b4 prep --edit-cover
$ b4 send  --no-sign

-- 
Cheers,

David

Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload

Posted by Garg, Shivank 1 month, 2 weeks ago


On 4/29/2026 1:03 AM, David Hildenbrand (Arm) wrote:
> On 4/28/26 19:11, Garg, Shivank wrote:
>> Hi all,
>>
>> Apologies. The subject prefix should have been [RFC PATCH v5 0/7].
>>
>> This is the fifth RFC, as mentioned in the cover letter, but I
>> missed the prefix while formatting the patches. Please treat this
>> round as RFC v5.
> 
> Ever since I switched to b4 for patch management, the quality of my life improved :)
> 
> $ b4 prep -n SERIES -f mm/mm-unstable
> $ b4 prep --set-prefixes RFC
> ... add patches
> $ b4 prep --auto-to-cc
> $ b4 prep --edit-cover
> $ b4 send  --no-sign
> 

Thanks, appreciate the pointers. :)
I'll switch to b4.

Best regards,
Shivank

Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload

Posted by David Hildenbrand (Arm) 1 month ago

On 4/28/26 17:50, Shivank Garg wrote:
> This is the fifth RFC of the patchset to enhance page migration by

Ah, this is an RFC ...

... I suggest b4 for patch series management :P

That also explains why patch #7 is still in there.

> batching folio-copy operations and enabling acceleration via DMA offload.
> 
> Single-threaded, folio-by-folio copying bottlenecks page migration in
> modern systems with deep memory hierarchies, especially for large folios
> where copy overhead dominates, leaving significant hardware potential
> untapped.
> 
> By batching the copy phase, we create an opportunity for hardware
> acceleration. This series builds the framework and provides a DMA
> offload driver (dcbm) as a reference implementation, targeting bulk
> migration workloads where offloading the copy improves throughput
> and latency while freeing the CPU cycles.
> 
> See the RFC V3 cover letter [2] for motivation.
> 
> Changelog since V4:
> -------------------
> 
> 1. Renamed PAGE_* migration state flags to FOLIO_*. (David)
> 2. Use the new folio->migrate_info field instead of folio->private
>    for migration state. (David)
> 3. Fold folios_mc_copy patch in batch-copy implementation patch. (David)
> 3. Renamed migrate_offload_start()/stop() to register()/unregister().
>    (Huang, Ying)
> 4. Dropped should_batch() callback from struct migrator. Reason-based
>    policy now lives in migrate_pages_batch(). Migrators can still skip
>    a batch they don't want (size based policy). (Huang, Ying)
> 5. CONFIG_MIGRATION_COPY_OFFLOAD is now hidden and selected by the
>    migrator driver. CONFIG_DCBM_DMA is tristate. (Huang Ying, Gregory Price). 
> 6. Wrapped the SRCU + static_call dispatch in a small helper. (Huang, Ying)
> 7. Requir m->owner in migrate_offload_register(), SRCU sync at
>    unregister relies on it. Counters are atomic_long_t to avoid lock-order
>    issue.
> 9. Moved DCBM sysfs from /sys/kernel/dcbm to /sys/module/dcbm (Huang, Ying) 
> 10. Rebased on v7.1-rc1.
> 

[...]

> 
> OPEN QUESTIONS:
> ---------------
> 
> 1. Should the batch path run without a registered migrator? Patches 1-4
>    are self-contained and use folios_mc_copy() (CPU). I have several
>    options like making batch path always-on for eligible folios, or
>    giving admin an option to flip the static branch, or keep the gate.
>    I'm leaning toward always-on.

Hiding that detail from migrate.c sounds interesting.

> 
> 2. Carrying already_copied via folio->migrate_info vs changing the
>    migrate_folio() callback signature (Huang, Ying). I went with the
>    field for now to avoid touching every fs callback before the design
>    settles. Happy to revisit.
> 
> 3. Per-caller offload selection: Today eligibility is by migrate_reason
>    only. Some are latency-tolerant, others may be not. Is reason the
>    right granularity, or do we want a per-caller hint?

Isn't it sufficient to just do it based on the #folios or sth like that?

If someone migrates a handful of folios, latency is likely more important (and
batching less beneficial).

I'd assume when migrating many folios, batching could just always be done. Or
what's the concern?

> 
> 4. Cgroup integration: How should per-cgroup be accounted for different
>    migrators (e.g.: any accounting for DMA-busy time)?

Oh. Do we even have to mess with that?

> 
> 5. Tuning migrate_pages callers for offloading. For instance, in
>    compaction COMPACT_CLUSTER_MAX = 32 caps DMA's payoff for compaction
>    (V4 experiment).

Is that HW dependent?

> 
> 6. Where do batch-size thresholds live, and how are they tuned? Per
>    Huang Ying's split, that policy lives in the migrator. DCBM has no
>    threshold today. Open whether it should later be a per-migrator
>    sysfs knob or hard-coded; probably clearer once a second migrator
>    (SDXI, mtcopy) shows the trade-off.

Again, sounds like being HW dependent, no?


-- 
Cheers,

David

Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload

Posted by Garg, Shivank 3 weeks, 6 days ago


On 5/11/2026 9:23 PM, David Hildenbrand (Arm) wrote:
> On 4/28/26 17:50, Shivank Garg wrote:
>> This is the fifth RFC of the patchset to enhance page migration by
> 
> Ah, this is an RFC ...
> 
> ... I suggest b4 for patch series management :P
> 
> That also explains why patch #7 is still in there.
> 

yes, started using it :)

Patch 7 is for testing only but I need to think on optimum batch-size for
offload which depends on HW, or have a callback as per Huang Ying's suggestion.


>> batching folio-copy operations and enabling acceleration via DMA offload.
>>
>> Single-threaded, folio-by-folio copying bottlenecks page migration in
>> modern systems with deep memory hierarchies, especially for large folios
>> where copy overhead dominates, leaving significant hardware potential
>> untapped.
>>
>> By batching the copy phase, we create an opportunity for hardware
>> acceleration. This series builds the framework and provides a DMA
>> offload driver (dcbm) as a reference implementation, targeting bulk
>> migration workloads where offloading the copy improves throughput
>> and latency while freeing the CPU cycles.
>>
>> See the RFC V3 cover letter [2] for motivation.
>>
>> Changelog since V4:
>> -------------------
>>
>> 1. Renamed PAGE_* migration state flags to FOLIO_*. (David)
>> 2. Use the new folio->migrate_info field instead of folio->private
>>    for migration state. (David)
>> 3. Fold folios_mc_copy patch in batch-copy implementation patch. (David)
>> 3. Renamed migrate_offload_start()/stop() to register()/unregister().
>>    (Huang, Ying)
>> 4. Dropped should_batch() callback from struct migrator. Reason-based
>>    policy now lives in migrate_pages_batch(). Migrators can still skip
>>    a batch they don't want (size based policy). (Huang, Ying)
>> 5. CONFIG_MIGRATION_COPY_OFFLOAD is now hidden and selected by the
>>    migrator driver. CONFIG_DCBM_DMA is tristate. (Huang Ying, Gregory Price). 
>> 6. Wrapped the SRCU + static_call dispatch in a small helper. (Huang, Ying)
>> 7. Requir m->owner in migrate_offload_register(), SRCU sync at
>>    unregister relies on it. Counters are atomic_long_t to avoid lock-order
>>    issue.
>> 9. Moved DCBM sysfs from /sys/kernel/dcbm to /sys/module/dcbm (Huang, Ying) 
>> 10. Rebased on v7.1-rc1.
>>
> 
> [...]
> 
>>
>> OPEN QUESTIONS:
>> ---------------
>>
>> 1. Should the batch path run without a registered migrator? Patches 1-4
>>    are self-contained and use folios_mc_copy() (CPU). I have several
>>    options like making batch path always-on for eligible folios, or
>>    giving admin an option to flip the static branch, or keep the gate.
>>    I'm leaning toward always-on.
> 
> Hiding that detail from migrate.c sounds interesting.
> 

Yes, will do that.


>> 2. Carrying already_copied via folio->migrate_info vs changing the
>>    migrate_folio() callback signature (Huang, Ying). I went with the
>>    field for now to avoid touching every fs callback before the design
>>    settles. Happy to revisit.
>>
>> 3. Per-caller offload selection: Today eligibility is by migrate_reason
>>    only. Some are latency-tolerant, others may be not. Is reason the
>>    right granularity, or do we want a per-caller hint?
> 
> Isn't it sufficient to just do it based on the #folios or sth like that?
> 
> If someone migrates a handful of folios, latency is likely more important (and
> batching less beneficial).
> 
> I'd assume when migrating many folios, batching could just always be done. Or
> what's the concern?
> 

It could be a requirement for some users who want only specific use cases to go
through DMA offload.

I agree with your point, and will discuss more on it.

>>
>> 4. Cgroup integration: How should per-cgroup be accounted for different
>>    migrators (e.g.: any accounting for DMA-busy time)?
> 
> Oh. Do we even have to mess with that?

Probably not for the intial series.
Will drop this question.

>>
>> 5. Tuning migrate_pages callers for offloading. For instance, in
>>    compaction COMPACT_CLUSTER_MAX = 32 caps DMA's payoff for compaction
>>    (V4 experiment).
> 
> Is that HW dependent?
> 
>>
>> 6. Where do batch-size thresholds live, and how are they tuned? Per
>>    Huang Ying's split, that policy lives in the migrator. DCBM has no
>>    threshold today. Open whether it should later be a per-migrator
>>    sysfs knob or hard-coded; probably clearer once a second migrator
>>    (SDXI, mtcopy) shows the trade-off.
> 
> Again, sounds like being HW dependent, no?

Yes, both are HW dependent.
Batch-size gating fits naturally in the migrator.
For something like COMPACT_CLUSTER_MAX, would a callback from compaction
to registered migrator is right thought? or do you have something else in mind?
For initial series, I think I need not mess with it.

Thanks,
Shivank

Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload

Posted by Huang, Ying 1 month ago

"David Hildenbrand (Arm)" <david@kernel.org> writes:

> On 4/28/26 17:50, Shivank Garg wrote:
>> This is the fifth RFC of the patchset to enhance page migration by

[snip]
>
>> 
>> 3. Per-caller offload selection: Today eligibility is by migrate_reason
>>    only. Some are latency-tolerant, others may be not. Is reason the
>>    right granularity, or do we want a per-caller hint?
>
> Isn't it sufficient to just do it based on the #folios or sth like that?
>
> If someone migrates a handful of folios, latency is likely more important (and
> batching less beneficial).
>
> I'd assume when migrating many folios, batching could just always be done. Or
> what's the concern?

IIUC, for callers like migrate_pages syscall, it's possible that almost all
folios of a process are passed to migrate_pages().  However, I think that
we still need to keep the folio inaccessible time reasonable.

[snip]

---
Best Regards,
Huang, Ying

Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload

Posted by David Hildenbrand (Arm) 1 month ago

On 5/12/26 04:35, Huang, Ying wrote:
> "David Hildenbrand (Arm)" <david@kernel.org> writes:
> 
>> On 4/28/26 17:50, Shivank Garg wrote:
>>> This is the fifth RFC of the patchset to enhance page migration by
> 
> [snip]
>>
>>>
>>> 3. Per-caller offload selection: Today eligibility is by migrate_reason
>>>    only. Some are latency-tolerant, others may be not. Is reason the
>>>    right granularity, or do we want a per-caller hint?
>>
>> Isn't it sufficient to just do it based on the #folios or sth like that?
>>
>> If someone migrates a handful of folios, latency is likely more important (and
>> batching less beneficial).
>>
>> I'd assume when migrating many folios, batching could just always be done. Or
>> what's the concern?
> 
> IIUC, for callers like migrate_pages syscall, it's possible that almost all
> folios of a process are passed to migrate_pages().  However, I think that
> we still need to keep the folio inaccessible time reasonable.

Wouldn't we still want to process them in batches, only affecting folios in a
batch at one point in time, not the whole address space?

-- 
Cheers,

David

Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload

Posted by Huang, Ying 1 month ago

"David Hildenbrand (Arm)" <david@kernel.org> writes:

> On 5/12/26 04:35, Huang, Ying wrote:
>> "David Hildenbrand (Arm)" <david@kernel.org> writes:
>> 
>>> On 4/28/26 17:50, Shivank Garg wrote:
>>>> This is the fifth RFC of the patchset to enhance page migration by
>> 
>> [snip]
>>>
>>>>
>>>> 3. Per-caller offload selection: Today eligibility is by migrate_reason
>>>>    only. Some are latency-tolerant, others may be not. Is reason the
>>>>    right granularity, or do we want a per-caller hint?
>>>
>>> Isn't it sufficient to just do it based on the #folios or sth like that?
>>>
>>> If someone migrates a handful of folios, latency is likely more important (and
>>> batching less beneficial).
>>>
>>> I'd assume when migrating many folios, batching could just always be done. Or
>>> what's the concern?
>> 
>> IIUC, for callers like migrate_pages syscall, it's possible that almost all
>> folios of a process are passed to migrate_pages().  However, I think that
>> we still need to keep the folio inaccessible time reasonable.
>
> Wouldn't we still want to process them in batches, only affecting folios in a
> batch at one point in time, not the whole address space?

Sorry, my previous reply was confusing.  Let me try to be clearer.

I think we need to distinguish between two kinds of latency.  One is the
latency to migrate folios; the other is the latency during which a core
cannot access memory (while unmapped during migration).

IIUC, most users want the second one to be as short as possible.  One
possible user that does not care as much is the folio demoter, which
migrates cold higher-tier folios to lower tier after they have not been
accessed for some time.

---
Best Regards,
Huang, Ying

Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload

Posted by Huang, Ying 1 month, 1 week ago

Shivank Garg <shivankg@amd.com> writes:

> This is the fifth RFC of the patchset to enhance page migration by
> batching folio-copy operations and enabling acceleration via DMA offload.
>
> Single-threaded, folio-by-folio copying bottlenecks page migration in
> modern systems with deep memory hierarchies, especially for large folios
> where copy overhead dominates, leaving significant hardware potential
> untapped.
>
> By batching the copy phase, we create an opportunity for hardware
> acceleration. This series builds the framework and provides a DMA
> offload driver (dcbm) as a reference implementation, targeting bulk
> migration workloads where offloading the copy improves throughput
> and latency while freeing the CPU cycles.
>
> See the RFC V3 cover letter [2] for motivation.
>
> Changelog since V4:
> -------------------
>
> 1. Renamed PAGE_* migration state flags to FOLIO_*. (David)
> 2. Use the new folio->migrate_info field instead of folio->private
>    for migration state. (David)
> 3. Fold folios_mc_copy patch in batch-copy implementation patch. (David)
> 3. Renamed migrate_offload_start()/stop() to register()/unregister().
>    (Huang, Ying)
> 4. Dropped should_batch() callback from struct migrator. Reason-based
>    policy now lives in migrate_pages_batch(). Migrators can still skip
>    a batch they don't want (size based policy). (Huang, Ying)
> 5. CONFIG_MIGRATION_COPY_OFFLOAD is now hidden and selected by the
>    migrator driver. CONFIG_DCBM_DMA is tristate. (Huang Ying, Gregory Price). 
> 6. Wrapped the SRCU + static_call dispatch in a small helper. (Huang, Ying)
> 7. Requir m->owner in migrate_offload_register(), SRCU sync at
>    unregister relies on it. Counters are atomic_long_t to avoid lock-order
>    issue.
> 9. Moved DCBM sysfs from /sys/kernel/dcbm to /sys/module/dcbm (Huang, Ying) 
> 10. Rebased on v7.1-rc1.
>
>
> DESIGN:
> -------
>
> New Migration Flow:
>
> [ migrate_pages_batch() ]
>     |
>     |--> do_batch = migrate_offload_do_batch(reason)  // core filters by migration reason
>     |
>     |--> for each folio:
>     |      migrate_folio_unmap()        // unmap the folio
>     |      |
>     |      +--> (success):
>     |           if do_batch && folio_supports_batch_copy():
>     |               -> unmap_batch / dst_batch  // batch list for copy offloading
>     |           else:
>     |               -> unmap_single / dst_single // single lists for per-folio CPU copy
>     |
>     |--> try_to_unmap_flush()                   // single batched TLB flush
>     |
>     |--> Batch copy (if unmap_batch not empty):
>     |    - Migrator is configurable at runtime via sysfs.
>     |
>     |      static_call(migrate_offload_copy)    // Pluggable Migrators
>     |              /          |            \
>     |             v           v             v
>     |     [ Default ]  [ DMA Offload ]  [ ... ]
>     |
>     |      On -EOPNOTSUPP or other error, batch falls back to per-folio CPU copy.
>     |
>     +--> migrate_folios_move()      // metadata, update PTEs, finalize
>          (batch list with already_copied=true, single list with false)
>
> Offload Registration:
>
>     Driver fills struct migrator { .name, .offload_copy, .owner } and calls
>     migrate_offload_register().  This:
>       - Pins the module via try_module_get()
>       - Patches the migrate_offload_copy() static_call target
>       - Enables the migrate_offload_enabled static branch
>
>     migrate_offload_unregister() disables the static branch and reverts
>     the static_call, then synchronize_srcu() waits for in-flight migrations
>     before module_put().
>
> PERFORMANCE RESULTS:
> --------------------
>
> Re-ran the V4 workload on v7.1-rc1 with this series; relative
> speedups match V4 (~6x for 2MB folios at 16 DMA channels). No design
> change in V5 alters this picture; please refer to the V4 cover letter
> for the throughput tables [1].
>
>
> PLAN:
> -----
>
> Patches 1-4 (the batching infrastructure) don't depend on the migrator
> interface, so if it helps I can split them off and post them ahead of
> the migrator and DCBM bits, which still have a few open questions to
> work through.
>
> I would appreciate guidance on splitting the infrastructure portion
> ahead of the migrator interface if that matches maintainers' preference.
>
> OPEN QUESTIONS:
> ---------------
>
> 1. Should the batch path run without a registered migrator? Patches 1-4
>    are self-contained and use folios_mc_copy() (CPU). I have several
>    options like making batch path always-on for eligible folios, or
>    giving admin an option to flip the static branch, or keep the gate.
>    I'm leaning toward always-on.
>
> 2. Carrying already_copied via folio->migrate_info vs changing the
>    migrate_folio() callback signature (Huang, Ying). I went with the
>    field for now to avoid touching every fs callback before the design
>    settles. Happy to revisit.

Personally, I still prefer to change migrate_folio() callbacks for
better readability.

> 3. Per-caller offload selection: Today eligibility is by migrate_reason
>    only. Some are latency-tolerant, others may be not. Is reason the
>    right granularity, or do we want a per-caller hint?
>
> 4. Cgroup integration: How should per-cgroup be accounted for different
>    migrators (e.g.: any accounting for DMA-busy time)?
>
> 5. Tuning migrate_pages callers for offloading. For instance, in
>    compaction COMPACT_CLUSTER_MAX = 32 caps DMA's payoff for compaction
>    (V4 experiment).
>
> 6. Where do batch-size thresholds live, and how are they tuned? Per
>    Huang Ying's split, that policy lives in the migrator. DCBM has no
>    threshold today. Open whether it should later be a per-migrator
>    sysfs knob or hard-coded; probably clearer once a second migrator
>    (SDXI, mtcopy) shows the trade-off.
>
>
> FOLLOW-UPS:
> --------------
>
> 1. dmaengine_prep_dma_memcpy_sg() in DCBM (Vinod Koul). The SG-prep
>    variant cuts per-batch prep/submit cost (=CPU savings), but ptdma does
>    not implement the SG hook yet [10]. The end-to-end migration throughput
>    delta is small because per-descriptor execute time dominates.
>    I'll post the ptdma SG hook + DCBM switch as a follow-up.
>   
> 2. SDXI as a second migrator. The SDXI series [11] is in  review. SDXI is
>    a generic memcpy engine without DMA_PRIVATE, so channel acquisition
>    goes through dma_find_channel() or async_tx rather than
>    dma_request_chan_by_mask(). I have a local DCBM variant working on top
>    of the SDXI driver. I'm planning to send it as a follow-up once the
>    SDXI series settles.
>  
> 3. IOMMU SG merging in DCBM (Gregory). dma_map_sgtable() may merge
>    contiguous PFNs unevenly, so src.nents != dst.nents. DCBM falls back
>    to CPU for safety. Though I haven't seen it on Zen3 + PTDMA. I'll
>    understand this and address it a follow-up.
>  
> 4. Revisit Multi-threaded CPU copy migrator once the infra is settled.
>
> EARLIER POSTINGS:
> -----------------
> [1] RFC V4: https://lore.kernel.org/all/20260309120725.308854-3-shivankg@amd.com
> [2] RFC V3: https://lore.kernel.org/all/20250923174752.35701-1-shivankg@amd.com
> [3] RFC V2: https://lore.kernel.org/all/20250319192211.10092-1-shivankg@amd.com
> [4] RFC V1: https://lore.kernel.org/all/20240614221525.19170-1-shivankg@amd.com
> [5] RFC from Zi Yan: https://lore.kernel.org/all/20250103172419.4148674-1-ziy@nvidia.com
>
> RELATED DISCUSSIONS:
> --------------------
> [6] MM-alignment Session [Nov 12, 2025]:
>     https://lore.kernel.org/linux-mm/bd6a3c75-b9f0-cbcf-f7c4-1ef5dff06d24@google.com
> [7] Linux Memory Hotness and Promotion call [Nov 6, 2025]:
>     https://lore.kernel.org/linux-mm/8ff2fd10-c9ac-4912-cf56-7ecd4afd2770@google.com
> [8] LSFMM 2025:
>     https://lore.kernel.org/all/cf6fc05d-c0b0-4de3-985e-5403977aa3aa@amd.com
> [9] OSS India:
>     https://ossindia2025.sched.com/event/23Jk1
> [10] DMA_MEMCPY_SG comparison:
>      https://lore.kernel.org/linux-mm/3e73addb-ac01-4a05-bc75-c6c1c56072df@amd.com
> [11] SDXI V1:
>      https://lore.kernel.org/all/20260410-sdxi-base-v1-0-1d184cb5c60a@amd.com
>
> Thanks to everyone who reviewed, tested or participated in discussions
> around this series. Your feedback helped me throughout the development
> process.
>
> Best Regards,
> Shivank
>
>
> Shivank Garg (6):
>   mm/migrate: rename PAGE_ migration flags to FOLIO_
>   mm/migrate: use migrate_info field instead of private
>   mm/migrate: skip data copy for already-copied folios
>   mm/migrate: add batch-copy path in migrate_pages_batch
>   mm/migrate: add copy offload registration infrastructure
>   drivers/migrate_offload: add DMA batch copy driver (dcbm)
>
> Zi Yan (1):
>   mm/migrate: adjust NR_MAX_BATCHED_MIGRATION for testing
>
>  drivers/Kconfig                       |   2 +
>  drivers/Makefile                      |   2 +
>  drivers/migrate_offload/Kconfig       |   9 +
>  drivers/migrate_offload/Makefile      |   1 +
>  drivers/migrate_offload/dcbm/Makefile |   1 +
>  drivers/migrate_offload/dcbm/dcbm.c   | 440 ++++++++++++++++++++++++++
>  include/linux/migrate_copy_offload.h  |  44 +++
>  include/linux/mm.h                    |   2 +
>  include/linux/mm_types.h              |   1 +
>  mm/Kconfig                            |   6 +
>  mm/Makefile                           |   1 +
>  mm/migrate.c                          | 211 ++++++++----
>  mm/migrate_copy_offload.c             |  94 ++++++
>  mm/util.c                             |  30 ++
>  14 files changed, 784 insertions(+), 60 deletions(-)
>  create mode 100644 drivers/migrate_offload/Kconfig
>  create mode 100644 drivers/migrate_offload/Makefile
>  create mode 100644 drivers/migrate_offload/dcbm/Makefile
>  create mode 100644 drivers/migrate_offload/dcbm/dcbm.c
>  create mode 100644 include/linux/migrate_copy_offload.h
>  create mode 100644 mm/migrate_copy_offload.c
>
>
> base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731

---
Best Regards,
Huang, Ying

Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload

Posted by David Hildenbrand (Arm) 1 month ago

>> ---------------
>>
>> 1. Should the batch path run without a registered migrator? Patches 1-4
>>    are self-contained and use folios_mc_copy() (CPU). I have several
>>    options like making batch path always-on for eligible folios, or
>>    giving admin an option to flip the static branch, or keep the gate.
>>    I'm leaning toward always-on.
>>
>> 2. Carrying already_copied via folio->migrate_info vs changing the
>>    migrate_folio() callback signature (Huang, Ying). I went with the
>>    field for now to avoid touching every fs callback before the design
>>    settles. Happy to revisit.
> 
> Personally, I still prefer to change migrate_folio() callbacks for
> better readability.

Can that be added as a cleanup on top?

-- 
Cheers,

David

Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload

Posted by Huang, Ying 1 month ago

"David Hildenbrand (Arm)" <david@kernel.org> writes:

>>> ---------------
>>>
>>> 1. Should the batch path run without a registered migrator? Patches 1-4
>>>    are self-contained and use folios_mc_copy() (CPU). I have several
>>>    options like making batch path always-on for eligible folios, or
>>>    giving admin an option to flip the static branch, or keep the gate.
>>>    I'm leaning toward always-on.
>>>
>>> 2. Carrying already_copied via folio->migrate_info vs changing the
>>>    migrate_folio() callback signature (Huang, Ying). I went with the
>>>    field for now to avoid touching every fs callback before the design
>>>    settles. Happy to revisit.
>> 
>> Personally, I still prefer to change migrate_folio() callbacks for
>> better readability.
>
> Can that be added as a cleanup on top?

It sounds good to me.

---
Best Regards,
Huang, Ying

Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload

Posted by Huang, Ying 1 month, 2 weeks ago

Shivank Garg <shivankg@amd.com> writes:

> This is the fifth RFC of the patchset to enhance page migration by
> batching folio-copy operations and enabling acceleration via DMA offload.
>
> Single-threaded, folio-by-folio copying bottlenecks page migration in
> modern systems with deep memory hierarchies, especially for large folios
> where copy overhead dominates, leaving significant hardware potential
> untapped.
>
> By batching the copy phase, we create an opportunity for hardware
> acceleration. This series builds the framework and provides a DMA
> offload driver (dcbm) as a reference implementation, targeting bulk
> migration workloads where offloading the copy improves throughput
> and latency while freeing the CPU cycles.
>
> See the RFC V3 cover letter [2] for motivation.
>
> Changelog since V4:
> -------------------
>
> 1. Renamed PAGE_* migration state flags to FOLIO_*. (David)
> 2. Use the new folio->migrate_info field instead of folio->private
>    for migration state. (David)
> 3. Fold folios_mc_copy patch in batch-copy implementation patch. (David)
> 3. Renamed migrate_offload_start()/stop() to register()/unregister().
>    (Huang, Ying)
> 4. Dropped should_batch() callback from struct migrator. Reason-based
>    policy now lives in migrate_pages_batch(). Migrators can still skip
>    a batch they don't want (size based policy). (Huang, Ying)
> 5. CONFIG_MIGRATION_COPY_OFFLOAD is now hidden and selected by the
>    migrator driver. CONFIG_DCBM_DMA is tristate. (Huang Ying, Gregory Price). 
> 6. Wrapped the SRCU + static_call dispatch in a small helper. (Huang, Ying)
> 7. Requir m->owner in migrate_offload_register(), SRCU sync at
>    unregister relies on it. Counters are atomic_long_t to avoid lock-order
>    issue.
> 9. Moved DCBM sysfs from /sys/kernel/dcbm to /sys/module/dcbm (Huang, Ying) 
> 10. Rebased on v7.1-rc1.
>
>
> DESIGN:
> -------
>
> New Migration Flow:
>
> [ migrate_pages_batch() ]
>     |
>     |--> do_batch = migrate_offload_do_batch(reason)  // core filters by migration reason
>     |
>     |--> for each folio:
>     |      migrate_folio_unmap()        // unmap the folio
>     |      |
>     |      +--> (success):
>     |           if do_batch && folio_supports_batch_copy():
>     |               -> unmap_batch / dst_batch  // batch list for copy offloading
>     |           else:
>     |               -> unmap_single / dst_single // single lists for per-folio CPU copy
>     |
>     |--> try_to_unmap_flush()                   // single batched TLB flush
>     |
>     |--> Batch copy (if unmap_batch not empty):
>     |    - Migrator is configurable at runtime via sysfs.
>     |
>     |      static_call(migrate_offload_copy)    // Pluggable Migrators
>     |              /          |            \
>     |             v           v             v
>     |     [ Default ]  [ DMA Offload ]  [ ... ]
>     |
>     |      On -EOPNOTSUPP or other error, batch falls back to per-folio CPU copy.
>     |
>     +--> migrate_folios_move()      // metadata, update PTEs, finalize
>          (batch list with already_copied=true, single list with false)
>
> Offload Registration:
>
>     Driver fills struct migrator { .name, .offload_copy, .owner } and calls
>     migrate_offload_register().  This:
>       - Pins the module via try_module_get()
>       - Patches the migrate_offload_copy() static_call target
>       - Enables the migrate_offload_enabled static branch
>
>     migrate_offload_unregister() disables the static branch and reverts
>     the static_call, then synchronize_srcu() waits for in-flight migrations
>     before module_put().
>
> PERFORMANCE RESULTS:
> --------------------
>
> Re-ran the V4 workload on v7.1-rc1 with this series; relative
> speedups match V4 (~6x for 2MB folios at 16 DMA channels). No design
> change in V5 alters this picture; please refer to the V4 cover letter
> for the throughput tables [1].

IMHO, it's better to copy performance data here.

In addition to the performance benefit, I want to know the downside as
well.  For example, the migration latency of the first folio may be
longer.  If so, by how much?  Can you measure the batch number vs. total
migration time (benefit) and first folio migration time (downside)?
That can be used to determine the optimal batch number.

> PLAN:
> -----
>
> Patches 1-4 (the batching infrastructure) don't depend on the migrator
> interface, so if it helps I can split them off and post them ahead of
> the migrator and DCBM bits, which still have a few open questions to
> work through.
>
> I would appreciate guidance on splitting the infrastructure portion
> ahead of the migrator interface if that matches maintainers' preference.
>
> OPEN QUESTIONS:
> ---------------
>
> 1. Should the batch path run without a registered migrator? Patches 1-4
>    are self-contained and use folios_mc_copy() (CPU). I have several
>    options like making batch path always-on for eligible folios, or
>    giving admin an option to flip the static branch, or keep the gate.
>    I'm leaning toward always-on.
>
> 2. Carrying already_copied via folio->migrate_info vs changing the
>    migrate_folio() callback signature (Huang, Ying). I went with the
>    field for now to avoid touching every fs callback before the design
>    settles. Happy to revisit.
>
> 3. Per-caller offload selection: Today eligibility is by migrate_reason
>    only. Some are latency-tolerant, others may be not. Is reason the
>    right granularity, or do we want a per-caller hint?
>
> 4. Cgroup integration: How should per-cgroup be accounted for different
>    migrators (e.g.: any accounting for DMA-busy time)?
>
> 5. Tuning migrate_pages callers for offloading. For instance, in
>    compaction COMPACT_CLUSTER_MAX = 32 caps DMA's payoff for compaction
>    (V4 experiment).
>
> 6. Where do batch-size thresholds live, and how are they tuned? Per
>    Huang Ying's split, that policy lives in the migrator. DCBM has no
>    threshold today. Open whether it should later be a per-migrator
>    sysfs knob or hard-coded; probably clearer once a second migrator
>    (SDXI, mtcopy) shows the trade-off.
>
>
> FOLLOW-UPS:
> --------------
>
> 1. dmaengine_prep_dma_memcpy_sg() in DCBM (Vinod Koul). The SG-prep
>    variant cuts per-batch prep/submit cost (=CPU savings), but ptdma does
>    not implement the SG hook yet [10]. The end-to-end migration throughput
>    delta is small because per-descriptor execute time dominates.
>    I'll post the ptdma SG hook + DCBM switch as a follow-up.
>   
> 2. SDXI as a second migrator. The SDXI series [11] is in  review. SDXI is
>    a generic memcpy engine without DMA_PRIVATE, so channel acquisition
>    goes through dma_find_channel() or async_tx rather than
>    dma_request_chan_by_mask(). I have a local DCBM variant working on top
>    of the SDXI driver. I'm planning to send it as a follow-up once the
>    SDXI series settles.
>  
> 3. IOMMU SG merging in DCBM (Gregory). dma_map_sgtable() may merge
>    contiguous PFNs unevenly, so src.nents != dst.nents. DCBM falls back
>    to CPU for safety. Though I haven't seen it on Zen3 + PTDMA. I'll
>    understand this and address it a follow-up.
>  
> 4. Revisit Multi-threaded CPU copy migrator once the infra is settled.
>
> EARLIER POSTINGS:
> -----------------
> [1] RFC V4: https://lore.kernel.org/all/20260309120725.308854-3-shivankg@amd.com
> [2] RFC V3: https://lore.kernel.org/all/20250923174752.35701-1-shivankg@amd.com
> [3] RFC V2: https://lore.kernel.org/all/20250319192211.10092-1-shivankg@amd.com
> [4] RFC V1: https://lore.kernel.org/all/20240614221525.19170-1-shivankg@amd.com
> [5] RFC from Zi Yan: https://lore.kernel.org/all/20250103172419.4148674-1-ziy@nvidia.com
>
> RELATED DISCUSSIONS:
> --------------------
> [6] MM-alignment Session [Nov 12, 2025]:
>     https://lore.kernel.org/linux-mm/bd6a3c75-b9f0-cbcf-f7c4-1ef5dff06d24@google.com
> [7] Linux Memory Hotness and Promotion call [Nov 6, 2025]:
>     https://lore.kernel.org/linux-mm/8ff2fd10-c9ac-4912-cf56-7ecd4afd2770@google.com
> [8] LSFMM 2025:
>     https://lore.kernel.org/all/cf6fc05d-c0b0-4de3-985e-5403977aa3aa@amd.com
> [9] OSS India:
>     https://ossindia2025.sched.com/event/23Jk1
> [10] DMA_MEMCPY_SG comparison:
>      https://lore.kernel.org/linux-mm/3e73addb-ac01-4a05-bc75-c6c1c56072df@amd.com
> [11] SDXI V1:
>      https://lore.kernel.org/all/20260410-sdxi-base-v1-0-1d184cb5c60a@amd.com
>
> Thanks to everyone who reviewed, tested or participated in discussions
> around this series. Your feedback helped me throughout the development
> process.
>
> Best Regards,
> Shivank
>
>
> Shivank Garg (6):
>   mm/migrate: rename PAGE_ migration flags to FOLIO_
>   mm/migrate: use migrate_info field instead of private
>   mm/migrate: skip data copy for already-copied folios
>   mm/migrate: add batch-copy path in migrate_pages_batch
>   mm/migrate: add copy offload registration infrastructure
>   drivers/migrate_offload: add DMA batch copy driver (dcbm)
>
> Zi Yan (1):
>   mm/migrate: adjust NR_MAX_BATCHED_MIGRATION for testing
>
>  drivers/Kconfig                       |   2 +
>  drivers/Makefile                      |   2 +
>  drivers/migrate_offload/Kconfig       |   9 +
>  drivers/migrate_offload/Makefile      |   1 +
>  drivers/migrate_offload/dcbm/Makefile |   1 +
>  drivers/migrate_offload/dcbm/dcbm.c   | 440 ++++++++++++++++++++++++++
>  include/linux/migrate_copy_offload.h  |  44 +++
>  include/linux/mm.h                    |   2 +
>  include/linux/mm_types.h              |   1 +
>  mm/Kconfig                            |   6 +
>  mm/Makefile                           |   1 +
>  mm/migrate.c                          | 211 ++++++++----
>  mm/migrate_copy_offload.c             |  94 ++++++
>  mm/util.c                             |  30 ++
>  14 files changed, 784 insertions(+), 60 deletions(-)
>  create mode 100644 drivers/migrate_offload/Kconfig
>  create mode 100644 drivers/migrate_offload/Makefile
>  create mode 100644 drivers/migrate_offload/dcbm/Makefile
>  create mode 100644 drivers/migrate_offload/dcbm/dcbm.c
>  create mode 100644 include/linux/migrate_copy_offload.h
>  create mode 100644 mm/migrate_copy_offload.c
>
>
> base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731

---
Best Regards,
Huang, Ying

Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload

Posted by Garg, Shivank 1 month, 1 week ago


On 4/30/2026 2:17 PM, Huang, Ying wrote:
> Shivank Garg <shivankg@amd.com> writes:

>> PERFORMANCE RESULTS:
>> --------------------
>>
>> Re-ran the V4 workload on v7.1-rc1 with this series; relative
>> speedups match V4 (~6x for 2MB folios at 16 DMA channels). No design
>> change in V5 alters this picture; please refer to the V4 cover letter
>> for the throughput tables [1].
> 
> IMHO, it's better to copy performance data here.
> 
> In addition to the performance benefit, I want to know the downside as
> well.  For example, the migration latency of the first folio may be
> longer.  If so, by how much?  Can you measure the batch number vs. total
> migration time (benefit) and first folio migration time (downside)?
> That can be used to determine the optimal batch number.
> 

System Info: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT Enabled),
1 NUMA node per socket, v7.1-rc1, DVFS set to Performance, PTDMA hardware.

Benchmark: move_pages() syscall to move pages between two NUMA nodes.

1). Moving different sized folios such that total transfer size is constant
(1GB), with different number of DMA channels. Throughput in GB/s.

a. Baseline (vanilla kernel, single-threaded, serial folio_copy):

================================================================================
4K          | 16K        | 64K        | 256K       | 1M          | 2M          |
================================================================================
3.31±0.18   | 5.61±0.07  | 6.66±0.03  | 7.01±0.03  | 7.13±0.08   | 11.02±0.17  |


b. DMA offload (Patched Kernel, dcbm driver, N DMA channels):

============================================================================================
N channel| 4K        | 16K         | 64K         | 256K        | 1M          | 2M          |
============================================================================================
1      | 2.16±0.14   | 2.58±0.02   | 3.00±0.04   | 4.56±0.28   | 4.62±0.02   | 12.65±0.08  |
2      | 2.68±0.09   | 3.69±0.15   | 4.52±0.04   | 6.75±0.06   | 7.19±0.19   | 14.38±0.06  |
4      | 3.07±0.13   | 4.62±0.09   | 6.47±0.56   | 9.22±0.15   | 10.24±0.47  | 27.01±0.11  |
8      | 3.43±0.09   | 5.40±0.16   | 7.67±0.08   | 11.25±0.17  | 12.60±0.60  | 45.62±0.52  |
12     | 3.50±0.11   | 5.66±0.16   | 8.12±0.10   | 11.97±0.19  | 13.43±0.08  | 61.02±0.92  |
16     | 3.54±0.12   | 5.79±0.14   | 8.50±0.13   | 12.59±0.15  | 17.21±6.40  | 65.23±1.70  |


2).  First-folio latency: Instrumented with custom tracepoints to measure latency per migrate_pages_batch() call.
    Result: throughput (GB/s) and first-folio latency (in microseconds), median of 10 runs.

A). Vanilla Kernel:

Here, n = workload size passed to move_pages() in folios. Move n number of folios with move_pages().
NR_MAX_BATCHED_MIGRATION is upstream default value 512.

--- Order 0 (4K folios) ---
     n      vanilla/cpu
(folios)    GB/s | first(us)
--------------------------
     1       0.04 |     24
     4       0.16 |     25
     8       0.29 |     31
    16       0.54 |     27
    64       1.15 |     68
   256       1.86 |    162
   512       2.21 |    264
  2048       2.62 |    208
  4096       2.74 |    182
 16384       2.73 |    173
 65536       3.28 |    166
262144       3.20 |    167

--- Order 9 (2M folios) ---
     n      vanilla/cpu
(folios)    GB/s | first(us)
--------------------------
     1       7.05 |    194
     4       8.78 |    186
     8       8.47 |    188
    16       7.20 |    193
    64       8.23 |    191
   256      10.51 |    180
   512      10.88 |    173

Takeaway:
In each migrate_pages_batch() call, folios are first unmapped, then try_to_unmap_flush(),
and only then folios enter move_to_new_folio(). So first-folio latency is bounded by the
per-batch unmap+flush cost, and then plateaus once workload is large enough.


B). Patched kernel:

Here, N = NR_MAX_BATCHED_MIGRATION (in page). Total migrated data is fixed at 1 GB.
Change N with a knob to measure impact of different max batched size.

--- ORDER 0 (4K folios) ---
     N         offload/dma1          offload/dma4          offload/dma16
               GB/s | first(us)      GB/s | first(us)      GB/s | first(us)
------------------------------------------------------------------------
   512         2.13 |    639         3.23 |    290         3.27 |    253
  1024         2.17 |   1261         3.44 |    582         3.58 |    536
  2048         2.01 |   2769         3.09 |   1360         3.45 |   1083
  4096         2.10 |   5059         3.13 |   2737         3.58 |   2115 
  8192         2.21 |   9320         3.17 |   5015         3.75 |   3617 
 16384         2.15 |  18689         3.31 |   9623         3.87 |   6937
 32768         2.12 |  42692         3.38 |  18893         3.83 |  14255
 65536         2.09 |  81956         3.38 |  38556         3.64 |  29003
131072         2.02 | 169563         3.22 |  81082         3.63 |  62236
262144         2.21 | 318424         3.12 | 170174         3.50 | 129413

--- ORDER 9 (2M folios) ---
     N         offload/dma1          offload/dma4          offload/dma16
               GB/s | first(us)      GB/s | first(us)      GB/s | first(us)
-------------------------------------------------------------------------
   512         11.66 |    160        11.68 |    160        11.65 |    160
  1024         12.16 |    310        13.67 |    275        13.64 |    276
  2048         12.30 |    613        25.47 |    290        25.48 |    291
  4096         12.48 |   1215        26.19 |    566        42.59 |    335
  8192         12.56 |   2424        26.57 |   1118        58.72 |    470 *
 16384         12.61 |   4839        26.77 |   2218        61.94 |    896
 32768         12.60 |   9667        26.98 |   4422        63.75 |   1748
 65536         12.63 |  19318        26.99 |   8838        60.66 |   3543
131072         12.64 |  38935        27.02 |  17935        61.06 |   7178
262144         12.66 |  77694        26.85 |  35871        65.06 |  14129

In the batch-copy offload approach, DMA copy phase is inserted between unmap/flush and move,
So larger N increases first-folio wall clock latency. Throughput improves but with diminishing
returns.

For DCBM+PTDMA setup, the optimal batch for 2M folios sits around N=8192-16384,
because a larger batch allows the driver to distribute more folios across available DMA channels.
This is where we get most throughput while keeping the first folio latency in check.

This optimal batch value is hardware-specific. Other engines (eg. SDXI) and memory tier (eg. CXL)
will likely have different curves.

Does this approach and experiment look good to you?

Thanks,
Shivank

Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload

Posted by Huang, Ying 1 month, 1 week ago

Hi, Shivank,

"Garg, Shivank" <shivankg@amd.com> writes:

> On 4/30/2026 2:17 PM, Huang, Ying wrote:
>> Shivank Garg <shivankg@amd.com> writes:
>
>>> PERFORMANCE RESULTS:
>>> --------------------
>>>
>>> Re-ran the V4 workload on v7.1-rc1 with this series; relative
>>> speedups match V4 (~6x for 2MB folios at 16 DMA channels). No design
>>> change in V5 alters this picture; please refer to the V4 cover letter
>>> for the throughput tables [1].
>> 
>> IMHO, it's better to copy performance data here.
>> 
>> In addition to the performance benefit, I want to know the downside as
>> well.  For example, the migration latency of the first folio may be
>> longer.  If so, by how much?  Can you measure the batch number vs. total
>> migration time (benefit) and first folio migration time (downside)?
>> That can be used to determine the optimal batch number.
>> 
>
> System Info: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT Enabled),
> 1 NUMA node per socket, v7.1-rc1, DVFS set to Performance, PTDMA hardware.
>
> Benchmark: move_pages() syscall to move pages between two NUMA nodes.
>
> 1). Moving different sized folios such that total transfer size is constant
> (1GB), with different number of DMA channels. Throughput in GB/s.
>
> a. Baseline (vanilla kernel, single-threaded, serial folio_copy):
>
> ================================================================================
> 4K          | 16K        | 64K        | 256K       | 1M          | 2M          |
> ================================================================================
> 3.31±0.18   | 5.61±0.07  | 6.66±0.03  | 7.01±0.03  | 7.13±0.08   | 11.02±0.17  |
>
>
> b. DMA offload (Patched Kernel, dcbm driver, N DMA channels):
>
> ============================================================================================
> N channel| 4K        | 16K         | 64K         | 256K        | 1M          | 2M          |
> ============================================================================================
> 1      | 2.16±0.14   | 2.58±0.02   | 3.00±0.04   | 4.56±0.28   | 4.62±0.02   | 12.65±0.08  |
> 2      | 2.68±0.09   | 3.69±0.15   | 4.52±0.04   | 6.75±0.06   | 7.19±0.19   | 14.38±0.06  |
> 4      | 3.07±0.13   | 4.62±0.09   | 6.47±0.56   | 9.22±0.15   | 10.24±0.47  | 27.01±0.11  |
> 8      | 3.43±0.09   | 5.40±0.16   | 7.67±0.08   | 11.25±0.17  | 12.60±0.60  | 45.62±0.52  |
> 12     | 3.50±0.11   | 5.66±0.16   | 8.12±0.10   | 11.97±0.19  | 13.43±0.08  | 61.02±0.92  |
> 16     | 3.54±0.12   | 5.79±0.14   | 8.50±0.13   | 12.59±0.15  | 17.21±6.40  | 65.23±1.70  |
>
>
> 2).  First-folio latency: Instrumented with custom tracepoints to measure latency per migrate_pages_batch() call.
>     Result: throughput (GB/s) and first-folio latency (in microseconds), median of 10 runs.

Thanks for detailed data.  Per my understanding, the run time of
migrate_pages_batch() may be not good enough for measuring first folio
latency.  IIUC, the migration procedure is something like,

  for each folio
        unmap
  flush
  for each folio
        copy
        remap ===> first folio migrated

Some tracepoint should be better to measure it.

> A). Vanilla Kernel:
>
> Here, n = workload size passed to move_pages() in folios. Move n number of folios with move_pages().
> NR_MAX_BATCHED_MIGRATION is upstream default value 512.
>
> --- Order 0 (4K folios) ---
>      n      vanilla/cpu
> (folios)    GB/s | first(us)
> --------------------------
>      1       0.04 |     24
>      4       0.16 |     25
>      8       0.29 |     31
>     16       0.54 |     27
>     64       1.15 |     68
>    256       1.86 |    162
>    512       2.21 |    264
>   2048       2.62 |    208
>   4096       2.74 |    182
>  16384       2.73 |    173
>  65536       3.28 |    166
> 262144       3.20 |    167
>
> --- Order 9 (2M folios) ---
>      n      vanilla/cpu
> (folios)    GB/s | first(us)
> --------------------------
>      1       7.05 |    194
>      4       8.78 |    186
>      8       8.47 |    188
>     16       7.20 |    193
>     64       8.23 |    191
>    256      10.51 |    180
>    512      10.88 |    173
>
> Takeaway:
> In each migrate_pages_batch() call, folios are first unmapped, then try_to_unmap_flush(),
> and only then folios enter move_to_new_folio(). So first-folio latency is bounded by the
> per-batch unmap+flush cost, and then plateaus once workload is large enough.
>
>
> B). Patched kernel:
>
> Here, N = NR_MAX_BATCHED_MIGRATION (in page). Total migrated data is fixed at 1 GB.

Emm, so NR_MAX_BATCHED_MIGRATION could be very large?  I think that it
needs to be bounded.  If it is too large, too many pages may be in an
inaccessible state for a longer time.  That will hurt the workload
performance, although it is optimal for migration performance.

> Change N with a knob to measure impact of different max batched size.
>
> --- ORDER 0 (4K folios) ---
>      N         offload/dma1          offload/dma4          offload/dma16
>                GB/s | first(us)      GB/s | first(us)      GB/s | first(us)
> ------------------------------------------------------------------------
>    512         2.13 |    639         3.23 |    290         3.27 |    253
>   1024         2.17 |   1261         3.44 |    582         3.58 |    536
>   2048         2.01 |   2769         3.09 |   1360         3.45 |   1083
>   4096         2.10 |   5059         3.13 |   2737         3.58 |   2115 
>   8192         2.21 |   9320         3.17 |   5015         3.75 |   3617 
>  16384         2.15 |  18689         3.31 |   9623         3.87 |   6937
>  32768         2.12 |  42692         3.38 |  18893         3.83 |  14255
>  65536         2.09 |  81956         3.38 |  38556         3.64 |  29003
> 131072         2.02 | 169563         3.22 |  81082         3.63 |  62236
> 262144         2.21 | 318424         3.12 | 170174         3.50 | 129413
>
> --- ORDER 9 (2M folios) ---
>      N         offload/dma1          offload/dma4          offload/dma16
>                GB/s | first(us)      GB/s | first(us)      GB/s | first(us)
> -------------------------------------------------------------------------
>    512         11.66 |    160        11.68 |    160        11.65 |    160
>   1024         12.16 |    310        13.67 |    275        13.64 |    276
>   2048         12.30 |    613        25.47 |    290        25.48 |    291
>   4096         12.48 |   1215        26.19 |    566        42.59 |    335
>   8192         12.56 |   2424        26.57 |   1118        58.72 |    470 *
>  16384         12.61 |   4839        26.77 |   2218        61.94 |    896
>  32768         12.60 |   9667        26.98 |   4422        63.75 |   1748
>  65536         12.63 |  19318        26.99 |   8838        60.66 |   3543
> 131072         12.64 |  38935        27.02 |  17935        61.06 |   7178
> 262144         12.66 |  77694        26.85 |  35871        65.06 |  14129
>
> In the batch-copy offload approach, DMA copy phase is inserted between unmap/flush and move,
> So larger N increases first-folio wall clock latency. Throughput improves but with diminishing
> returns.
>
> For DCBM+PTDMA setup, the optimal batch for 2M folios sits around N=8192-16384,
> because a larger batch allows the driver to distribute more folios across available DMA channels.
> This is where we get most throughput while keeping the first folio latency in check.
>
> This optimal batch value is hardware-specific. Other engines (eg. SDXI) and memory tier (eg. CXL)
> will likely have different curves.
>
> Does this approach and experiment look good to you?

---
Best Regards,
Huang, Ying

Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload

Posted by Garg, Shivank 1 month, 1 week ago


On 5/8/2026 4:58 PM, Huang, Ying wrote:
> Hi, Shivank,
> 
> "Garg, Shivank" <shivankg@amd.com> writes:
> 
>> On 4/30/2026 2:17 PM, Huang, Ying wrote:
>>> Shivank Garg <shivankg@amd.com> writes:
>>
>>>> PERFORMANCE RESULTS:
>>>> --------------------
>>>>
>>>> Re-ran the V4 workload on v7.1-rc1 with this series; relative
>>>> speedups match V4 (~6x for 2MB folios at 16 DMA channels). No design
>>>> change in V5 alters this picture; please refer to the V4 cover letter
>>>> for the throughput tables [1].
>>>
>>> IMHO, it's better to copy performance data here.
>>>
>>> In addition to the performance benefit, I want to know the downside as
>>> well.  For example, the migration latency of the first folio may be
>>> longer.  If so, by how much?  Can you measure the batch number vs. total
>>> migration time (benefit) and first folio migration time (downside)?
>>> That can be used to determine the optimal batch number.
>>>
>>
>> System Info: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT Enabled),
>> 1 NUMA node per socket, v7.1-rc1, DVFS set to Performance, PTDMA hardware.
>>
>> Benchmark: move_pages() syscall to move pages between two NUMA nodes.
>>
>> 1). Moving different sized folios such that total transfer size is constant
>> (1GB), with different number of DMA channels. Throughput in GB/s.
>>
>> a. Baseline (vanilla kernel, single-threaded, serial folio_copy):
>>
>> ================================================================================
>> 4K          | 16K        | 64K        | 256K       | 1M          | 2M          |
>> ================================================================================
>> 3.31±0.18   | 5.61±0.07  | 6.66±0.03  | 7.01±0.03  | 7.13±0.08   | 11.02±0.17  |
>>
>>
>> b. DMA offload (Patched Kernel, dcbm driver, N DMA channels):
>>
>> ============================================================================================
>> N channel| 4K        | 16K         | 64K         | 256K        | 1M          | 2M          |
>> ============================================================================================
>> 1      | 2.16±0.14   | 2.58±0.02   | 3.00±0.04   | 4.56±0.28   | 4.62±0.02   | 12.65±0.08  |
>> 2      | 2.68±0.09   | 3.69±0.15   | 4.52±0.04   | 6.75±0.06   | 7.19±0.19   | 14.38±0.06  |
>> 4      | 3.07±0.13   | 4.62±0.09   | 6.47±0.56   | 9.22±0.15   | 10.24±0.47  | 27.01±0.11  |
>> 8      | 3.43±0.09   | 5.40±0.16   | 7.67±0.08   | 11.25±0.17  | 12.60±0.60  | 45.62±0.52  |
>> 12     | 3.50±0.11   | 5.66±0.16   | 8.12±0.10   | 11.97±0.19  | 13.43±0.08  | 61.02±0.92  |
>> 16     | 3.54±0.12   | 5.79±0.14   | 8.50±0.13   | 12.59±0.15  | 17.21±6.40  | 65.23±1.70  |
>>
>>
>> 2).  First-folio latency: Instrumented with custom tracepoints to measure latency per migrate_pages_batch() call.
>>     Result: throughput (GB/s) and first-folio latency (in microseconds), median of 10 runs.
> 
> Thanks for detailed data.  Per my understanding, the run time of
> migrate_pages_batch() may be not good enough for measuring first folio
> latency.  IIUC, the migration procedure is something like,
> 
>   for each folio
>         unmap
>   flush
>   for each folio
>         copy
>         remap ===> first folio migrated
> 
> Some tracepoint should be better to measure it.

Sorry, my earlier write-up was unclear.
For first folio latency, I add two tracepoints: one at the start of migrate_pages_batch()
and one in migrate_folio_done(). 

I agree that the user-accessible point tracepoint should be right after remove_migration_ptes().
Though, migrate_folio_done() runs only a few operations later, and will have a constant
offset, so it's unlikely to change the shape of the trade-off curve.
I'll move the tracepoint right after remove_migration_ptes() for new posting.

> 
>> A). Vanilla Kernel:
>>
>> Here, n = workload size passed to move_pages() in folios. Move n number of folios with move_pages().
>> NR_MAX_BATCHED_MIGRATION is upstream default value 512.
>>
>> --- Order 0 (4K folios) ---
>>      n      vanilla/cpu
>> (folios)    GB/s | first(us)
>> --------------------------
>>      1       0.04 |     24
>>      4       0.16 |     25
>>      8       0.29 |     31
>>     16       0.54 |     27
>>     64       1.15 |     68
>>    256       1.86 |    162
>>    512       2.21 |    264
>>   2048       2.62 |    208
>>   4096       2.74 |    182
>>  16384       2.73 |    173
>>  65536       3.28 |    166
>> 262144       3.20 |    167
>>
>> --- Order 9 (2M folios) ---
>>      n      vanilla/cpu
>> (folios)    GB/s | first(us)
>> --------------------------
>>      1       7.05 |    194
>>      4       8.78 |    186
>>      8       8.47 |    188
>>     16       7.20 |    193
>>     64       8.23 |    191
>>    256      10.51 |    180
>>    512      10.88 |    173
>>
>> Takeaway:
>> In each migrate_pages_batch() call, folios are first unmapped, then try_to_unmap_flush(),
>> and only then folios enter move_to_new_folio(). So first-folio latency is bounded by the
>> per-batch unmap+flush cost, and then plateaus once workload is large enough.
>>
>>
>> B). Patched kernel:
>>
>> Here, N = NR_MAX_BATCHED_MIGRATION (in page). Total migrated data is fixed at 1 GB.
> 
> Emm, so NR_MAX_BATCHED_MIGRATION could be very large?  I think that it
> needs to be bounded.  If it is too large, too many pages may be in an
> inaccessible state for a longer time.  That will hurt the workload
> performance, although it is optimal for migration performance.
> 

Agreed, it must be bounded.

>> Change N with a knob to measure impact of different max batched size.
>>
>> --- ORDER 0 (4K folios) ---
>>      N         offload/dma1          offload/dma4          offload/dma16
>>                GB/s | first(us)      GB/s | first(us)      GB/s | first(us)
>> ------------------------------------------------------------------------
>>    512         2.13 |    639         3.23 |    290         3.27 |    253
>>   1024         2.17 |   1261         3.44 |    582         3.58 |    536
>>   2048         2.01 |   2769         3.09 |   1360         3.45 |   1083
>>   4096         2.10 |   5059         3.13 |   2737         3.58 |   2115 
>>   8192         2.21 |   9320         3.17 |   5015         3.75 |   3617 
>>  16384         2.15 |  18689         3.31 |   9623         3.87 |   6937
>>  32768         2.12 |  42692         3.38 |  18893         3.83 |  14255
>>  65536         2.09 |  81956         3.38 |  38556         3.64 |  29003
>> 131072         2.02 | 169563         3.22 |  81082         3.63 |  62236
>> 262144         2.21 | 318424         3.12 | 170174         3.50 | 129413
>>
>> --- ORDER 9 (2M folios) ---
>>      N         offload/dma1          offload/dma4          offload/dma16
>>                GB/s | first(us)      GB/s | first(us)      GB/s | first(us)
>> -------------------------------------------------------------------------
>>    512         11.66 |    160        11.68 |    160        11.65 |    160
>>   1024         12.16 |    310        13.67 |    275        13.64 |    276
>>   2048         12.30 |    613        25.47 |    290        25.48 |    291
>>   4096         12.48 |   1215        26.19 |    566        42.59 |    335
>>   8192         12.56 |   2424        26.57 |   1118        58.72 |    470 *
>>  16384         12.61 |   4839        26.77 |   2218        61.94 |    896
>>  32768         12.60 |   9667        26.98 |   4422        63.75 |   1748
>>  65536         12.63 |  19318        26.99 |   8838        60.66 |   3543
>> 131072         12.64 |  38935        27.02 |  17935        61.06 |   7178
>> 262144         12.66 |  77694        26.85 |  35871        65.06 |  14129
>>
>> In the batch-copy offload approach, DMA copy phase is inserted between unmap/flush and move,
>> So larger N increases first-folio wall clock latency. Throughput improves but with diminishing
>> returns.
>>
>> For DCBM+PTDMA setup, the optimal batch for 2M folios sits around N=8192-16384,
>> because a larger batch allows the driver to distribute more folios across available DMA channels.
>> This is where we get most throughput while keeping the first folio latency in check.
>>
>> This optimal batch value is hardware-specific. Other engines (eg. SDXI) and memory tier (eg. CXL)
>> will likely have different curves.
>>
>> Does this approach and experiment look good to you?
> 
> ---
> Best Regards,
> Huang, Ying

Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload

Posted by Huang, Ying 1 month, 1 week ago

"Garg, Shivank" <shivankg@amd.com> writes:

> On 5/8/2026 4:58 PM, Huang, Ying wrote:
>> Hi, Shivank,
>> 
>> "Garg, Shivank" <shivankg@amd.com> writes:
>> 
>>> On 4/30/2026 2:17 PM, Huang, Ying wrote:
>>>> Shivank Garg <shivankg@amd.com> writes:
>>>
>>>>> PERFORMANCE RESULTS:
>>>>> --------------------
>>>>>
>>>>> Re-ran the V4 workload on v7.1-rc1 with this series; relative
>>>>> speedups match V4 (~6x for 2MB folios at 16 DMA channels). No design
>>>>> change in V5 alters this picture; please refer to the V4 cover letter
>>>>> for the throughput tables [1].
>>>>
>>>> IMHO, it's better to copy performance data here.
>>>>
>>>> In addition to the performance benefit, I want to know the downside as
>>>> well.  For example, the migration latency of the first folio may be
>>>> longer.  If so, by how much?  Can you measure the batch number vs. total
>>>> migration time (benefit) and first folio migration time (downside)?
>>>> That can be used to determine the optimal batch number.
>>>>
>>>
>>> System Info: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT Enabled),
>>> 1 NUMA node per socket, v7.1-rc1, DVFS set to Performance, PTDMA hardware.
>>>
>>> Benchmark: move_pages() syscall to move pages between two NUMA nodes.
>>>
>>> 1). Moving different sized folios such that total transfer size is constant
>>> (1GB), with different number of DMA channels. Throughput in GB/s.
>>>
>>> a. Baseline (vanilla kernel, single-threaded, serial folio_copy):
>>>
>>> ================================================================================
>>> 4K          | 16K        | 64K        | 256K       | 1M          | 2M          |
>>> ================================================================================
>>> 3.31±0.18   | 5.61±0.07  | 6.66±0.03  | 7.01±0.03  | 7.13±0.08   | 11.02±0.17  |
>>>
>>>
>>> b. DMA offload (Patched Kernel, dcbm driver, N DMA channels):
>>>
>>> ============================================================================================
>>> N channel| 4K        | 16K         | 64K         | 256K        | 1M          | 2M          |
>>> ============================================================================================
>>> 1      | 2.16±0.14   | 2.58±0.02   | 3.00±0.04   | 4.56±0.28   | 4.62±0.02   | 12.65±0.08  |
>>> 2      | 2.68±0.09   | 3.69±0.15   | 4.52±0.04   | 6.75±0.06   | 7.19±0.19   | 14.38±0.06  |
>>> 4      | 3.07±0.13   | 4.62±0.09   | 6.47±0.56   | 9.22±0.15   | 10.24±0.47  | 27.01±0.11  |
>>> 8      | 3.43±0.09   | 5.40±0.16   | 7.67±0.08   | 11.25±0.17  | 12.60±0.60  | 45.62±0.52  |
>>> 12     | 3.50±0.11   | 5.66±0.16   | 8.12±0.10   | 11.97±0.19  | 13.43±0.08  | 61.02±0.92  |
>>> 16     | 3.54±0.12   | 5.79±0.14   | 8.50±0.13   | 12.59±0.15  | 17.21±6.40  | 65.23±1.70  |
>>>
>>>
>>> 2).  First-folio latency: Instrumented with custom tracepoints to
>>> measure latency per migrate_pages_batch() call.
>>>     Result: throughput (GB/s) and first-folio latency (in microseconds), median of 10 runs.
>> 
>> Thanks for detailed data.  Per my understanding, the run time of
>> migrate_pages_batch() may be not good enough for measuring first folio
>> latency.  IIUC, the migration procedure is something like,
>> 
>>   for each folio
>>         unmap
>>   flush
>>   for each folio
>>         copy
>>         remap ===> first folio migrated
>> 
>> Some tracepoint should be better to measure it.
>
> Sorry, my earlier write-up was unclear.
> For first folio latency, I add two tracepoints: one at the start of migrate_pages_batch()
> and one in migrate_folio_done(). 
>
> I agree that the user-accessible point tracepoint should be right after remove_migration_ptes().
> Though, migrate_folio_done() runs only a few operations later, and will have a constant
> offset, so it's unlikely to change the shape of the trade-off curve.
> I'll move the tracepoint right after remove_migration_ptes() for new posting.

Thanks for explanation.  Trace point in migrate_folio_done() should be OK.

>> 
>>> A). Vanilla Kernel:
>>>
>>> Here, n = workload size passed to move_pages() in folios. Move n number of folios with move_pages().
>>> NR_MAX_BATCHED_MIGRATION is upstream default value 512.
>>>
>>> --- Order 0 (4K folios) ---
>>>      n      vanilla/cpu
>>> (folios)    GB/s | first(us)
>>> --------------------------
>>>      1       0.04 |     24
>>>      4       0.16 |     25
>>>      8       0.29 |     31
>>>     16       0.54 |     27
>>>     64       1.15 |     68
>>>    256       1.86 |    162
>>>    512       2.21 |    264
>>>   2048       2.62 |    208
>>>   4096       2.74 |    182
>>>  16384       2.73 |    173
>>>  65536       3.28 |    166
>>> 262144       3.20 |    167
>>>
>>> --- Order 9 (2M folios) ---
>>>      n      vanilla/cpu
>>> (folios)    GB/s | first(us)
>>> --------------------------
>>>      1       7.05 |    194
>>>      4       8.78 |    186
>>>      8       8.47 |    188
>>>     16       7.20 |    193
>>>     64       8.23 |    191
>>>    256      10.51 |    180
>>>    512      10.88 |    173
>>>
>>> Takeaway:
>>> In each migrate_pages_batch() call, folios are first unmapped, then try_to_unmap_flush(),
>>> and only then folios enter move_to_new_folio(). So first-folio latency is bounded by the
>>> per-batch unmap+flush cost, and then plateaus once workload is large enough.
>>>
>>>
>>> B). Patched kernel:
>>>
>>> Here, N = NR_MAX_BATCHED_MIGRATION (in page). Total migrated data is fixed at 1 GB.
>> 
>> Emm, so NR_MAX_BATCHED_MIGRATION could be very large?  I think that it
>> needs to be bounded.  If it is too large, too many pages may be in an
>> inaccessible state for a longer time.  That will hurt the workload
>> performance, although it is optimal for migration performance.
>> 
>
> Agreed, it must be bounded.

Thanks!  Could you retest with bounded NR_MAX_BATCHED_MIGRATION.  If the
upstream default doesn't work well for you.  We can find a better one
that balances throughput and latency well.

>>> Change N with a knob to measure impact of different max batched size.
>>>
>>> --- ORDER 0 (4K folios) ---
>>>      N         offload/dma1          offload/dma4          offload/dma16
>>>                GB/s | first(us)      GB/s | first(us)      GB/s | first(us)
>>> ------------------------------------------------------------------------
>>>    512         2.13 |    639         3.23 |    290         3.27 |    253
>>>   1024         2.17 |   1261         3.44 |    582         3.58 |    536
>>>   2048         2.01 |   2769         3.09 |   1360         3.45 |   1083
>>>   4096         2.10 |   5059         3.13 |   2737         3.58 |   2115 
>>>   8192         2.21 |   9320         3.17 |   5015         3.75 |   3617 
>>>  16384         2.15 |  18689         3.31 |   9623         3.87 |   6937
>>>  32768         2.12 |  42692         3.38 |  18893         3.83 |  14255
>>>  65536         2.09 |  81956         3.38 |  38556         3.64 |  29003
>>> 131072         2.02 | 169563         3.22 |  81082         3.63 |  62236
>>> 262144         2.21 | 318424         3.12 | 170174         3.50 | 129413
>>>
>>> --- ORDER 9 (2M folios) ---
>>>      N         offload/dma1          offload/dma4          offload/dma16
>>>                GB/s | first(us)      GB/s | first(us)      GB/s | first(us)
>>> -------------------------------------------------------------------------
>>>    512         11.66 |    160        11.68 |    160        11.65 |    160
>>>   1024         12.16 |    310        13.67 |    275        13.64 |    276
>>>   2048         12.30 |    613        25.47 |    290        25.48 |    291
>>>   4096         12.48 |   1215        26.19 |    566        42.59 |    335
>>>   8192         12.56 |   2424        26.57 |   1118        58.72 |    470 *
>>>  16384         12.61 |   4839        26.77 |   2218        61.94 |    896
>>>  32768         12.60 |   9667        26.98 |   4422        63.75 |   1748
>>>  65536         12.63 |  19318        26.99 |   8838        60.66 |   3543
>>> 131072         12.64 |  38935        27.02 |  17935        61.06 |   7178
>>> 262144         12.66 |  77694        26.85 |  35871        65.06 |  14129
>>>
>>> In the batch-copy offload approach, DMA copy phase is inserted between unmap/flush and move,
>>> So larger N increases first-folio wall clock latency. Throughput improves but with diminishing
>>> returns.
>>>
>>> For DCBM+PTDMA setup, the optimal batch for 2M folios sits around N=8192-16384,
>>> because a larger batch allows the driver to distribute more folios across available DMA channels.
>>> This is where we get most throughput while keeping the first folio latency in check.
>>>
>>> This optimal batch value is hardware-specific. Other engines (eg. SDXI) and memory tier (eg. CXL)
>>> will likely have different curves.
>>>
>>> Does this approach and experiment look good to you?

---
Best Regards,
Huang, Ying

Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload

Posted by Garg, Shivank 1 month, 1 week ago


On 5/9/2026 1:19 PM, Huang, Ying wrote:
> "Garg, Shivank" <shivankg@amd.com> writes:
> 
>> On 5/8/2026 4:58 PM, Huang, Ying wrote:
>>> Hi, Shivank,
>>>
>>> "Garg, Shivank" <shivankg@amd.com> writes:
>>>
>>>> On 4/30/2026 2:17 PM, Huang, Ying wrote:
>>>>> Shivank Garg <shivankg@amd.com> writes:
>>>>
>>>>>> PERFORMANCE RESULTS:
>>>>>> --------------------
>>>>>>
>>>>>> Re-ran the V4 workload on v7.1-rc1 with this series; relative
>>>>>> speedups match V4 (~6x for 2MB folios at 16 DMA channels). No design
>>>>>> change in V5 alters this picture; please refer to the V4 cover letter
>>>>>> for the throughput tables [1].
>>>>>
>>>>> IMHO, it's better to copy performance data here.
>>>>>
>>>>> In addition to the performance benefit, I want to know the downside as
>>>>> well.  For example, the migration latency of the first folio may be
>>>>> longer.  If so, by how much?  Can you measure the batch number vs. total
>>>>> migration time (benefit) and first folio migration time (downside)?
>>>>> That can be used to determine the optimal batch number.
>>>>>
>>>>
>>>> System Info: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT Enabled),
>>>> 1 NUMA node per socket, v7.1-rc1, DVFS set to Performance, PTDMA hardware.
>>>>
>>>> Benchmark: move_pages() syscall to move pages between two NUMA nodes.
>>>>
>>>> 1). Moving different sized folios such that total transfer size is constant
>>>> (1GB), with different number of DMA channels. Throughput in GB/s.
>>>>
>>>> a. Baseline (vanilla kernel, single-threaded, serial folio_copy):
>>>>
>>>> ================================================================================
>>>> 4K          | 16K        | 64K        | 256K       | 1M          | 2M          |
>>>> ================================================================================
>>>> 3.31±0.18   | 5.61±0.07  | 6.66±0.03  | 7.01±0.03  | 7.13±0.08   | 11.02±0.17  |
>>>>
>>>>
>>>> b. DMA offload (Patched Kernel, dcbm driver, N DMA channels):
>>>>
>>>> ============================================================================================
>>>> N channel| 4K        | 16K         | 64K         | 256K        | 1M          | 2M          |
>>>> ============================================================================================
>>>> 1      | 2.16±0.14   | 2.58±0.02   | 3.00±0.04   | 4.56±0.28   | 4.62±0.02   | 12.65±0.08  |
>>>> 2      | 2.68±0.09   | 3.69±0.15   | 4.52±0.04   | 6.75±0.06   | 7.19±0.19   | 14.38±0.06  |
>>>> 4      | 3.07±0.13   | 4.62±0.09   | 6.47±0.56   | 9.22±0.15   | 10.24±0.47  | 27.01±0.11  |
>>>> 8      | 3.43±0.09   | 5.40±0.16   | 7.67±0.08   | 11.25±0.17  | 12.60±0.60  | 45.62±0.52  |
>>>> 12     | 3.50±0.11   | 5.66±0.16   | 8.12±0.10   | 11.97±0.19  | 13.43±0.08  | 61.02±0.92  |
>>>> 16     | 3.54±0.12   | 5.79±0.14   | 8.50±0.13   | 12.59±0.15  | 17.21±6.40  | 65.23±1.70  |
>>>>
>>>>
>>>> 2).  First-folio latency: Instrumented with custom tracepoints to
>>>> measure latency per migrate_pages_batch() call.
>>>>     Result: throughput (GB/s) and first-folio latency (in microseconds), median of 10 runs.
>>>
>>> Thanks for detailed data.  Per my understanding, the run time of
>>> migrate_pages_batch() may be not good enough for measuring first folio
>>> latency.  IIUC, the migration procedure is something like,
>>>
>>>   for each folio
>>>         unmap
>>>   flush
>>>   for each folio
>>>         copy
>>>         remap ===> first folio migrated
>>>
>>> Some tracepoint should be better to measure it.
>>
>> Sorry, my earlier write-up was unclear.
>> For first folio latency, I add two tracepoints: one at the start of migrate_pages_batch()
>> and one in migrate_folio_done(). 
>>
>> I agree that the user-accessible point tracepoint should be right after remove_migration_ptes().
>> Though, migrate_folio_done() runs only a few operations later, and will have a constant
>> offset, so it's unlikely to change the shape of the trade-off curve.
>> I'll move the tracepoint right after remove_migration_ptes() for new posting.
> 
> Thanks for explanation.  Trace point in migrate_folio_done() should be OK.
> 
>>>
>>>> A). Vanilla Kernel:
>>>>
>>>> Here, n = workload size passed to move_pages() in folios. Move n number of folios with move_pages().
>>>> NR_MAX_BATCHED_MIGRATION is upstream default value 512.
>>>>
>>>> --- Order 0 (4K folios) ---
>>>>      n      vanilla/cpu
>>>> (folios)    GB/s | first(us)
>>>> --------------------------
>>>>      1       0.04 |     24
>>>>      4       0.16 |     25
>>>>      8       0.29 |     31
>>>>     16       0.54 |     27
>>>>     64       1.15 |     68
>>>>    256       1.86 |    162
>>>>    512       2.21 |    264
>>>>   2048       2.62 |    208
>>>>   4096       2.74 |    182
>>>>  16384       2.73 |    173
>>>>  65536       3.28 |    166
>>>> 262144       3.20 |    167
>>>>
>>>> --- Order 9 (2M folios) ---
>>>>      n      vanilla/cpu
>>>> (folios)    GB/s | first(us)
>>>> --------------------------
>>>>      1       7.05 |    194
>>>>      4       8.78 |    186
>>>>      8       8.47 |    188
>>>>     16       7.20 |    193
>>>>     64       8.23 |    191
>>>>    256      10.51 |    180
>>>>    512      10.88 |    173
>>>>
>>>> Takeaway:
>>>> In each migrate_pages_batch() call, folios are first unmapped, then try_to_unmap_flush(),
>>>> and only then folios enter move_to_new_folio(). So first-folio latency is bounded by the
>>>> per-batch unmap+flush cost, and then plateaus once workload is large enough.
>>>>
>>>>
>>>> B). Patched kernel:
>>>>
>>>> Here, N = NR_MAX_BATCHED_MIGRATION (in page). Total migrated data is fixed at 1 GB.
>>>
>>> Emm, so NR_MAX_BATCHED_MIGRATION could be very large?  I think that it
>>> needs to be bounded.  If it is too large, too many pages may be in an
>>> inaccessible state for a longer time.  That will hurt the workload
>>> performance, although it is optimal for migration performance.
>>>
>>
>> Agreed, it must be bounded.
> 
> Thanks!  Could you retest with bounded NR_MAX_BATCHED_MIGRATION.  If the
> upstream default doesn't work well for you.  We can find a better one
> that balances throughput and latency well.
> 

Thanks. Below tables sweep NR_MAX_BATCHED_MIGRATION from 512 up to 262144. On 2M folios,
16-channel PTDMA, the knee is at N=8192-16384 (= {16 to 32} * 512 ).

>>>>   8192         12.56 |   2424        26.57 |   1118        58.72 |    470 *

One thing worth flagging on the "bounded default": at the upstream cap of 512 pages,
migrate_pages_batch() receives at most one 2M folio per call, so PTDMA can only use
one of its 16 channels per batch and the offload reduces to vanilla. (DCBM offloads
one 2M folio to each channel).
The larger-N rows are what exercise the channel parallelism for PTDMA case.

"SDXI"[1] like memory-to-memory data movers should reach good throughput with just 1 channel, 
and thus may not require increasing the NR_MAX_BATCHED_MIGRATION for good throughput.

I'm not tying series this to specific perf default for now, the design review (batch-copy
path, migrator interface, registration, static_call dispatch) is the part I'd like to converge
on first, then tune the threshold after it. Does that ordering work?

[1] https://lore.kernel.org/all/20260410-sdxi-base-v1-0-1d184cb5c60a@amd.com

Best regards,
Shivank

>>>> Change N with a knob to measure impact of different max batched size.
>>>>
>>>> --- ORDER 0 (4K folios) ---
>>>>      N         offload/dma1          offload/dma4          offload/dma16
>>>>                GB/s | first(us)      GB/s | first(us)      GB/s | first(us)
>>>> ------------------------------------------------------------------------
>>>>    512         2.13 |    639         3.23 |    290         3.27 |    253
>>>>   1024         2.17 |   1261         3.44 |    582         3.58 |    536
>>>>   2048         2.01 |   2769         3.09 |   1360         3.45 |   1083
>>>>   4096         2.10 |   5059         3.13 |   2737         3.58 |   2115 
>>>>   8192         2.21 |   9320         3.17 |   5015         3.75 |   3617 
>>>>  16384         2.15 |  18689         3.31 |   9623         3.87 |   6937
>>>>  32768         2.12 |  42692         3.38 |  18893         3.83 |  14255
>>>>  65536         2.09 |  81956         3.38 |  38556         3.64 |  29003
>>>> 131072         2.02 | 169563         3.22 |  81082         3.63 |  62236
>>>> 262144         2.21 | 318424         3.12 | 170174         3.50 | 129413
>>>>
>>>> --- ORDER 9 (2M folios) ---
>>>>      N         offload/dma1          offload/dma4          offload/dma16
>>>>                GB/s | first(us)      GB/s | first(us)      GB/s | first(us)
>>>> -------------------------------------------------------------------------
>>>>    512         11.66 |    160        11.68 |    160        11.65 |    160
>>>>   1024         12.16 |    310        13.67 |    275        13.64 |    276
>>>>   2048         12.30 |    613        25.47 |    290        25.48 |    291
>>>>   4096         12.48 |   1215        26.19 |    566        42.59 |    335
>>>>   8192         12.56 |   2424        26.57 |   1118        58.72 |    470 *
>>>>  16384         12.61 |   4839        26.77 |   2218        61.94 |    896
>>>>  32768         12.60 |   9667        26.98 |   4422        63.75 |   1748
>>>>  65536         12.63 |  19318        26.99 |   8838        60.66 |   3543
>>>> 131072         12.64 |  38935        27.02 |  17935        61.06 |   7178
>>>> 262144         12.66 |  77694        26.85 |  35871        65.06 |  14129
>>>>
>>>> In the batch-copy offload approach, DMA copy phase is inserted between unmap/flush and move,
>>>> So larger N increases first-folio wall clock latency. Throughput improves but with diminishing
>>>> returns.
>>>>
>>>> For DCBM+PTDMA setup, the optimal batch for 2M folios sits around N=8192-16384,
>>>> because a larger batch allows the driver to distribute more folios across available DMA channels.
>>>> This is where we get most throughput while keeping the first folio latency in check.
>>>>
>>>> This optimal batch value is hardware-specific. Other engines (eg. SDXI) and memory tier (eg. CXL)
>>>> will likely have different curves.
>>>>
>>>> Does this approach and experiment look good to you?
> 
> ---
> Best Regards,
> Huang, Ying

Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload

Posted by Huang, Ying 1 month ago

"Garg, Shivank" <shivankg@amd.com> writes:

> On 5/9/2026 1:19 PM, Huang, Ying wrote:
>> "Garg, Shivank" <shivankg@amd.com> writes:
>> 
>>> On 5/8/2026 4:58 PM, Huang, Ying wrote:
>>>> Hi, Shivank,
>>>>
>>>> "Garg, Shivank" <shivankg@amd.com> writes:
>>>>
>>>>> On 4/30/2026 2:17 PM, Huang, Ying wrote:
>>>>>> Shivank Garg <shivankg@amd.com> writes:
>>>>>
>>>>>>> PERFORMANCE RESULTS:
>>>>>>> --------------------
>>>>>>>
>>>>>>> Re-ran the V4 workload on v7.1-rc1 with this series; relative
>>>>>>> speedups match V4 (~6x for 2MB folios at 16 DMA channels). No design
>>>>>>> change in V5 alters this picture; please refer to the V4 cover letter
>>>>>>> for the throughput tables [1].
>>>>>>
>>>>>> IMHO, it's better to copy performance data here.
>>>>>>
>>>>>> In addition to the performance benefit, I want to know the downside as
>>>>>> well.  For example, the migration latency of the first folio may be
>>>>>> longer.  If so, by how much?  Can you measure the batch number vs. total
>>>>>> migration time (benefit) and first folio migration time (downside)?
>>>>>> That can be used to determine the optimal batch number.
>>>>>>
>>>>>
>>>>> System Info: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT Enabled),
>>>>> 1 NUMA node per socket, v7.1-rc1, DVFS set to Performance, PTDMA hardware.
>>>>>
>>>>> Benchmark: move_pages() syscall to move pages between two NUMA nodes.
>>>>>
>>>>> 1). Moving different sized folios such that total transfer size is constant
>>>>> (1GB), with different number of DMA channels. Throughput in GB/s.
>>>>>
>>>>> a. Baseline (vanilla kernel, single-threaded, serial folio_copy):
>>>>>
>>>>> ================================================================================
>>>>> 4K          | 16K        | 64K        | 256K       | 1M          | 2M          |
>>>>> ================================================================================
>>>>> 3.31±0.18   | 5.61±0.07  | 6.66±0.03  | 7.01±0.03  | 7.13±0.08   | 11.02±0.17  |
>>>>>
>>>>>
>>>>> b. DMA offload (Patched Kernel, dcbm driver, N DMA channels):
>>>>>
>>>>> ============================================================================================
>>>>> N channel| 4K        | 16K         | 64K         | 256K        | 1M          | 2M          |
>>>>> ============================================================================================
>>>>> 1      | 2.16±0.14   | 2.58±0.02   | 3.00±0.04   | 4.56±0.28   | 4.62±0.02   | 12.65±0.08  |
>>>>> 2      | 2.68±0.09   | 3.69±0.15   | 4.52±0.04   | 6.75±0.06   | 7.19±0.19   | 14.38±0.06  |
>>>>> 4      | 3.07±0.13   | 4.62±0.09   | 6.47±0.56   | 9.22±0.15   | 10.24±0.47  | 27.01±0.11  |
>>>>> 8      | 3.43±0.09   | 5.40±0.16   | 7.67±0.08   | 11.25±0.17  | 12.60±0.60  | 45.62±0.52  |
>>>>> 12     | 3.50±0.11   | 5.66±0.16   | 8.12±0.10   | 11.97±0.19  | 13.43±0.08  | 61.02±0.92  |
>>>>> 16     | 3.54±0.12   | 5.79±0.14   | 8.50±0.13   | 12.59±0.15  | 17.21±6.40  | 65.23±1.70  |
>>>>>
>>>>>
>>>>> 2).  First-folio latency: Instrumented with custom tracepoints to
>>>>> measure latency per migrate_pages_batch() call.
>>>>>     Result: throughput (GB/s) and first-folio latency (in microseconds), median of 10 runs.
>>>>
>>>> Thanks for detailed data.  Per my understanding, the run time of
>>>> migrate_pages_batch() may be not good enough for measuring first folio
>>>> latency.  IIUC, the migration procedure is something like,
>>>>
>>>>   for each folio
>>>>         unmap
>>>>   flush
>>>>   for each folio
>>>>         copy
>>>>         remap ===> first folio migrated
>>>>
>>>> Some tracepoint should be better to measure it.
>>>
>>> Sorry, my earlier write-up was unclear.
>>> For first folio latency, I add two tracepoints: one at the start of migrate_pages_batch()
>>> and one in migrate_folio_done(). 
>>>
>>> I agree that the user-accessible point tracepoint should be right after remove_migration_ptes().
>>> Though, migrate_folio_done() runs only a few operations later, and will have a constant
>>> offset, so it's unlikely to change the shape of the trade-off curve.
>>> I'll move the tracepoint right after remove_migration_ptes() for new posting.
>> 
>> Thanks for explanation.  Trace point in migrate_folio_done() should be OK.
>> 
>>>>
>>>>> A). Vanilla Kernel:
>>>>>
>>>>> Here, n = workload size passed to move_pages() in folios. Move n number of folios with move_pages().
>>>>> NR_MAX_BATCHED_MIGRATION is upstream default value 512.
>>>>>
>>>>> --- Order 0 (4K folios) ---
>>>>>      n      vanilla/cpu
>>>>> (folios)    GB/s | first(us)
>>>>> --------------------------
>>>>>      1       0.04 |     24
>>>>>      4       0.16 |     25
>>>>>      8       0.29 |     31
>>>>>     16       0.54 |     27
>>>>>     64       1.15 |     68
>>>>>    256       1.86 |    162
>>>>>    512       2.21 |    264
>>>>>   2048       2.62 |    208
>>>>>   4096       2.74 |    182
>>>>>  16384       2.73 |    173
>>>>>  65536       3.28 |    166
>>>>> 262144       3.20 |    167
>>>>>
>>>>> --- Order 9 (2M folios) ---
>>>>>      n      vanilla/cpu
>>>>> (folios)    GB/s | first(us)
>>>>> --------------------------
>>>>>      1       7.05 |    194
>>>>>      4       8.78 |    186
>>>>>      8       8.47 |    188
>>>>>     16       7.20 |    193
>>>>>     64       8.23 |    191
>>>>>    256      10.51 |    180
>>>>>    512      10.88 |    173
>>>>>
>>>>> Takeaway:
>>>>> In each migrate_pages_batch() call, folios are first unmapped, then try_to_unmap_flush(),
>>>>> and only then folios enter move_to_new_folio(). So first-folio latency is bounded by the
>>>>> per-batch unmap+flush cost, and then plateaus once workload is large enough.
>>>>>
>>>>>
>>>>> B). Patched kernel:
>>>>>
>>>>> Here, N = NR_MAX_BATCHED_MIGRATION (in page). Total migrated data is fixed at 1 GB.
>>>>
>>>> Emm, so NR_MAX_BATCHED_MIGRATION could be very large?  I think that it
>>>> needs to be bounded.  If it is too large, too many pages may be in an
>>>> inaccessible state for a longer time.  That will hurt the workload
>>>> performance, although it is optimal for migration performance.
>>>>
>>>
>>> Agreed, it must be bounded.
>> 
>> Thanks!  Could you retest with bounded NR_MAX_BATCHED_MIGRATION.  If the
>> upstream default doesn't work well for you.  We can find a better one
>> that balances throughput and latency well.
>> 
>
> Thanks. Below tables sweep NR_MAX_BATCHED_MIGRATION from 512 up to 262144. On 2M folios,
> 16-channel PTDMA, the knee is at N=8192-16384 (= {16 to 32} * 512 ).
>
>>>>>   8192         12.56 |   2424        26.57 |   1118        58.72 |    470 *

        2048         12.30 |    613        25.47 |    290        25.48 |    291

IIUC, N=2048 already helps dma4.  And, the latency looks OK too.  The
good batch size is hardware configuration dependent too?  If so, we may
need to add another migrator callback for that.

> One thing worth flagging on the "bounded default": at the upstream cap of 512 pages,
> migrate_pages_batch() receives at most one 2M folio per call, so PTDMA can only use
> one of its 16 channels per batch and the offload reduces to vanilla. (DCBM offloads
> one 2M folio to each channel).
> The larger-N rows are what exercise the channel parallelism for PTDMA case.
>
> "SDXI"[1] like memory-to-memory data movers should reach good throughput with just 1 channel, 
> and thus may not require increasing the NR_MAX_BATCHED_MIGRATION for good throughput.
>
> I'm not tying series this to specific perf default for now, the design review (batch-copy
> path, migrator interface, registration, static_call dispatch) is the part I'd like to converge
> on first, then tune the threshold after it. Does that ordering work?

IMHO, we need some performance data to justify the added complexity.
So, threshold tuning isn't the goal, whether we can get better
throughput with some bounded latency is.

> [1] https://lore.kernel.org/all/20260410-sdxi-base-v1-0-1d184cb5c60a@amd.com
>
> Best regards,
> Shivank
>
>>>>> Change N with a knob to measure impact of different max batched size.
>>>>>
>>>>> --- ORDER 0 (4K folios) ---
>>>>>      N         offload/dma1          offload/dma4          offload/dma16
>>>>>                GB/s | first(us)      GB/s | first(us)      GB/s | first(us)
>>>>> ------------------------------------------------------------------------
>>>>>    512         2.13 |    639         3.23 |    290         3.27 |    253
>>>>>   1024         2.17 |   1261         3.44 |    582         3.58 |    536
>>>>>   2048         2.01 |   2769         3.09 |   1360         3.45 |   1083
>>>>>   4096         2.10 |   5059         3.13 |   2737         3.58 |   2115 
>>>>>   8192         2.21 |   9320         3.17 |   5015         3.75 |   3617 
>>>>>  16384         2.15 |  18689         3.31 |   9623         3.87 |   6937
>>>>>  32768         2.12 |  42692         3.38 |  18893         3.83 |  14255
>>>>>  65536         2.09 |  81956         3.38 |  38556         3.64 |  29003
>>>>> 131072         2.02 | 169563         3.22 |  81082         3.63 |  62236
>>>>> 262144         2.21 | 318424         3.12 | 170174         3.50 | 129413
>>>>>
>>>>> --- ORDER 9 (2M folios) ---
>>>>>      N         offload/dma1          offload/dma4          offload/dma16
>>>>>                GB/s | first(us)      GB/s | first(us)      GB/s | first(us)
>>>>> -------------------------------------------------------------------------
>>>>>    512         11.66 |    160        11.68 |    160        11.65 |    160
>>>>>   1024         12.16 |    310        13.67 |    275        13.64 |    276
>>>>>   2048         12.30 |    613        25.47 |    290        25.48 |    291
>>>>>   4096         12.48 |   1215        26.19 |    566        42.59 |    335
>>>>>   8192         12.56 |   2424        26.57 |   1118        58.72 |    470 *
>>>>>  16384         12.61 |   4839        26.77 |   2218        61.94 |    896
>>>>>  32768         12.60 |   9667        26.98 |   4422        63.75 |   1748
>>>>>  65536         12.63 |  19318        26.99 |   8838        60.66 |   3543
>>>>> 131072         12.64 |  38935        27.02 |  17935        61.06 |   7178
>>>>> 262144         12.66 |  77694        26.85 |  35871        65.06 |  14129
>>>>>
>>>>> In the batch-copy offload approach, DMA copy phase is inserted between unmap/flush and move,
>>>>> So larger N increases first-folio wall clock latency. Throughput improves but with diminishing
>>>>> returns.
>>>>>
>>>>> For DCBM+PTDMA setup, the optimal batch for 2M folios sits around N=8192-16384,
>>>>> because a larger batch allows the driver to distribute more folios across available DMA channels.
>>>>> This is where we get most throughput while keeping the first folio latency in check.
>>>>>
>>>>> This optimal batch value is hardware-specific. Other engines (eg. SDXI) and memory tier (eg. CXL)
>>>>> will likely have different curves.
>>>>>
>>>>> Does this approach and experiment look good to you?
>> 

---
Best Regards,
Huang, Ying

Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload

Posted by Garg, Shivank 3 weeks, 6 days ago


On 5/12/2026 7:45 AM, Huang, Ying wrote:
> "Garg, Shivank" <shivankg@amd.com> writes:
> 
>> On 5/9/2026 1:19 PM, Huang, Ying wrote:
>>> "Garg, Shivank" <shivankg@amd.com> writes:
>>>
>>>> On 5/8/2026 4:58 PM, Huang, Ying wrote:
>>>>> Hi, Shivank,
>>>>>
>>>>> "Garg, Shivank" <shivankg@amd.com> writes:
>>>>>
>>>>>> On 4/30/2026 2:17 PM, Huang, Ying wrote:
>>>>>>> Shivank Garg <shivankg@amd.com> writes:
>>>>>>


>> Thanks. Below tables sweep NR_MAX_BATCHED_MIGRATION from 512 up to 262144. On 2M folios,
>> 16-channel PTDMA, the knee is at N=8192-16384 (= {16 to 32} * 512 ).
>>
>>>>>>   8192         12.56 |   2424        26.57 |   1118        58.72 |    470 *
> 
>         2048         12.30 |    613        25.47 |    290        25.48 |    291
> 
> IIUC, N=2048 already helps dma4.  And, the latency looks OK too.  The
> good batch size is hardware configuration dependent too?  If so, we may
> need to add another migrator callback for that.

Yeah, right. 

>> One thing worth flagging on the "bounded default": at the upstream cap of 512 pages,
>> migrate_pages_batch() receives at most one 2M folio per call, so PTDMA can only use
>> one of its 16 channels per batch and the offload reduces to vanilla. (DCBM offloads
>> one 2M folio to each channel).
>> The larger-N rows are what exercise the channel parallelism for PTDMA case.
>>
>> "SDXI"[1] like memory-to-memory data movers should reach good throughput with just 1 channel, 
>> and thus may not require increasing the NR_MAX_BATCHED_MIGRATION for good throughput.
>>
>> I'm not tying series this to specific perf default for now, the design review (batch-copy
>> path, migrator interface, registration, static_call dispatch) is the part I'd like to converge
>> on first, then tune the threshold after it. Does that ordering work?
> 
> IMHO, we need some performance data to justify the added complexity.
> So, threshold tuning isn't the goal, whether we can get better
> throughput with some bounded latency is.
> 

Fair point. 

As you pointed PTDMA-4chan gives both tput and inaccessible time improvement.
PTDMA 16-chan could be preferred by some who value throuhgput more.
And this is different for different hardware. So, I think another callback for
this sounds like good idea.

Thanks,
Shivank