[RFC PATCH 0/4] Migrate on fault for device pages

Mika Penttilä posted 4 patches 1 month, 3 weeks ago
include/linux/hmm.h                    |  10 +-
include/linux/migrate.h                |   6 +-
lib/test_hmm.c                         | 105 ++++++-
lib/test_hmm_uapi.h                    |  17 +-
mm/hmm.c                               | 351 +++++++++++++++++++++-
mm/huge_memory.c                       |   6 +-
mm/migrate_device.c                    | 384 +++++--------------------
mm/rmap.c                              |   4 +-
tools/testing/selftests/mm/hmm-tests.c |  53 ++++
9 files changed, 592 insertions(+), 344 deletions(-)
[RFC PATCH 0/4] Migrate on fault for device pages
Posted by Mika Penttilä 1 month, 3 weeks ago
As of this writing, the way device page faulting and migration
works is not optimal, if you want to do both fault handling
and migration at once.

Being able to migrate not present pages (or pages mapped with incorrect
permissions, eg. COW) to the GPU requires doing either of the following
sequences:

1. hmm_range_fault() - fault in non-present pages with correct
   permissions,etc.
2. migrate_vma_*() - migrate the pages

Or:

1. migrate_vma_*() - migrate present pages
2. If non-present pages detected by migrate_vma_*():
   a) call hmm_range_fault() to fault pages in
   b) call migrate_vma_*() again to migrate now present pages

The problem with the first sequence is that you always have to do two
page walks even when most of the time the pages are present or zero page
mappings so the common case takes a performance hit.

The second sequence is better for the common case, but far worse if
pages aren't present because now you have to walk the page tables three
times (once to find the page is not present, once so hmm_range_fault()
can find a non-present page to fault in and once again to setup the
migration). It also tricky to code correctly.

We should be able to walk the page table once, faulting
pages in as required and replacing them with migration entries if
requested.

These patches add a new flag to HMM APIs, HMM_PFN_REQ_MIGRATE,
which tells to prepare for migration also during fault handling.
Also, for the migrate_vma_setup() call paths, a flag,
MIGRATE_VMA_FAULT, is added to tell to add fault handling to migrate.
The original idea came from Alistair.

These patches are based on 6.16 mainline, and they pass the HMM
selftests. The support for THP pages from Balbir should
not be to hard to integrate into this, and that
is something I am looking at.

Cc: David Hildenbrand <david@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Leon Romanovsky <leonro@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Balbir Singh <balbirs@nvidia.com>

Mika Penttilä (4):
  mm: use current as mmu notifier's owner
  mm: unified fault and migrate device page paths
  mm:/migrate_device.c: remove migrate_vma_collect_*() functions
  mm: add new testcase for the migrate on fault case

 include/linux/hmm.h                    |  10 +-
 include/linux/migrate.h                |   6 +-
 lib/test_hmm.c                         | 105 ++++++-
 lib/test_hmm_uapi.h                    |  17 +-
 mm/hmm.c                               | 351 +++++++++++++++++++++-
 mm/huge_memory.c                       |   6 +-
 mm/migrate_device.c                    | 384 +++++--------------------
 mm/rmap.c                              |   4 +-
 tools/testing/selftests/mm/hmm-tests.c |  53 ++++
 9 files changed, 592 insertions(+), 344 deletions(-)

-- 
2.50.0

Re: [RFC PATCH 0/4] Migrate on fault for device pages
Posted by Balbir Singh 1 month, 2 weeks ago
On 8/14/25 17:19, Mika Penttilä wrote:
> As of this writing, the way device page faulting and migration
> works is not optimal, if you want to do both fault handling
> and migration at once.
> 
> Being able to migrate not present pages (or pages mapped with incorrect
> permissions, eg. COW) to the GPU requires doing either of the following
> sequences:
> 
> 1. hmm_range_fault() - fault in non-present pages with correct
>    permissions,etc.
> 2. migrate_vma_*() - migrate the pages
> 
> Or:
> 
> 1. migrate_vma_*() - migrate present pages
> 2. If non-present pages detected by migrate_vma_*():
>    a) call hmm_range_fault() to fault pages in
>    b) call migrate_vma_*() again to migrate now present pages
> 
> The problem with the first sequence is that you always have to do two
> page walks even when most of the time the pages are present or zero page
> mappings so the common case takes a performance hit.
> 
> The second sequence is better for the common case, but far worse if
> pages aren't present because now you have to walk the page tables three
> times (once to find the page is not present, once so hmm_range_fault()
> can find a non-present page to fault in and once again to setup the
> migration). It also tricky to code correctly.
> 
> We should be able to walk the page table once, faulting
> pages in as required and replacing them with migration entries if
> requested.
> 

The use case makes sense to me, but isn't the sequence always going
to be racy, by the time the pages are faulted in, there could be
others that have been marked non-present or do you intend to lock
all pages during this operation?

Balbir
Re: [RFC PATCH 0/4] Migrate on fault for device pages
Posted by Mika Penttilä 1 month, 2 weeks ago
On 8/15/25 14:36, Balbir Singh wrote:

> On 8/14/25 17:19, Mika Penttilä wrote:
>> As of this writing, the way device page faulting and migration
>> works is not optimal, if you want to do both fault handling
>> and migration at once.
>>
>> Being able to migrate not present pages (or pages mapped with incorrect
>> permissions, eg. COW) to the GPU requires doing either of the following
>> sequences:
>>
>> 1. hmm_range_fault() - fault in non-present pages with correct
>>    permissions,etc.
>> 2. migrate_vma_*() - migrate the pages
>>
>> Or:
>>
>> 1. migrate_vma_*() - migrate present pages
>> 2. If non-present pages detected by migrate_vma_*():
>>    a) call hmm_range_fault() to fault pages in
>>    b) call migrate_vma_*() again to migrate now present pages
>>
>> The problem with the first sequence is that you always have to do two
>> page walks even when most of the time the pages are present or zero page
>> mappings so the common case takes a performance hit.
>>
>> The second sequence is better for the common case, but far worse if
>> pages aren't present because now you have to walk the page tables three
>> times (once to find the page is not present, once so hmm_range_fault()
>> can find a non-present page to fault in and once again to setup the
>> migration). It also tricky to code correctly.
>>
>> We should be able to walk the page table once, faulting
>> pages in as required and replacing them with migration entries if
>> requested.
>>
> The use case makes sense to me, but isn't the sequence always going
> to be racy, by the time the pages are faulted in, there could be
> others that have been marked non-present or do you intend to lock
> all pages during this operation?
>
> Balbir

Yes the pages are "collected", so locked and ref taken as soon as faulted in.

--Mika

>