[v1] THP support for zone device page migration

[v1 resend 00/12] THP support for zone device page migration

Posted by Balbir Singh 3 months ago

This patch series adds support for THP migration of zone device pages.
To do so, the patches implement support for folio zone device pages
by adding support for setting up larger order pages.

These patches build on the earlier posts by Ralph Campbell [1]

Two new flags are added in vma_migration to select and mark compound pages.
migrate_vma_setup(), migrate_vma_pages() and migrate_vma_finalize()
support migration of these pages when MIGRATE_VMA_SELECT_COMPOUND
is passed in as arguments.

The series also adds zone device awareness to (m)THP pages along
with fault handling of large zone device private pages. page vma walk
and the rmap code is also zone device aware. Support has also been
added for folios that might need to be split in the middle
of migration (when the src and dst do not agree on
MIGRATE_PFN_COMPOUND), that occurs when src side of the migration can
migrate large pages, but the destination has not been able to allocate
large pages. The code supported and used folio_split() when migrating
THP pages, this is used when MIGRATE_VMA_SELECT_COMPOUND is not passed
as an argument to migrate_vma_setup().

The test infrastructure lib/test_hmm.c has been enhanced to support THP
migration. A new ioctl to emulate failure of large page allocations has
been added to test the folio split code path. hmm-tests.c has new test
cases for huge page migration and to test the folio split path. A new
throughput test has been added as well.

The nouveau dmem code has been enhanced to use the new THP migration
capability.

Feedback from the RFC [2]:

It was advised that prep_compound_page() not be exposed just for the purposes
of testing (test driver lib/test_hmm.c). Work arounds of copy and split the
folios did not work due to lock order dependency in the callback for
split folio.

mTHP support:

The patches hard code, HPAGE_PMD_NR in a few places, but the code has
been kept generic to support various order sizes. With additional
refactoring of the code support of different order sizes should be
possible.

The future plan is to post enhancements to support mTHP with a rough
design as follows:

1. Add the notion of allowable thp orders to the HMM based test driver
2. For non PMD based THP paths in migrate_device.c, check to see if
   a suitable order is found and supported by the driver
3. Iterate across orders to check the highest supported order for migration
4. Migrate and finalize

The mTHP patches can be built on top of this series, the key design elements
that need to be worked out are infrastructure and driver support for multiple
ordered pages and their migration.

References:
[1] https://lore.kernel.org/linux-mm/20201106005147.20113-1-rcampbell@nvidia.com/
[2] https://lore.kernel.org/linux-mm/20250306044239.3874247-3-balbirs@nvidia.com/T/

These patches are built on top of mm-unstable

Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>

Changelog v1:
- Changes from RFC [2], include support for handling fault_folio and using
  trylock in the fault path
- A new test case has been added to measure the throughput improvement
- General refactoring of code to keep up with the changes in mm
- New split folio callback when the entire split is complete/done. The
  callback is used to know when the head order needs to be reset.

Testing:
- Testing was done with ZONE_DEVICE private pages on an x86 VM
- Throughput showed upto 5x improvement with THP migration, system to device
  migration is slower due to the mirroring of data (see buffer->mirror)


Balbir Singh (12):
  mm/zone_device: support large zone device private folios
  mm/migrate_device: flags for selecting device private THP pages
  mm/thp: zone_device awareness in THP handling code
  mm/migrate_device: THP migration of zone device pages
  mm/memory/fault: add support for zone device THP fault handling
  lib/test_hmm: test cases and support for zone device private THP
  mm/memremap: add folio_split support
  mm/thp: add split during migration support
  lib/test_hmm: add test case for split pages
  selftests/mm/hmm-tests: new tests for zone device THP migration
  gpu/drm/nouveau: add THP migration support
  selftests/mm/hmm-tests: new throughput tests including THP

 drivers/gpu/drm/nouveau/nouveau_dmem.c | 246 +++++++---
 drivers/gpu/drm/nouveau/nouveau_svm.c  |   6 +-
 drivers/gpu/drm/nouveau/nouveau_svm.h  |   3 +-
 include/linux/huge_mm.h                |  18 +-
 include/linux/memremap.h               |  29 +-
 include/linux/migrate.h                |   2 +
 include/linux/mm.h                     |   1 +
 lib/test_hmm.c                         | 428 +++++++++++++----
 lib/test_hmm_uapi.h                    |   3 +
 mm/huge_memory.c                       | 261 ++++++++---
 mm/memory.c                            |   6 +-
 mm/memremap.c                          |  50 +-
 mm/migrate.c                           |   2 +
 mm/migrate_device.c                    | 488 +++++++++++++++++---
 mm/page_alloc.c                        |   1 +
 mm/page_vma_mapped.c                   |  10 +
 mm/pgtable-generic.c                   |   6 +
 mm/rmap.c                              |  19 +-
 tools/testing/selftests/mm/hmm-tests.c | 607 ++++++++++++++++++++++++-
 19 files changed, 1874 insertions(+), 312 deletions(-)

-- 
2.49.0

Re: [v1 resend 00/12] THP support for zone device page migration

Posted by Zi Yan 3 months ago

On 3 Jul 2025, at 19:34, Balbir Singh wrote:

> This patch series adds support for THP migration of zone device pages.
> To do so, the patches implement support for folio zone device pages
> by adding support for setting up larger order pages.
>
> These patches build on the earlier posts by Ralph Campbell [1]
>
> Two new flags are added in vma_migration to select and mark compound pages.
> migrate_vma_setup(), migrate_vma_pages() and migrate_vma_finalize()
> support migration of these pages when MIGRATE_VMA_SELECT_COMPOUND
> is passed in as arguments.
>
> The series also adds zone device awareness to (m)THP pages along
> with fault handling of large zone device private pages. page vma walk
> and the rmap code is also zone device aware. Support has also been
> added for folios that might need to be split in the middle
> of migration (when the src and dst do not agree on
> MIGRATE_PFN_COMPOUND), that occurs when src side of the migration can
> migrate large pages, but the destination has not been able to allocate
> large pages. The code supported and used folio_split() when migrating
> THP pages, this is used when MIGRATE_VMA_SELECT_COMPOUND is not passed
> as an argument to migrate_vma_setup().
>
> The test infrastructure lib/test_hmm.c has been enhanced to support THP
> migration. A new ioctl to emulate failure of large page allocations has
> been added to test the folio split code path. hmm-tests.c has new test
> cases for huge page migration and to test the folio split path. A new
> throughput test has been added as well.
>
> The nouveau dmem code has been enhanced to use the new THP migration
> capability.
>
> Feedback from the RFC [2]:
>
> It was advised that prep_compound_page() not be exposed just for the purposes
> of testing (test driver lib/test_hmm.c). Work arounds of copy and split the
> folios did not work due to lock order dependency in the callback for
> split folio.
>
> mTHP support:
>
> The patches hard code, HPAGE_PMD_NR in a few places, but the code has
> been kept generic to support various order sizes. With additional
> refactoring of the code support of different order sizes should be
> possible.
>
> The future plan is to post enhancements to support mTHP with a rough
> design as follows:
>
> 1. Add the notion of allowable thp orders to the HMM based test driver
> 2. For non PMD based THP paths in migrate_device.c, check to see if
>    a suitable order is found and supported by the driver
> 3. Iterate across orders to check the highest supported order for migration
> 4. Migrate and finalize
>
> The mTHP patches can be built on top of this series, the key design elements
> that need to be worked out are infrastructure and driver support for multiple
> ordered pages and their migration.

To help me better review the patches, can you tell me if my mental model below
for device private folios is correct or not?

1. device private folios represent device memory, but the associated PFNs
do not exist in the system. folio->pgmap contains the meta info about
device memory.

2. when data is migrated from system memory to device private memory, a device
private page table entry is established in place of the original entry.
A device private page table entry is a swap entry with a device private type.
And the swap entry points to a device private folio in which the data resides
in the device private memory.

3. when CPU tries to access an address with device private page table entry,
a fault happens and data is migrated from device private memory to system
memory. The device private folio pointed by the device private page table
entry tells driver where to look for the data on the device.

4. one of the reasons causing a large device private folio split is that
when a large device private folio is migrated back to system memory and
there is no free large folio in system memory. So that driver splits
the large device private folio and only migrate a subpage instead.

Thanks.

--
Best Regards,
Yan, Zi

Re: [v1 resend 00/12] THP support for zone device page migration

Posted by Balbir Singh 3 months ago

On 7/5/25 02:16, Zi Yan wrote:
> On 3 Jul 2025, at 19:34, Balbir Singh wrote:
> 
>> This patch series adds support for THP migration of zone device pages.
>> To do so, the patches implement support for folio zone device pages
>> by adding support for setting up larger order pages.
>>
>> These patches build on the earlier posts by Ralph Campbell [1]
>>
>> Two new flags are added in vma_migration to select and mark compound pages.
>> migrate_vma_setup(), migrate_vma_pages() and migrate_vma_finalize()
>> support migration of these pages when MIGRATE_VMA_SELECT_COMPOUND
>> is passed in as arguments.
>>
>> The series also adds zone device awareness to (m)THP pages along
>> with fault handling of large zone device private pages. page vma walk
>> and the rmap code is also zone device aware. Support has also been
>> added for folios that might need to be split in the middle
>> of migration (when the src and dst do not agree on
>> MIGRATE_PFN_COMPOUND), that occurs when src side of the migration can
>> migrate large pages, but the destination has not been able to allocate
>> large pages. The code supported and used folio_split() when migrating
>> THP pages, this is used when MIGRATE_VMA_SELECT_COMPOUND is not passed
>> as an argument to migrate_vma_setup().
>>
>> The test infrastructure lib/test_hmm.c has been enhanced to support THP
>> migration. A new ioctl to emulate failure of large page allocations has
>> been added to test the folio split code path. hmm-tests.c has new test
>> cases for huge page migration and to test the folio split path. A new
>> throughput test has been added as well.
>>
>> The nouveau dmem code has been enhanced to use the new THP migration
>> capability.
>>
>> Feedback from the RFC [2]:
>>
>> It was advised that prep_compound_page() not be exposed just for the purposes
>> of testing (test driver lib/test_hmm.c). Work arounds of copy and split the
>> folios did not work due to lock order dependency in the callback for
>> split folio.
>>
>> mTHP support:
>>
>> The patches hard code, HPAGE_PMD_NR in a few places, but the code has
>> been kept generic to support various order sizes. With additional
>> refactoring of the code support of different order sizes should be
>> possible.
>>
>> The future plan is to post enhancements to support mTHP with a rough
>> design as follows:
>>
>> 1. Add the notion of allowable thp orders to the HMM based test driver
>> 2. For non PMD based THP paths in migrate_device.c, check to see if
>>    a suitable order is found and supported by the driver
>> 3. Iterate across orders to check the highest supported order for migration
>> 4. Migrate and finalize
>>
>> The mTHP patches can be built on top of this series, the key design elements
>> that need to be worked out are infrastructure and driver support for multiple
>> ordered pages and their migration.
> 
> To help me better review the patches, can you tell me if my mental model below
> for device private folios is correct or not?
> 
> 1. device private folios represent device memory, but the associated PFNs
> do not exist in the system. folio->pgmap contains the meta info about
> device memory.

Yes that is right

> 
> 2. when data is migrated from system memory to device private memory, a device
> private page table entry is established in place of the original entry.
> A device private page table entry is a swap entry with a device private type.
> And the swap entry points to a device private folio in which the data resides
> in the device private memory.
> 

Yes

> 3. when CPU tries to access an address with device private page table entry,
> a fault happens and data is migrated from device private memory to system
> memory. The device private folio pointed by the device private page table
> entry tells driver where to look for the data on the device.
> 
> 4. one of the reasons causing a large device private folio split is that
> when a large device private folio is migrated back to system memory and
> there is no free large folio in system memory. So that driver splits
> the large device private folio and only migrate a subpage instead.
> 

Both the points are correct, to add to point 4, it can so happen that during
migration of memory from system to device, we might need a split as well.
Effectively the destination is unable to allocate a large page for migration


Balbir

Re: [v1 resend 00/12] THP support for zone device page migration

Posted by David Hildenbrand 3 months ago

On 04.07.25 01:34, Balbir Singh wrote:

So, I shared some feedback as reply to "[RFC 00/11] THP support for zone 
device pages".

It popped up in my mail box again after there apparently was a 
discussion a couple of days ago.

... and now I realize that this is apparently the same series with 
renamed subject. Gah.

So please, find my feedback there -- most of that should still apply, 
scanning your changelog ...


> Changelog v1:
> - Changes from RFC [2], include support for handling fault_folio and using
>    trylock in the fault path
> - A new test case has been added to measure the throughput improvement
> - General refactoring of code to keep up with the changes in mm
> - New split folio callback when the entire split is complete/done. The
>    callback is used to know when the head order needs to be reset.
> 
> Testing:
> - Testing was done with ZONE_DEVICE private pages on an x86 VM
> - Throughput showed upto 5x improvement with THP migration, system to device
>    migration is slower due to the mirroring of data (see buffer->mirror)
> 
> 
> Balbir Singh (12):
>    mm/zone_device: support large zone device private folios
>    mm/migrate_device: flags for selecting device private THP pages
>    mm/thp: zone_device awareness in THP handling code
>    mm/migrate_device: THP migration of zone device pages
>    mm/memory/fault: add support for zone device THP fault handling
>    lib/test_hmm: test cases and support for zone device private THP
>    mm/memremap: add folio_split support
>    mm/thp: add split during migration support
>    lib/test_hmm: add test case for split pages
>    selftests/mm/hmm-tests: new tests for zone device THP migration
>    gpu/drm/nouveau: add THP migration support
>    selftests/mm/hmm-tests: new throughput tests including THP
> 
>   drivers/gpu/drm/nouveau/nouveau_dmem.c | 246 +++++++---
>   drivers/gpu/drm/nouveau/nouveau_svm.c  |   6 +-
>   drivers/gpu/drm/nouveau/nouveau_svm.h  |   3 +-
>   include/linux/huge_mm.h                |  18 +-
>   include/linux/memremap.h               |  29 +-
>   include/linux/migrate.h                |   2 +
>   include/linux/mm.h                     |   1 +
>   lib/test_hmm.c                         | 428 +++++++++++++----
>   lib/test_hmm_uapi.h                    |   3 +
>   mm/huge_memory.c                       | 261 ++++++++---
>   mm/memory.c                            |   6 +-
>   mm/memremap.c                          |  50 +-
>   mm/migrate.c                           |   2 +
>   mm/migrate_device.c                    | 488 +++++++++++++++++---
>   mm/page_alloc.c                        |   1 +
>   mm/page_vma_mapped.c                   |  10 +
>   mm/pgtable-generic.c                   |   6 +
>   mm/rmap.c                              |  19 +-
>   tools/testing/selftests/mm/hmm-tests.c | 607 ++++++++++++++++++++++++-
>   19 files changed, 1874 insertions(+), 312 deletions(-)
> 


-- 
Cheers,

David / dhildenb

Re: [v1 resend 00/12] THP support for zone device page migration

Posted by Balbir Singh 3 months ago

On 7/9/25 00:53, David Hildenbrand wrote:
> On 04.07.25 01:34, Balbir Singh wrote:
> 
> So, I shared some feedback as reply to "[RFC 00/11] THP support for zone device pages".
> 
> It popped up in my mail box again after there apparently was a discussion a couple of days ago.
> 
> ... and now I realize that this is apparently the same series with renamed subject. Gah.
> 
> So please, find my feedback there -- most of that should still apply, scanning your changelog ...
> 
Will do

Thanks for letting me know

Balbir Singh

Re: [v1 resend 00/12] THP support for zone device page migration

Posted by Matthew Brost 2 months, 3 weeks ago

On Fri, Jul 04, 2025 at 09:34:59AM +1000, Balbir Singh wrote:
> This patch series adds support for THP migration of zone device pages.
> To do so, the patches implement support for folio zone device pages
> by adding support for setting up larger order pages.
> 
> These patches build on the earlier posts by Ralph Campbell [1]
> 
> Two new flags are added in vma_migration to select and mark compound pages.
> migrate_vma_setup(), migrate_vma_pages() and migrate_vma_finalize()
> support migration of these pages when MIGRATE_VMA_SELECT_COMPOUND
> is passed in as arguments.
> 
> The series also adds zone device awareness to (m)THP pages along
> with fault handling of large zone device private pages. page vma walk
> and the rmap code is also zone device aware. Support has also been
> added for folios that might need to be split in the middle
> of migration (when the src and dst do not agree on
> MIGRATE_PFN_COMPOUND), that occurs when src side of the migration can
> migrate large pages, but the destination has not been able to allocate
> large pages. The code supported and used folio_split() when migrating
> THP pages, this is used when MIGRATE_VMA_SELECT_COMPOUND is not passed
> as an argument to migrate_vma_setup().
> 
> The test infrastructure lib/test_hmm.c has been enhanced to support THP
> migration. A new ioctl to emulate failure of large page allocations has
> been added to test the folio split code path. hmm-tests.c has new test
> cases for huge page migration and to test the folio split path. A new
> throughput test has been added as well.
> 
> The nouveau dmem code has been enhanced to use the new THP migration
> capability.
> 
> Feedback from the RFC [2]:
> 

Thanks for the patches, results look very promising. I wanted to give
some quick feedback:

- You appear to have missed updating hmm_range_fault, specifically
hmm_vma_handle_pmd, to check for device-private entries and populate the
HMM PFNs accordingly. My colleague François has a fix for this if you're
interested.

- I believe copy_huge_pmd also needs to be updated to avoid installing a
migration entry if the swap entry is device-private. I don't have an
exact fix yet due to my limited experience with core MM. The test case
that triggers this is fairly simple: fault in a 2MB device page on the
GPU, then fork a process that reads the page — the kernel crashes in
this scenario.

Matt

> It was advised that prep_compound_page() not be exposed just for the purposes
> of testing (test driver lib/test_hmm.c). Work arounds of copy and split the
> folios did not work due to lock order dependency in the callback for
> split folio.
> 
> mTHP support:
> 
> The patches hard code, HPAGE_PMD_NR in a few places, but the code has
> been kept generic to support various order sizes. With additional
> refactoring of the code support of different order sizes should be
> possible.
> 
> The future plan is to post enhancements to support mTHP with a rough
> design as follows:
> 
> 1. Add the notion of allowable thp orders to the HMM based test driver
> 2. For non PMD based THP paths in migrate_device.c, check to see if
>    a suitable order is found and supported by the driver
> 3. Iterate across orders to check the highest supported order for migration
> 4. Migrate and finalize
> 
> The mTHP patches can be built on top of this series, the key design elements
> that need to be worked out are infrastructure and driver support for multiple
> ordered pages and their migration.
> 
> References:
> [1] https://lore.kernel.org/linux-mm/20201106005147.20113-1-rcampbell@nvidia.com/
> [2] https://lore.kernel.org/linux-mm/20250306044239.3874247-3-balbirs@nvidia.com/T/
> 
> These patches are built on top of mm-unstable
> 
> Cc: Karol Herbst <kherbst@redhat.com>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> Cc: Jane Chu <jane.chu@oracle.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Donet Tom <donettom@linux.ibm.com>
> 
> Changelog v1:
> - Changes from RFC [2], include support for handling fault_folio and using
>   trylock in the fault path
> - A new test case has been added to measure the throughput improvement
> - General refactoring of code to keep up with the changes in mm
> - New split folio callback when the entire split is complete/done. The
>   callback is used to know when the head order needs to be reset.
> 
> Testing:
> - Testing was done with ZONE_DEVICE private pages on an x86 VM
> - Throughput showed upto 5x improvement with THP migration, system to device
>   migration is slower due to the mirroring of data (see buffer->mirror)
> 
> 
> Balbir Singh (12):
>   mm/zone_device: support large zone device private folios
>   mm/migrate_device: flags for selecting device private THP pages
>   mm/thp: zone_device awareness in THP handling code
>   mm/migrate_device: THP migration of zone device pages
>   mm/memory/fault: add support for zone device THP fault handling
>   lib/test_hmm: test cases and support for zone device private THP
>   mm/memremap: add folio_split support
>   mm/thp: add split during migration support
>   lib/test_hmm: add test case for split pages
>   selftests/mm/hmm-tests: new tests for zone device THP migration
>   gpu/drm/nouveau: add THP migration support
>   selftests/mm/hmm-tests: new throughput tests including THP
> 
>  drivers/gpu/drm/nouveau/nouveau_dmem.c | 246 +++++++---
>  drivers/gpu/drm/nouveau/nouveau_svm.c  |   6 +-
>  drivers/gpu/drm/nouveau/nouveau_svm.h  |   3 +-
>  include/linux/huge_mm.h                |  18 +-
>  include/linux/memremap.h               |  29 +-
>  include/linux/migrate.h                |   2 +
>  include/linux/mm.h                     |   1 +
>  lib/test_hmm.c                         | 428 +++++++++++++----
>  lib/test_hmm_uapi.h                    |   3 +
>  mm/huge_memory.c                       | 261 ++++++++---
>  mm/memory.c                            |   6 +-
>  mm/memremap.c                          |  50 +-
>  mm/migrate.c                           |   2 +
>  mm/migrate_device.c                    | 488 +++++++++++++++++---
>  mm/page_alloc.c                        |   1 +
>  mm/page_vma_mapped.c                   |  10 +
>  mm/pgtable-generic.c                   |   6 +
>  mm/rmap.c                              |  19 +-
>  tools/testing/selftests/mm/hmm-tests.c | 607 ++++++++++++++++++++++++-
>  19 files changed, 1874 insertions(+), 312 deletions(-)
> 
> -- 
> 2.49.0
>

Re: [v1 resend 00/12] THP support for zone device page migration

Posted by Balbir Singh 2 months, 3 weeks ago

On 7/18/25 09:40, Matthew Brost wrote:
> On Fri, Jul 04, 2025 at 09:34:59AM +1000, Balbir Singh wrote:
...
>>
>> The nouveau dmem code has been enhanced to use the new THP migration
>> capability.
>>
>> Feedback from the RFC [2]:
>>
> 
> Thanks for the patches, results look very promising. I wanted to give
> some quick feedback:
> 

Are you seeing improvements with the patchset?

> - You appear to have missed updating hmm_range_fault, specifically
> hmm_vma_handle_pmd, to check for device-private entries and populate the
> HMM PFNs accordingly. My colleague François has a fix for this if you're
> interested.
> 

Sure, please feel free to post them. 

> - I believe copy_huge_pmd also needs to be updated to avoid installing a
> migration entry if the swap entry is device-private. I don't have an
> exact fix yet due to my limited experience with core MM. The test case
> that triggers this is fairly simple: fault in a 2MB device page on the
> GPU, then fork a process that reads the page — the kernel crashes in
> this scenario.
> 

I'd be happy to look at any traces you have or post any fixes you have

Thanks for the feedback
Balbir Singh

Re: [v1 resend 00/12] THP support for zone device page migration

Posted by Francois Dugast 2 months, 2 weeks ago

On Fri, Jul 18, 2025 at 01:57:13PM +1000, Balbir Singh wrote:
> On 7/18/25 09:40, Matthew Brost wrote:
> > On Fri, Jul 04, 2025 at 09:34:59AM +1000, Balbir Singh wrote:
> ...
> >>
> >> The nouveau dmem code has been enhanced to use the new THP migration
> >> capability.
> >>
> >> Feedback from the RFC [2]:
> >>
> > 
> > Thanks for the patches, results look very promising. I wanted to give
> > some quick feedback:
> > 
> 
> Are you seeing improvements with the patchset?
> 
> > - You appear to have missed updating hmm_range_fault, specifically
> > hmm_vma_handle_pmd, to check for device-private entries and populate the
> > HMM PFNs accordingly. My colleague François has a fix for this if you're
> > interested.
> > 
> 
> Sure, please feel free to post them. 

Hi Balbir,

It seems we are missing this special handling in in hmm_vma_walk_pmd():

diff --git a/mm/hmm.c b/mm/hmm.c
index f2415b4b2cdd..449025f72b2f 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -355,6 +355,27 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
        }
 
        if (!pmd_present(pmd)) {
+               swp_entry_t entry = pmd_to_swp_entry(pmd);
+
+               /*
+                * Don't fault in device private pages owned by the caller,
+                * just report the PFNs.
+                */
+               if (is_device_private_entry(entry) &&
+                   pfn_swap_entry_folio(entry)->pgmap->owner ==
+                   range->dev_private_owner) {
+                       unsigned long cpu_flags = pmd_to_hmm_pfn_flags(range, pmd);
+                       unsigned long pfn = swp_offset_pfn(entry);
+                       unsigned long i;
+
+                       for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
+                               hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
+                               hmm_pfns[i] |= pfn | cpu_flags;
+                       }
+
+                       return 0;
+               }
+
                if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
                        return -EFAULT;
                return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);

Francois

> 
> > - I believe copy_huge_pmd also needs to be updated to avoid installing a
> > migration entry if the swap entry is device-private. I don't have an
> > exact fix yet due to my limited experience with core MM. The test case
> > that triggers this is fairly simple: fault in a 2MB device page on the
> > GPU, then fork a process that reads the page — the kernel crashes in
> > this scenario.
> > 
> 
> I'd be happy to look at any traces you have or post any fixes you have
> 
> Thanks for the feedback
> Balbir Singh
>

Re: [v1 resend 00/12] THP support for zone device page migration

Posted by Balbir Singh 2 months, 2 weeks ago

On 7/21/25 21:42, Francois Dugast wrote:
> On Fri, Jul 18, 2025 at 01:57:13PM +1000, Balbir Singh wrote:
>> On 7/18/25 09:40, Matthew Brost wrote:
>>> On Fri, Jul 04, 2025 at 09:34:59AM +1000, Balbir Singh wrote:
>> ...
>>>>
>>>> The nouveau dmem code has been enhanced to use the new THP migration
>>>> capability.
>>>>
>>>> Feedback from the RFC [2]:
>>>>
>>>
>>> Thanks for the patches, results look very promising. I wanted to give
>>> some quick feedback:
>>>
>>
>> Are you seeing improvements with the patchset?
>>
>>> - You appear to have missed updating hmm_range_fault, specifically
>>> hmm_vma_handle_pmd, to check for device-private entries and populate the
>>> HMM PFNs accordingly. My colleague François has a fix for this if you're
>>> interested.
>>>
>>
>> Sure, please feel free to post them. 
> 
> Hi Balbir,
> 
> It seems we are missing this special handling in in hmm_vma_walk_pmd():
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index f2415b4b2cdd..449025f72b2f 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -355,6 +355,27 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>         }
>  
>         if (!pmd_present(pmd)) {
> +               swp_entry_t entry = pmd_to_swp_entry(pmd);
> +
> +               /*
> +                * Don't fault in device private pages owned by the caller,
> +                * just report the PFNs.
> +                */
> +               if (is_device_private_entry(entry) &&
> +                   pfn_swap_entry_folio(entry)->pgmap->owner ==
> +                   range->dev_private_owner) {
> +                       unsigned long cpu_flags = pmd_to_hmm_pfn_flags(range, pmd);
> +                       unsigned long pfn = swp_offset_pfn(entry);
> +                       unsigned long i;
> +
> +                       for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
> +                               hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
> +                               hmm_pfns[i] |= pfn | cpu_flags;

Won't we use hmm_pfn_to_map_order(), do we still need to populate each entry in the hmm_pfns[i]?

> +                       }
> +
> +                       return 0;
> +               }
> +

Thanks for the patch! If you could send it with a full sign-off, I can add it to my series while
posting

Balbir Singh

Re: [v1 resend 00/12] THP support for zone device page migration

Posted by Matthew Brost 2 months, 2 weeks ago

On Tue, Jul 22, 2025 at 09:34:13AM +1000, Balbir Singh wrote:
> On 7/21/25 21:42, Francois Dugast wrote:
> > On Fri, Jul 18, 2025 at 01:57:13PM +1000, Balbir Singh wrote:
> >> On 7/18/25 09:40, Matthew Brost wrote:
> >>> On Fri, Jul 04, 2025 at 09:34:59AM +1000, Balbir Singh wrote:
> >> ...
> >>>>
> >>>> The nouveau dmem code has been enhanced to use the new THP migration
> >>>> capability.
> >>>>
> >>>> Feedback from the RFC [2]:
> >>>>
> >>>
> >>> Thanks for the patches, results look very promising. I wanted to give
> >>> some quick feedback:
> >>>
> >>
> >> Are you seeing improvements with the patchset?
> >>
> >>> - You appear to have missed updating hmm_range_fault, specifically
> >>> hmm_vma_handle_pmd, to check for device-private entries and populate the
> >>> HMM PFNs accordingly. My colleague François has a fix for this if you're
> >>> interested.
> >>>
> >>
> >> Sure, please feel free to post them. 
> > 
> > Hi Balbir,
> > 
> > It seems we are missing this special handling in in hmm_vma_walk_pmd():
> > 
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index f2415b4b2cdd..449025f72b2f 100644
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -355,6 +355,27 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
> >         }
> >  
> >         if (!pmd_present(pmd)) {
> > +               swp_entry_t entry = pmd_to_swp_entry(pmd);
> > +
> > +               /*
> > +                * Don't fault in device private pages owned by the caller,
> > +                * just report the PFNs.
> > +                */
> > +               if (is_device_private_entry(entry) &&
> > +                   pfn_swap_entry_folio(entry)->pgmap->owner ==
> > +                   range->dev_private_owner) {
> > +                       unsigned long cpu_flags = pmd_to_hmm_pfn_flags(range, pmd);
> > +                       unsigned long pfn = swp_offset_pfn(entry);
> > +                       unsigned long i;
> > +
> > +                       for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
> > +                               hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
> > +                               hmm_pfns[i] |= pfn | cpu_flags;
> 
> Won't we use hmm_pfn_to_map_order(), do we still need to populate each entry in the hmm_pfns[i]?
> 

I had the same question. hmm_vma_handle_pmd populates subsequent PFNs as
well, but this seems unnecessary. I removed both of these cases and my
code still worked. However, I haven't audited all kernel callers to
ensure this doesn't cause issues elsewhere.

Matt

> > +                       }
> > +
> > +                       return 0;
> > +               }
> > +
> 
> Thanks for the patch! If you could send it with a full sign-off, I can add it to my series while
> posting
> 
> Balbir Singh
> 
>

[PATCH] mm/hmm: Do not fault in device private pages owned by the caller

Posted by Francois Dugast 2 months, 2 weeks ago

When the PMD swap entry is device private and owned by the caller,
skip the range faulting and instead just set the correct HMM PFNs.
This is similar to the logic for PTEs in hmm_vma_handle_pte().

For now, each hmm_pfns[i] entry is populated as it is currently done
in hmm_vma_handle_pmd() but this might not be necessary. A follow-up
optimization could be to make use of the order and skip populating
subsequent PFNs.

Signed-off-by: Francois Dugast <francois.dugast@intel.com>
---
 mm/hmm.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/mm/hmm.c b/mm/hmm.c
index f2415b4b2cdd..63ec1b18a656 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -355,6 +355,31 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	}
 
 	if (!pmd_present(pmd)) {
+		swp_entry_t entry = pmd_to_swp_entry(pmd);
+
+		/*
+		 * Don't fault in device private pages owned by the caller,
+		 * just report the PFNs.
+		 */
+		if (is_device_private_entry(entry) &&
+		    pfn_swap_entry_folio(entry)->pgmap->owner ==
+		    range->dev_private_owner) {
+			unsigned long cpu_flags = HMM_PFN_VALID |
+				hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT);
+			unsigned long pfn = swp_offset_pfn(entry);
+			unsigned long i;
+
+			if (is_writable_device_private_entry(entry))
+				cpu_flags |= HMM_PFN_WRITE;
+
+			for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
+				hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
+				hmm_pfns[i] |= pfn | cpu_flags;
+			}
+
+			return 0;
+		}
+
 		if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
 			return -EFAULT;
 		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
-- 
2.43.0

Re: [PATCH] mm/hmm: Do not fault in device private pages owned by the caller

Posted by Matthew Brost 2 months ago

On Tue, Jul 22, 2025 at 09:34:45PM +0200, Francois Dugast wrote:
> When the PMD swap entry is device private and owned by the caller,
> skip the range faulting and instead just set the correct HMM PFNs.
> This is similar to the logic for PTEs in hmm_vma_handle_pte().
> 
> For now, each hmm_pfns[i] entry is populated as it is currently done
> in hmm_vma_handle_pmd() but this might not be necessary. A follow-up
> optimization could be to make use of the order and skip populating
> subsequent PFNs.
> 
> Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> ---
>  mm/hmm.c | 25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index f2415b4b2cdd..63ec1b18a656 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -355,6 +355,31 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  	}
>  
>  	if (!pmd_present(pmd)) {
> +		swp_entry_t entry = pmd_to_swp_entry(pmd);
> +
> +		/*
> +		 * Don't fault in device private pages owned by the caller,
> +		 * just report the PFNs.
> +		 */
> +		if (is_device_private_entry(entry) &&
> +		    pfn_swap_entry_folio(entry)->pgmap->owner ==
> +		    range->dev_private_owner) {
> +			unsigned long cpu_flags = HMM_PFN_VALID |
> +				hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT);
> +			unsigned long pfn = swp_offset_pfn(entry);
> +			unsigned long i;
> +
> +			if (is_writable_device_private_entry(entry))
> +				cpu_flags |= HMM_PFN_WRITE;
> +
> +			for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
> +				hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
> +				hmm_pfns[i] |= pfn | cpu_flags;
> +			}
> +
> +			return 0;
> +		}
> +
>  		if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
>  			return -EFAULT;

I think here if this is a is_device_private_entry(entry), we need to
call hmm_vma_fault.

Matt

>  		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
> -- 
> 2.43.0
>

Re: [PATCH] mm/hmm: Do not fault in device private pages owned by the caller

Posted by Francois Dugast 2 months ago

On Thu, Aug 07, 2025 at 05:21:45PM -0700, Matthew Brost wrote:
> On Tue, Jul 22, 2025 at 09:34:45PM +0200, Francois Dugast wrote:
> > When the PMD swap entry is device private and owned by the caller,
> > skip the range faulting and instead just set the correct HMM PFNs.
> > This is similar to the logic for PTEs in hmm_vma_handle_pte().
> > 
> > For now, each hmm_pfns[i] entry is populated as it is currently done
> > in hmm_vma_handle_pmd() but this might not be necessary. A follow-up
> > optimization could be to make use of the order and skip populating
> > subsequent PFNs.
> > 
> > Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> > ---
> >  mm/hmm.c | 25 +++++++++++++++++++++++++
> >  1 file changed, 25 insertions(+)
> > 
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index f2415b4b2cdd..63ec1b18a656 100644
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -355,6 +355,31 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
> >  	}
> >  
> >  	if (!pmd_present(pmd)) {
> > +		swp_entry_t entry = pmd_to_swp_entry(pmd);
> > +
> > +		/*
> > +		 * Don't fault in device private pages owned by the caller,
> > +		 * just report the PFNs.
> > +		 */
> > +		if (is_device_private_entry(entry) &&
> > +		    pfn_swap_entry_folio(entry)->pgmap->owner ==
> > +		    range->dev_private_owner) {
> > +			unsigned long cpu_flags = HMM_PFN_VALID |
> > +				hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT);
> > +			unsigned long pfn = swp_offset_pfn(entry);
> > +			unsigned long i;
> > +
> > +			if (is_writable_device_private_entry(entry))
> > +				cpu_flags |= HMM_PFN_WRITE;
> > +
> > +			for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
> > +				hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
> > +				hmm_pfns[i] |= pfn | cpu_flags;
> > +			}
> > +
> > +			return 0;
> > +		}
> > +
> >  		if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
> >  			return -EFAULT;
> 
> I think here if this is a is_device_private_entry(entry), we need to
> call hmm_vma_fault.

Yes that seems needed, thanks for the catch.

Francois

> 
> Matt
> 
> >  		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
> > -- 
> > 2.43.0
> >

Re: [PATCH] mm/hmm: Do not fault in device private pages owned by the caller

Posted by Balbir Singh 2 months, 2 weeks ago

On 7/23/25 05:34, Francois Dugast wrote:
> When the PMD swap entry is device private and owned by the caller,
> skip the range faulting and instead just set the correct HMM PFNs.
> This is similar to the logic for PTEs in hmm_vma_handle_pte().
> 
> For now, each hmm_pfns[i] entry is populated as it is currently done
> in hmm_vma_handle_pmd() but this might not be necessary. A follow-up
> optimization could be to make use of the order and skip populating
> subsequent PFNs.

I think we should test and remove these now

> 
> Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> ---
>  mm/hmm.c | 25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index f2415b4b2cdd..63ec1b18a656 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -355,6 +355,31 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  	}
>  
>  	if (!pmd_present(pmd)) {
> +		swp_entry_t entry = pmd_to_swp_entry(pmd);
> +
> +		/*
> +		 * Don't fault in device private pages owned by the caller,
> +		 * just report the PFNs.
> +		 */
> +		if (is_device_private_entry(entry) &&
> +		    pfn_swap_entry_folio(entry)->pgmap->owner ==
> +		    range->dev_private_owner) {
> +			unsigned long cpu_flags = HMM_PFN_VALID |
> +				hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT);
> +			unsigned long pfn = swp_offset_pfn(entry);
> +			unsigned long i;
> +
> +			if (is_writable_device_private_entry(entry))
> +				cpu_flags |= HMM_PFN_WRITE;
> +
> +			for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
> +				hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
> +				hmm_pfns[i] |= pfn | cpu_flags;
> +			}
> +

As discussed, can we remove these.

> +			return 0;
> +		}

All of this be under CONFIG_ARCH_ENABLE_THP_MIGRATION

> +
>  		if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
>  			return -EFAULT;
>  		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);



Balbir Singh

Re: [PATCH] mm/hmm: Do not fault in device private pages owned by the caller

Posted by Matthew Brost 2 months, 2 weeks ago

On Thu, Jul 24, 2025 at 10:25:11AM +1000, Balbir Singh wrote:
> On 7/23/25 05:34, Francois Dugast wrote:
> > When the PMD swap entry is device private and owned by the caller,
> > skip the range faulting and instead just set the correct HMM PFNs.
> > This is similar to the logic for PTEs in hmm_vma_handle_pte().
> > 
> > For now, each hmm_pfns[i] entry is populated as it is currently done
> > in hmm_vma_handle_pmd() but this might not be necessary. A follow-up
> > optimization could be to make use of the order and skip populating
> > subsequent PFNs.
> 
> I think we should test and remove these now
> 

+Jason, Leon – perhaps either of you can provide insight into why
hmm_vma_handle_pmd fully populates the HMM PFNs when a higher-order page
is found.

If we can be assured that changing this won’t break other parts of the
kernel, I agree it should be removed. A snippet of documentation should
also be added indicating that when higher-order PFNs are found,
subsequent PFNs within the range will remain unpopulated. I can verify
that GPU SVM works just fine without these PFNs being populated.

Matt

> > 
> > Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> > ---
> >  mm/hmm.c | 25 +++++++++++++++++++++++++
> >  1 file changed, 25 insertions(+)
> > 
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index f2415b4b2cdd..63ec1b18a656 100644
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -355,6 +355,31 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
> >  	}
> >  
> >  	if (!pmd_present(pmd)) {
> > +		swp_entry_t entry = pmd_to_swp_entry(pmd);
> > +
> > +		/*
> > +		 * Don't fault in device private pages owned by the caller,
> > +		 * just report the PFNs.
> > +		 */
> > +		if (is_device_private_entry(entry) &&
> > +		    pfn_swap_entry_folio(entry)->pgmap->owner ==
> > +		    range->dev_private_owner) {
> > +			unsigned long cpu_flags = HMM_PFN_VALID |
> > +				hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT);
> > +			unsigned long pfn = swp_offset_pfn(entry);
> > +			unsigned long i;
> > +
> > +			if (is_writable_device_private_entry(entry))
> > +				cpu_flags |= HMM_PFN_WRITE;
> > +
> > +			for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
> > +				hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
> > +				hmm_pfns[i] |= pfn | cpu_flags;
> > +			}
> > +
> 
> As discussed, can we remove these.
> 
> > +			return 0;
> > +		}
> 
> All of this be under CONFIG_ARCH_ENABLE_THP_MIGRATION
> 
> > +
> >  		if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
> >  			return -EFAULT;
> >  		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
> 
> 
> 
> Balbir Singh

Re: [PATCH] mm/hmm: Do not fault in device private pages owned by the caller

Posted by Jason Gunthorpe 2 months, 1 week ago

On Wed, Jul 23, 2025 at 10:02:58PM -0700, Matthew Brost wrote:
> On Thu, Jul 24, 2025 at 10:25:11AM +1000, Balbir Singh wrote:
> > On 7/23/25 05:34, Francois Dugast wrote:
> > > When the PMD swap entry is device private and owned by the caller,
> > > skip the range faulting and instead just set the correct HMM PFNs.
> > > This is similar to the logic for PTEs in hmm_vma_handle_pte().
> > > 
> > > For now, each hmm_pfns[i] entry is populated as it is currently done
> > > in hmm_vma_handle_pmd() but this might not be necessary. A follow-up
> > > optimization could be to make use of the order and skip populating
> > > subsequent PFNs.
> > 
> > I think we should test and remove these now
> > 
> 
> +Jason, Leon – perhaps either of you can provide insight into why
> hmm_vma_handle_pmd fully populates the HMM PFNs when a higher-order page
> is found.

I'm not sure why this is burined in some weird unrelated thread,
please post patches normally and CC the right people.

At least the main patch looks reasonable to me, and probably should do
pgd as well while at it?

> If we can be assured that changing this won’t break other parts of the
> kernel, I agree it should be removed. A snippet of documentation should
> also be added indicating that when higher-order PFNs are found,
> subsequent PFNs within the range will remain unpopulated. I can verify
> that GPU SVM works just fine without these PFNs being populated.

We can only do this if someone audits all the current users to confirm
they are compatible. Do that and then it is OK to propose the change.

I think the current version evolved as an optimization so I would not
be surprised to see that some callers need fixing.

Jason

Re: [PATCH] mm/hmm: Do not fault in device private pages owned by the caller

Posted by Mika Penttilä 2 months, 2 weeks ago

On 7/24/25 08:02, Matthew Brost wrote:
> On Thu, Jul 24, 2025 at 10:25:11AM +1000, Balbir Singh wrote:
>> On 7/23/25 05:34, Francois Dugast wrote:
>>> When the PMD swap entry is device private and owned by the caller,
>>> skip the range faulting and instead just set the correct HMM PFNs.
>>> This is similar to the logic for PTEs in hmm_vma_handle_pte().
>>>
>>> For now, each hmm_pfns[i] entry is populated as it is currently done
>>> in hmm_vma_handle_pmd() but this might not be necessary. A follow-up
>>> optimization could be to make use of the order and skip populating
>>> subsequent PFNs.
>> I think we should test and remove these now
>>
> +Jason, Leon – perhaps either of you can provide insight into why
> hmm_vma_handle_pmd fully populates the HMM PFNs when a higher-order page
> is found.
>
> If we can be assured that changing this won’t break other parts of the
> kernel, I agree it should be removed. A snippet of documentation should
> also be added indicating that when higher-order PFNs are found,
> subsequent PFNs within the range will remain unpopulated. I can verify
> that GPU SVM works just fine without these PFNs being populated.

afaics the device can consume the range as smaller pages also, and some
hmm users depend on that.


> Matt


--Mika


>
>>> Signed-off-by: Francois Dugast <francois.dugast@intel.com>
>>> ---
>>>  mm/hmm.c | 25 +++++++++++++++++++++++++
>>>  1 file changed, 25 insertions(+)
>>>
>>> diff --git a/mm/hmm.c b/mm/hmm.c
>>> index f2415b4b2cdd..63ec1b18a656 100644
>>> --- a/mm/hmm.c
>>> +++ b/mm/hmm.c
>>> @@ -355,6 +355,31 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>>>  	}
>>>  
>>>  	if (!pmd_present(pmd)) {
>>> +		swp_entry_t entry = pmd_to_swp_entry(pmd);
>>> +
>>> +		/*
>>> +		 * Don't fault in device private pages owned by the caller,
>>> +		 * just report the PFNs.
>>> +		 */
>>> +		if (is_device_private_entry(entry) &&
>>> +		    pfn_swap_entry_folio(entry)->pgmap->owner ==
>>> +		    range->dev_private_owner) {
>>> +			unsigned long cpu_flags = HMM_PFN_VALID |
>>> +				hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT);
>>> +			unsigned long pfn = swp_offset_pfn(entry);
>>> +			unsigned long i;
>>> +
>>> +			if (is_writable_device_private_entry(entry))
>>> +				cpu_flags |= HMM_PFN_WRITE;
>>> +
>>> +			for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
>>> +				hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
>>> +				hmm_pfns[i] |= pfn | cpu_flags;
>>> +			}
>>> +
>> As discussed, can we remove these.
>>
>>> +			return 0;
>>> +		}
>> All of this be under CONFIG_ARCH_ENABLE_THP_MIGRATION
>>
>>> +
>>>  		if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
>>>  			return -EFAULT;
>>>  		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
>>
>>
>> Balbir Singh

Re: [PATCH] mm/hmm: Do not fault in device private pages owned by the caller

Posted by Matthew Brost 2 months, 2 weeks ago

On Thu, Jul 24, 2025 at 08:46:11AM +0300, Mika Penttilä wrote:
> 
> On 7/24/25 08:02, Matthew Brost wrote:
> > On Thu, Jul 24, 2025 at 10:25:11AM +1000, Balbir Singh wrote:
> >> On 7/23/25 05:34, Francois Dugast wrote:
> >>> When the PMD swap entry is device private and owned by the caller,
> >>> skip the range faulting and instead just set the correct HMM PFNs.
> >>> This is similar to the logic for PTEs in hmm_vma_handle_pte().
> >>>
> >>> For now, each hmm_pfns[i] entry is populated as it is currently done
> >>> in hmm_vma_handle_pmd() but this might not be necessary. A follow-up
> >>> optimization could be to make use of the order and skip populating
> >>> subsequent PFNs.
> >> I think we should test and remove these now
> >>
> > +Jason, Leon – perhaps either of you can provide insight into why
> > hmm_vma_handle_pmd fully populates the HMM PFNs when a higher-order page
> > is found.
> >
> > If we can be assured that changing this won’t break other parts of the
> > kernel, I agree it should be removed. A snippet of documentation should
> > also be added indicating that when higher-order PFNs are found,
> > subsequent PFNs within the range will remain unpopulated. I can verify
> > that GPU SVM works just fine without these PFNs being populated.
> 
> afaics the device can consume the range as smaller pages also, and some
> hmm users depend on that.
> 

Sure, but I think that should be fixed in the device code. If a
large-order PFN is found, the subsequent PFNs can clearly be inferred.
It's a micro-optimization here, but devices or callers capable of
handling this properly shouldn't force a hacky, less optimal behavior on
core code. If anything relies on the current behavior, we should fix it
and ensure correctness.

Matt

> 
> > Matt
> 
> 
> --Mika
> 
> 
> >
> >>> Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> >>> ---
> >>>  mm/hmm.c | 25 +++++++++++++++++++++++++
> >>>  1 file changed, 25 insertions(+)
> >>>
> >>> diff --git a/mm/hmm.c b/mm/hmm.c
> >>> index f2415b4b2cdd..63ec1b18a656 100644
> >>> --- a/mm/hmm.c
> >>> +++ b/mm/hmm.c
> >>> @@ -355,6 +355,31 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
> >>>  	}
> >>>  
> >>>  	if (!pmd_present(pmd)) {
> >>> +		swp_entry_t entry = pmd_to_swp_entry(pmd);
> >>> +
> >>> +		/*
> >>> +		 * Don't fault in device private pages owned by the caller,
> >>> +		 * just report the PFNs.
> >>> +		 */
> >>> +		if (is_device_private_entry(entry) &&
> >>> +		    pfn_swap_entry_folio(entry)->pgmap->owner ==
> >>> +		    range->dev_private_owner) {
> >>> +			unsigned long cpu_flags = HMM_PFN_VALID |
> >>> +				hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT);
> >>> +			unsigned long pfn = swp_offset_pfn(entry);
> >>> +			unsigned long i;
> >>> +
> >>> +			if (is_writable_device_private_entry(entry))
> >>> +				cpu_flags |= HMM_PFN_WRITE;
> >>> +
> >>> +			for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
> >>> +				hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
> >>> +				hmm_pfns[i] |= pfn | cpu_flags;
> >>> +			}
> >>> +
> >> As discussed, can we remove these.
> >>
> >>> +			return 0;
> >>> +		}
> >> All of this be under CONFIG_ARCH_ENABLE_THP_MIGRATION
> >>
> >>> +
> >>>  		if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
> >>>  			return -EFAULT;
> >>>  		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
> >>
> >>
> >> Balbir Singh
>

Re: [PATCH] mm/hmm: Do not fault in device private pages owned by the caller

Posted by Mika Penttilä 2 months, 2 weeks ago

On 7/24/25 08:57, Matthew Brost wrote:
> On Thu, Jul 24, 2025 at 08:46:11AM +0300, Mika Penttilä wrote:
>> On 7/24/25 08:02, Matthew Brost wrote:
>>> On Thu, Jul 24, 2025 at 10:25:11AM +1000, Balbir Singh wrote:
>>>> On 7/23/25 05:34, Francois Dugast wrote:
>>>>> When the PMD swap entry is device private and owned by the caller,
>>>>> skip the range faulting and instead just set the correct HMM PFNs.
>>>>> This is similar to the logic for PTEs in hmm_vma_handle_pte().
>>>>>
>>>>> For now, each hmm_pfns[i] entry is populated as it is currently done
>>>>> in hmm_vma_handle_pmd() but this might not be necessary. A follow-up
>>>>> optimization could be to make use of the order and skip populating
>>>>> subsequent PFNs.
>>>> I think we should test and remove these now
>>>>
>>> +Jason, Leon – perhaps either of you can provide insight into why
>>> hmm_vma_handle_pmd fully populates the HMM PFNs when a higher-order page
>>> is found.
>>>
>>> If we can be assured that changing this won’t break other parts of the
>>> kernel, I agree it should be removed. A snippet of documentation should
>>> also be added indicating that when higher-order PFNs are found,
>>> subsequent PFNs within the range will remain unpopulated. I can verify
>>> that GPU SVM works just fine without these PFNs being populated.
>> afaics the device can consume the range as smaller pages also, and some
>> hmm users depend on that.
>>
> Sure, but I think that should be fixed in the device code. If a
> large-order PFN is found, the subsequent PFNs can clearly be inferred.
> It's a micro-optimization here, but devices or callers capable of
> handling this properly shouldn't force a hacky, less optimal behavior on
> core code. If anything relies on the current behavior, we should fix it
> and ensure correctness.

Yes sure device code can be changed but meant to say we can't just
delete those lines without breaking existing users.


>
> Matt


--Mika


>
>>> Matt
>>
>> --Mika
>>
>>
>>>>> Signed-off-by: Francois Dugast <francois.dugast@intel.com>
>>>>> ---
>>>>>  mm/hmm.c | 25 +++++++++++++++++++++++++
>>>>>  1 file changed, 25 insertions(+)
>>>>>
>>>>> diff --git a/mm/hmm.c b/mm/hmm.c
>>>>> index f2415b4b2cdd..63ec1b18a656 100644
>>>>> --- a/mm/hmm.c
>>>>> +++ b/mm/hmm.c
>>>>> @@ -355,6 +355,31 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>>>>>  	}
>>>>>  
>>>>>  	if (!pmd_present(pmd)) {
>>>>> +		swp_entry_t entry = pmd_to_swp_entry(pmd);
>>>>> +
>>>>> +		/*
>>>>> +		 * Don't fault in device private pages owned by the caller,
>>>>> +		 * just report the PFNs.
>>>>> +		 */
>>>>> +		if (is_device_private_entry(entry) &&
>>>>> +		    pfn_swap_entry_folio(entry)->pgmap->owner ==
>>>>> +		    range->dev_private_owner) {
>>>>> +			unsigned long cpu_flags = HMM_PFN_VALID |
>>>>> +				hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT);
>>>>> +			unsigned long pfn = swp_offset_pfn(entry);
>>>>> +			unsigned long i;
>>>>> +
>>>>> +			if (is_writable_device_private_entry(entry))
>>>>> +				cpu_flags |= HMM_PFN_WRITE;
>>>>> +
>>>>> +			for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
>>>>> +				hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
>>>>> +				hmm_pfns[i] |= pfn | cpu_flags;
>>>>> +			}
>>>>> +
>>>> As discussed, can we remove these.
>>>>
>>>>> +			return 0;
>>>>> +		}
>>>> All of this be under CONFIG_ARCH_ENABLE_THP_MIGRATION
>>>>
>>>>> +
>>>>>  		if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
>>>>>  			return -EFAULT;
>>>>>  		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
>>>>
>>>> Balbir Singh

Re: [PATCH] mm/hmm: Do not fault in device private pages owned by the caller

Posted by Leon Romanovsky 2 months, 2 weeks ago

On Thu, Jul 24, 2025 at 09:04:36AM +0300, Mika Penttilä wrote:
> 
> On 7/24/25 08:57, Matthew Brost wrote:
> > On Thu, Jul 24, 2025 at 08:46:11AM +0300, Mika Penttilä wrote:
> >> On 7/24/25 08:02, Matthew Brost wrote:
> >>> On Thu, Jul 24, 2025 at 10:25:11AM +1000, Balbir Singh wrote:
> >>>> On 7/23/25 05:34, Francois Dugast wrote:
> >>>>> When the PMD swap entry is device private and owned by the caller,
> >>>>> skip the range faulting and instead just set the correct HMM PFNs.
> >>>>> This is similar to the logic for PTEs in hmm_vma_handle_pte().
> >>>>>
> >>>>> For now, each hmm_pfns[i] entry is populated as it is currently done
> >>>>> in hmm_vma_handle_pmd() but this might not be necessary. A follow-up
> >>>>> optimization could be to make use of the order and skip populating
> >>>>> subsequent PFNs.
> >>>> I think we should test and remove these now
> >>>>
> >>> +Jason, Leon – perhaps either of you can provide insight into why
> >>> hmm_vma_handle_pmd fully populates the HMM PFNs when a higher-order page
> >>> is found.
> >>>
> >>> If we can be assured that changing this won’t break other parts of the
> >>> kernel, I agree it should be removed. A snippet of documentation should
> >>> also be added indicating that when higher-order PFNs are found,
> >>> subsequent PFNs within the range will remain unpopulated. I can verify
> >>> that GPU SVM works just fine without these PFNs being populated.
> >> afaics the device can consume the range as smaller pages also, and some
> >> hmm users depend on that.
> >>
> > Sure, but I think that should be fixed in the device code. If a
> > large-order PFN is found, the subsequent PFNs can clearly be inferred.
> > It's a micro-optimization here, but devices or callers capable of
> > handling this properly shouldn't force a hacky, less optimal behavior on
> > core code. If anything relies on the current behavior, we should fix it
> > and ensure correctness.
> 
> Yes sure device code can be changed but meant to say we can't just
> delete those lines without breaking existing users.

Mika is right. RDMA subsystem and HMM users there need to be updated.

We have special flag (IB_ACCESS_HUGETLB) that prepare whole RDMA stack
to handle large order PFNs. If this flag is not provided, we need to
fallback to basic device page size (4k) and for that we expect fully
populated PFN list.

Thanks

Re: [PATCH] mm/hmm: Do not fault in device private pages owned by the caller

Posted by Andrew Morton 2 months, 2 weeks ago

On Tue, 22 Jul 2025 21:34:45 +0200 Francois Dugast <francois.dugast@intel.com> wrote:

> When the PMD swap entry is device private and owned by the caller,
> skip the range faulting and instead just set the correct HMM PFNs.
> This is similar to the logic for PTEs in hmm_vma_handle_pte().

Please always tell us why a patch does something, not only what it does.

> For now, each hmm_pfns[i] entry is populated as it is currently done
> in hmm_vma_handle_pmd() but this might not be necessary. A follow-up
> optimization could be to make use of the order and skip populating
> subsequent PFNs.

I infer from this paragraph that this patch is a performance
optimization?  Have its effects been measured?

> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -355,6 +355,31 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  	}
>  
>  	if (!pmd_present(pmd)) {
> +		swp_entry_t entry = pmd_to_swp_entry(pmd);
> +
> +		/*
> +		 * Don't fault in device private pages owned by the caller,
> +		 * just report the PFNs.
> +		 */

Similarly, this tells us "what" it does, which is fairly obvious from
the code itself.  What is not obvious from the code is the "why".

> +		if (is_device_private_entry(entry) &&
> +		    pfn_swap_entry_folio(entry)->pgmap->owner ==
> +		    range->dev_private_owner) {
> +			unsigned long cpu_flags = HMM_PFN_VALID |
> +				hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT);
> +			unsigned long pfn = swp_offset_pfn(entry);
> +			unsigned long i;
> +
> +			if (is_writable_device_private_entry(entry))
> +				cpu_flags |= HMM_PFN_WRITE;
> +
> +			for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
> +				hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
> +				hmm_pfns[i] |= pfn | cpu_flags;
> +			}
> +
> +			return 0;
> +		}
> +
>  		if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
>  			return -EFAULT;
>  		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);

Re: [PATCH] mm/hmm: Do not fault in device private pages owned by the caller

Posted by Francois Dugast 2 months, 2 weeks ago

On Tue, Jul 22, 2025 at 01:07:21PM -0700, Andrew Morton wrote:
> On Tue, 22 Jul 2025 21:34:45 +0200 Francois Dugast <francois.dugast@intel.com> wrote:
> 
> > When the PMD swap entry is device private and owned by the caller,
> > skip the range faulting and instead just set the correct HMM PFNs.
> > This is similar to the logic for PTEs in hmm_vma_handle_pte().
> 
> Please always tell us why a patch does something, not only what it does.

Sure, let me improve this in the next version.

> 
> > For now, each hmm_pfns[i] entry is populated as it is currently done
> > in hmm_vma_handle_pmd() but this might not be necessary. A follow-up
> > optimization could be to make use of the order and skip populating
> > subsequent PFNs.
> 
> I infer from this paragraph that this patch is a performance
> optimization?  Have its effects been measured?

Yes, this performance optimization would come from avoiding the loop
over the range but it has neither been properly tested nor measured
yet.

> 
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -355,6 +355,31 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
> >  	}
> >  
> >  	if (!pmd_present(pmd)) {
> > +		swp_entry_t entry = pmd_to_swp_entry(pmd);
> > +
> > +		/*
> > +		 * Don't fault in device private pages owned by the caller,
> > +		 * just report the PFNs.
> > +		 */
> 
> Similarly, this tells us "what" it does, which is fairly obvious from
> the code itself.  What is not obvious from the code is the "why".

Indeed, will fix.

> 
> > +		if (is_device_private_entry(entry) &&
> > +		    pfn_swap_entry_folio(entry)->pgmap->owner ==
> > +		    range->dev_private_owner) {
> > +			unsigned long cpu_flags = HMM_PFN_VALID |
> > +				hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT);
> > +			unsigned long pfn = swp_offset_pfn(entry);
> > +			unsigned long i;
> > +
> > +			if (is_writable_device_private_entry(entry))
> > +				cpu_flags |= HMM_PFN_WRITE;
> > +
> > +			for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
> > +				hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
> > +				hmm_pfns[i] |= pfn | cpu_flags;
> > +			}
> > +
> > +			return 0;
> > +		}
> > +
> >  		if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
> >  			return -EFAULT;
> >  		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
>

Re: [PATCH] mm/hmm: Do not fault in device private pages owned by the caller

Posted by Matthew Brost 2 months, 2 weeks ago

On Wed, Jul 23, 2025 at 05:34:17PM +0200, Francois Dugast wrote:
> On Tue, Jul 22, 2025 at 01:07:21PM -0700, Andrew Morton wrote:
> > On Tue, 22 Jul 2025 21:34:45 +0200 Francois Dugast <francois.dugast@intel.com> wrote:
> > 
> > > When the PMD swap entry is device private and owned by the caller,
> > > skip the range faulting and instead just set the correct HMM PFNs.
> > > This is similar to the logic for PTEs in hmm_vma_handle_pte().
> > 
> > Please always tell us why a patch does something, not only what it does.
> 
> Sure, let me improve this in the next version.
> 
> > 
> > > For now, each hmm_pfns[i] entry is populated as it is currently done
> > > in hmm_vma_handle_pmd() but this might not be necessary. A follow-up
> > > optimization could be to make use of the order and skip populating
> > > subsequent PFNs.
> > 
> > I infer from this paragraph that this patch is a performance
> > optimization?  Have its effects been measured?
> 
> Yes, this performance optimization would come from avoiding the loop
> over the range but it has neither been properly tested nor measured
> yet.
> 

This is also a functional change. Once THP device pages are enabled (for
performance), we will encounter device-private swap entries in PMDs. At
that point, the correct behavior is to populate HMM PFNs from the swap
entry when dev_private_owner matches; otherwise, trigger a fault if the
HMM range-walk input requests one, or skip it in the non-faulting case.

It’s harmless to merge this patch before THP device pages are enabled,
since with the current code base we never find device-private swap
entries in PMDs.

I'd include something like the above explanation in the patch commit
message, or in code comments if needed.

Matt

> > 
> > > --- a/mm/hmm.c
> > > +++ b/mm/hmm.c
> > > @@ -355,6 +355,31 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
> > >  	}
> > >  
> > >  	if (!pmd_present(pmd)) {
> > > +		swp_entry_t entry = pmd_to_swp_entry(pmd);
> > > +
> > > +		/*
> > > +		 * Don't fault in device private pages owned by the caller,
> > > +		 * just report the PFNs.
> > > +		 */
> > 
> > Similarly, this tells us "what" it does, which is fairly obvious from
> > the code itself.  What is not obvious from the code is the "why".
> 
> Indeed, will fix.
> 
> > 
> > > +		if (is_device_private_entry(entry) &&
> > > +		    pfn_swap_entry_folio(entry)->pgmap->owner ==
> > > +		    range->dev_private_owner) {
> > > +			unsigned long cpu_flags = HMM_PFN_VALID |
> > > +				hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT);
> > > +			unsigned long pfn = swp_offset_pfn(entry);
> > > +			unsigned long i;
> > > +
> > > +			if (is_writable_device_private_entry(entry))
> > > +				cpu_flags |= HMM_PFN_WRITE;
> > > +
> > > +			for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
> > > +				hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
> > > +				hmm_pfns[i] |= pfn | cpu_flags;
> > > +			}
> > > +
> > > +			return 0;
> > > +		}
> > > +
> > >  		if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
> > >  			return -EFAULT;
> > >  		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
> >

Re: [v1 resend 00/12] THP support for zone device page migration

Posted by Matthew Brost 2 months, 2 weeks ago

On Fri, Jul 18, 2025 at 01:57:13PM +1000, Balbir Singh wrote:
> On 7/18/25 09:40, Matthew Brost wrote:
> > On Fri, Jul 04, 2025 at 09:34:59AM +1000, Balbir Singh wrote:
> ...
> >>
> >> The nouveau dmem code has been enhanced to use the new THP migration
> >> capability.
> >>
> >> Feedback from the RFC [2]:
> >>
> > 
> > Thanks for the patches, results look very promising. I wanted to give
> > some quick feedback:
> > 
> 
> Are you seeing improvements with the patchset?
> 
> > - You appear to have missed updating hmm_range_fault, specifically
> > hmm_vma_handle_pmd, to check for device-private entries and populate the
> > HMM PFNs accordingly. My colleague François has a fix for this if you're
> > interested.
> > 
> 
> Sure, please feel free to post them. 
> 
> > - I believe copy_huge_pmd also needs to be updated to avoid installing a
> > migration entry if the swap entry is device-private. I don't have an
> > exact fix yet due to my limited experience with core MM. The test case
> > that triggers this is fairly simple: fault in a 2MB device page on the
> > GPU, then fork a process that reads the page — the kernel crashes in
> > this scenario.
> > 
> 
> I'd be happy to look at any traces you have or post any fixes you have
> 

Ok, I think I have some code that works after slowly reverse-engineering
the core MM code - my test case passes without any warnings / kernel
crashes.

I've included it below. Feel free to include it in your next revision,
modify it as you see fit, or do whatever you like with it.

diff --git  a/mm/huge_memory.c b/mm/huge_memory.c
index 2b2563f35544..1cd6d9a10657 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1773,17 +1773,46 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
                swp_entry_t entry = pmd_to_swp_entry(pmd);

                VM_BUG_ON(!is_pmd_migration_entry(pmd) &&
-                               !is_device_private_entry(entry));
-               if (!is_readable_migration_entry(entry)) {
-                       entry = make_readable_migration_entry(
-                                                       swp_offset(entry));
+                         !is_device_private_entry(entry));
+
+               if (!is_device_private_entry(entry) &&
+                   !is_readable_migration_entry(entry)) {
+                       entry = make_readable_migration_entry(swp_offset(entry));
                        pmd = swp_entry_to_pmd(entry);
                        if (pmd_swp_soft_dirty(*src_pmd))
                                pmd = pmd_swp_mksoft_dirty(pmd);
                        if (pmd_swp_uffd_wp(*src_pmd))
                                pmd = pmd_swp_mkuffd_wp(pmd);
                        set_pmd_at(src_mm, addr, src_pmd, pmd);
+               } else if (is_device_private_entry(entry)) {
+                       if (is_writable_device_private_entry(entry)) {
+                               entry = make_readable_device_private_entry(swp_offset(entry));
+
+                               pmd = swp_entry_to_pmd(entry);
+                               if (pmd_swp_soft_dirty(*src_pmd))
+                                       pmd = pmd_swp_mksoft_dirty(pmd);
+                               if (pmd_swp_uffd_wp(*src_pmd))
+                                       pmd = pmd_swp_mkuffd_wp(pmd);
+                               set_pmd_at(src_mm, addr, src_pmd, pmd);
+                       }
+
+                       src_page = pfn_swap_entry_to_page(entry);
+                       VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
+                       src_folio = page_folio(src_page);
+
+                       folio_get(src_folio);
+                       if (unlikely(folio_try_dup_anon_rmap_pmd(src_folio, src_page,
+                                                                dst_vma, src_vma))) {
+                               /* Page maybe pinned: split and retry the fault on PTEs. */
+                               folio_put(src_folio);
+                               pte_free(dst_mm, pgtable);
+                               spin_unlock(src_ptl);
+                               spin_unlock(dst_ptl);
+                               __split_huge_pmd(src_vma, src_pmd, addr, false);
+                               return -EAGAIN;
+                       }
                }
+
                add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
                mm_inc_nr_ptes(dst_mm);
                pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);

Matt

> Thanks for the feedback
> Balbir Singh

Re: [v1 resend 00/12] THP support for zone device page migration

Posted by Matthew Brost 2 months, 3 weeks ago

On Fri, Jul 18, 2025 at 01:57:13PM +1000, Balbir Singh wrote:
> On 7/18/25 09:40, Matthew Brost wrote:
> > On Fri, Jul 04, 2025 at 09:34:59AM +1000, Balbir Singh wrote:
> ...
> >>
> >> The nouveau dmem code has been enhanced to use the new THP migration
> >> capability.
> >>
> >> Feedback from the RFC [2]:
> >>
> > 
> > Thanks for the patches, results look very promising. I wanted to give
> > some quick feedback:
> > 
> 
> Are you seeing improvements with the patchset?
> 

We're nowhere near stable yet, but basic testing shows that CPU time
from the start of migrate_vma_* to the end drops from ~300µs to ~6µs on
a 2MB GPU fault. A lot of this improvement is dma-mapping at 2M
grainularity for the CPU<->GPU copy rather than mapping 512 4k pages
too.

> > - You appear to have missed updating hmm_range_fault, specifically
> > hmm_vma_handle_pmd, to check for device-private entries and populate the
> > HMM PFNs accordingly. My colleague François has a fix for this if you're
> > interested.
> > 
> 
> Sure, please feel free to post them. 
> 
> > - I believe copy_huge_pmd also needs to be updated to avoid installing a
> > migration entry if the swap entry is device-private. I don't have an
> > exact fix yet due to my limited experience with core MM. The test case
> > that triggers this is fairly simple: fault in a 2MB device page on the
> > GPU, then fork a process that reads the page — the kernel crashes in
> > this scenario.
> > 
> 
> I'd be happy to look at any traces you have or post any fixes you have
> 

I've got it so the kernel doesn't explode but still get warnings like:

[ 3564.850036] mm/pgtable-generic.c:54: bad pmd ffff8881290408e0(efffff80042bfe00)
[ 3565.298186] BUG: Bad rss-counter state mm:ffff88810a100c40 type:MM_ANONPAGES val:114688
[ 3565.306108] BUG: non-zero pgtables_bytes on freeing mm: 917504

I'm basically just skip is_swap_pmd clause if the entry is device
private, and let the rest of the function execute. This avoids
installing a migration entry—which isn't required and cause the
crash—and allows the rmap code to run, which flips the pages to not
anonymous exclusive (effectively making them copy-on-write (?), though
that doesn't fully apply to device pages). It's not 100% correct yet,
but it's a step in the right direction.

Matt

> Thanks for the feedback
> Balbir Singh

Re: [v1 resend 00/12] THP support for zone device page migration

Posted by Balbir Singh 2 months, 2 weeks ago

On 7/18/25 14:57, Matthew Brost wrote:
> On Fri, Jul 18, 2025 at 01:57:13PM +1000, Balbir Singh wrote:
>> On 7/18/25 09:40, Matthew Brost wrote:
>>> On Fri, Jul 04, 2025 at 09:34:59AM +1000, Balbir Singh wrote:
>> ...
>>>>
>>>> The nouveau dmem code has been enhanced to use the new THP migration
>>>> capability.
>>>>
>>>> Feedback from the RFC [2]:
>>>>
>>>
>>> Thanks for the patches, results look very promising. I wanted to give
>>> some quick feedback:
>>>
>>
>> Are you seeing improvements with the patchset?
>>
> 
> We're nowhere near stable yet, but basic testing shows that CPU time
> from the start of migrate_vma_* to the end drops from ~300µs to ~6µs on
> a 2MB GPU fault. A lot of this improvement is dma-mapping at 2M
> grainularity for the CPU<->GPU copy rather than mapping 512 4k pages
> too.
> 
>>> - You appear to have missed updating hmm_range_fault, specifically
>>> hmm_vma_handle_pmd, to check for device-private entries and populate the
>>> HMM PFNs accordingly. My colleague François has a fix for this if you're
>>> interested.
>>>
>>
>> Sure, please feel free to post them. 
>>
>>> - I believe copy_huge_pmd also needs to be updated to avoid installing a
>>> migration entry if the swap entry is device-private. I don't have an
>>> exact fix yet due to my limited experience with core MM. The test case
>>> that triggers this is fairly simple: fault in a 2MB device page on the
>>> GPU, then fork a process that reads the page — the kernel crashes in
>>> this scenario.
>>>
>>
>> I'd be happy to look at any traces you have or post any fixes you have
>>
> 
> I've got it so the kernel doesn't explode but still get warnings like:
> 
> [ 3564.850036] mm/pgtable-generic.c:54: bad pmd ffff8881290408e0(efffff80042bfe00)
> [ 3565.298186] BUG: Bad rss-counter state mm:ffff88810a100c40 type:MM_ANONPAGES val:114688
> [ 3565.306108] BUG: non-zero pgtables_bytes on freeing mm: 917504
> 
> I'm basically just skip is_swap_pmd clause if the entry is device
> private, and let the rest of the function execute. This avoids
> installing a migration entry—which isn't required and cause the
> crash—and allows the rmap code to run, which flips the pages to not
> anonymous exclusive (effectively making them copy-on-write (?), though
> that doesn't fully apply to device pages). It's not 100% correct yet,
> but it's a step in the right direction.
> 


Thanks, could you post the stack trace as well. This is usually a symptom of 
not freeing up the page table cleanly.

Do you have my latest patches that have

	if (is_swap_pmd(pmdval)) {
		swp_entry_t entry = pmd_to_swp_entry(pmdval);

		if (is_device_private_entry(entry))
			goto nomap;
	}

in __pte_offset_map()?

Balbir Singh

Re: [v1 resend 00/12] THP support for zone device page migration

Posted by Matthew Brost 2 months, 2 weeks ago

On Tue, Jul 22, 2025 at 09:48:18AM +1000, Balbir Singh wrote:
> On 7/18/25 14:57, Matthew Brost wrote:
> > On Fri, Jul 18, 2025 at 01:57:13PM +1000, Balbir Singh wrote:
> >> On 7/18/25 09:40, Matthew Brost wrote:
> >>> On Fri, Jul 04, 2025 at 09:34:59AM +1000, Balbir Singh wrote:
> >> ...
> >>>>
> >>>> The nouveau dmem code has been enhanced to use the new THP migration
> >>>> capability.
> >>>>
> >>>> Feedback from the RFC [2]:
> >>>>
> >>>
> >>> Thanks for the patches, results look very promising. I wanted to give
> >>> some quick feedback:
> >>>
> >>
> >> Are you seeing improvements with the patchset?
> >>
> > 
> > We're nowhere near stable yet, but basic testing shows that CPU time
> > from the start of migrate_vma_* to the end drops from ~300µs to ~6µs on
> > a 2MB GPU fault. A lot of this improvement is dma-mapping at 2M
> > grainularity for the CPU<->GPU copy rather than mapping 512 4k pages
> > too.
> > 
> >>> - You appear to have missed updating hmm_range_fault, specifically
> >>> hmm_vma_handle_pmd, to check for device-private entries and populate the
> >>> HMM PFNs accordingly. My colleague François has a fix for this if you're
> >>> interested.
> >>>
> >>
> >> Sure, please feel free to post them. 
> >>
> >>> - I believe copy_huge_pmd also needs to be updated to avoid installing a
> >>> migration entry if the swap entry is device-private. I don't have an
> >>> exact fix yet due to my limited experience with core MM. The test case
> >>> that triggers this is fairly simple: fault in a 2MB device page on the
> >>> GPU, then fork a process that reads the page — the kernel crashes in
> >>> this scenario.
> >>>
> >>
> >> I'd be happy to look at any traces you have or post any fixes you have
> >>
> > 
> > I've got it so the kernel doesn't explode but still get warnings like:
> > 
> > [ 3564.850036] mm/pgtable-generic.c:54: bad pmd ffff8881290408e0(efffff80042bfe00)
> > [ 3565.298186] BUG: Bad rss-counter state mm:ffff88810a100c40 type:MM_ANONPAGES val:114688
> > [ 3565.306108] BUG: non-zero pgtables_bytes on freeing mm: 917504
> > 
> > I'm basically just skip is_swap_pmd clause if the entry is device
> > private, and let the rest of the function execute. This avoids
> > installing a migration entry—which isn't required and cause the
> > crash—and allows the rmap code to run, which flips the pages to not
> > anonymous exclusive (effectively making them copy-on-write (?), though
> > that doesn't fully apply to device pages). It's not 100% correct yet,
> > but it's a step in the right direction.
> > 
> 
> 
> Thanks, could you post the stack trace as well. This is usually a symptom of 
> not freeing up the page table cleanly.
> 

Did you see my reply here [1]? I've got this working cleanly.

I actually have all my tests passing with a few additional core MM
changes. I'll reply shortly to a few other patches with those details
and will also send over a complete set of the core MM changes we've made
to get things stable.

Matt

[1] https://lore.kernel.org/linux-mm/aHrsdvjjliBBdVQm@lstrano-desk.jf.intel.com/

> Do you have my latest patches that have
> 
> 	if (is_swap_pmd(pmdval)) {
> 		swp_entry_t entry = pmd_to_swp_entry(pmdval);
> 
> 		if (is_device_private_entry(entry))
> 			goto nomap;
> 	}
> 
> in __pte_offset_map()?
> 
> Balbir Singh

Re: [v1 resend 00/12] THP support for zone device page migration

Posted by Balbir Singh 2 months, 2 weeks ago

On 7/22/25 10:07, Matthew Brost wrote:
> On Tue, Jul 22, 2025 at 09:48:18AM +1000, Balbir Singh wrote:
>> On 7/18/25 14:57, Matthew Brost wrote:
>>> On Fri, Jul 18, 2025 at 01:57:13PM +1000, Balbir Singh wrote:
>>>> On 7/18/25 09:40, Matthew Brost wrote:
>>>>> On Fri, Jul 04, 2025 at 09:34:59AM +1000, Balbir Singh wrote:
>>>> ...
>>>>>>
>>>>>> The nouveau dmem code has been enhanced to use the new THP migration
>>>>>> capability.
>>>>>>
>>>>>> Feedback from the RFC [2]:
>>>>>>
>>>>>
>>>>> Thanks for the patches, results look very promising. I wanted to give
>>>>> some quick feedback:
>>>>>
>>>>
>>>> Are you seeing improvements with the patchset?
>>>>
>>>
>>> We're nowhere near stable yet, but basic testing shows that CPU time
>>> from the start of migrate_vma_* to the end drops from ~300µs to ~6µs on
>>> a 2MB GPU fault. A lot of this improvement is dma-mapping at 2M
>>> grainularity for the CPU<->GPU copy rather than mapping 512 4k pages
>>> too.
>>>
>>>>> - You appear to have missed updating hmm_range_fault, specifically
>>>>> hmm_vma_handle_pmd, to check for device-private entries and populate the
>>>>> HMM PFNs accordingly. My colleague François has a fix for this if you're
>>>>> interested.
>>>>>
>>>>
>>>> Sure, please feel free to post them. 
>>>>
>>>>> - I believe copy_huge_pmd also needs to be updated to avoid installing a
>>>>> migration entry if the swap entry is device-private. I don't have an
>>>>> exact fix yet due to my limited experience with core MM. The test case
>>>>> that triggers this is fairly simple: fault in a 2MB device page on the
>>>>> GPU, then fork a process that reads the page — the kernel crashes in
>>>>> this scenario.
>>>>>
>>>>
>>>> I'd be happy to look at any traces you have or post any fixes you have
>>>>
>>>
>>> I've got it so the kernel doesn't explode but still get warnings like:
>>>
>>> [ 3564.850036] mm/pgtable-generic.c:54: bad pmd ffff8881290408e0(efffff80042bfe00)
>>> [ 3565.298186] BUG: Bad rss-counter state mm:ffff88810a100c40 type:MM_ANONPAGES val:114688
>>> [ 3565.306108] BUG: non-zero pgtables_bytes on freeing mm: 917504
>>>
>>> I'm basically just skip is_swap_pmd clause if the entry is device
>>> private, and let the rest of the function execute. This avoids
>>> installing a migration entry—which isn't required and cause the
>>> crash—and allows the rmap code to run, which flips the pages to not
>>> anonymous exclusive (effectively making them copy-on-write (?), though
>>> that doesn't fully apply to device pages). It's not 100% correct yet,
>>> but it's a step in the right direction.
>>>
>>
>>
>> Thanks, could you post the stack trace as well. This is usually a symptom of 
>> not freeing up the page table cleanly.
>>
> 
> Did you see my reply here [1]? I've got this working cleanly.
> 
> I actually have all my tests passing with a few additional core MM
> changes. I'll reply shortly to a few other patches with those details
> and will also send over a complete set of the core MM changes we've made
> to get things stable.
> 
> Matt
> 
> [1] https://lore.kernel.org/linux-mm/aHrsdvjjliBBdVQm@lstrano-desk.jf.intel.com/

Sorry I missed it. Checking now!

I'd be happy to hear any numbers you have. In my throughput tests that rely on
page copying (lib/test_hmm.c), I am seeing a 500% improvement in performance and
a 80% improvement in latency

Thanks,
Balbir