drivers/gpu/drm/nouveau/nouveau_dmem.c | 246 +++++++--- drivers/gpu/drm/nouveau/nouveau_svm.c | 6 +- drivers/gpu/drm/nouveau/nouveau_svm.h | 3 +- include/linux/huge_mm.h | 18 +- include/linux/memremap.h | 29 +- include/linux/migrate.h | 2 + include/linux/mm.h | 1 + lib/test_hmm.c | 428 +++++++++++++---- lib/test_hmm_uapi.h | 3 + mm/huge_memory.c | 261 ++++++++--- mm/memory.c | 6 +- mm/memremap.c | 50 +- mm/migrate.c | 2 + mm/migrate_device.c | 488 +++++++++++++++++--- mm/page_alloc.c | 1 + mm/page_vma_mapped.c | 10 + mm/pgtable-generic.c | 6 + mm/rmap.c | 19 +- tools/testing/selftests/mm/hmm-tests.c | 607 ++++++++++++++++++++++++- 19 files changed, 1874 insertions(+), 312 deletions(-)
This patch series adds support for THP migration of zone device pages. To do so, the patches implement support for folio zone device pages by adding support for setting up larger order pages. These patches build on the earlier posts by Ralph Campbell [1] Two new flags are added in vma_migration to select and mark compound pages. migrate_vma_setup(), migrate_vma_pages() and migrate_vma_finalize() support migration of these pages when MIGRATE_VMA_SELECT_COMPOUND is passed in as arguments. The series also adds zone device awareness to (m)THP pages along with fault handling of large zone device private pages. page vma walk and the rmap code is also zone device aware. Support has also been added for folios that might need to be split in the middle of migration (when the src and dst do not agree on MIGRATE_PFN_COMPOUND), that occurs when src side of the migration can migrate large pages, but the destination has not been able to allocate large pages. The code supported and used folio_split() when migrating THP pages, this is used when MIGRATE_VMA_SELECT_COMPOUND is not passed as an argument to migrate_vma_setup(). The test infrastructure lib/test_hmm.c has been enhanced to support THP migration. A new ioctl to emulate failure of large page allocations has been added to test the folio split code path. hmm-tests.c has new test cases for huge page migration and to test the folio split path. A new throughput test has been added as well. The nouveau dmem code has been enhanced to use the new THP migration capability. Feedback from the RFC [2]: It was advised that prep_compound_page() not be exposed just for the purposes of testing (test driver lib/test_hmm.c). Work arounds of copy and split the folios did not work due to lock order dependency in the callback for split folio. mTHP support: The patches hard code, HPAGE_PMD_NR in a few places, but the code has been kept generic to support various order sizes. With additional refactoring of the code support of different order sizes should be possible. The future plan is to post enhancements to support mTHP with a rough design as follows: 1. Add the notion of allowable thp orders to the HMM based test driver 2. For non PMD based THP paths in migrate_device.c, check to see if a suitable order is found and supported by the driver 3. Iterate across orders to check the highest supported order for migration 4. Migrate and finalize The mTHP patches can be built on top of this series, the key design elements that need to be worked out are infrastructure and driver support for multiple ordered pages and their migration. References: [1] https://lore.kernel.org/linux-mm/20201106005147.20113-1-rcampbell@nvidia.com/ [2] https://lore.kernel.org/linux-mm/20250306044239.3874247-3-balbirs@nvidia.com/T/ These patches are built on top of mm-unstable Cc: Karol Herbst <kherbst@redhat.com> Cc: Lyude Paul <lyude@redhat.com> Cc: Danilo Krummrich <dakr@kernel.org> Cc: David Airlie <airlied@gmail.com> Cc: Simona Vetter <simona@ffwll.ch> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Donet Tom <donettom@linux.ibm.com> Changelog v1: - Changes from RFC [2], include support for handling fault_folio and using trylock in the fault path - A new test case has been added to measure the throughput improvement - General refactoring of code to keep up with the changes in mm - New split folio callback when the entire split is complete/done. The callback is used to know when the head order needs to be reset. Testing: - Testing was done with ZONE_DEVICE private pages on an x86 VM - Throughput showed upto 5x improvement with THP migration, system to device migration is slower due to the mirroring of data (see buffer->mirror) Balbir Singh (12): mm/zone_device: support large zone device private folios mm/migrate_device: flags for selecting device private THP pages mm/thp: zone_device awareness in THP handling code mm/migrate_device: THP migration of zone device pages mm/memory/fault: add support for zone device THP fault handling lib/test_hmm: test cases and support for zone device private THP mm/memremap: add folio_split support mm/thp: add split during migration support lib/test_hmm: add test case for split pages selftests/mm/hmm-tests: new tests for zone device THP migration gpu/drm/nouveau: add THP migration support selftests/mm/hmm-tests: new throughput tests including THP drivers/gpu/drm/nouveau/nouveau_dmem.c | 246 +++++++--- drivers/gpu/drm/nouveau/nouveau_svm.c | 6 +- drivers/gpu/drm/nouveau/nouveau_svm.h | 3 +- include/linux/huge_mm.h | 18 +- include/linux/memremap.h | 29 +- include/linux/migrate.h | 2 + include/linux/mm.h | 1 + lib/test_hmm.c | 428 +++++++++++++---- lib/test_hmm_uapi.h | 3 + mm/huge_memory.c | 261 ++++++++--- mm/memory.c | 6 +- mm/memremap.c | 50 +- mm/migrate.c | 2 + mm/migrate_device.c | 488 +++++++++++++++++--- mm/page_alloc.c | 1 + mm/page_vma_mapped.c | 10 + mm/pgtable-generic.c | 6 + mm/rmap.c | 19 +- tools/testing/selftests/mm/hmm-tests.c | 607 ++++++++++++++++++++++++- 19 files changed, 1874 insertions(+), 312 deletions(-) -- 2.49.0
On 3 Jul 2025, at 19:34, Balbir Singh wrote: > This patch series adds support for THP migration of zone device pages. > To do so, the patches implement support for folio zone device pages > by adding support for setting up larger order pages. > > These patches build on the earlier posts by Ralph Campbell [1] > > Two new flags are added in vma_migration to select and mark compound pages. > migrate_vma_setup(), migrate_vma_pages() and migrate_vma_finalize() > support migration of these pages when MIGRATE_VMA_SELECT_COMPOUND > is passed in as arguments. > > The series also adds zone device awareness to (m)THP pages along > with fault handling of large zone device private pages. page vma walk > and the rmap code is also zone device aware. Support has also been > added for folios that might need to be split in the middle > of migration (when the src and dst do not agree on > MIGRATE_PFN_COMPOUND), that occurs when src side of the migration can > migrate large pages, but the destination has not been able to allocate > large pages. The code supported and used folio_split() when migrating > THP pages, this is used when MIGRATE_VMA_SELECT_COMPOUND is not passed > as an argument to migrate_vma_setup(). > > The test infrastructure lib/test_hmm.c has been enhanced to support THP > migration. A new ioctl to emulate failure of large page allocations has > been added to test the folio split code path. hmm-tests.c has new test > cases for huge page migration and to test the folio split path. A new > throughput test has been added as well. > > The nouveau dmem code has been enhanced to use the new THP migration > capability. > > Feedback from the RFC [2]: > > It was advised that prep_compound_page() not be exposed just for the purposes > of testing (test driver lib/test_hmm.c). Work arounds of copy and split the > folios did not work due to lock order dependency in the callback for > split folio. > > mTHP support: > > The patches hard code, HPAGE_PMD_NR in a few places, but the code has > been kept generic to support various order sizes. With additional > refactoring of the code support of different order sizes should be > possible. > > The future plan is to post enhancements to support mTHP with a rough > design as follows: > > 1. Add the notion of allowable thp orders to the HMM based test driver > 2. For non PMD based THP paths in migrate_device.c, check to see if > a suitable order is found and supported by the driver > 3. Iterate across orders to check the highest supported order for migration > 4. Migrate and finalize > > The mTHP patches can be built on top of this series, the key design elements > that need to be worked out are infrastructure and driver support for multiple > ordered pages and their migration. To help me better review the patches, can you tell me if my mental model below for device private folios is correct or not? 1. device private folios represent device memory, but the associated PFNs do not exist in the system. folio->pgmap contains the meta info about device memory. 2. when data is migrated from system memory to device private memory, a device private page table entry is established in place of the original entry. A device private page table entry is a swap entry with a device private type. And the swap entry points to a device private folio in which the data resides in the device private memory. 3. when CPU tries to access an address with device private page table entry, a fault happens and data is migrated from device private memory to system memory. The device private folio pointed by the device private page table entry tells driver where to look for the data on the device. 4. one of the reasons causing a large device private folio split is that when a large device private folio is migrated back to system memory and there is no free large folio in system memory. So that driver splits the large device private folio and only migrate a subpage instead. Thanks. -- Best Regards, Yan, Zi
On 7/5/25 02:16, Zi Yan wrote: > On 3 Jul 2025, at 19:34, Balbir Singh wrote: > >> This patch series adds support for THP migration of zone device pages. >> To do so, the patches implement support for folio zone device pages >> by adding support for setting up larger order pages. >> >> These patches build on the earlier posts by Ralph Campbell [1] >> >> Two new flags are added in vma_migration to select and mark compound pages. >> migrate_vma_setup(), migrate_vma_pages() and migrate_vma_finalize() >> support migration of these pages when MIGRATE_VMA_SELECT_COMPOUND >> is passed in as arguments. >> >> The series also adds zone device awareness to (m)THP pages along >> with fault handling of large zone device private pages. page vma walk >> and the rmap code is also zone device aware. Support has also been >> added for folios that might need to be split in the middle >> of migration (when the src and dst do not agree on >> MIGRATE_PFN_COMPOUND), that occurs when src side of the migration can >> migrate large pages, but the destination has not been able to allocate >> large pages. The code supported and used folio_split() when migrating >> THP pages, this is used when MIGRATE_VMA_SELECT_COMPOUND is not passed >> as an argument to migrate_vma_setup(). >> >> The test infrastructure lib/test_hmm.c has been enhanced to support THP >> migration. A new ioctl to emulate failure of large page allocations has >> been added to test the folio split code path. hmm-tests.c has new test >> cases for huge page migration and to test the folio split path. A new >> throughput test has been added as well. >> >> The nouveau dmem code has been enhanced to use the new THP migration >> capability. >> >> Feedback from the RFC [2]: >> >> It was advised that prep_compound_page() not be exposed just for the purposes >> of testing (test driver lib/test_hmm.c). Work arounds of copy and split the >> folios did not work due to lock order dependency in the callback for >> split folio. >> >> mTHP support: >> >> The patches hard code, HPAGE_PMD_NR in a few places, but the code has >> been kept generic to support various order sizes. With additional >> refactoring of the code support of different order sizes should be >> possible. >> >> The future plan is to post enhancements to support mTHP with a rough >> design as follows: >> >> 1. Add the notion of allowable thp orders to the HMM based test driver >> 2. For non PMD based THP paths in migrate_device.c, check to see if >> a suitable order is found and supported by the driver >> 3. Iterate across orders to check the highest supported order for migration >> 4. Migrate and finalize >> >> The mTHP patches can be built on top of this series, the key design elements >> that need to be worked out are infrastructure and driver support for multiple >> ordered pages and their migration. > > To help me better review the patches, can you tell me if my mental model below > for device private folios is correct or not? > > 1. device private folios represent device memory, but the associated PFNs > do not exist in the system. folio->pgmap contains the meta info about > device memory. Yes that is right > > 2. when data is migrated from system memory to device private memory, a device > private page table entry is established in place of the original entry. > A device private page table entry is a swap entry with a device private type. > And the swap entry points to a device private folio in which the data resides > in the device private memory. > Yes > 3. when CPU tries to access an address with device private page table entry, > a fault happens and data is migrated from device private memory to system > memory. The device private folio pointed by the device private page table > entry tells driver where to look for the data on the device. > > 4. one of the reasons causing a large device private folio split is that > when a large device private folio is migrated back to system memory and > there is no free large folio in system memory. So that driver splits > the large device private folio and only migrate a subpage instead. > Both the points are correct, to add to point 4, it can so happen that during migration of memory from system to device, we might need a split as well. Effectively the destination is unable to allocate a large page for migration Balbir
On 04.07.25 01:34, Balbir Singh wrote: So, I shared some feedback as reply to "[RFC 00/11] THP support for zone device pages". It popped up in my mail box again after there apparently was a discussion a couple of days ago. ... and now I realize that this is apparently the same series with renamed subject. Gah. So please, find my feedback there -- most of that should still apply, scanning your changelog ... > Changelog v1: > - Changes from RFC [2], include support for handling fault_folio and using > trylock in the fault path > - A new test case has been added to measure the throughput improvement > - General refactoring of code to keep up with the changes in mm > - New split folio callback when the entire split is complete/done. The > callback is used to know when the head order needs to be reset. > > Testing: > - Testing was done with ZONE_DEVICE private pages on an x86 VM > - Throughput showed upto 5x improvement with THP migration, system to device > migration is slower due to the mirroring of data (see buffer->mirror) > > > Balbir Singh (12): > mm/zone_device: support large zone device private folios > mm/migrate_device: flags for selecting device private THP pages > mm/thp: zone_device awareness in THP handling code > mm/migrate_device: THP migration of zone device pages > mm/memory/fault: add support for zone device THP fault handling > lib/test_hmm: test cases and support for zone device private THP > mm/memremap: add folio_split support > mm/thp: add split during migration support > lib/test_hmm: add test case for split pages > selftests/mm/hmm-tests: new tests for zone device THP migration > gpu/drm/nouveau: add THP migration support > selftests/mm/hmm-tests: new throughput tests including THP > > drivers/gpu/drm/nouveau/nouveau_dmem.c | 246 +++++++--- > drivers/gpu/drm/nouveau/nouveau_svm.c | 6 +- > drivers/gpu/drm/nouveau/nouveau_svm.h | 3 +- > include/linux/huge_mm.h | 18 +- > include/linux/memremap.h | 29 +- > include/linux/migrate.h | 2 + > include/linux/mm.h | 1 + > lib/test_hmm.c | 428 +++++++++++++---- > lib/test_hmm_uapi.h | 3 + > mm/huge_memory.c | 261 ++++++++--- > mm/memory.c | 6 +- > mm/memremap.c | 50 +- > mm/migrate.c | 2 + > mm/migrate_device.c | 488 +++++++++++++++++--- > mm/page_alloc.c | 1 + > mm/page_vma_mapped.c | 10 + > mm/pgtable-generic.c | 6 + > mm/rmap.c | 19 +- > tools/testing/selftests/mm/hmm-tests.c | 607 ++++++++++++++++++++++++- > 19 files changed, 1874 insertions(+), 312 deletions(-) > -- Cheers, David / dhildenb
On 7/9/25 00:53, David Hildenbrand wrote: > On 04.07.25 01:34, Balbir Singh wrote: > > So, I shared some feedback as reply to "[RFC 00/11] THP support for zone device pages". > > It popped up in my mail box again after there apparently was a discussion a couple of days ago. > > ... and now I realize that this is apparently the same series with renamed subject. Gah. > > So please, find my feedback there -- most of that should still apply, scanning your changelog ... > Will do Thanks for letting me know Balbir Singh
On Fri, Jul 04, 2025 at 09:34:59AM +1000, Balbir Singh wrote: > This patch series adds support for THP migration of zone device pages. > To do so, the patches implement support for folio zone device pages > by adding support for setting up larger order pages. > > These patches build on the earlier posts by Ralph Campbell [1] > > Two new flags are added in vma_migration to select and mark compound pages. > migrate_vma_setup(), migrate_vma_pages() and migrate_vma_finalize() > support migration of these pages when MIGRATE_VMA_SELECT_COMPOUND > is passed in as arguments. > > The series also adds zone device awareness to (m)THP pages along > with fault handling of large zone device private pages. page vma walk > and the rmap code is also zone device aware. Support has also been > added for folios that might need to be split in the middle > of migration (when the src and dst do not agree on > MIGRATE_PFN_COMPOUND), that occurs when src side of the migration can > migrate large pages, but the destination has not been able to allocate > large pages. The code supported and used folio_split() when migrating > THP pages, this is used when MIGRATE_VMA_SELECT_COMPOUND is not passed > as an argument to migrate_vma_setup(). > > The test infrastructure lib/test_hmm.c has been enhanced to support THP > migration. A new ioctl to emulate failure of large page allocations has > been added to test the folio split code path. hmm-tests.c has new test > cases for huge page migration and to test the folio split path. A new > throughput test has been added as well. > > The nouveau dmem code has been enhanced to use the new THP migration > capability. > > Feedback from the RFC [2]: > Thanks for the patches, results look very promising. I wanted to give some quick feedback: - You appear to have missed updating hmm_range_fault, specifically hmm_vma_handle_pmd, to check for device-private entries and populate the HMM PFNs accordingly. My colleague François has a fix for this if you're interested. - I believe copy_huge_pmd also needs to be updated to avoid installing a migration entry if the swap entry is device-private. I don't have an exact fix yet due to my limited experience with core MM. The test case that triggers this is fairly simple: fault in a 2MB device page on the GPU, then fork a process that reads the page — the kernel crashes in this scenario. Matt > It was advised that prep_compound_page() not be exposed just for the purposes > of testing (test driver lib/test_hmm.c). Work arounds of copy and split the > folios did not work due to lock order dependency in the callback for > split folio. > > mTHP support: > > The patches hard code, HPAGE_PMD_NR in a few places, but the code has > been kept generic to support various order sizes. With additional > refactoring of the code support of different order sizes should be > possible. > > The future plan is to post enhancements to support mTHP with a rough > design as follows: > > 1. Add the notion of allowable thp orders to the HMM based test driver > 2. For non PMD based THP paths in migrate_device.c, check to see if > a suitable order is found and supported by the driver > 3. Iterate across orders to check the highest supported order for migration > 4. Migrate and finalize > > The mTHP patches can be built on top of this series, the key design elements > that need to be worked out are infrastructure and driver support for multiple > ordered pages and their migration. > > References: > [1] https://lore.kernel.org/linux-mm/20201106005147.20113-1-rcampbell@nvidia.com/ > [2] https://lore.kernel.org/linux-mm/20250306044239.3874247-3-balbirs@nvidia.com/T/ > > These patches are built on top of mm-unstable > > Cc: Karol Herbst <kherbst@redhat.com> > Cc: Lyude Paul <lyude@redhat.com> > Cc: Danilo Krummrich <dakr@kernel.org> > Cc: David Airlie <airlied@gmail.com> > Cc: Simona Vetter <simona@ffwll.ch> > Cc: "Jérôme Glisse" <jglisse@redhat.com> > Cc: Shuah Khan <shuah@kernel.org> > Cc: David Hildenbrand <david@redhat.com> > Cc: Barry Song <baohua@kernel.org> > Cc: Baolin Wang <baolin.wang@linux.alibaba.com> > Cc: Ryan Roberts <ryan.roberts@arm.com> > Cc: Matthew Wilcox <willy@infradead.org> > Cc: Peter Xu <peterx@redhat.com> > Cc: Zi Yan <ziy@nvidia.com> > Cc: Kefeng Wang <wangkefeng.wang@huawei.com> > Cc: Jane Chu <jane.chu@oracle.com> > Cc: Alistair Popple <apopple@nvidia.com> > Cc: Donet Tom <donettom@linux.ibm.com> > > Changelog v1: > - Changes from RFC [2], include support for handling fault_folio and using > trylock in the fault path > - A new test case has been added to measure the throughput improvement > - General refactoring of code to keep up with the changes in mm > - New split folio callback when the entire split is complete/done. The > callback is used to know when the head order needs to be reset. > > Testing: > - Testing was done with ZONE_DEVICE private pages on an x86 VM > - Throughput showed upto 5x improvement with THP migration, system to device > migration is slower due to the mirroring of data (see buffer->mirror) > > > Balbir Singh (12): > mm/zone_device: support large zone device private folios > mm/migrate_device: flags for selecting device private THP pages > mm/thp: zone_device awareness in THP handling code > mm/migrate_device: THP migration of zone device pages > mm/memory/fault: add support for zone device THP fault handling > lib/test_hmm: test cases and support for zone device private THP > mm/memremap: add folio_split support > mm/thp: add split during migration support > lib/test_hmm: add test case for split pages > selftests/mm/hmm-tests: new tests for zone device THP migration > gpu/drm/nouveau: add THP migration support > selftests/mm/hmm-tests: new throughput tests including THP > > drivers/gpu/drm/nouveau/nouveau_dmem.c | 246 +++++++--- > drivers/gpu/drm/nouveau/nouveau_svm.c | 6 +- > drivers/gpu/drm/nouveau/nouveau_svm.h | 3 +- > include/linux/huge_mm.h | 18 +- > include/linux/memremap.h | 29 +- > include/linux/migrate.h | 2 + > include/linux/mm.h | 1 + > lib/test_hmm.c | 428 +++++++++++++---- > lib/test_hmm_uapi.h | 3 + > mm/huge_memory.c | 261 ++++++++--- > mm/memory.c | 6 +- > mm/memremap.c | 50 +- > mm/migrate.c | 2 + > mm/migrate_device.c | 488 +++++++++++++++++--- > mm/page_alloc.c | 1 + > mm/page_vma_mapped.c | 10 + > mm/pgtable-generic.c | 6 + > mm/rmap.c | 19 +- > tools/testing/selftests/mm/hmm-tests.c | 607 ++++++++++++++++++++++++- > 19 files changed, 1874 insertions(+), 312 deletions(-) > > -- > 2.49.0 >
On 7/18/25 09:40, Matthew Brost wrote: > On Fri, Jul 04, 2025 at 09:34:59AM +1000, Balbir Singh wrote: ... >> >> The nouveau dmem code has been enhanced to use the new THP migration >> capability. >> >> Feedback from the RFC [2]: >> > > Thanks for the patches, results look very promising. I wanted to give > some quick feedback: > Are you seeing improvements with the patchset? > - You appear to have missed updating hmm_range_fault, specifically > hmm_vma_handle_pmd, to check for device-private entries and populate the > HMM PFNs accordingly. My colleague François has a fix for this if you're > interested. > Sure, please feel free to post them. > - I believe copy_huge_pmd also needs to be updated to avoid installing a > migration entry if the swap entry is device-private. I don't have an > exact fix yet due to my limited experience with core MM. The test case > that triggers this is fairly simple: fault in a 2MB device page on the > GPU, then fork a process that reads the page — the kernel crashes in > this scenario. > I'd be happy to look at any traces you have or post any fixes you have Thanks for the feedback Balbir Singh
On Fri, Jul 18, 2025 at 01:57:13PM +1000, Balbir Singh wrote: > On 7/18/25 09:40, Matthew Brost wrote: > > On Fri, Jul 04, 2025 at 09:34:59AM +1000, Balbir Singh wrote: > ... > >> > >> The nouveau dmem code has been enhanced to use the new THP migration > >> capability. > >> > >> Feedback from the RFC [2]: > >> > > > > Thanks for the patches, results look very promising. I wanted to give > > some quick feedback: > > > > Are you seeing improvements with the patchset? > > > - You appear to have missed updating hmm_range_fault, specifically > > hmm_vma_handle_pmd, to check for device-private entries and populate the > > HMM PFNs accordingly. My colleague François has a fix for this if you're > > interested. > > > > Sure, please feel free to post them. Hi Balbir, It seems we are missing this special handling in in hmm_vma_walk_pmd(): diff --git a/mm/hmm.c b/mm/hmm.c index f2415b4b2cdd..449025f72b2f 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -355,6 +355,27 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, } if (!pmd_present(pmd)) { + swp_entry_t entry = pmd_to_swp_entry(pmd); + + /* + * Don't fault in device private pages owned by the caller, + * just report the PFNs. + */ + if (is_device_private_entry(entry) && + pfn_swap_entry_folio(entry)->pgmap->owner == + range->dev_private_owner) { + unsigned long cpu_flags = pmd_to_hmm_pfn_flags(range, pmd); + unsigned long pfn = swp_offset_pfn(entry); + unsigned long i; + + for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) { + hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS; + hmm_pfns[i] |= pfn | cpu_flags; + } + + return 0; + } + if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0)) return -EFAULT; return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR); Francois > > > - I believe copy_huge_pmd also needs to be updated to avoid installing a > > migration entry if the swap entry is device-private. I don't have an > > exact fix yet due to my limited experience with core MM. The test case > > that triggers this is fairly simple: fault in a 2MB device page on the > > GPU, then fork a process that reads the page — the kernel crashes in > > this scenario. > > > > I'd be happy to look at any traces you have or post any fixes you have > > Thanks for the feedback > Balbir Singh >
On 7/21/25 21:42, Francois Dugast wrote: > On Fri, Jul 18, 2025 at 01:57:13PM +1000, Balbir Singh wrote: >> On 7/18/25 09:40, Matthew Brost wrote: >>> On Fri, Jul 04, 2025 at 09:34:59AM +1000, Balbir Singh wrote: >> ... >>>> >>>> The nouveau dmem code has been enhanced to use the new THP migration >>>> capability. >>>> >>>> Feedback from the RFC [2]: >>>> >>> >>> Thanks for the patches, results look very promising. I wanted to give >>> some quick feedback: >>> >> >> Are you seeing improvements with the patchset? >> >>> - You appear to have missed updating hmm_range_fault, specifically >>> hmm_vma_handle_pmd, to check for device-private entries and populate the >>> HMM PFNs accordingly. My colleague François has a fix for this if you're >>> interested. >>> >> >> Sure, please feel free to post them. > > Hi Balbir, > > It seems we are missing this special handling in in hmm_vma_walk_pmd(): > > diff --git a/mm/hmm.c b/mm/hmm.c > index f2415b4b2cdd..449025f72b2f 100644 > --- a/mm/hmm.c > +++ b/mm/hmm.c > @@ -355,6 +355,27 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, > } > > if (!pmd_present(pmd)) { > + swp_entry_t entry = pmd_to_swp_entry(pmd); > + > + /* > + * Don't fault in device private pages owned by the caller, > + * just report the PFNs. > + */ > + if (is_device_private_entry(entry) && > + pfn_swap_entry_folio(entry)->pgmap->owner == > + range->dev_private_owner) { > + unsigned long cpu_flags = pmd_to_hmm_pfn_flags(range, pmd); > + unsigned long pfn = swp_offset_pfn(entry); > + unsigned long i; > + > + for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) { > + hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS; > + hmm_pfns[i] |= pfn | cpu_flags; Won't we use hmm_pfn_to_map_order(), do we still need to populate each entry in the hmm_pfns[i]? > + } > + > + return 0; > + } > + Thanks for the patch! If you could send it with a full sign-off, I can add it to my series while posting Balbir Singh
On Tue, Jul 22, 2025 at 09:34:13AM +1000, Balbir Singh wrote: > On 7/21/25 21:42, Francois Dugast wrote: > > On Fri, Jul 18, 2025 at 01:57:13PM +1000, Balbir Singh wrote: > >> On 7/18/25 09:40, Matthew Brost wrote: > >>> On Fri, Jul 04, 2025 at 09:34:59AM +1000, Balbir Singh wrote: > >> ... > >>>> > >>>> The nouveau dmem code has been enhanced to use the new THP migration > >>>> capability. > >>>> > >>>> Feedback from the RFC [2]: > >>>> > >>> > >>> Thanks for the patches, results look very promising. I wanted to give > >>> some quick feedback: > >>> > >> > >> Are you seeing improvements with the patchset? > >> > >>> - You appear to have missed updating hmm_range_fault, specifically > >>> hmm_vma_handle_pmd, to check for device-private entries and populate the > >>> HMM PFNs accordingly. My colleague François has a fix for this if you're > >>> interested. > >>> > >> > >> Sure, please feel free to post them. > > > > Hi Balbir, > > > > It seems we are missing this special handling in in hmm_vma_walk_pmd(): > > > > diff --git a/mm/hmm.c b/mm/hmm.c > > index f2415b4b2cdd..449025f72b2f 100644 > > --- a/mm/hmm.c > > +++ b/mm/hmm.c > > @@ -355,6 +355,27 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, > > } > > > > if (!pmd_present(pmd)) { > > + swp_entry_t entry = pmd_to_swp_entry(pmd); > > + > > + /* > > + * Don't fault in device private pages owned by the caller, > > + * just report the PFNs. > > + */ > > + if (is_device_private_entry(entry) && > > + pfn_swap_entry_folio(entry)->pgmap->owner == > > + range->dev_private_owner) { > > + unsigned long cpu_flags = pmd_to_hmm_pfn_flags(range, pmd); > > + unsigned long pfn = swp_offset_pfn(entry); > > + unsigned long i; > > + > > + for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) { > > + hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS; > > + hmm_pfns[i] |= pfn | cpu_flags; > > Won't we use hmm_pfn_to_map_order(), do we still need to populate each entry in the hmm_pfns[i]? > I had the same question. hmm_vma_handle_pmd populates subsequent PFNs as well, but this seems unnecessary. I removed both of these cases and my code still worked. However, I haven't audited all kernel callers to ensure this doesn't cause issues elsewhere. Matt > > + } > > + > > + return 0; > > + } > > + > > Thanks for the patch! If you could send it with a full sign-off, I can add it to my series while > posting > > Balbir Singh > >
When the PMD swap entry is device private and owned by the caller,
skip the range faulting and instead just set the correct HMM PFNs.
This is similar to the logic for PTEs in hmm_vma_handle_pte().
For now, each hmm_pfns[i] entry is populated as it is currently done
in hmm_vma_handle_pmd() but this might not be necessary. A follow-up
optimization could be to make use of the order and skip populating
subsequent PFNs.
Signed-off-by: Francois Dugast <francois.dugast@intel.com>
---
mm/hmm.c | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)
diff --git a/mm/hmm.c b/mm/hmm.c
index f2415b4b2cdd..63ec1b18a656 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -355,6 +355,31 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
}
if (!pmd_present(pmd)) {
+ swp_entry_t entry = pmd_to_swp_entry(pmd);
+
+ /*
+ * Don't fault in device private pages owned by the caller,
+ * just report the PFNs.
+ */
+ if (is_device_private_entry(entry) &&
+ pfn_swap_entry_folio(entry)->pgmap->owner ==
+ range->dev_private_owner) {
+ unsigned long cpu_flags = HMM_PFN_VALID |
+ hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT);
+ unsigned long pfn = swp_offset_pfn(entry);
+ unsigned long i;
+
+ if (is_writable_device_private_entry(entry))
+ cpu_flags |= HMM_PFN_WRITE;
+
+ for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
+ hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
+ hmm_pfns[i] |= pfn | cpu_flags;
+ }
+
+ return 0;
+ }
+
if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
return -EFAULT;
return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
--
2.43.0
On Tue, Jul 22, 2025 at 09:34:45PM +0200, Francois Dugast wrote: > When the PMD swap entry is device private and owned by the caller, > skip the range faulting and instead just set the correct HMM PFNs. > This is similar to the logic for PTEs in hmm_vma_handle_pte(). > > For now, each hmm_pfns[i] entry is populated as it is currently done > in hmm_vma_handle_pmd() but this might not be necessary. A follow-up > optimization could be to make use of the order and skip populating > subsequent PFNs. > > Signed-off-by: Francois Dugast <francois.dugast@intel.com> > --- > mm/hmm.c | 25 +++++++++++++++++++++++++ > 1 file changed, 25 insertions(+) > > diff --git a/mm/hmm.c b/mm/hmm.c > index f2415b4b2cdd..63ec1b18a656 100644 > --- a/mm/hmm.c > +++ b/mm/hmm.c > @@ -355,6 +355,31 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, > } > > if (!pmd_present(pmd)) { > + swp_entry_t entry = pmd_to_swp_entry(pmd); > + > + /* > + * Don't fault in device private pages owned by the caller, > + * just report the PFNs. > + */ > + if (is_device_private_entry(entry) && > + pfn_swap_entry_folio(entry)->pgmap->owner == > + range->dev_private_owner) { > + unsigned long cpu_flags = HMM_PFN_VALID | > + hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT); > + unsigned long pfn = swp_offset_pfn(entry); > + unsigned long i; > + > + if (is_writable_device_private_entry(entry)) > + cpu_flags |= HMM_PFN_WRITE; > + > + for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) { > + hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS; > + hmm_pfns[i] |= pfn | cpu_flags; > + } > + > + return 0; > + } > + > if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0)) > return -EFAULT; I think here if this is a is_device_private_entry(entry), we need to call hmm_vma_fault. Matt > return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR); > -- > 2.43.0 >
On Thu, Aug 07, 2025 at 05:21:45PM -0700, Matthew Brost wrote: > On Tue, Jul 22, 2025 at 09:34:45PM +0200, Francois Dugast wrote: > > When the PMD swap entry is device private and owned by the caller, > > skip the range faulting and instead just set the correct HMM PFNs. > > This is similar to the logic for PTEs in hmm_vma_handle_pte(). > > > > For now, each hmm_pfns[i] entry is populated as it is currently done > > in hmm_vma_handle_pmd() but this might not be necessary. A follow-up > > optimization could be to make use of the order and skip populating > > subsequent PFNs. > > > > Signed-off-by: Francois Dugast <francois.dugast@intel.com> > > --- > > mm/hmm.c | 25 +++++++++++++++++++++++++ > > 1 file changed, 25 insertions(+) > > > > diff --git a/mm/hmm.c b/mm/hmm.c > > index f2415b4b2cdd..63ec1b18a656 100644 > > --- a/mm/hmm.c > > +++ b/mm/hmm.c > > @@ -355,6 +355,31 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, > > } > > > > if (!pmd_present(pmd)) { > > + swp_entry_t entry = pmd_to_swp_entry(pmd); > > + > > + /* > > + * Don't fault in device private pages owned by the caller, > > + * just report the PFNs. > > + */ > > + if (is_device_private_entry(entry) && > > + pfn_swap_entry_folio(entry)->pgmap->owner == > > + range->dev_private_owner) { > > + unsigned long cpu_flags = HMM_PFN_VALID | > > + hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT); > > + unsigned long pfn = swp_offset_pfn(entry); > > + unsigned long i; > > + > > + if (is_writable_device_private_entry(entry)) > > + cpu_flags |= HMM_PFN_WRITE; > > + > > + for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) { > > + hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS; > > + hmm_pfns[i] |= pfn | cpu_flags; > > + } > > + > > + return 0; > > + } > > + > > if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0)) > > return -EFAULT; > > I think here if this is a is_device_private_entry(entry), we need to > call hmm_vma_fault. Yes that seems needed, thanks for the catch. Francois > > Matt > > > return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR); > > -- > > 2.43.0 > >
On 7/23/25 05:34, Francois Dugast wrote: > When the PMD swap entry is device private and owned by the caller, > skip the range faulting and instead just set the correct HMM PFNs. > This is similar to the logic for PTEs in hmm_vma_handle_pte(). > > For now, each hmm_pfns[i] entry is populated as it is currently done > in hmm_vma_handle_pmd() but this might not be necessary. A follow-up > optimization could be to make use of the order and skip populating > subsequent PFNs. I think we should test and remove these now > > Signed-off-by: Francois Dugast <francois.dugast@intel.com> > --- > mm/hmm.c | 25 +++++++++++++++++++++++++ > 1 file changed, 25 insertions(+) > > diff --git a/mm/hmm.c b/mm/hmm.c > index f2415b4b2cdd..63ec1b18a656 100644 > --- a/mm/hmm.c > +++ b/mm/hmm.c > @@ -355,6 +355,31 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, > } > > if (!pmd_present(pmd)) { > + swp_entry_t entry = pmd_to_swp_entry(pmd); > + > + /* > + * Don't fault in device private pages owned by the caller, > + * just report the PFNs. > + */ > + if (is_device_private_entry(entry) && > + pfn_swap_entry_folio(entry)->pgmap->owner == > + range->dev_private_owner) { > + unsigned long cpu_flags = HMM_PFN_VALID | > + hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT); > + unsigned long pfn = swp_offset_pfn(entry); > + unsigned long i; > + > + if (is_writable_device_private_entry(entry)) > + cpu_flags |= HMM_PFN_WRITE; > + > + for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) { > + hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS; > + hmm_pfns[i] |= pfn | cpu_flags; > + } > + As discussed, can we remove these. > + return 0; > + } All of this be under CONFIG_ARCH_ENABLE_THP_MIGRATION > + > if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0)) > return -EFAULT; > return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR); Balbir Singh
On Thu, Jul 24, 2025 at 10:25:11AM +1000, Balbir Singh wrote: > On 7/23/25 05:34, Francois Dugast wrote: > > When the PMD swap entry is device private and owned by the caller, > > skip the range faulting and instead just set the correct HMM PFNs. > > This is similar to the logic for PTEs in hmm_vma_handle_pte(). > > > > For now, each hmm_pfns[i] entry is populated as it is currently done > > in hmm_vma_handle_pmd() but this might not be necessary. A follow-up > > optimization could be to make use of the order and skip populating > > subsequent PFNs. > > I think we should test and remove these now > +Jason, Leon – perhaps either of you can provide insight into why hmm_vma_handle_pmd fully populates the HMM PFNs when a higher-order page is found. If we can be assured that changing this won’t break other parts of the kernel, I agree it should be removed. A snippet of documentation should also be added indicating that when higher-order PFNs are found, subsequent PFNs within the range will remain unpopulated. I can verify that GPU SVM works just fine without these PFNs being populated. Matt > > > > Signed-off-by: Francois Dugast <francois.dugast@intel.com> > > --- > > mm/hmm.c | 25 +++++++++++++++++++++++++ > > 1 file changed, 25 insertions(+) > > > > diff --git a/mm/hmm.c b/mm/hmm.c > > index f2415b4b2cdd..63ec1b18a656 100644 > > --- a/mm/hmm.c > > +++ b/mm/hmm.c > > @@ -355,6 +355,31 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, > > } > > > > if (!pmd_present(pmd)) { > > + swp_entry_t entry = pmd_to_swp_entry(pmd); > > + > > + /* > > + * Don't fault in device private pages owned by the caller, > > + * just report the PFNs. > > + */ > > + if (is_device_private_entry(entry) && > > + pfn_swap_entry_folio(entry)->pgmap->owner == > > + range->dev_private_owner) { > > + unsigned long cpu_flags = HMM_PFN_VALID | > > + hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT); > > + unsigned long pfn = swp_offset_pfn(entry); > > + unsigned long i; > > + > > + if (is_writable_device_private_entry(entry)) > > + cpu_flags |= HMM_PFN_WRITE; > > + > > + for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) { > > + hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS; > > + hmm_pfns[i] |= pfn | cpu_flags; > > + } > > + > > As discussed, can we remove these. > > > + return 0; > > + } > > All of this be under CONFIG_ARCH_ENABLE_THP_MIGRATION > > > + > > if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0)) > > return -EFAULT; > > return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR); > > > > Balbir Singh
On Wed, Jul 23, 2025 at 10:02:58PM -0700, Matthew Brost wrote: > On Thu, Jul 24, 2025 at 10:25:11AM +1000, Balbir Singh wrote: > > On 7/23/25 05:34, Francois Dugast wrote: > > > When the PMD swap entry is device private and owned by the caller, > > > skip the range faulting and instead just set the correct HMM PFNs. > > > This is similar to the logic for PTEs in hmm_vma_handle_pte(). > > > > > > For now, each hmm_pfns[i] entry is populated as it is currently done > > > in hmm_vma_handle_pmd() but this might not be necessary. A follow-up > > > optimization could be to make use of the order and skip populating > > > subsequent PFNs. > > > > I think we should test and remove these now > > > > +Jason, Leon – perhaps either of you can provide insight into why > hmm_vma_handle_pmd fully populates the HMM PFNs when a higher-order page > is found. I'm not sure why this is burined in some weird unrelated thread, please post patches normally and CC the right people. At least the main patch looks reasonable to me, and probably should do pgd as well while at it? > If we can be assured that changing this won’t break other parts of the > kernel, I agree it should be removed. A snippet of documentation should > also be added indicating that when higher-order PFNs are found, > subsequent PFNs within the range will remain unpopulated. I can verify > that GPU SVM works just fine without these PFNs being populated. We can only do this if someone audits all the current users to confirm they are compatible. Do that and then it is OK to propose the change. I think the current version evolved as an optimization so I would not be surprised to see that some callers need fixing. Jason
On 7/24/25 08:02, Matthew Brost wrote: > On Thu, Jul 24, 2025 at 10:25:11AM +1000, Balbir Singh wrote: >> On 7/23/25 05:34, Francois Dugast wrote: >>> When the PMD swap entry is device private and owned by the caller, >>> skip the range faulting and instead just set the correct HMM PFNs. >>> This is similar to the logic for PTEs in hmm_vma_handle_pte(). >>> >>> For now, each hmm_pfns[i] entry is populated as it is currently done >>> in hmm_vma_handle_pmd() but this might not be necessary. A follow-up >>> optimization could be to make use of the order and skip populating >>> subsequent PFNs. >> I think we should test and remove these now >> > +Jason, Leon – perhaps either of you can provide insight into why > hmm_vma_handle_pmd fully populates the HMM PFNs when a higher-order page > is found. > > If we can be assured that changing this won’t break other parts of the > kernel, I agree it should be removed. A snippet of documentation should > also be added indicating that when higher-order PFNs are found, > subsequent PFNs within the range will remain unpopulated. I can verify > that GPU SVM works just fine without these PFNs being populated. afaics the device can consume the range as smaller pages also, and some hmm users depend on that. > Matt --Mika > >>> Signed-off-by: Francois Dugast <francois.dugast@intel.com> >>> --- >>> mm/hmm.c | 25 +++++++++++++++++++++++++ >>> 1 file changed, 25 insertions(+) >>> >>> diff --git a/mm/hmm.c b/mm/hmm.c >>> index f2415b4b2cdd..63ec1b18a656 100644 >>> --- a/mm/hmm.c >>> +++ b/mm/hmm.c >>> @@ -355,6 +355,31 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, >>> } >>> >>> if (!pmd_present(pmd)) { >>> + swp_entry_t entry = pmd_to_swp_entry(pmd); >>> + >>> + /* >>> + * Don't fault in device private pages owned by the caller, >>> + * just report the PFNs. >>> + */ >>> + if (is_device_private_entry(entry) && >>> + pfn_swap_entry_folio(entry)->pgmap->owner == >>> + range->dev_private_owner) { >>> + unsigned long cpu_flags = HMM_PFN_VALID | >>> + hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT); >>> + unsigned long pfn = swp_offset_pfn(entry); >>> + unsigned long i; >>> + >>> + if (is_writable_device_private_entry(entry)) >>> + cpu_flags |= HMM_PFN_WRITE; >>> + >>> + for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) { >>> + hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS; >>> + hmm_pfns[i] |= pfn | cpu_flags; >>> + } >>> + >> As discussed, can we remove these. >> >>> + return 0; >>> + } >> All of this be under CONFIG_ARCH_ENABLE_THP_MIGRATION >> >>> + >>> if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0)) >>> return -EFAULT; >>> return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR); >> >> >> Balbir Singh
On Thu, Jul 24, 2025 at 08:46:11AM +0300, Mika Penttilä wrote: > > On 7/24/25 08:02, Matthew Brost wrote: > > On Thu, Jul 24, 2025 at 10:25:11AM +1000, Balbir Singh wrote: > >> On 7/23/25 05:34, Francois Dugast wrote: > >>> When the PMD swap entry is device private and owned by the caller, > >>> skip the range faulting and instead just set the correct HMM PFNs. > >>> This is similar to the logic for PTEs in hmm_vma_handle_pte(). > >>> > >>> For now, each hmm_pfns[i] entry is populated as it is currently done > >>> in hmm_vma_handle_pmd() but this might not be necessary. A follow-up > >>> optimization could be to make use of the order and skip populating > >>> subsequent PFNs. > >> I think we should test and remove these now > >> > > +Jason, Leon – perhaps either of you can provide insight into why > > hmm_vma_handle_pmd fully populates the HMM PFNs when a higher-order page > > is found. > > > > If we can be assured that changing this won’t break other parts of the > > kernel, I agree it should be removed. A snippet of documentation should > > also be added indicating that when higher-order PFNs are found, > > subsequent PFNs within the range will remain unpopulated. I can verify > > that GPU SVM works just fine without these PFNs being populated. > > afaics the device can consume the range as smaller pages also, and some > hmm users depend on that. > Sure, but I think that should be fixed in the device code. If a large-order PFN is found, the subsequent PFNs can clearly be inferred. It's a micro-optimization here, but devices or callers capable of handling this properly shouldn't force a hacky, less optimal behavior on core code. If anything relies on the current behavior, we should fix it and ensure correctness. Matt > > > Matt > > > --Mika > > > > > >>> Signed-off-by: Francois Dugast <francois.dugast@intel.com> > >>> --- > >>> mm/hmm.c | 25 +++++++++++++++++++++++++ > >>> 1 file changed, 25 insertions(+) > >>> > >>> diff --git a/mm/hmm.c b/mm/hmm.c > >>> index f2415b4b2cdd..63ec1b18a656 100644 > >>> --- a/mm/hmm.c > >>> +++ b/mm/hmm.c > >>> @@ -355,6 +355,31 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, > >>> } > >>> > >>> if (!pmd_present(pmd)) { > >>> + swp_entry_t entry = pmd_to_swp_entry(pmd); > >>> + > >>> + /* > >>> + * Don't fault in device private pages owned by the caller, > >>> + * just report the PFNs. > >>> + */ > >>> + if (is_device_private_entry(entry) && > >>> + pfn_swap_entry_folio(entry)->pgmap->owner == > >>> + range->dev_private_owner) { > >>> + unsigned long cpu_flags = HMM_PFN_VALID | > >>> + hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT); > >>> + unsigned long pfn = swp_offset_pfn(entry); > >>> + unsigned long i; > >>> + > >>> + if (is_writable_device_private_entry(entry)) > >>> + cpu_flags |= HMM_PFN_WRITE; > >>> + > >>> + for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) { > >>> + hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS; > >>> + hmm_pfns[i] |= pfn | cpu_flags; > >>> + } > >>> + > >> As discussed, can we remove these. > >> > >>> + return 0; > >>> + } > >> All of this be under CONFIG_ARCH_ENABLE_THP_MIGRATION > >> > >>> + > >>> if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0)) > >>> return -EFAULT; > >>> return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR); > >> > >> > >> Balbir Singh >
On 7/24/25 08:57, Matthew Brost wrote: > On Thu, Jul 24, 2025 at 08:46:11AM +0300, Mika Penttilä wrote: >> On 7/24/25 08:02, Matthew Brost wrote: >>> On Thu, Jul 24, 2025 at 10:25:11AM +1000, Balbir Singh wrote: >>>> On 7/23/25 05:34, Francois Dugast wrote: >>>>> When the PMD swap entry is device private and owned by the caller, >>>>> skip the range faulting and instead just set the correct HMM PFNs. >>>>> This is similar to the logic for PTEs in hmm_vma_handle_pte(). >>>>> >>>>> For now, each hmm_pfns[i] entry is populated as it is currently done >>>>> in hmm_vma_handle_pmd() but this might not be necessary. A follow-up >>>>> optimization could be to make use of the order and skip populating >>>>> subsequent PFNs. >>>> I think we should test and remove these now >>>> >>> +Jason, Leon – perhaps either of you can provide insight into why >>> hmm_vma_handle_pmd fully populates the HMM PFNs when a higher-order page >>> is found. >>> >>> If we can be assured that changing this won’t break other parts of the >>> kernel, I agree it should be removed. A snippet of documentation should >>> also be added indicating that when higher-order PFNs are found, >>> subsequent PFNs within the range will remain unpopulated. I can verify >>> that GPU SVM works just fine without these PFNs being populated. >> afaics the device can consume the range as smaller pages also, and some >> hmm users depend on that. >> > Sure, but I think that should be fixed in the device code. If a > large-order PFN is found, the subsequent PFNs can clearly be inferred. > It's a micro-optimization here, but devices or callers capable of > handling this properly shouldn't force a hacky, less optimal behavior on > core code. If anything relies on the current behavior, we should fix it > and ensure correctness. Yes sure device code can be changed but meant to say we can't just delete those lines without breaking existing users. > > Matt --Mika > >>> Matt >> >> --Mika >> >> >>>>> Signed-off-by: Francois Dugast <francois.dugast@intel.com> >>>>> --- >>>>> mm/hmm.c | 25 +++++++++++++++++++++++++ >>>>> 1 file changed, 25 insertions(+) >>>>> >>>>> diff --git a/mm/hmm.c b/mm/hmm.c >>>>> index f2415b4b2cdd..63ec1b18a656 100644 >>>>> --- a/mm/hmm.c >>>>> +++ b/mm/hmm.c >>>>> @@ -355,6 +355,31 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, >>>>> } >>>>> >>>>> if (!pmd_present(pmd)) { >>>>> + swp_entry_t entry = pmd_to_swp_entry(pmd); >>>>> + >>>>> + /* >>>>> + * Don't fault in device private pages owned by the caller, >>>>> + * just report the PFNs. >>>>> + */ >>>>> + if (is_device_private_entry(entry) && >>>>> + pfn_swap_entry_folio(entry)->pgmap->owner == >>>>> + range->dev_private_owner) { >>>>> + unsigned long cpu_flags = HMM_PFN_VALID | >>>>> + hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT); >>>>> + unsigned long pfn = swp_offset_pfn(entry); >>>>> + unsigned long i; >>>>> + >>>>> + if (is_writable_device_private_entry(entry)) >>>>> + cpu_flags |= HMM_PFN_WRITE; >>>>> + >>>>> + for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) { >>>>> + hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS; >>>>> + hmm_pfns[i] |= pfn | cpu_flags; >>>>> + } >>>>> + >>>> As discussed, can we remove these. >>>> >>>>> + return 0; >>>>> + } >>>> All of this be under CONFIG_ARCH_ENABLE_THP_MIGRATION >>>> >>>>> + >>>>> if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0)) >>>>> return -EFAULT; >>>>> return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR); >>>> >>>> Balbir Singh
On Thu, Jul 24, 2025 at 09:04:36AM +0300, Mika Penttilä wrote: > > On 7/24/25 08:57, Matthew Brost wrote: > > On Thu, Jul 24, 2025 at 08:46:11AM +0300, Mika Penttilä wrote: > >> On 7/24/25 08:02, Matthew Brost wrote: > >>> On Thu, Jul 24, 2025 at 10:25:11AM +1000, Balbir Singh wrote: > >>>> On 7/23/25 05:34, Francois Dugast wrote: > >>>>> When the PMD swap entry is device private and owned by the caller, > >>>>> skip the range faulting and instead just set the correct HMM PFNs. > >>>>> This is similar to the logic for PTEs in hmm_vma_handle_pte(). > >>>>> > >>>>> For now, each hmm_pfns[i] entry is populated as it is currently done > >>>>> in hmm_vma_handle_pmd() but this might not be necessary. A follow-up > >>>>> optimization could be to make use of the order and skip populating > >>>>> subsequent PFNs. > >>>> I think we should test and remove these now > >>>> > >>> +Jason, Leon – perhaps either of you can provide insight into why > >>> hmm_vma_handle_pmd fully populates the HMM PFNs when a higher-order page > >>> is found. > >>> > >>> If we can be assured that changing this won’t break other parts of the > >>> kernel, I agree it should be removed. A snippet of documentation should > >>> also be added indicating that when higher-order PFNs are found, > >>> subsequent PFNs within the range will remain unpopulated. I can verify > >>> that GPU SVM works just fine without these PFNs being populated. > >> afaics the device can consume the range as smaller pages also, and some > >> hmm users depend on that. > >> > > Sure, but I think that should be fixed in the device code. If a > > large-order PFN is found, the subsequent PFNs can clearly be inferred. > > It's a micro-optimization here, but devices or callers capable of > > handling this properly shouldn't force a hacky, less optimal behavior on > > core code. If anything relies on the current behavior, we should fix it > > and ensure correctness. > > Yes sure device code can be changed but meant to say we can't just > delete those lines without breaking existing users. Mika is right. RDMA subsystem and HMM users there need to be updated. We have special flag (IB_ACCESS_HUGETLB) that prepare whole RDMA stack to handle large order PFNs. If this flag is not provided, we need to fallback to basic device page size (4k) and for that we expect fully populated PFN list. Thanks
On Tue, 22 Jul 2025 21:34:45 +0200 Francois Dugast <francois.dugast@intel.com> wrote: > When the PMD swap entry is device private and owned by the caller, > skip the range faulting and instead just set the correct HMM PFNs. > This is similar to the logic for PTEs in hmm_vma_handle_pte(). Please always tell us why a patch does something, not only what it does. > For now, each hmm_pfns[i] entry is populated as it is currently done > in hmm_vma_handle_pmd() but this might not be necessary. A follow-up > optimization could be to make use of the order and skip populating > subsequent PFNs. I infer from this paragraph that this patch is a performance optimization? Have its effects been measured? > --- a/mm/hmm.c > +++ b/mm/hmm.c > @@ -355,6 +355,31 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, > } > > if (!pmd_present(pmd)) { > + swp_entry_t entry = pmd_to_swp_entry(pmd); > + > + /* > + * Don't fault in device private pages owned by the caller, > + * just report the PFNs. > + */ Similarly, this tells us "what" it does, which is fairly obvious from the code itself. What is not obvious from the code is the "why". > + if (is_device_private_entry(entry) && > + pfn_swap_entry_folio(entry)->pgmap->owner == > + range->dev_private_owner) { > + unsigned long cpu_flags = HMM_PFN_VALID | > + hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT); > + unsigned long pfn = swp_offset_pfn(entry); > + unsigned long i; > + > + if (is_writable_device_private_entry(entry)) > + cpu_flags |= HMM_PFN_WRITE; > + > + for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) { > + hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS; > + hmm_pfns[i] |= pfn | cpu_flags; > + } > + > + return 0; > + } > + > if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0)) > return -EFAULT; > return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
On Tue, Jul 22, 2025 at 01:07:21PM -0700, Andrew Morton wrote: > On Tue, 22 Jul 2025 21:34:45 +0200 Francois Dugast <francois.dugast@intel.com> wrote: > > > When the PMD swap entry is device private and owned by the caller, > > skip the range faulting and instead just set the correct HMM PFNs. > > This is similar to the logic for PTEs in hmm_vma_handle_pte(). > > Please always tell us why a patch does something, not only what it does. Sure, let me improve this in the next version. > > > For now, each hmm_pfns[i] entry is populated as it is currently done > > in hmm_vma_handle_pmd() but this might not be necessary. A follow-up > > optimization could be to make use of the order and skip populating > > subsequent PFNs. > > I infer from this paragraph that this patch is a performance > optimization? Have its effects been measured? Yes, this performance optimization would come from avoiding the loop over the range but it has neither been properly tested nor measured yet. > > > --- a/mm/hmm.c > > +++ b/mm/hmm.c > > @@ -355,6 +355,31 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, > > } > > > > if (!pmd_present(pmd)) { > > + swp_entry_t entry = pmd_to_swp_entry(pmd); > > + > > + /* > > + * Don't fault in device private pages owned by the caller, > > + * just report the PFNs. > > + */ > > Similarly, this tells us "what" it does, which is fairly obvious from > the code itself. What is not obvious from the code is the "why". Indeed, will fix. > > > + if (is_device_private_entry(entry) && > > + pfn_swap_entry_folio(entry)->pgmap->owner == > > + range->dev_private_owner) { > > + unsigned long cpu_flags = HMM_PFN_VALID | > > + hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT); > > + unsigned long pfn = swp_offset_pfn(entry); > > + unsigned long i; > > + > > + if (is_writable_device_private_entry(entry)) > > + cpu_flags |= HMM_PFN_WRITE; > > + > > + for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) { > > + hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS; > > + hmm_pfns[i] |= pfn | cpu_flags; > > + } > > + > > + return 0; > > + } > > + > > if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0)) > > return -EFAULT; > > return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR); >
On Wed, Jul 23, 2025 at 05:34:17PM +0200, Francois Dugast wrote: > On Tue, Jul 22, 2025 at 01:07:21PM -0700, Andrew Morton wrote: > > On Tue, 22 Jul 2025 21:34:45 +0200 Francois Dugast <francois.dugast@intel.com> wrote: > > > > > When the PMD swap entry is device private and owned by the caller, > > > skip the range faulting and instead just set the correct HMM PFNs. > > > This is similar to the logic for PTEs in hmm_vma_handle_pte(). > > > > Please always tell us why a patch does something, not only what it does. > > Sure, let me improve this in the next version. > > > > > > For now, each hmm_pfns[i] entry is populated as it is currently done > > > in hmm_vma_handle_pmd() but this might not be necessary. A follow-up > > > optimization could be to make use of the order and skip populating > > > subsequent PFNs. > > > > I infer from this paragraph that this patch is a performance > > optimization? Have its effects been measured? > > Yes, this performance optimization would come from avoiding the loop > over the range but it has neither been properly tested nor measured > yet. > This is also a functional change. Once THP device pages are enabled (for performance), we will encounter device-private swap entries in PMDs. At that point, the correct behavior is to populate HMM PFNs from the swap entry when dev_private_owner matches; otherwise, trigger a fault if the HMM range-walk input requests one, or skip it in the non-faulting case. It’s harmless to merge this patch before THP device pages are enabled, since with the current code base we never find device-private swap entries in PMDs. I'd include something like the above explanation in the patch commit message, or in code comments if needed. Matt > > > > > --- a/mm/hmm.c > > > +++ b/mm/hmm.c > > > @@ -355,6 +355,31 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, > > > } > > > > > > if (!pmd_present(pmd)) { > > > + swp_entry_t entry = pmd_to_swp_entry(pmd); > > > + > > > + /* > > > + * Don't fault in device private pages owned by the caller, > > > + * just report the PFNs. > > > + */ > > > > Similarly, this tells us "what" it does, which is fairly obvious from > > the code itself. What is not obvious from the code is the "why". > > Indeed, will fix. > > > > > > + if (is_device_private_entry(entry) && > > > + pfn_swap_entry_folio(entry)->pgmap->owner == > > > + range->dev_private_owner) { > > > + unsigned long cpu_flags = HMM_PFN_VALID | > > > + hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT); > > > + unsigned long pfn = swp_offset_pfn(entry); > > > + unsigned long i; > > > + > > > + if (is_writable_device_private_entry(entry)) > > > + cpu_flags |= HMM_PFN_WRITE; > > > + > > > + for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) { > > > + hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS; > > > + hmm_pfns[i] |= pfn | cpu_flags; > > > + } > > > + > > > + return 0; > > > + } > > > + > > > if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0)) > > > return -EFAULT; > > > return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR); > >
On Fri, Jul 18, 2025 at 01:57:13PM +1000, Balbir Singh wrote: > On 7/18/25 09:40, Matthew Brost wrote: > > On Fri, Jul 04, 2025 at 09:34:59AM +1000, Balbir Singh wrote: > ... > >> > >> The nouveau dmem code has been enhanced to use the new THP migration > >> capability. > >> > >> Feedback from the RFC [2]: > >> > > > > Thanks for the patches, results look very promising. I wanted to give > > some quick feedback: > > > > Are you seeing improvements with the patchset? > > > - You appear to have missed updating hmm_range_fault, specifically > > hmm_vma_handle_pmd, to check for device-private entries and populate the > > HMM PFNs accordingly. My colleague François has a fix for this if you're > > interested. > > > > Sure, please feel free to post them. > > > - I believe copy_huge_pmd also needs to be updated to avoid installing a > > migration entry if the swap entry is device-private. I don't have an > > exact fix yet due to my limited experience with core MM. The test case > > that triggers this is fairly simple: fault in a 2MB device page on the > > GPU, then fork a process that reads the page — the kernel crashes in > > this scenario. > > > > I'd be happy to look at any traces you have or post any fixes you have > Ok, I think I have some code that works after slowly reverse-engineering the core MM code - my test case passes without any warnings / kernel crashes. I've included it below. Feel free to include it in your next revision, modify it as you see fit, or do whatever you like with it. diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 2b2563f35544..1cd6d9a10657 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1773,17 +1773,46 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, swp_entry_t entry = pmd_to_swp_entry(pmd); VM_BUG_ON(!is_pmd_migration_entry(pmd) && - !is_device_private_entry(entry)); - if (!is_readable_migration_entry(entry)) { - entry = make_readable_migration_entry( - swp_offset(entry)); + !is_device_private_entry(entry)); + + if (!is_device_private_entry(entry) && + !is_readable_migration_entry(entry)) { + entry = make_readable_migration_entry(swp_offset(entry)); pmd = swp_entry_to_pmd(entry); if (pmd_swp_soft_dirty(*src_pmd)) pmd = pmd_swp_mksoft_dirty(pmd); if (pmd_swp_uffd_wp(*src_pmd)) pmd = pmd_swp_mkuffd_wp(pmd); set_pmd_at(src_mm, addr, src_pmd, pmd); + } else if (is_device_private_entry(entry)) { + if (is_writable_device_private_entry(entry)) { + entry = make_readable_device_private_entry(swp_offset(entry)); + + pmd = swp_entry_to_pmd(entry); + if (pmd_swp_soft_dirty(*src_pmd)) + pmd = pmd_swp_mksoft_dirty(pmd); + if (pmd_swp_uffd_wp(*src_pmd)) + pmd = pmd_swp_mkuffd_wp(pmd); + set_pmd_at(src_mm, addr, src_pmd, pmd); + } + + src_page = pfn_swap_entry_to_page(entry); + VM_BUG_ON_PAGE(!PageHead(src_page), src_page); + src_folio = page_folio(src_page); + + folio_get(src_folio); + if (unlikely(folio_try_dup_anon_rmap_pmd(src_folio, src_page, + dst_vma, src_vma))) { + /* Page maybe pinned: split and retry the fault on PTEs. */ + folio_put(src_folio); + pte_free(dst_mm, pgtable); + spin_unlock(src_ptl); + spin_unlock(dst_ptl); + __split_huge_pmd(src_vma, src_pmd, addr, false); + return -EAGAIN; + } } + add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); mm_inc_nr_ptes(dst_mm); pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); Matt > Thanks for the feedback > Balbir Singh
On Fri, Jul 18, 2025 at 01:57:13PM +1000, Balbir Singh wrote: > On 7/18/25 09:40, Matthew Brost wrote: > > On Fri, Jul 04, 2025 at 09:34:59AM +1000, Balbir Singh wrote: > ... > >> > >> The nouveau dmem code has been enhanced to use the new THP migration > >> capability. > >> > >> Feedback from the RFC [2]: > >> > > > > Thanks for the patches, results look very promising. I wanted to give > > some quick feedback: > > > > Are you seeing improvements with the patchset? > We're nowhere near stable yet, but basic testing shows that CPU time from the start of migrate_vma_* to the end drops from ~300µs to ~6µs on a 2MB GPU fault. A lot of this improvement is dma-mapping at 2M grainularity for the CPU<->GPU copy rather than mapping 512 4k pages too. > > - You appear to have missed updating hmm_range_fault, specifically > > hmm_vma_handle_pmd, to check for device-private entries and populate the > > HMM PFNs accordingly. My colleague François has a fix for this if you're > > interested. > > > > Sure, please feel free to post them. > > > - I believe copy_huge_pmd also needs to be updated to avoid installing a > > migration entry if the swap entry is device-private. I don't have an > > exact fix yet due to my limited experience with core MM. The test case > > that triggers this is fairly simple: fault in a 2MB device page on the > > GPU, then fork a process that reads the page — the kernel crashes in > > this scenario. > > > > I'd be happy to look at any traces you have or post any fixes you have > I've got it so the kernel doesn't explode but still get warnings like: [ 3564.850036] mm/pgtable-generic.c:54: bad pmd ffff8881290408e0(efffff80042bfe00) [ 3565.298186] BUG: Bad rss-counter state mm:ffff88810a100c40 type:MM_ANONPAGES val:114688 [ 3565.306108] BUG: non-zero pgtables_bytes on freeing mm: 917504 I'm basically just skip is_swap_pmd clause if the entry is device private, and let the rest of the function execute. This avoids installing a migration entry—which isn't required and cause the crash—and allows the rmap code to run, which flips the pages to not anonymous exclusive (effectively making them copy-on-write (?), though that doesn't fully apply to device pages). It's not 100% correct yet, but it's a step in the right direction. Matt > Thanks for the feedback > Balbir Singh
On 7/18/25 14:57, Matthew Brost wrote: > On Fri, Jul 18, 2025 at 01:57:13PM +1000, Balbir Singh wrote: >> On 7/18/25 09:40, Matthew Brost wrote: >>> On Fri, Jul 04, 2025 at 09:34:59AM +1000, Balbir Singh wrote: >> ... >>>> >>>> The nouveau dmem code has been enhanced to use the new THP migration >>>> capability. >>>> >>>> Feedback from the RFC [2]: >>>> >>> >>> Thanks for the patches, results look very promising. I wanted to give >>> some quick feedback: >>> >> >> Are you seeing improvements with the patchset? >> > > We're nowhere near stable yet, but basic testing shows that CPU time > from the start of migrate_vma_* to the end drops from ~300µs to ~6µs on > a 2MB GPU fault. A lot of this improvement is dma-mapping at 2M > grainularity for the CPU<->GPU copy rather than mapping 512 4k pages > too. > >>> - You appear to have missed updating hmm_range_fault, specifically >>> hmm_vma_handle_pmd, to check for device-private entries and populate the >>> HMM PFNs accordingly. My colleague François has a fix for this if you're >>> interested. >>> >> >> Sure, please feel free to post them. >> >>> - I believe copy_huge_pmd also needs to be updated to avoid installing a >>> migration entry if the swap entry is device-private. I don't have an >>> exact fix yet due to my limited experience with core MM. The test case >>> that triggers this is fairly simple: fault in a 2MB device page on the >>> GPU, then fork a process that reads the page — the kernel crashes in >>> this scenario. >>> >> >> I'd be happy to look at any traces you have or post any fixes you have >> > > I've got it so the kernel doesn't explode but still get warnings like: > > [ 3564.850036] mm/pgtable-generic.c:54: bad pmd ffff8881290408e0(efffff80042bfe00) > [ 3565.298186] BUG: Bad rss-counter state mm:ffff88810a100c40 type:MM_ANONPAGES val:114688 > [ 3565.306108] BUG: non-zero pgtables_bytes on freeing mm: 917504 > > I'm basically just skip is_swap_pmd clause if the entry is device > private, and let the rest of the function execute. This avoids > installing a migration entry—which isn't required and cause the > crash—and allows the rmap code to run, which flips the pages to not > anonymous exclusive (effectively making them copy-on-write (?), though > that doesn't fully apply to device pages). It's not 100% correct yet, > but it's a step in the right direction. > Thanks, could you post the stack trace as well. This is usually a symptom of not freeing up the page table cleanly. Do you have my latest patches that have if (is_swap_pmd(pmdval)) { swp_entry_t entry = pmd_to_swp_entry(pmdval); if (is_device_private_entry(entry)) goto nomap; } in __pte_offset_map()? Balbir Singh
On Tue, Jul 22, 2025 at 09:48:18AM +1000, Balbir Singh wrote: > On 7/18/25 14:57, Matthew Brost wrote: > > On Fri, Jul 18, 2025 at 01:57:13PM +1000, Balbir Singh wrote: > >> On 7/18/25 09:40, Matthew Brost wrote: > >>> On Fri, Jul 04, 2025 at 09:34:59AM +1000, Balbir Singh wrote: > >> ... > >>>> > >>>> The nouveau dmem code has been enhanced to use the new THP migration > >>>> capability. > >>>> > >>>> Feedback from the RFC [2]: > >>>> > >>> > >>> Thanks for the patches, results look very promising. I wanted to give > >>> some quick feedback: > >>> > >> > >> Are you seeing improvements with the patchset? > >> > > > > We're nowhere near stable yet, but basic testing shows that CPU time > > from the start of migrate_vma_* to the end drops from ~300µs to ~6µs on > > a 2MB GPU fault. A lot of this improvement is dma-mapping at 2M > > grainularity for the CPU<->GPU copy rather than mapping 512 4k pages > > too. > > > >>> - You appear to have missed updating hmm_range_fault, specifically > >>> hmm_vma_handle_pmd, to check for device-private entries and populate the > >>> HMM PFNs accordingly. My colleague François has a fix for this if you're > >>> interested. > >>> > >> > >> Sure, please feel free to post them. > >> > >>> - I believe copy_huge_pmd also needs to be updated to avoid installing a > >>> migration entry if the swap entry is device-private. I don't have an > >>> exact fix yet due to my limited experience with core MM. The test case > >>> that triggers this is fairly simple: fault in a 2MB device page on the > >>> GPU, then fork a process that reads the page — the kernel crashes in > >>> this scenario. > >>> > >> > >> I'd be happy to look at any traces you have or post any fixes you have > >> > > > > I've got it so the kernel doesn't explode but still get warnings like: > > > > [ 3564.850036] mm/pgtable-generic.c:54: bad pmd ffff8881290408e0(efffff80042bfe00) > > [ 3565.298186] BUG: Bad rss-counter state mm:ffff88810a100c40 type:MM_ANONPAGES val:114688 > > [ 3565.306108] BUG: non-zero pgtables_bytes on freeing mm: 917504 > > > > I'm basically just skip is_swap_pmd clause if the entry is device > > private, and let the rest of the function execute. This avoids > > installing a migration entry—which isn't required and cause the > > crash—and allows the rmap code to run, which flips the pages to not > > anonymous exclusive (effectively making them copy-on-write (?), though > > that doesn't fully apply to device pages). It's not 100% correct yet, > > but it's a step in the right direction. > > > > > Thanks, could you post the stack trace as well. This is usually a symptom of > not freeing up the page table cleanly. > Did you see my reply here [1]? I've got this working cleanly. I actually have all my tests passing with a few additional core MM changes. I'll reply shortly to a few other patches with those details and will also send over a complete set of the core MM changes we've made to get things stable. Matt [1] https://lore.kernel.org/linux-mm/aHrsdvjjliBBdVQm@lstrano-desk.jf.intel.com/ > Do you have my latest patches that have > > if (is_swap_pmd(pmdval)) { > swp_entry_t entry = pmd_to_swp_entry(pmdval); > > if (is_device_private_entry(entry)) > goto nomap; > } > > in __pte_offset_map()? > > Balbir Singh
On 7/22/25 10:07, Matthew Brost wrote: > On Tue, Jul 22, 2025 at 09:48:18AM +1000, Balbir Singh wrote: >> On 7/18/25 14:57, Matthew Brost wrote: >>> On Fri, Jul 18, 2025 at 01:57:13PM +1000, Balbir Singh wrote: >>>> On 7/18/25 09:40, Matthew Brost wrote: >>>>> On Fri, Jul 04, 2025 at 09:34:59AM +1000, Balbir Singh wrote: >>>> ... >>>>>> >>>>>> The nouveau dmem code has been enhanced to use the new THP migration >>>>>> capability. >>>>>> >>>>>> Feedback from the RFC [2]: >>>>>> >>>>> >>>>> Thanks for the patches, results look very promising. I wanted to give >>>>> some quick feedback: >>>>> >>>> >>>> Are you seeing improvements with the patchset? >>>> >>> >>> We're nowhere near stable yet, but basic testing shows that CPU time >>> from the start of migrate_vma_* to the end drops from ~300µs to ~6µs on >>> a 2MB GPU fault. A lot of this improvement is dma-mapping at 2M >>> grainularity for the CPU<->GPU copy rather than mapping 512 4k pages >>> too. >>> >>>>> - You appear to have missed updating hmm_range_fault, specifically >>>>> hmm_vma_handle_pmd, to check for device-private entries and populate the >>>>> HMM PFNs accordingly. My colleague François has a fix for this if you're >>>>> interested. >>>>> >>>> >>>> Sure, please feel free to post them. >>>> >>>>> - I believe copy_huge_pmd also needs to be updated to avoid installing a >>>>> migration entry if the swap entry is device-private. I don't have an >>>>> exact fix yet due to my limited experience with core MM. The test case >>>>> that triggers this is fairly simple: fault in a 2MB device page on the >>>>> GPU, then fork a process that reads the page — the kernel crashes in >>>>> this scenario. >>>>> >>>> >>>> I'd be happy to look at any traces you have or post any fixes you have >>>> >>> >>> I've got it so the kernel doesn't explode but still get warnings like: >>> >>> [ 3564.850036] mm/pgtable-generic.c:54: bad pmd ffff8881290408e0(efffff80042bfe00) >>> [ 3565.298186] BUG: Bad rss-counter state mm:ffff88810a100c40 type:MM_ANONPAGES val:114688 >>> [ 3565.306108] BUG: non-zero pgtables_bytes on freeing mm: 917504 >>> >>> I'm basically just skip is_swap_pmd clause if the entry is device >>> private, and let the rest of the function execute. This avoids >>> installing a migration entry—which isn't required and cause the >>> crash—and allows the rmap code to run, which flips the pages to not >>> anonymous exclusive (effectively making them copy-on-write (?), though >>> that doesn't fully apply to device pages). It's not 100% correct yet, >>> but it's a step in the right direction. >>> >> >> >> Thanks, could you post the stack trace as well. This is usually a symptom of >> not freeing up the page table cleanly. >> > > Did you see my reply here [1]? I've got this working cleanly. > > I actually have all my tests passing with a few additional core MM > changes. I'll reply shortly to a few other patches with those details > and will also send over a complete set of the core MM changes we've made > to get things stable. > > Matt > > [1] https://lore.kernel.org/linux-mm/aHrsdvjjliBBdVQm@lstrano-desk.jf.intel.com/ Sorry I missed it. Checking now! I'd be happy to hear any numbers you have. In my throughput tests that rely on page copying (lib/test_hmm.c), I am seeing a 500% improvement in performance and a 80% improvement in latency Thanks, Balbir
© 2016 - 2025 Red Hat, Inc.