[v7] mm: support device-private THP

[v7 00/16] mm: support device-private THP

Posted by Balbir Singh 4 months, 1 week ago

This patch series introduces support for Transparent Huge Page
(THP) migration in zone device-private memory. The implementation enables
efficient migration of large folios between system memory and
device-private memory

Background

Current zone device-private memory implementation only supports PAGE_SIZE
granularity, leading to:
- Increased TLB pressure
- Inefficient migration between CPU and device memory

This series extends the existing zone device-private infrastructure to
support THP, leading to:
- Reduced page table overhead
- Improved memory bandwidth utilization
- Seamless fallback to base pages when needed

In my local testing (using lib/test_hmm) and a throughput test, the
series shows a 350% improvement in data transfer throughput and a
80% improvement in latency

These patches build on the earlier posts by Ralph Campbell [1]

Two new flags are added in vma_migration to select and mark compound pages.
migrate_vma_setup(), migrate_vma_pages() and migrate_vma_finalize()
support migration of these pages when MIGRATE_VMA_SELECT_COMPOUND
is passed in as arguments.

The series also adds zone device awareness to (m)THP pages along
with fault handling of large zone device private pages. page vma walk
and the rmap code is also zone device aware. Support has also been
added for folios that might need to be split in the middle
of migration (when the src and dst do not agree on
MIGRATE_PFN_COMPOUND), that occurs when src side of the migration can
migrate large pages, but the destination has not been able to allocate
large pages. The code supported and used folio_split() when migrating
THP pages, this is used when MIGRATE_VMA_SELECT_COMPOUND is not passed
as an argument to migrate_vma_setup().

The test infrastructure lib/test_hmm.c has been enhanced to support THP
migration. A new ioctl to emulate failure of large page allocations has
been added to test the folio split code path. hmm-tests.c has new test
cases for huge page migration and to test the folio split path. A new
throughput test has been added as well.

The nouveau dmem code has been enhanced to use the new THP migration
capability. 

mTHP support:

The patches hard code, HPAGE_PMD_NR in a few places, but the code has
been kept generic to support various order sizes. With additional
refactoring of the code support of different order sizes should be
possible.

The future plan is to post enhancements to support mTHP with a rough
design as follows:

1. Add the notion of allowable thp orders to the HMM based test driver
2. For non PMD based THP paths in migrate_device.c, check to see if
   a suitable order is found and supported by the driver
3. Iterate across orders to check the highest supported order for migration
4. Migrate and finalize

The mTHP patches can be built on top of this series, the key design
elements that need to be worked out are infrastructure and driver support
for multiple ordered pages and their migration.

HMM support for large folios, patches are already posted and in
mm-unstable.

Cc: Andrew Morton <akpm@linux-foundation.org> 
Cc: David Hildenbrand <david@redhat.com> 
Cc: Zi Yan <ziy@nvidia.com>  
Cc: Joshua Hahn <joshua.hahnjy@gmail.com> 
Cc: Rakie Kim <rakie.kim@sk.com> 
Cc: Byungchul Park <byungchul@sk.com> 
Cc: Gregory Price <gourry@gourry.net> 
Cc: Ying Huang <ying.huang@linux.alibaba.com> 
Cc: Alistair Popple <apopple@nvidia.com> 
Cc: Oscar Salvador <osalvador@suse.de> 
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> 
Cc: Baolin Wang <baolin.wang@linux.alibaba.com> 
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> 
Cc: Nico Pache <npache@redhat.com> 
Cc: Ryan Roberts <ryan.roberts@arm.com> 
Cc: Dev Jain <dev.jain@arm.com> 
Cc: Barry Song <baohua@kernel.org> 
Cc: Lyude Paul <lyude@redhat.com> 
Cc: Danilo Krummrich <dakr@kernel.org> 
Cc: David Airlie <airlied@gmail.com> 
Cc: Simona Vetter <simona@ffwll.ch> 
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>

References:
[1] https://lore.kernel.org/linux-mm/20201106005147.20113-1-rcampbell@nvidia.com/
[2] https://lore.kernel.org/linux-mm/20250306044239.3874247-3-balbirs@nvidia.com/T/
[3] https://lore.kernel.org/lkml/20250703233511.2028395-1-balbirs@nvidia.com/
[4] https://lkml.kernel.org/r/20250902130713.1644661-1-francois.dugast@intel.com
[5] https://lore.kernel.org/lkml/20250730092139.3890844-1-balbirs@nvidia.com/
[6] https://lore.kernel.org/lkml/20250812024036.690064-1-balbirs@nvidia.com/
[7] https://lore.kernel.org/lkml/20250903011900.3657435-1-balbirs@nvidia.com/
[8] https://lore.kernel.org/all/20250908000448.180088-1-balbirs@nvidia.com/
[9] https://lore.kernel.org/lkml/20250916122128.2098535-1-balbirs@nvidia.com/

These patches are built on top of mm/mm-new

Changelog v7 [9]:
- Rebased against mm/mm-new again
  - Addressed more review comments from Zi Yan and David Hildenbrand
  - Code flow reorganization of split_huge_pmd_locked
  - page_free callback is now changed to folio_free (posted as patch 2
    in the series)
  - zone_device_page_init() takes an order parameter
  - migrate_vma_split_pages() is now called
    migrate_vma_split_unmapped_folio()
  - More cleanups and fixes
  - Patch 6 partial unmapped folio case has been split into two parts
    some of the content has been moved to the actual device private
    split handling code
  - Fault handling for device-private pages now uses folio routines
    instead of page_get/trylock/put routines.
  
Changelog v6 [8]:
- Rebased against mm/mm-new after fixing the following
  - Two issues reported by kernel test robot
    - m68k requires an lvalue for pmd_present()
    - BUILD_BUG_ON() issues when THP is disabled
  - kernel doc warnings reported on linux-next
    - Thanks Stephen Rothwell!
  - smatch fixes and issues reported
    - Fix issue with potential NULL page
    - Report about young being uninitialized for device-private pages in
      __split_huge_pmd_locked()
- Several Review comments from David
  - Indentation changes and style improvements
  - Removal of some unwanted extra lines
  - Introduction of new helper function is_pmd_non_present_folio_entry()
    to represent migration and device private pmd's
  - Code flow refactoring into migration and device private paths
  - More consistent use of helper function is_pmd_device_private()
- Review comments from Mika
  - folio_get() is not required for huge_pmd prior to split

Changelog v5 [7] :
- Rebased against mm/mm-new (resolved conflict caused by
  MIGRATEPAGE_SUCCESS removal)
- Fixed a kernel-doc warning reported by kernel test robot

Changelog v4 [6] :
- Addressed review comments
  - Split patch 2 into a smaller set of patches
  - PVMW_THP_DEVICE_PRIVATE flag is no longer present
  - damon/page_idle and other page_vma_mapped_walk paths are aware of
    device-private folios
  - No more flush for non-present entries in set_pmd_migration_entry
  - Implemented a helper function for migrate_vma_split_folio() which
    splits large folios if seen during a pte walk
  - Removed the controversial change for folio_ref_freeze using
    folio_expected_ref_count()
  - Removed functions invoked from with VM_WARN_ON
  - New test cases and fixes from Matthew Brost
  - Fixed bugs reported by kernel test robot (Thanks!)
  - Several fixes for THP support in nouveau driver

Changelog v3 [5] :
- Addressed review comments
  - No more split_device_private_folio() helper
  - Device private large folios do not end up on deferred scan lists
  - Removed THP size order checks when initializing zone device folio
  - Fixed bugs reported by kernel test robot (Thanks!)

Changelog v2 [3] :
- Several review comments from David Hildenbrand were addressed, Mika,
  Zi, Matthew also provided helpful review comments
  - In paths where it makes sense a new helper
    is_pmd_device_private_entry() is used
  - anon_exclusive handling of zone device private pages in
    split_huge_pmd_locked() has been fixed
  - Patches that introduced helpers have been folded into where they
    are used
- Zone device handling in mm/huge_memory.c has benefited from the code
  and testing of Matthew Brost, he helped find bugs related to
  copy_huge_pmd() and partial unmapping of folios.
- Zone device THP PMD support via page_vma_mapped_walk() is restricted
  to try_to_migrate_one()
- There is a new dedicated helper to split large zone device folios

Changelog v1 [2]:
- Support for handling fault_folio and using trylock in the fault path
- A new test case has been added to measure the throughput improvement
- General refactoring of code to keep up with the changes in mm
- New split folio callback when the entire split is complete/done. The
  callback is used to know when the head order needs to be reset.

Testing:
- Testing was done with ZONE_DEVICE private pages on an x86 VM

Balbir Singh (15):
  mm/zone_device: support large zone device private folios
  mm/zone_device: Rename page_free callback to folio_free
  mm/huge_memory: add device-private THP support to PMD operations
  mm/rmap: extend rmap and migration support device-private entries
  mm/huge_memory: implement device-private THP splitting
  mm/migrate_device: handle partially mapped folios during collection
  mm/migrate_device: implement THP migration of zone device pages
  mm/memory/fault: add THP fault handling for zone device private pages
  lib/test_hmm: add zone device private THP test infrastructure
  mm/memremap: add driver callback support for folio splitting
  mm/migrate_device: add THP splitting during migration
  lib/test_hmm: add large page allocation failure testing
  selftests/mm/hmm-tests: new tests for zone device THP migration
  selftests/mm/hmm-tests: new throughput tests including THP
  gpu/drm/nouveau: enable THP support for GPU memory migration

Matthew Brost (1):
  selftests/mm/hmm-tests: partial unmap, mremap and anon_write tests

 Documentation/mm/memory-model.rst        |   2 +-
 arch/powerpc/kvm/book3s_hv_uvmem.c       |   7 +-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |   7 +-
 drivers/gpu/drm/drm_pagemap.c            |  12 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c   | 308 ++++++--
 drivers/gpu/drm/nouveau/nouveau_svm.c    |   6 +-
 drivers/gpu/drm/nouveau/nouveau_svm.h    |   3 +-
 drivers/pci/p2pdma.c                     |   5 +-
 include/linux/huge_mm.h                  |  18 +-
 include/linux/memremap.h                 |  57 +-
 include/linux/migrate.h                  |   2 +
 include/linux/swapops.h                  |  32 +
 lib/test_hmm.c                           | 448 +++++++++--
 lib/test_hmm_uapi.h                      |   3 +
 mm/damon/ops-common.c                    |  20 +-
 mm/huge_memory.c                         | 243 ++++--
 mm/memory.c                              |   5 +-
 mm/memremap.c                            |  40 +-
 mm/migrate.c                             |   1 +
 mm/migrate_device.c                      | 609 +++++++++++++--
 mm/page_idle.c                           |   7 +-
 mm/page_vma_mapped.c                     |   7 +
 mm/pgtable-generic.c                     |   2 +-
 mm/rmap.c                                |  30 +-
 tools/testing/selftests/mm/hmm-tests.c   | 919 +++++++++++++++++++++--
 25 files changed, 2399 insertions(+), 394 deletions(-)

-- 
2.51.0

Re: [v7 00/16] mm: support device-private THP

Posted by Andrew Morton 4 months ago

On Wed,  1 Oct 2025 16:56:51 +1000 Balbir Singh <balbirs@nvidia.com> wrote:

> This patch series introduces support for Transparent Huge Page
> (THP) migration in zone device-private memory. The implementation enables
> efficient migration of large folios between system memory and
> device-private memory

Lots of chatter for the v6 series, but none for v7.  I hope that's a
good sign.

> 
> HMM support for large folios, patches are already posted and in
> mm-unstable.

Not any more.  Which series was this?

Re: [v7 00/16] mm: support device-private THP

Posted by Balbir Singh 4 months ago

On 10/9/25 14:17, Andrew Morton wrote:
> On Wed,  1 Oct 2025 16:56:51 +1000 Balbir Singh <balbirs@nvidia.com> wrote:
> 
>> This patch series introduces support for Transparent Huge Page
>> (THP) migration in zone device-private memory. The implementation enables
>> efficient migration of large folios between system memory and
>> device-private memory
> 
> Lots of chatter for the v6 series, but none for v7.  I hope that's a
> good sign.
> 

I hope so too, I've tried to address the comments in v6.

>>
>> HMM support for large folios, patches are already posted and in
>> mm-unstable.
> 
> Not any more.  Which series was this?

Not a series, but a patch

https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?id=10b9feee2d0dc81c44f7a9e69e7a894e33f8c4a1

Thanks,
Balbir

Re: [v7 00/16] mm: support device-private THP

Posted by Matthew Brost 4 months ago

On Thu, Oct 09, 2025 at 02:26:30PM +1100, Balbir Singh wrote:
> On 10/9/25 14:17, Andrew Morton wrote:
> > On Wed,  1 Oct 2025 16:56:51 +1000 Balbir Singh <balbirs@nvidia.com> wrote:
> > 
> >> This patch series introduces support for Transparent Huge Page
> >> (THP) migration in zone device-private memory. The implementation enables
> >> efficient migration of large folios between system memory and
> >> device-private memory
> > 
> > Lots of chatter for the v6 series, but none for v7.  I hope that's a
> > good sign.
> > 
> 
> I hope so too, I've tried to address the comments in v6.
> 

Circling back to this series, we will itegrate and test this version.

> >>
> >> HMM support for large folios, patches are already posted and in
> >> mm-unstable.
> > 
> > Not any more.  Which series was this?
> 
> Not a series, but a patch
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?id=10b9feee2d0dc81c44f7a9e69e7a894e33f8c4a1

I think this [1] means this patch is Linus's tree?

Matt

[1] https://github.com/torvalds/linux/commit/10b9feee2d0dc81c44f7a9e69e7a894e33f8c4a1 

> 
> Thanks,
> Balbir

Re: [v7 00/16] mm: support device-private THP

Posted by Andrew Morton 2 months, 3 weeks ago

On Thu, 9 Oct 2025 03:33:33 -0700 Matthew Brost <matthew.brost@intel.com> wrote:

> > >> This patch series introduces support for Transparent Huge Page
> > >> (THP) migration in zone device-private memory. The implementation enables
> > >> efficient migration of large folios between system memory and
> > >> device-private memory
> > > 
> > > Lots of chatter for the v6 series, but none for v7.  I hope that's a
> > > good sign.
> > > 
> > 
> > I hope so too, I've tried to address the comments in v6.
> > 
> 
> Circling back to this series, we will itegrate and test this version.

How'd it go?

Balbir, what's the status here?  It's been a month and this series
still has a "needs a new version" feeling to it.  If so, very soon
please.

TODOs which I have noted are

https://lkml.kernel.org/r/aOePfeoDuRW+prFq@lstrano-desk.jf.intel.com
https://lkml.kernel.org/r/CABzRoyZZ8QLF5PSeDCVxgcnQmF9kFQ3RZdNq0Deik3o9OrK+BQ@mail.gmail.com
https://lkml.kernel.org/r/D2A4B724-E5EF-46D3-9D3F-EBAD9B22371E@nvidia.com
https://lkml.kernel.org/r/62073ca1-5bb6-49e8-b8d4-447c5e0e582e@

plus a general re-read of the
mm-migrate_device-add-thp-splitting-during-migration.patch review
discussion.

Re: [v7 00/16] mm: support device-private THP

Posted by Balbir Singh 2 months, 3 weeks ago

On 11/12/25 10:43, Andrew Morton wrote:
> On Thu, 9 Oct 2025 03:33:33 -0700 Matthew Brost <matthew.brost@intel.com> wrote:
> 
>>>>> This patch series introduces support for Transparent Huge Page
>>>>> (THP) migration in zone device-private memory. The implementation enables
>>>>> efficient migration of large folios between system memory and
>>>>> device-private memory
>>>>
>>>> Lots of chatter for the v6 series, but none for v7.  I hope that's a
>>>> good sign.
>>>>
>>>
>>> I hope so too, I've tried to address the comments in v6.
>>>
>>
>> Circling back to this series, we will itegrate and test this version.
> 
> How'd it go?
> 
> Balbir, what's the status here?  It's been a month and this series
> still has a "needs a new version" feeling to it.  If so, very soon
> please.
> 

I don't think this needs a new revision, I've been testing frequently
at my end to see if I can catch any regressions. I have a patch update for
mm-migrate_device-add-thp-splitting-during-migration.patch, it can be applied
on top or I can send a new version of the patch. I was waiting
on any feedback before I sent the patch out, but I'll do it now.

> TODOs which I have noted are
> 
> https://lkml.kernel.org/r/aOePfeoDuRW+prFq@lstrano-desk.jf.intel.com

This was a clarification on the HMM patch mentioned in the changelog

> https://lkml.kernel.org/r/CABzRoyZZ8QLF5PSeDCVxgcnQmF9kFQ3RZdNq0Deik3o9OrK+BQ@mail.gmail.com

That's a minor comment on not using a temporary declaration, I don't think we need it, let me know if you feel strongly

> https://lkml.kernel.org/r/D2A4B724-E5EF-46D3-9D3F-EBAD9B22371E@nvidia.com

I have a patch for this, which I posted, I can do an update and resend it if required (the one mentioned above)

> https://lkml.kernel.org/r/62073ca1-5bb6-49e8-b8d4-447c5e0e582e@
> 

I can't seem to open this

> plus a general re-read of the
> mm-migrate_device-add-thp-splitting-during-migration.patch review
> discussion.
> 
That's the patch I have

Thanks for following up
Balbir

Re: [v7 00/16] mm: support device-private THP

Posted by Matthew Brost 2 months, 2 weeks ago

On Wed, Nov 12, 2025 at 10:52:43AM +1100, Balbir Singh wrote:
> On 11/12/25 10:43, Andrew Morton wrote:
> > On Thu, 9 Oct 2025 03:33:33 -0700 Matthew Brost <matthew.brost@intel.com> wrote:
> > 
> >>>>> This patch series introduces support for Transparent Huge Page
> >>>>> (THP) migration in zone device-private memory. The implementation enables
> >>>>> efficient migration of large folios between system memory and
> >>>>> device-private memory
> >>>>
> >>>> Lots of chatter for the v6 series, but none for v7.  I hope that's a
> >>>> good sign.
> >>>>
> >>>
> >>> I hope so too, I've tried to address the comments in v6.
> >>>
> >>
> >> Circling back to this series, we will itegrate and test this version.
> > 
> > How'd it go?
> > 

My apologies for the delay—I got distracted by other tasks in Xe (my
driver) and was out for a bit. Unfortunately, this series breaks
something in the existing core MM code for the Xe SVM implementation. I
have an extensive test case that hammers on SVM, which fully passes
prior to applying this series, but fails randomly with the series
applied (to drm-tip-rc6) due to the below kernel lockup.

I've tried to trace where the migration PTE gets installed but not
removed or isolate a test case which causes this failure but no luck so
far. I'll keep digging as I have time.

Beyond that, if I enable Xe SVM + THP, it seems to mostly work (though
the same issue as above eventually occurs), but I do need two additional
core MM patches—one is new code required for Xe, and the other could be
considered a bug fix. Those patches can included when Xe merges SVM THP
support but we need at least not break Xe SVM before this series merges.

Stack trace:

INFO: task kworker/u65:2:1642 blocked for more than 30
seconds.
[  212.624286]       Tainted: G S      W           6.18.0-rc6-xe+ #1719
[  212.630561] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  212.638285] task:kworker/u65:2   state:D stack:0     pid:1642
tgid:1642  ppid:2      task_flags:0x4208060 flags:0x00080000
[  212.638288] Workqueue: xe_page_fault_work_queue
xe_pagefault_queue_work [xe]
[  212.638323] Call Trace:
[  212.638324]  <TASK>
[  212.638325]  __schedule+0x4b0/0x990
[  212.638330]  schedule+0x22/0xd0
[  212.638331]  io_schedule+0x41/0x60
[  212.638333]  migration_entry_wait_on_locked+0x1d8/0x2d0
[  212.638336]  ? __pfx_wake_page_function+0x10/0x10
[  212.638339]  migration_entry_wait+0xd2/0xe0
[  212.638341]  hmm_vma_walk_pmd+0x7c9/0x8d0
[  212.638343]  walk_pgd_range+0x51d/0xa40
[  212.638345]  __walk_page_range+0x75/0x1e0
[  212.638347]  walk_page_range_mm+0x138/0x1f0
[  212.638349]  hmm_range_fault+0x59/0xa0
[  212.638351]  drm_gpusvm_get_pages+0x194/0x7b0 [drm_gpusvm_helper]
[  212.638354]  drm_gpusvm_range_get_pages+0x2d/0x40 [drm_gpusvm_helper]
[  212.638355]  __xe_svm_handle_pagefault+0x259/0x900 [xe]
[  212.638375]  ? update_load_avg+0x7f/0x6c0
[  212.638377]  ? update_curr+0x13d/0x170
[  212.638379]  xe_svm_handle_pagefault+0x37/0x90 [xe]
[  212.638396]  xe_pagefault_queue_work+0x2da/0x3c0 [xe]
[  212.638420]  process_one_work+0x16e/0x2e0
[  212.638422]  worker_thread+0x284/0x410
[  212.638423]  ? __pfx_worker_thread+0x10/0x10
[  212.638425]  kthread+0xec/0x210
[  212.638427]  ? __pfx_kthread+0x10/0x10
[  212.638428]  ? __pfx_kthread+0x10/0x10
[  212.638430]  ret_from_fork+0xbd/0x100
[  212.638433]  ? __pfx_kthread+0x10/0x10
[  212.638434]  ret_from_fork_asm+0x1a/0x30
[  212.638436]  </TASK>

Matt 

> > Balbir, what's the status here?  It's been a month and this series
> > still has a "needs a new version" feeling to it.  If so, very soon
> > please.
> > 
> 
> I don't think this needs a new revision, I've been testing frequently
> at my end to see if I can catch any regressions. I have a patch update for
> mm-migrate_device-add-thp-splitting-during-migration.patch, it can be applied
> on top or I can send a new version of the patch. I was waiting
> on any feedback before I sent the patch out, but I'll do it now.
> 
> > TODOs which I have noted are
> > 
> > https://lkml.kernel.org/r/aOePfeoDuRW+prFq@lstrano-desk.jf.intel.com
> 
> This was a clarification on the HMM patch mentioned in the changelog
> 
> > https://lkml.kernel.org/r/CABzRoyZZ8QLF5PSeDCVxgcnQmF9kFQ3RZdNq0Deik3o9OrK+BQ@mail.gmail.com
> 
> That's a minor comment on not using a temporary declaration, I don't think we need it, let me know if you feel strongly
> 
> > https://lkml.kernel.org/r/D2A4B724-E5EF-46D3-9D3F-EBAD9B22371E@nvidia.com
> 
> I have a patch for this, which I posted, I can do an update and resend it if required (the one mentioned above)
> 
> > https://lkml.kernel.org/r/62073ca1-5bb6-49e8-b8d4-447c5e0e582e@
> > 
> 
> I can't seem to open this
> 
> > plus a general re-read of the
> > mm-migrate_device-add-thp-splitting-during-migration.patch review
> > discussion.
> > 
> That's the patch I have
> 
> Thanks for following up
> Balbir

Re: [v7 00/16] mm: support device-private THP

Posted by Balbir Singh 2 months, 2 weeks ago

On 11/20/25 13:40, Matthew Brost wrote:
> On Wed, Nov 12, 2025 at 10:52:43AM +1100, Balbir Singh wrote:
>> On 11/12/25 10:43, Andrew Morton wrote:
>>> On Thu, 9 Oct 2025 03:33:33 -0700 Matthew Brost <matthew.brost@intel.com> wrote:
>>>
>>>>>>> This patch series introduces support for Transparent Huge Page
>>>>>>> (THP) migration in zone device-private memory. The implementation enables
>>>>>>> efficient migration of large folios between system memory and
>>>>>>> device-private memory
>>>>>>
>>>>>> Lots of chatter for the v6 series, but none for v7.  I hope that's a
>>>>>> good sign.
>>>>>>
>>>>>
>>>>> I hope so too, I've tried to address the comments in v6.
>>>>>
>>>>
>>>> Circling back to this series, we will itegrate and test this version.
>>>
>>> How'd it go?
>>>
> 
> My apologies for the delay—I got distracted by other tasks in Xe (my
> driver) and was out for a bit. Unfortunately, this series breaks
> something in the existing core MM code for the Xe SVM implementation. I
> have an extensive test case that hammers on SVM, which fully passes
> prior to applying this series, but fails randomly with the series
> applied (to drm-tip-rc6) due to the below kernel lockup.
> 
> I've tried to trace where the migration PTE gets installed but not
> removed or isolate a test case which causes this failure but no luck so
> far. I'll keep digging as I have time.
> 
> Beyond that, if I enable Xe SVM + THP, it seems to mostly work (though
> the same issue as above eventually occurs), but I do need two additional
> core MM patches—one is new code required for Xe, and the other could be
> considered a bug fix. Those patches can included when Xe merges SVM THP
> support but we need at least not break Xe SVM before this series merges.
> 
> Stack trace:
> 
> INFO: task kworker/u65:2:1642 blocked for more than 30
> seconds.
> [  212.624286]       Tainted: G S      W           6.18.0-rc6-xe+ #1719
> [  212.630561] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  212.638285] task:kworker/u65:2   state:D stack:0     pid:1642
> tgid:1642  ppid:2      task_flags:0x4208060 flags:0x00080000
> [  212.638288] Workqueue: xe_page_fault_work_queue
> xe_pagefault_queue_work [xe]
> [  212.638323] Call Trace:
> [  212.638324]  <TASK>
> [  212.638325]  __schedule+0x4b0/0x990
> [  212.638330]  schedule+0x22/0xd0
> [  212.638331]  io_schedule+0x41/0x60
> [  212.638333]  migration_entry_wait_on_locked+0x1d8/0x2d0
> [  212.638336]  ? __pfx_wake_page_function+0x10/0x10
> [  212.638339]  migration_entry_wait+0xd2/0xe0
> [  212.638341]  hmm_vma_walk_pmd+0x7c9/0x8d0
> [  212.638343]  walk_pgd_range+0x51d/0xa40
> [  212.638345]  __walk_page_range+0x75/0x1e0
> [  212.638347]  walk_page_range_mm+0x138/0x1f0
> [  212.638349]  hmm_range_fault+0x59/0xa0
> [  212.638351]  drm_gpusvm_get_pages+0x194/0x7b0 [drm_gpusvm_helper]
> [  212.638354]  drm_gpusvm_range_get_pages+0x2d/0x40 [drm_gpusvm_helper]
> [  212.638355]  __xe_svm_handle_pagefault+0x259/0x900 [xe]
> [  212.638375]  ? update_load_avg+0x7f/0x6c0
> [  212.638377]  ? update_curr+0x13d/0x170
> [  212.638379]  xe_svm_handle_pagefault+0x37/0x90 [xe]
> [  212.638396]  xe_pagefault_queue_work+0x2da/0x3c0 [xe]
> [  212.638420]  process_one_work+0x16e/0x2e0
> [  212.638422]  worker_thread+0x284/0x410
> [  212.638423]  ? __pfx_worker_thread+0x10/0x10
> [  212.638425]  kthread+0xec/0x210
> [  212.638427]  ? __pfx_kthread+0x10/0x10
> [  212.638428]  ? __pfx_kthread+0x10/0x10
> [  212.638430]  ret_from_fork+0xbd/0x100
> [  212.638433]  ? __pfx_kthread+0x10/0x10
> [  212.638434]  ret_from_fork_asm+0x1a/0x30
> [  212.638436]  </TASK>
> 

Hi, Matt

Thanks for the report, two questions

1. Are you using mm/mm-unstable, we've got some fixes in there (including fixes to remove_migration_pmd())
   - Generally a left behind migration entry is a symptom of a failed migration that did not clean up
     after itself.
2. The stack trace is from hmm_range_fault(), not something that this code touches.

The stack trace shows your code is seeing a migration entry and waiting on it.
Can you please provide a reproducer for the issue? In the form of a test in hmm-tests.c

Have you been able to bisect the issue?

Balbir


> Matt 
> 
>>> Balbir, what's the status here?  It's been a month and this series
>>> still has a "needs a new version" feeling to it.  If so, very soon
>>> please.
>>>
>>
>> I don't think this needs a new revision, I've been testing frequently
>> at my end to see if I can catch any regressions. I have a patch update for
>> mm-migrate_device-add-thp-splitting-during-migration.patch, it can be applied
>> on top or I can send a new version of the patch. I was waiting
>> on any feedback before I sent the patch out, but I'll do it now.
>>
>>> TODOs which I have noted are
>>>
>>> https://lkml.kernel.org/r/aOePfeoDuRW+prFq@lstrano-desk.jf.intel.com
>>
>> This was a clarification on the HMM patch mentioned in the changelog
>>
>>> https://lkml.kernel.org/r/CABzRoyZZ8QLF5PSeDCVxgcnQmF9kFQ3RZdNq0Deik3o9OrK+BQ@mail.gmail.com
>>
>> That's a minor comment on not using a temporary declaration, I don't think we need it, let me know if you feel strongly
>>
>>> https://lkml.kernel.org/r/D2A4B724-E5EF-46D3-9D3F-EBAD9B22371E@nvidia.com
>>
>> I have a patch for this, which I posted, I can do an update and resend it if required (the one mentioned above)
>>
>>> https://lkml.kernel.org/r/62073ca1-5bb6-49e8-b8d4-447c5e0e582e@
>>>
>>
>> I can't seem to open this
>>
>>> plus a general re-read of the
>>> mm-migrate_device-add-thp-splitting-during-migration.patch review
>>> discussion.
>>>
>> That's the patch I have
>>
>> Thanks for following up
>> Balbir

Re: [v7 00/16] mm: support device-private THP

Posted by Balbir Singh 2 months, 2 weeks ago

On 11/20/25 13:50, Balbir Singh wrote:
> On 11/20/25 13:40, Matthew Brost wrote:
>> On Wed, Nov 12, 2025 at 10:52:43AM +1100, Balbir Singh wrote:
>>> On 11/12/25 10:43, Andrew Morton wrote:
>>>> On Thu, 9 Oct 2025 03:33:33 -0700 Matthew Brost <matthew.brost@intel.com> wrote:
>>>>
>>>>>>>> This patch series introduces support for Transparent Huge Page
>>>>>>>> (THP) migration in zone device-private memory. The implementation enables
>>>>>>>> efficient migration of large folios between system memory and
>>>>>>>> device-private memory
>>>>>>>
>>>>>>> Lots of chatter for the v6 series, but none for v7.  I hope that's a
>>>>>>> good sign.
>>>>>>>
>>>>>>
>>>>>> I hope so too, I've tried to address the comments in v6.
>>>>>>
>>>>>
>>>>> Circling back to this series, we will itegrate and test this version.
>>>>
>>>> How'd it go?
>>>>
>>
>> My apologies for the delay—I got distracted by other tasks in Xe (my
>> driver) and was out for a bit. Unfortunately, this series breaks
>> something in the existing core MM code for the Xe SVM implementation. I
>> have an extensive test case that hammers on SVM, which fully passes
>> prior to applying this series, but fails randomly with the series
>> applied (to drm-tip-rc6) due to the below kernel lockup.
>>
>> I've tried to trace where the migration PTE gets installed but not
>> removed or isolate a test case which causes this failure but no luck so
>> far. I'll keep digging as I have time.
>>
>> Beyond that, if I enable Xe SVM + THP, it seems to mostly work (though
>> the same issue as above eventually occurs), but I do need two additional
>> core MM patches—one is new code required for Xe, and the other could be
>> considered a bug fix. Those patches can included when Xe merges SVM THP
>> support but we need at least not break Xe SVM before this series merges.
>>
>> Stack trace:
>>
>> INFO: task kworker/u65:2:1642 blocked for more than 30
>> seconds.
>> [  212.624286]       Tainted: G S      W           6.18.0-rc6-xe+ #1719
>> [  212.630561] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [  212.638285] task:kworker/u65:2   state:D stack:0     pid:1642
>> tgid:1642  ppid:2      task_flags:0x4208060 flags:0x00080000
>> [  212.638288] Workqueue: xe_page_fault_work_queue
>> xe_pagefault_queue_work [xe]
>> [  212.638323] Call Trace:
>> [  212.638324]  <TASK>
>> [  212.638325]  __schedule+0x4b0/0x990
>> [  212.638330]  schedule+0x22/0xd0
>> [  212.638331]  io_schedule+0x41/0x60
>> [  212.638333]  migration_entry_wait_on_locked+0x1d8/0x2d0
>> [  212.638336]  ? __pfx_wake_page_function+0x10/0x10
>> [  212.638339]  migration_entry_wait+0xd2/0xe0
>> [  212.638341]  hmm_vma_walk_pmd+0x7c9/0x8d0
>> [  212.638343]  walk_pgd_range+0x51d/0xa40
>> [  212.638345]  __walk_page_range+0x75/0x1e0
>> [  212.638347]  walk_page_range_mm+0x138/0x1f0
>> [  212.638349]  hmm_range_fault+0x59/0xa0
>> [  212.638351]  drm_gpusvm_get_pages+0x194/0x7b0 [drm_gpusvm_helper]
>> [  212.638354]  drm_gpusvm_range_get_pages+0x2d/0x40 [drm_gpusvm_helper]
>> [  212.638355]  __xe_svm_handle_pagefault+0x259/0x900 [xe]
>> [  212.638375]  ? update_load_avg+0x7f/0x6c0
>> [  212.638377]  ? update_curr+0x13d/0x170
>> [  212.638379]  xe_svm_handle_pagefault+0x37/0x90 [xe]
>> [  212.638396]  xe_pagefault_queue_work+0x2da/0x3c0 [xe]
>> [  212.638420]  process_one_work+0x16e/0x2e0
>> [  212.638422]  worker_thread+0x284/0x410
>> [  212.638423]  ? __pfx_worker_thread+0x10/0x10
>> [  212.638425]  kthread+0xec/0x210
>> [  212.638427]  ? __pfx_kthread+0x10/0x10
>> [  212.638428]  ? __pfx_kthread+0x10/0x10
>> [  212.638430]  ret_from_fork+0xbd/0x100
>> [  212.638433]  ? __pfx_kthread+0x10/0x10
>> [  212.638434]  ret_from_fork_asm+0x1a/0x30
>> [  212.638436]  </TASK>
>>
> 
> Hi, Matt
> 
> Thanks for the report, two questions
> 
> 1. Are you using mm/mm-unstable, we've got some fixes in there (including fixes to remove_migration_pmd())
>    - Generally a left behind migration entry is a symptom of a failed migration that did not clean up
>      after itself.
> 2. The stack trace is from hmm_range_fault(), not something that this code touches.
> 
> The stack trace shows your code is seeing a migration entry and waiting on it.
> Can you please provide a reproducer for the issue? In the form of a test in hmm-tests.c
> 
> Have you been able to bisect the issue?

Also could you please try with 10b9feee2d0d ("mm/hmm: populate PFNs from PMD swap entry")
reverted?

> 
> Balbir
> 
> 
>> Matt 
>>
>>>> Balbir, what's the status here?  It's been a month and this series
>>>> still has a "needs a new version" feeling to it.  If so, very soon
>>>> please.
>>>>
>>>
>>> I don't think this needs a new revision, I've been testing frequently
>>> at my end to see if I can catch any regressions. I have a patch update for
>>> mm-migrate_device-add-thp-splitting-during-migration.patch, it can be applied
>>> on top or I can send a new version of the patch. I was waiting
>>> on any feedback before I sent the patch out, but I'll do it now.
>>>
>>>> TODOs which I have noted are
>>>>
>>>> https://lkml.kernel.org/r/aOePfeoDuRW+prFq@lstrano-desk.jf.intel.com
>>>
>>> This was a clarification on the HMM patch mentioned in the changelog
>>>
>>>> https://lkml.kernel.org/r/CABzRoyZZ8QLF5PSeDCVxgcnQmF9kFQ3RZdNq0Deik3o9OrK+BQ@mail.gmail.com
>>>
>>> That's a minor comment on not using a temporary declaration, I don't think we need it, let me know if you feel strongly
>>>
>>>> https://lkml.kernel.org/r/D2A4B724-E5EF-46D3-9D3F-EBAD9B22371E@nvidia.com
>>>
>>> I have a patch for this, which I posted, I can do an update and resend it if required (the one mentioned above)
>>>
>>>> https://lkml.kernel.org/r/62073ca1-5bb6-49e8-b8d4-447c5e0e582e@
>>>>
>>>
>>> I can't seem to open this
>>>
>>>> plus a general re-read of the
>>>> mm-migrate_device-add-thp-splitting-during-migration.patch review
>>>> discussion.
>>>>
>>> That's the patch I have
>>>
>>> Thanks for following up
>>> Balbir
>

Re: [v7 00/16] mm: support device-private THP

Posted by Matthew Brost 2 months, 2 weeks ago

On Thu, Nov 20, 2025 at 01:59:09PM +1100, Balbir Singh wrote:
> On 11/20/25 13:50, Balbir Singh wrote:
> > On 11/20/25 13:40, Matthew Brost wrote:
> >> On Wed, Nov 12, 2025 at 10:52:43AM +1100, Balbir Singh wrote:
> >>> On 11/12/25 10:43, Andrew Morton wrote:
> >>>> On Thu, 9 Oct 2025 03:33:33 -0700 Matthew Brost <matthew.brost@intel.com> wrote:
> >>>>
> >>>>>>>> This patch series introduces support for Transparent Huge Page
> >>>>>>>> (THP) migration in zone device-private memory. The implementation enables
> >>>>>>>> efficient migration of large folios between system memory and
> >>>>>>>> device-private memory
> >>>>>>>
> >>>>>>> Lots of chatter for the v6 series, but none for v7.  I hope that's a
> >>>>>>> good sign.
> >>>>>>>
> >>>>>>
> >>>>>> I hope so too, I've tried to address the comments in v6.
> >>>>>>
> >>>>>
> >>>>> Circling back to this series, we will itegrate and test this version.
> >>>>
> >>>> How'd it go?
> >>>>
> >>
> >> My apologies for the delay—I got distracted by other tasks in Xe (my
> >> driver) and was out for a bit. Unfortunately, this series breaks
> >> something in the existing core MM code for the Xe SVM implementation. I
> >> have an extensive test case that hammers on SVM, which fully passes
> >> prior to applying this series, but fails randomly with the series
> >> applied (to drm-tip-rc6) due to the below kernel lockup.
> >>
> >> I've tried to trace where the migration PTE gets installed but not
> >> removed or isolate a test case which causes this failure but no luck so
> >> far. I'll keep digging as I have time.
> >>
> >> Beyond that, if I enable Xe SVM + THP, it seems to mostly work (though
> >> the same issue as above eventually occurs), but I do need two additional
> >> core MM patches—one is new code required for Xe, and the other could be
> >> considered a bug fix. Those patches can included when Xe merges SVM THP
> >> support but we need at least not break Xe SVM before this series merges.
> >>
> >> Stack trace:
> >>
> >> INFO: task kworker/u65:2:1642 blocked for more than 30
> >> seconds.
> >> [  212.624286]       Tainted: G S      W           6.18.0-rc6-xe+ #1719
> >> [  212.630561] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> >> disables this message.
> >> [  212.638285] task:kworker/u65:2   state:D stack:0     pid:1642
> >> tgid:1642  ppid:2      task_flags:0x4208060 flags:0x00080000
> >> [  212.638288] Workqueue: xe_page_fault_work_queue
> >> xe_pagefault_queue_work [xe]
> >> [  212.638323] Call Trace:
> >> [  212.638324]  <TASK>
> >> [  212.638325]  __schedule+0x4b0/0x990
> >> [  212.638330]  schedule+0x22/0xd0
> >> [  212.638331]  io_schedule+0x41/0x60
> >> [  212.638333]  migration_entry_wait_on_locked+0x1d8/0x2d0
> >> [  212.638336]  ? __pfx_wake_page_function+0x10/0x10
> >> [  212.638339]  migration_entry_wait+0xd2/0xe0
> >> [  212.638341]  hmm_vma_walk_pmd+0x7c9/0x8d0
> >> [  212.638343]  walk_pgd_range+0x51d/0xa40
> >> [  212.638345]  __walk_page_range+0x75/0x1e0
> >> [  212.638347]  walk_page_range_mm+0x138/0x1f0
> >> [  212.638349]  hmm_range_fault+0x59/0xa0
> >> [  212.638351]  drm_gpusvm_get_pages+0x194/0x7b0 [drm_gpusvm_helper]
> >> [  212.638354]  drm_gpusvm_range_get_pages+0x2d/0x40 [drm_gpusvm_helper]
> >> [  212.638355]  __xe_svm_handle_pagefault+0x259/0x900 [xe]
> >> [  212.638375]  ? update_load_avg+0x7f/0x6c0
> >> [  212.638377]  ? update_curr+0x13d/0x170
> >> [  212.638379]  xe_svm_handle_pagefault+0x37/0x90 [xe]
> >> [  212.638396]  xe_pagefault_queue_work+0x2da/0x3c0 [xe]
> >> [  212.638420]  process_one_work+0x16e/0x2e0
> >> [  212.638422]  worker_thread+0x284/0x410
> >> [  212.638423]  ? __pfx_worker_thread+0x10/0x10
> >> [  212.638425]  kthread+0xec/0x210
> >> [  212.638427]  ? __pfx_kthread+0x10/0x10
> >> [  212.638428]  ? __pfx_kthread+0x10/0x10
> >> [  212.638430]  ret_from_fork+0xbd/0x100
> >> [  212.638433]  ? __pfx_kthread+0x10/0x10
> >> [  212.638434]  ret_from_fork_asm+0x1a/0x30
> >> [  212.638436]  </TASK>
> >>
> > 
> > Hi, Matt
> > 
> > Thanks for the report, two questions
> > 
> > 1. Are you using mm/mm-unstable, we've got some fixes in there (including fixes to remove_migration_pmd())

remove_migration_pmd - This is a PTE migration entry.

> >    - Generally a left behind migration entry is a symptom of a failed migration that did not clean up
> >      after itself.

I'm on drm-tip as I generally need the latest version of my driver
because of the speed we move at.

Yes, I agree it looks like somehow a migration PTE is not getting
properly removed.

I'm happy to cherry pick any patches that you think might be helpful
into my tree.

> > 2. The stack trace is from hmm_range_fault(), not something that this code touches.
> > 

Agree this is a symptom of the above issue.

> > The stack trace shows your code is seeing a migration entry and waiting on it.
> > Can you please provide a reproducer for the issue? In the form of a test in hmm-tests.c
> > 

That will be my plan. Right now I'm opening my test up which runs 1000s
of variations of SVM tests and the test that hangs is not consistent.
Some of these are threaded or multi-process so it might possibly be a
timing issue which could be hard to reproduce in hmm-tests.c. I'll do my
best here.

> > Have you been able to bisect the issue?
> 

That is my next step along with isolating a test case.

> Also could you please try with 10b9feee2d0d ("mm/hmm: populate PFNs from PMD swap entry")
> reverted?
> 

I can try but I highly doubt this is related. The hanging HMM code in is
PTE walk step after this, also I am not even enabling THP device pages
in my SVM code to reproduce this.

Matt

> > 
> > Balbir
> > 
> > 
> >> Matt 
> >>
> >>>> Balbir, what's the status here?  It's been a month and this series
> >>>> still has a "needs a new version" feeling to it.  If so, very soon
> >>>> please.
> >>>>
> >>>
> >>> I don't think this needs a new revision, I've been testing frequently
> >>> at my end to see if I can catch any regressions. I have a patch update for
> >>> mm-migrate_device-add-thp-splitting-during-migration.patch, it can be applied
> >>> on top or I can send a new version of the patch. I was waiting
> >>> on any feedback before I sent the patch out, but I'll do it now.
> >>>
> >>>> TODOs which I have noted are
> >>>>
> >>>> https://lkml.kernel.org/r/aOePfeoDuRW+prFq@lstrano-desk.jf.intel.com
> >>>
> >>> This was a clarification on the HMM patch mentioned in the changelog
> >>>
> >>>> https://lkml.kernel.org/r/CABzRoyZZ8QLF5PSeDCVxgcnQmF9kFQ3RZdNq0Deik3o9OrK+BQ@mail.gmail.com
> >>>
> >>> That's a minor comment on not using a temporary declaration, I don't think we need it, let me know if you feel strongly
> >>>
> >>>> https://lkml.kernel.org/r/D2A4B724-E5EF-46D3-9D3F-EBAD9B22371E@nvidia.com
> >>>
> >>> I have a patch for this, which I posted, I can do an update and resend it if required (the one mentioned above)
> >>>
> >>>> https://lkml.kernel.org/r/62073ca1-5bb6-49e8-b8d4-447c5e0e582e@
> >>>>
> >>>
> >>> I can't seem to open this
> >>>
> >>>> plus a general re-read of the
> >>>> mm-migrate_device-add-thp-splitting-during-migration.patch review
> >>>> discussion.
> >>>>
> >>> That's the patch I have
> >>>
> >>> Thanks for following up
> >>> Balbir
> > 
>

Re: [v7 00/16] mm: support device-private THP

Posted by Balbir Singh 2 months, 2 weeks ago

On 11/20/25 14:15, Matthew Brost wrote:
> On Thu, Nov 20, 2025 at 01:59:09PM +1100, Balbir Singh wrote:
>> On 11/20/25 13:50, Balbir Singh wrote:
>>> On 11/20/25 13:40, Matthew Brost wrote:
>>>> On Wed, Nov 12, 2025 at 10:52:43AM +1100, Balbir Singh wrote:
>>>>> On 11/12/25 10:43, Andrew Morton wrote:
>>>>>> On Thu, 9 Oct 2025 03:33:33 -0700 Matthew Brost <matthew.brost@intel.com> wrote:
>>>>>>
>>>>>>>>>> This patch series introduces support for Transparent Huge Page
>>>>>>>>>> (THP) migration in zone device-private memory. The implementation enables
>>>>>>>>>> efficient migration of large folios between system memory and
>>>>>>>>>> device-private memory
>>>>>>>>>
>>>>>>>>> Lots of chatter for the v6 series, but none for v7.  I hope that's a
>>>>>>>>> good sign.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I hope so too, I've tried to address the comments in v6.
>>>>>>>>
>>>>>>>
>>>>>>> Circling back to this series, we will itegrate and test this version.
>>>>>>
>>>>>> How'd it go?
>>>>>>
>>>>
>>>> My apologies for the delay—I got distracted by other tasks in Xe (my
>>>> driver) and was out for a bit. Unfortunately, this series breaks
>>>> something in the existing core MM code for the Xe SVM implementation. I
>>>> have an extensive test case that hammers on SVM, which fully passes
>>>> prior to applying this series, but fails randomly with the series
>>>> applied (to drm-tip-rc6) due to the below kernel lockup.
>>>>
>>>> I've tried to trace where the migration PTE gets installed but not
>>>> removed or isolate a test case which causes this failure but no luck so
>>>> far. I'll keep digging as I have time.
>>>>
>>>> Beyond that, if I enable Xe SVM + THP, it seems to mostly work (though
>>>> the same issue as above eventually occurs), but I do need two additional
>>>> core MM patches—one is new code required for Xe, and the other could be
>>>> considered a bug fix. Those patches can included when Xe merges SVM THP
>>>> support but we need at least not break Xe SVM before this series merges.
>>>>
>>>> Stack trace:
>>>>
>>>> INFO: task kworker/u65:2:1642 blocked for more than 30
>>>> seconds.
>>>> [  212.624286]       Tainted: G S      W           6.18.0-rc6-xe+ #1719
>>>> [  212.630561] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>>> disables this message.
>>>> [  212.638285] task:kworker/u65:2   state:D stack:0     pid:1642
>>>> tgid:1642  ppid:2      task_flags:0x4208060 flags:0x00080000
>>>> [  212.638288] Workqueue: xe_page_fault_work_queue
>>>> xe_pagefault_queue_work [xe]
>>>> [  212.638323] Call Trace:
>>>> [  212.638324]  <TASK>
>>>> [  212.638325]  __schedule+0x4b0/0x990
>>>> [  212.638330]  schedule+0x22/0xd0
>>>> [  212.638331]  io_schedule+0x41/0x60
>>>> [  212.638333]  migration_entry_wait_on_locked+0x1d8/0x2d0
>>>> [  212.638336]  ? __pfx_wake_page_function+0x10/0x10
>>>> [  212.638339]  migration_entry_wait+0xd2/0xe0
>>>> [  212.638341]  hmm_vma_walk_pmd+0x7c9/0x8d0
>>>> [  212.638343]  walk_pgd_range+0x51d/0xa40
>>>> [  212.638345]  __walk_page_range+0x75/0x1e0
>>>> [  212.638347]  walk_page_range_mm+0x138/0x1f0
>>>> [  212.638349]  hmm_range_fault+0x59/0xa0
>>>> [  212.638351]  drm_gpusvm_get_pages+0x194/0x7b0 [drm_gpusvm_helper]
>>>> [  212.638354]  drm_gpusvm_range_get_pages+0x2d/0x40 [drm_gpusvm_helper]
>>>> [  212.638355]  __xe_svm_handle_pagefault+0x259/0x900 [xe]
>>>> [  212.638375]  ? update_load_avg+0x7f/0x6c0
>>>> [  212.638377]  ? update_curr+0x13d/0x170
>>>> [  212.638379]  xe_svm_handle_pagefault+0x37/0x90 [xe]
>>>> [  212.638396]  xe_pagefault_queue_work+0x2da/0x3c0 [xe]
>>>> [  212.638420]  process_one_work+0x16e/0x2e0
>>>> [  212.638422]  worker_thread+0x284/0x410
>>>> [  212.638423]  ? __pfx_worker_thread+0x10/0x10
>>>> [  212.638425]  kthread+0xec/0x210
>>>> [  212.638427]  ? __pfx_kthread+0x10/0x10
>>>> [  212.638428]  ? __pfx_kthread+0x10/0x10
>>>> [  212.638430]  ret_from_fork+0xbd/0x100
>>>> [  212.638433]  ? __pfx_kthread+0x10/0x10
>>>> [  212.638434]  ret_from_fork_asm+0x1a/0x30
>>>> [  212.638436]  </TASK>
>>>>
>>>
>>> Hi, Matt
>>>
>>> Thanks for the report, two questions
>>>
>>> 1. Are you using mm/mm-unstable, we've got some fixes in there (including fixes to remove_migration_pmd())
> 
> remove_migration_pmd - This is a PTE migration entry.
> 

I don't have your symbols, I thought we were hitting, the following condition in the walk

	if (thp_migration_supported() && pmd_is_migration_entry(pmd)) {

But sounds like you are not, PMD/THP has not been enabled in this case


>>>    - Generally a left behind migration entry is a symptom of a failed migration that did not clean up
>>>      after itself.
> 
> I'm on drm-tip as I generally need the latest version of my driver
> because of the speed we move at.
> 
> Yes, I agree it looks like somehow a migration PTE is not getting
> properly removed.
> 
> I'm happy to cherry pick any patches that you think might be helpful
> into my tree.
> 

Could you try the mm/mm-new tree with the current xe driver?

In general, w.r.t failure, I would check for the following

1. Are the dst_pfns in migrate_vma_pages() setup correctly by the device driver?
2. Any failures in folio_migrate_mapping()?
3. In migrate_vma_finalize() check to see if remove_migration_ptes() failed

If (3) fails that will explain the left over migration entries

>>> 2. The stack trace is from hmm_range_fault(), not something that this code touches.
>>>
> 
> Agree this is a symptom of the above issue.
> 
>>> The stack trace shows your code is seeing a migration entry and waiting on it.
>>> Can you please provide a reproducer for the issue? In the form of a test in hmm-tests.c
>>>
> 
> That will be my plan. Right now I'm opening my test up which runs 1000s
> of variations of SVM tests and the test that hangs is not consistent.
> Some of these are threaded or multi-process so it might possibly be a
> timing issue which could be hard to reproduce in hmm-tests.c. I'll do my
> best here.
> 
>>> Have you been able to bisect the issue?
>>
> 
> That is my next step along with isolating a test case.
> 
>> Also could you please try with 10b9feee2d0d ("mm/hmm: populate PFNs from PMD swap entry")
>> reverted?
>>
> 
> I can try but I highly doubt this is related. The hanging HMM code in is
> PTE walk step after this, also I am not even enabling THP device pages
> in my SVM code to reproduce this.
> 

Thanks, do regular hmm-tests pass for you in that setup/environment?

Balbir

> Matt
> 
>>>
>>> Balbir
>>>
>>>
>>>> Matt 
>>>>
>>>>>> Balbir, what's the status here?  It's been a month and this series
>>>>>> still has a "needs a new version" feeling to it.  If so, very soon
>>>>>> please.
>>>>>>
>>>>>
>>>>> I don't think this needs a new revision, I've been testing frequently
>>>>> at my end to see if I can catch any regressions. I have a patch update for
>>>>> mm-migrate_device-add-thp-splitting-during-migration.patch, it can be applied
>>>>> on top or I can send a new version of the patch. I was waiting
>>>>> on any feedback before I sent the patch out, but I'll do it now.
>>>>>
>>>>>> TODOs which I have noted are
>>>>>>
>>>>>> https://lkml.kernel.org/r/aOePfeoDuRW+prFq@lstrano-desk.jf.intel.com
>>>>>
>>>>> This was a clarification on the HMM patch mentioned in the changelog
>>>>>
>>>>>> https://lkml.kernel.org/r/CABzRoyZZ8QLF5PSeDCVxgcnQmF9kFQ3RZdNq0Deik3o9OrK+BQ@mail.gmail.com
>>>>>
>>>>> That's a minor comment on not using a temporary declaration, I don't think we need it, let me know if you feel strongly
>>>>>
>>>>>> https://lkml.kernel.org/r/D2A4B724-E5EF-46D3-9D3F-EBAD9B22371E@nvidia.com
>>>>>
>>>>> I have a patch for this, which I posted, I can do an update and resend it if required (the one mentioned above)
>>>>>
>>>>>> https://lkml.kernel.org/r/62073ca1-5bb6-49e8-b8d4-447c5e0e582e@
>>>>>>
>>>>>
>>>>> I can't seem to open this
>>>>>
>>>>>> plus a general re-read of the
>>>>>> mm-migrate_device-add-thp-splitting-during-migration.patch review
>>>>>> discussion.
>>>>>>
>>>>> That's the patch I have
>>>>>
>>>>> Thanks for following up
>>>>> Balbir
>>>
>>

Re: [v7 00/16] mm: support device-private THP

Posted by Matthew Brost 2 months, 2 weeks ago

On Thu, Nov 20, 2025 at 02:58:58PM +1100, Balbir Singh wrote:
> On 11/20/25 14:15, Matthew Brost wrote:
> > On Thu, Nov 20, 2025 at 01:59:09PM +1100, Balbir Singh wrote:
> >> On 11/20/25 13:50, Balbir Singh wrote:
> >>> On 11/20/25 13:40, Matthew Brost wrote:
> >>>> On Wed, Nov 12, 2025 at 10:52:43AM +1100, Balbir Singh wrote:
> >>>>> On 11/12/25 10:43, Andrew Morton wrote:
> >>>>>> On Thu, 9 Oct 2025 03:33:33 -0700 Matthew Brost <matthew.brost@intel.com> wrote:
> >>>>>>
> >>>>>>>>>> This patch series introduces support for Transparent Huge Page
> >>>>>>>>>> (THP) migration in zone device-private memory. The implementation enables
> >>>>>>>>>> efficient migration of large folios between system memory and
> >>>>>>>>>> device-private memory
> >>>>>>>>>
> >>>>>>>>> Lots of chatter for the v6 series, but none for v7.  I hope that's a
> >>>>>>>>> good sign.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> I hope so too, I've tried to address the comments in v6.
> >>>>>>>>
> >>>>>>>
> >>>>>>> Circling back to this series, we will itegrate and test this version.
> >>>>>>
> >>>>>> How'd it go?
> >>>>>>
> >>>>
> >>>> My apologies for the delay—I got distracted by other tasks in Xe (my
> >>>> driver) and was out for a bit. Unfortunately, this series breaks
> >>>> something in the existing core MM code for the Xe SVM implementation. I
> >>>> have an extensive test case that hammers on SVM, which fully passes
> >>>> prior to applying this series, but fails randomly with the series
> >>>> applied (to drm-tip-rc6) due to the below kernel lockup.
> >>>>
> >>>> I've tried to trace where the migration PTE gets installed but not
> >>>> removed or isolate a test case which causes this failure but no luck so
> >>>> far. I'll keep digging as I have time.
> >>>>
> >>>> Beyond that, if I enable Xe SVM + THP, it seems to mostly work (though
> >>>> the same issue as above eventually occurs), but I do need two additional
> >>>> core MM patches—one is new code required for Xe, and the other could be
> >>>> considered a bug fix. Those patches can included when Xe merges SVM THP
> >>>> support but we need at least not break Xe SVM before this series merges.
> >>>>
> >>>> Stack trace:
> >>>>
> >>>> INFO: task kworker/u65:2:1642 blocked for more than 30
> >>>> seconds.
> >>>> [  212.624286]       Tainted: G S      W           6.18.0-rc6-xe+ #1719
> >>>> [  212.630561] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> >>>> disables this message.
> >>>> [  212.638285] task:kworker/u65:2   state:D stack:0     pid:1642
> >>>> tgid:1642  ppid:2      task_flags:0x4208060 flags:0x00080000
> >>>> [  212.638288] Workqueue: xe_page_fault_work_queue
> >>>> xe_pagefault_queue_work [xe]
> >>>> [  212.638323] Call Trace:
> >>>> [  212.638324]  <TASK>
> >>>> [  212.638325]  __schedule+0x4b0/0x990
> >>>> [  212.638330]  schedule+0x22/0xd0
> >>>> [  212.638331]  io_schedule+0x41/0x60
> >>>> [  212.638333]  migration_entry_wait_on_locked+0x1d8/0x2d0
> >>>> [  212.638336]  ? __pfx_wake_page_function+0x10/0x10
> >>>> [  212.638339]  migration_entry_wait+0xd2/0xe0
> >>>> [  212.638341]  hmm_vma_walk_pmd+0x7c9/0x8d0
> >>>> [  212.638343]  walk_pgd_range+0x51d/0xa40
> >>>> [  212.638345]  __walk_page_range+0x75/0x1e0
> >>>> [  212.638347]  walk_page_range_mm+0x138/0x1f0
> >>>> [  212.638349]  hmm_range_fault+0x59/0xa0
> >>>> [  212.638351]  drm_gpusvm_get_pages+0x194/0x7b0 [drm_gpusvm_helper]
> >>>> [  212.638354]  drm_gpusvm_range_get_pages+0x2d/0x40 [drm_gpusvm_helper]
> >>>> [  212.638355]  __xe_svm_handle_pagefault+0x259/0x900 [xe]
> >>>> [  212.638375]  ? update_load_avg+0x7f/0x6c0
> >>>> [  212.638377]  ? update_curr+0x13d/0x170
> >>>> [  212.638379]  xe_svm_handle_pagefault+0x37/0x90 [xe]
> >>>> [  212.638396]  xe_pagefault_queue_work+0x2da/0x3c0 [xe]
> >>>> [  212.638420]  process_one_work+0x16e/0x2e0
> >>>> [  212.638422]  worker_thread+0x284/0x410
> >>>> [  212.638423]  ? __pfx_worker_thread+0x10/0x10
> >>>> [  212.638425]  kthread+0xec/0x210
> >>>> [  212.638427]  ? __pfx_kthread+0x10/0x10
> >>>> [  212.638428]  ? __pfx_kthread+0x10/0x10
> >>>> [  212.638430]  ret_from_fork+0xbd/0x100
> >>>> [  212.638433]  ? __pfx_kthread+0x10/0x10
> >>>> [  212.638434]  ret_from_fork_asm+0x1a/0x30
> >>>> [  212.638436]  </TASK>
> >>>>
> >>>
> >>> Hi, Matt
> >>>
> >>> Thanks for the report, two questions
> >>>
> >>> 1. Are you using mm/mm-unstable, we've got some fixes in there (including fixes to remove_migration_pmd())
> > 
> > remove_migration_pmd - This is a PTE migration entry.
> > 
> 
> I don't have your symbols, I thought we were hitting, the following condition in the walk
> 
> 	if (thp_migration_supported() && pmd_is_migration_entry(pmd)) {
> 
> But sounds like you are not, PMD/THP has not been enabled in this case
> 

No, migration_entry_wait rather than pmd_migration_entry_wait.

> 
> >>>    - Generally a left behind migration entry is a symptom of a failed migration that did not clean up
> >>>      after itself.
> > 
> > I'm on drm-tip as I generally need the latest version of my driver
> > because of the speed we move at.
> > 
> > Yes, I agree it looks like somehow a migration PTE is not getting
> > properly removed.
> > 
> > I'm happy to cherry pick any patches that you think might be helpful
> > into my tree.
> > 
> 
> Could you try the mm/mm-new tree with the current xe driver?
>

Unfortunately, this is a tough one. We land a lot of patches in Xe/DRM,
so bringing the driver up to date with an MM branch is difficult, and
I’m not an expert at merging branches. It would be nice if, in the DRM
flow, we could merge patches from outside our subsystem into a
bleeding-edge kernel for the things we typically care about—but we’d
need a maintainer to sign up for that.

> In general, w.r.t failure, I would check for the following
> 
> 1. Are the dst_pfns in migrate_vma_pages() setup correctly by the device driver?
> 2. Any failures in folio_migrate_mapping()?
> 3. In migrate_vma_finalize() check to see if remove_migration_ptes() failed
> 
> If (3) fails that will explain the left over migration entries
> 

Good tips, but think I got it via biscet.

Offending patch is:

'mm/migrate_device: handle partially mapped folios during collection'

The failing test case involves some remap-related issue. It’s a
parameterized test, so I honestly couldn’t tell you exactly what it’s
doing beyond the fact that it seems nonsensical but stresses remap. I
thought commit '66d81853fa3d selftests/mm/hmm-tests: partial unmap,
mremap and anon_write tests' would catch this, but it looks like I need
to make the remap HMM test cases a bit more robust—similar to my
driver-side tests. I can take an action item to follow up on this.

Good news, I can tell you how to fix this...

In 'mm/migrate_device: handle partially mapped folios during collection': 

109 +#if 0
110 +                       folio = page ? page_folio(page) : NULL;
111 +                       if (folio && folio_test_large(folio)) {
112 +                               int ret;
113 +
114 +                               pte_unmap_unlock(ptep, ptl);
115 +                               ret = migrate_vma_split_folio(folio,
116 +								  migrate->fault_page);
117 +
118 +                               if (ret) {
119 +                                       ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
120 +                                       goto next;
121 +                               }
122 +
123 +                               addr = start;
124 +                               goto again;
125 +                       }
126 +#endif

You can probably just delete this and use my patch below, but if you
want to try fixing it with a quick look: if migrate_vma_split_folio
fails, you probably need to collect a hole. On success, you likely want
to continue executing the remainder of the loop. I can try playing with
this tomorrow, but it’s late here.

I had privately sent you a version of this patch as a fix for Xe, and
this one seems to work:

[PATCH] mm/migrate: Split THP found in middle of PMD during page collection

The migrate layer is not coded to handle a THP found in the middle of a
PMD. This can occur if a user manipulates mappings with mremap(). If a
THP is found mid-PMD during page collection, split it.

Cc: Balbir Singh <balbirs@nvidia.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 mm/migrate_device.c | 37 +++++++++++++++++++++++++++++++++++--
 1 file changed, 35 insertions(+), 2 deletions(-)

diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index abd9f6850db6..9ffc025bad50 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -65,6 +65,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
        struct vm_area_struct *vma = walk->vma;
        struct mm_struct *mm = vma->vm_mm;
        unsigned long addr = start, unmapped = 0;
+       struct folio *split_folio = NULL;
        spinlock_t *ptl;
        pte_t *ptep;

@@ -107,10 +108,11 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
                }
        }

-       ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+       ptep = pte_offset_map_lock(mm, pmdp, start, &ptl);
        if (!ptep)
                goto again;
        arch_enter_lazy_mmu_mode();
+       ptep += (addr - start) / PAGE_SIZE;

        for (; addr < end; addr += PAGE_SIZE, ptep++) {
                struct dev_pagemap *pgmap;
@@ -209,6 +211,11 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
                        bool anon_exclusive;
                        pte_t swp_pte;

+                       if (folio_order(folio)) {
+                               split_folio = folio;
+                               goto split;
+                       }
+
                        flush_cache_page(vma, addr, pte_pfn(pte));
                        anon_exclusive = folio_test_anon(folio) &&
                                          PageAnonExclusive(page);
@@ -287,8 +294,34 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
        if (unmapped)
                flush_tlb_range(walk->vma, start, end);

+split:
        arch_leave_lazy_mmu_mode();
-       pte_unmap_unlock(ptep - 1, ptl);
+       pte_unmap_unlock(ptep - 1 + !!split_folio, ptl);
+
+       /*
+        * XXX: No clean way to support higher-order folios that don't match PMD
+        * boundaries for now — split them instead. Once mTHP support lands, add
+        * proper support for this case.
+        *
+        * The test, which exposed this as problematic, remapped (memremap) a
+        * large folio to an unaligned address, resulting in the folio being
+        * found in the middle of the PTEs. The requested number of pages was
+        * less than the folio size. Likely to be handled gracefully by upper
+        * layers eventually, but not yet.
+        */
+       if (split_folio) {
+               int ret;
+
+               ret = split_folio(split_folio);
+               if (fault_folio != split_folio)
+                       folio_unlock(split_folio);
+               folio_put(split_folio);
+               if (ret)
+                       return migrate_vma_collect_skip(addr, end, walk);
+
+               split_folio = NULL;
+               goto again;
+       }

        return 0;
 }

If I apply the #if 0 change along with my patch above (plus one core
MM patch needed for Xe that adds a support function), Xe SVM fully
passes our test cases with both THP enabled and disabled.

> >>> 2. The stack trace is from hmm_range_fault(), not something that this code touches.
> >>>
> > 
> > Agree this is a symptom of the above issue.
> > 
> >>> The stack trace shows your code is seeing a migration entry and waiting on it.
> >>> Can you please provide a reproducer for the issue? In the form of a test in hmm-tests.c
> >>>
> > 
> > That will be my plan. Right now I'm opening my test up which runs 1000s
> > of variations of SVM tests and the test that hangs is not consistent.
> > Some of these are threaded or multi-process so it might possibly be a
> > timing issue which could be hard to reproduce in hmm-tests.c. I'll do my
> > best here.
> > 
> >>> Have you been able to bisect the issue?
> >>
> > 
> > That is my next step along with isolating a test case.
> > 
> >> Also could you please try with 10b9feee2d0d ("mm/hmm: populate PFNs from PMD swap entry")
> >> reverted?
> >>
> > 
> > I can try but I highly doubt this is related. The hanging HMM code in is
> > PTE walk step after this, also I am not even enabling THP device pages
> > in my SVM code to reproduce this.
> > 
> 
> Thanks, do regular hmm-tests pass for you in that setup/environment?
> 

Yes. As noted above, I need to make the remap HMM case a bit more
robust. I’ll try to get to this before the Thanksgiving break in the US
(next Thursday-Friday).

Matt

> Balbir
> 
> > Matt
> > 
> >>>
> >>> Balbir
> >>>
> >>>
> >>>> Matt 
> >>>>
> >>>>>> Balbir, what's the status here?  It's been a month and this series
> >>>>>> still has a "needs a new version" feeling to it.  If so, very soon
> >>>>>> please.
> >>>>>>
> >>>>>
> >>>>> I don't think this needs a new revision, I've been testing frequently
> >>>>> at my end to see if I can catch any regressions. I have a patch update for
> >>>>> mm-migrate_device-add-thp-splitting-during-migration.patch, it can be applied
> >>>>> on top or I can send a new version of the patch. I was waiting
> >>>>> on any feedback before I sent the patch out, but I'll do it now.
> >>>>>
> >>>>>> TODOs which I have noted are
> >>>>>>
> >>>>>> https://lkml.kernel.org/r/aOePfeoDuRW+prFq@lstrano-desk.jf.intel.com
> >>>>>
> >>>>> This was a clarification on the HMM patch mentioned in the changelog
> >>>>>
> >>>>>> https://lkml.kernel.org/r/CABzRoyZZ8QLF5PSeDCVxgcnQmF9kFQ3RZdNq0Deik3o9OrK+BQ@mail.gmail.com
> >>>>>
> >>>>> That's a minor comment on not using a temporary declaration, I don't think we need it, let me know if you feel strongly
> >>>>>
> >>>>>> https://lkml.kernel.org/r/D2A4B724-E5EF-46D3-9D3F-EBAD9B22371E@nvidia.com
> >>>>>
> >>>>> I have a patch for this, which I posted, I can do an update and resend it if required (the one mentioned above)
> >>>>>
> >>>>>> https://lkml.kernel.org/r/62073ca1-5bb6-49e8-b8d4-447c5e0e582e@
> >>>>>>
> >>>>>
> >>>>> I can't seem to open this
> >>>>>
> >>>>>> plus a general re-read of the
> >>>>>> mm-migrate_device-add-thp-splitting-during-migration.patch review
> >>>>>> discussion.
> >>>>>>
> >>>>> That's the patch I have
> >>>>>
> >>>>> Thanks for following up
> >>>>> Balbir
> >>>
> >>
>

Re: [v7 00/16] mm: support device-private THP

Posted by Balbir Singh 2 months, 2 weeks ago

On 11/20/25 16:53, Matthew Brost wrote:
> On Thu, Nov 20, 2025 at 02:58:58PM +1100, Balbir Singh wrote:
>> On 11/20/25 14:15, Matthew Brost wrote:
>>> On Thu, Nov 20, 2025 at 01:59:09PM +1100, Balbir Singh wrote:
>>>> On 11/20/25 13:50, Balbir Singh wrote:
>>>>> On 11/20/25 13:40, Matthew Brost wrote:
>>>>>> On Wed, Nov 12, 2025 at 10:52:43AM +1100, Balbir Singh wrote:
>>>>>>> On 11/12/25 10:43, Andrew Morton wrote:
>>>>>>>> On Thu, 9 Oct 2025 03:33:33 -0700 Matthew Brost <matthew.brost@intel.com> wrote:
>>>>>>>>
>>>>>>>>>>>> This patch series introduces support for Transparent Huge Page
>>>>>>>>>>>> (THP) migration in zone device-private memory. The implementation enables
>>>>>>>>>>>> efficient migration of large folios between system memory and
>>>>>>>>>>>> device-private memory
>>>>>>>>>>>
>>>>>>>>>>> Lots of chatter for the v6 series, but none for v7.  I hope that's a
>>>>>>>>>>> good sign.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I hope so too, I've tried to address the comments in v6.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Circling back to this series, we will itegrate and test this version.
>>>>>>>>
>>>>>>>> How'd it go?
>>>>>>>>
>>>>>>
>>>>>> My apologies for the delay—I got distracted by other tasks in Xe (my
>>>>>> driver) and was out for a bit. Unfortunately, this series breaks
>>>>>> something in the existing core MM code for the Xe SVM implementation. I
>>>>>> have an extensive test case that hammers on SVM, which fully passes
>>>>>> prior to applying this series, but fails randomly with the series
>>>>>> applied (to drm-tip-rc6) due to the below kernel lockup.
>>>>>>
>>>>>> I've tried to trace where the migration PTE gets installed but not
>>>>>> removed or isolate a test case which causes this failure but no luck so
>>>>>> far. I'll keep digging as I have time.
>>>>>>
>>>>>> Beyond that, if I enable Xe SVM + THP, it seems to mostly work (though
>>>>>> the same issue as above eventually occurs), but I do need two additional
>>>>>> core MM patches—one is new code required for Xe, and the other could be
>>>>>> considered a bug fix. Those patches can included when Xe merges SVM THP
>>>>>> support but we need at least not break Xe SVM before this series merges.
>>>>>>
>>>>>> Stack trace:
>>>>>>
>>>>>> INFO: task kworker/u65:2:1642 blocked for more than 30
>>>>>> seconds.
>>>>>> [  212.624286]       Tainted: G S      W           6.18.0-rc6-xe+ #1719
>>>>>> [  212.630561] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>>>>> disables this message.
>>>>>> [  212.638285] task:kworker/u65:2   state:D stack:0     pid:1642
>>>>>> tgid:1642  ppid:2      task_flags:0x4208060 flags:0x00080000
>>>>>> [  212.638288] Workqueue: xe_page_fault_work_queue
>>>>>> xe_pagefault_queue_work [xe]
>>>>>> [  212.638323] Call Trace:
>>>>>> [  212.638324]  <TASK>
>>>>>> [  212.638325]  __schedule+0x4b0/0x990
>>>>>> [  212.638330]  schedule+0x22/0xd0
>>>>>> [  212.638331]  io_schedule+0x41/0x60
>>>>>> [  212.638333]  migration_entry_wait_on_locked+0x1d8/0x2d0
>>>>>> [  212.638336]  ? __pfx_wake_page_function+0x10/0x10
>>>>>> [  212.638339]  migration_entry_wait+0xd2/0xe0
>>>>>> [  212.638341]  hmm_vma_walk_pmd+0x7c9/0x8d0
>>>>>> [  212.638343]  walk_pgd_range+0x51d/0xa40
>>>>>> [  212.638345]  __walk_page_range+0x75/0x1e0
>>>>>> [  212.638347]  walk_page_range_mm+0x138/0x1f0
>>>>>> [  212.638349]  hmm_range_fault+0x59/0xa0
>>>>>> [  212.638351]  drm_gpusvm_get_pages+0x194/0x7b0 [drm_gpusvm_helper]
>>>>>> [  212.638354]  drm_gpusvm_range_get_pages+0x2d/0x40 [drm_gpusvm_helper]
>>>>>> [  212.638355]  __xe_svm_handle_pagefault+0x259/0x900 [xe]
>>>>>> [  212.638375]  ? update_load_avg+0x7f/0x6c0
>>>>>> [  212.638377]  ? update_curr+0x13d/0x170
>>>>>> [  212.638379]  xe_svm_handle_pagefault+0x37/0x90 [xe]
>>>>>> [  212.638396]  xe_pagefault_queue_work+0x2da/0x3c0 [xe]
>>>>>> [  212.638420]  process_one_work+0x16e/0x2e0
>>>>>> [  212.638422]  worker_thread+0x284/0x410
>>>>>> [  212.638423]  ? __pfx_worker_thread+0x10/0x10
>>>>>> [  212.638425]  kthread+0xec/0x210
>>>>>> [  212.638427]  ? __pfx_kthread+0x10/0x10
>>>>>> [  212.638428]  ? __pfx_kthread+0x10/0x10
>>>>>> [  212.638430]  ret_from_fork+0xbd/0x100
>>>>>> [  212.638433]  ? __pfx_kthread+0x10/0x10
>>>>>> [  212.638434]  ret_from_fork_asm+0x1a/0x30
>>>>>> [  212.638436]  </TASK>
>>>>>>
>>>>>
>>>>> Hi, Matt
>>>>>
>>>>> Thanks for the report, two questions
>>>>>
>>>>> 1. Are you using mm/mm-unstable, we've got some fixes in there (including fixes to remove_migration_pmd())
>>>
>>> remove_migration_pmd - This is a PTE migration entry.
>>>
>>
>> I don't have your symbols, I thought we were hitting, the following condition in the walk
>>
>> 	if (thp_migration_supported() && pmd_is_migration_entry(pmd)) {
>>
>> But sounds like you are not, PMD/THP has not been enabled in this case
>>
> 
> No, migration_entry_wait rather than pmd_migration_entry_wait.
> 
>>
>>>>>    - Generally a left behind migration entry is a symptom of a failed migration that did not clean up
>>>>>      after itself.
>>>
>>> I'm on drm-tip as I generally need the latest version of my driver
>>> because of the speed we move at.
>>>
>>> Yes, I agree it looks like somehow a migration PTE is not getting
>>> properly removed.
>>>
>>> I'm happy to cherry pick any patches that you think might be helpful
>>> into my tree.
>>>
>>
>> Could you try the mm/mm-new tree with the current xe driver?
>>
> 
> Unfortunately, this is a tough one. We land a lot of patches in Xe/DRM,
> so bringing the driver up to date with an MM branch is difficult, and
> I’m not an expert at merging branches. It would be nice if, in the DRM
> flow, we could merge patches from outside our subsystem into a
> bleeding-edge kernel for the things we typically care about—but we’d
> need a maintainer to sign up for that.
> 
>> In general, w.r.t failure, I would check for the following
>>
>> 1. Are the dst_pfns in migrate_vma_pages() setup correctly by the device driver?
>> 2. Any failures in folio_migrate_mapping()?
>> 3. In migrate_vma_finalize() check to see if remove_migration_ptes() failed
>>
>> If (3) fails that will explain the left over migration entries
>>
> 
> Good tips, but think I got it via biscet.
> 
> Offending patch is:
> 
> 'mm/migrate_device: handle partially mapped folios during collection'
> 
> The failing test case involves some remap-related issue. It’s a
> parameterized test, so I honestly couldn’t tell you exactly what it’s
> doing beyond the fact that it seems nonsensical but stresses remap. I
> thought commit '66d81853fa3d selftests/mm/hmm-tests: partial unmap,
> mremap and anon_write tests' would catch this, but it looks like I need
> to make the remap HMM test cases a bit more robust—similar to my
> driver-side tests. I can take an action item to follow up on this.
> 
> Good news, I can tell you how to fix this...
> 
> In 'mm/migrate_device: handle partially mapped folios during collection': 
> 
> 109 +#if 0
> 110 +                       folio = page ? page_folio(page) : NULL;
> 111 +                       if (folio && folio_test_large(folio)) {
> 112 +                               int ret;
> 113 +
> 114 +                               pte_unmap_unlock(ptep, ptl);
> 115 +                               ret = migrate_vma_split_folio(folio,
> 116 +								  migrate->fault_page);
> 117 +
> 118 +                               if (ret) {
> 119 +                                       ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> 120 +                                       goto next;
> 121 +                               }
> 122 +
> 123 +                               addr = start;
> 124 +                               goto again;
> 125 +                       }
> 126 +#endif
> 
> You can probably just delete this and use my patch below, but if you
> want to try fixing it with a quick look: if migrate_vma_split_folio
> fails, you probably need to collect a hole. On success, you likely want
> to continue executing the remainder of the loop. I can try playing with
> this tomorrow, but it’s late here.
> 
> I had privately sent you a version of this patch as a fix for Xe, and
> this one seems to work:
> 
> [PATCH] mm/migrate: Split THP found in middle of PMD during page collection
> 
> The migrate layer is not coded to handle a THP found in the middle of a
> PMD. This can occur if a user manipulates mappings with mremap(). If a
> THP is found mid-PMD during page collection, split it.
> 
> Cc: Balbir Singh <balbirs@nvidia.com>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  mm/migrate_device.c | 37 +++++++++++++++++++++++++++++++++++--
>  1 file changed, 35 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index abd9f6850db6..9ffc025bad50 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -65,6 +65,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>         struct vm_area_struct *vma = walk->vma;
>         struct mm_struct *mm = vma->vm_mm;
>         unsigned long addr = start, unmapped = 0;
> +       struct folio *split_folio = NULL;
>         spinlock_t *ptl;
>         pte_t *ptep;
> 
> @@ -107,10 +108,11 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>                 }
>         }
> 
> -       ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> +       ptep = pte_offset_map_lock(mm, pmdp, start, &ptl);
>         if (!ptep)
>                 goto again;
>         arch_enter_lazy_mmu_mode();
> +       ptep += (addr - start) / PAGE_SIZE;
> 
>         for (; addr < end; addr += PAGE_SIZE, ptep++) {
>                 struct dev_pagemap *pgmap;
> @@ -209,6 +211,11 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>                         bool anon_exclusive;
>                         pte_t swp_pte;
> 
> +                       if (folio_order(folio)) {
> +                               split_folio = folio;
> +                               goto split;
> +                       }
> +
>                         flush_cache_page(vma, addr, pte_pfn(pte));
>                         anon_exclusive = folio_test_anon(folio) &&
>                                           PageAnonExclusive(page);
> @@ -287,8 +294,34 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>         if (unmapped)
>                 flush_tlb_range(walk->vma, start, end);
> 
> +split:
>         arch_leave_lazy_mmu_mode();
> -       pte_unmap_unlock(ptep - 1, ptl);
> +       pte_unmap_unlock(ptep - 1 + !!split_folio, ptl);
> +
> +       /*
> +        * XXX: No clean way to support higher-order folios that don't match PMD
> +        * boundaries for now — split them instead. Once mTHP support lands, add
> +        * proper support for this case.
> +        *
> +        * The test, which exposed this as problematic, remapped (memremap) a
> +        * large folio to an unaligned address, resulting in the folio being
> +        * found in the middle of the PTEs. The requested number of pages was
> +        * less than the folio size. Likely to be handled gracefully by upper
> +        * layers eventually, but not yet.
> +        */
> +       if (split_folio) {
> +               int ret;
> +
> +               ret = split_folio(split_folio);
> +               if (fault_folio != split_folio)
> +                       folio_unlock(split_folio);
> +               folio_put(split_folio);
> +               if (ret)
> +                       return migrate_vma_collect_skip(addr, end, walk);
> +
> +               split_folio = NULL;
> +               goto again;
> +       }
> 
>         return 0;
>  }
> 
> If I apply the #if 0 change along with my patch above (plus one core
> MM patch needed for Xe that adds a support function), Xe SVM fully
> passes our test cases with both THP enabled and disabled.
> 
Excellent work! Since you found this, do you mind sending the fix to Andrew as a fixup
to the original patch. Since I don't have the test case, I have no way of validating the
change or any change on top of it would continue to work

FYI: The original code does something similar, I might be missing the 
migrate_vma_collect_skip() bits.

Thanks!
Balbir

Re: [v7 00/16] mm: support device-private THP

Posted by Matthew Brost 2 months, 2 weeks ago

On Thu, Nov 20, 2025 at 05:03:36PM +1100, Balbir Singh wrote:
> On 11/20/25 16:53, Matthew Brost wrote:
> > On Thu, Nov 20, 2025 at 02:58:58PM +1100, Balbir Singh wrote:
> >> On 11/20/25 14:15, Matthew Brost wrote:
> >>> On Thu, Nov 20, 2025 at 01:59:09PM +1100, Balbir Singh wrote:
> >>>> On 11/20/25 13:50, Balbir Singh wrote:
> >>>>> On 11/20/25 13:40, Matthew Brost wrote:
> >>>>>> On Wed, Nov 12, 2025 at 10:52:43AM +1100, Balbir Singh wrote:
> >>>>>>> On 11/12/25 10:43, Andrew Morton wrote:
> >>>>>>>> On Thu, 9 Oct 2025 03:33:33 -0700 Matthew Brost <matthew.brost@intel.com> wrote:
> >>>>>>>>
> >>>>>>>>>>>> This patch series introduces support for Transparent Huge Page
> >>>>>>>>>>>> (THP) migration in zone device-private memory. The implementation enables
> >>>>>>>>>>>> efficient migration of large folios between system memory and
> >>>>>>>>>>>> device-private memory
> >>>>>>>>>>>
> >>>>>>>>>>> Lots of chatter for the v6 series, but none for v7.  I hope that's a
> >>>>>>>>>>> good sign.
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I hope so too, I've tried to address the comments in v6.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Circling back to this series, we will itegrate and test this version.
> >>>>>>>>
> >>>>>>>> How'd it go?
> >>>>>>>>
> >>>>>>
> >>>>>> My apologies for the delay—I got distracted by other tasks in Xe (my
> >>>>>> driver) and was out for a bit. Unfortunately, this series breaks
> >>>>>> something in the existing core MM code for the Xe SVM implementation. I
> >>>>>> have an extensive test case that hammers on SVM, which fully passes
> >>>>>> prior to applying this series, but fails randomly with the series
> >>>>>> applied (to drm-tip-rc6) due to the below kernel lockup.
> >>>>>>
> >>>>>> I've tried to trace where the migration PTE gets installed but not
> >>>>>> removed or isolate a test case which causes this failure but no luck so
> >>>>>> far. I'll keep digging as I have time.
> >>>>>>
> >>>>>> Beyond that, if I enable Xe SVM + THP, it seems to mostly work (though
> >>>>>> the same issue as above eventually occurs), but I do need two additional
> >>>>>> core MM patches—one is new code required for Xe, and the other could be
> >>>>>> considered a bug fix. Those patches can included when Xe merges SVM THP
> >>>>>> support but we need at least not break Xe SVM before this series merges.
> >>>>>>
> >>>>>> Stack trace:
> >>>>>>
> >>>>>> INFO: task kworker/u65:2:1642 blocked for more than 30
> >>>>>> seconds.
> >>>>>> [  212.624286]       Tainted: G S      W           6.18.0-rc6-xe+ #1719
> >>>>>> [  212.630561] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> >>>>>> disables this message.
> >>>>>> [  212.638285] task:kworker/u65:2   state:D stack:0     pid:1642
> >>>>>> tgid:1642  ppid:2      task_flags:0x4208060 flags:0x00080000
> >>>>>> [  212.638288] Workqueue: xe_page_fault_work_queue
> >>>>>> xe_pagefault_queue_work [xe]
> >>>>>> [  212.638323] Call Trace:
> >>>>>> [  212.638324]  <TASK>
> >>>>>> [  212.638325]  __schedule+0x4b0/0x990
> >>>>>> [  212.638330]  schedule+0x22/0xd0
> >>>>>> [  212.638331]  io_schedule+0x41/0x60
> >>>>>> [  212.638333]  migration_entry_wait_on_locked+0x1d8/0x2d0
> >>>>>> [  212.638336]  ? __pfx_wake_page_function+0x10/0x10
> >>>>>> [  212.638339]  migration_entry_wait+0xd2/0xe0
> >>>>>> [  212.638341]  hmm_vma_walk_pmd+0x7c9/0x8d0
> >>>>>> [  212.638343]  walk_pgd_range+0x51d/0xa40
> >>>>>> [  212.638345]  __walk_page_range+0x75/0x1e0
> >>>>>> [  212.638347]  walk_page_range_mm+0x138/0x1f0
> >>>>>> [  212.638349]  hmm_range_fault+0x59/0xa0
> >>>>>> [  212.638351]  drm_gpusvm_get_pages+0x194/0x7b0 [drm_gpusvm_helper]
> >>>>>> [  212.638354]  drm_gpusvm_range_get_pages+0x2d/0x40 [drm_gpusvm_helper]
> >>>>>> [  212.638355]  __xe_svm_handle_pagefault+0x259/0x900 [xe]
> >>>>>> [  212.638375]  ? update_load_avg+0x7f/0x6c0
> >>>>>> [  212.638377]  ? update_curr+0x13d/0x170
> >>>>>> [  212.638379]  xe_svm_handle_pagefault+0x37/0x90 [xe]
> >>>>>> [  212.638396]  xe_pagefault_queue_work+0x2da/0x3c0 [xe]
> >>>>>> [  212.638420]  process_one_work+0x16e/0x2e0
> >>>>>> [  212.638422]  worker_thread+0x284/0x410
> >>>>>> [  212.638423]  ? __pfx_worker_thread+0x10/0x10
> >>>>>> [  212.638425]  kthread+0xec/0x210
> >>>>>> [  212.638427]  ? __pfx_kthread+0x10/0x10
> >>>>>> [  212.638428]  ? __pfx_kthread+0x10/0x10
> >>>>>> [  212.638430]  ret_from_fork+0xbd/0x100
> >>>>>> [  212.638433]  ? __pfx_kthread+0x10/0x10
> >>>>>> [  212.638434]  ret_from_fork_asm+0x1a/0x30
> >>>>>> [  212.638436]  </TASK>
> >>>>>>
> >>>>>
> >>>>> Hi, Matt
> >>>>>
> >>>>> Thanks for the report, two questions
> >>>>>
> >>>>> 1. Are you using mm/mm-unstable, we've got some fixes in there (including fixes to remove_migration_pmd())
> >>>
> >>> remove_migration_pmd - This is a PTE migration entry.
> >>>
> >>
> >> I don't have your symbols, I thought we were hitting, the following condition in the walk
> >>
> >> 	if (thp_migration_supported() && pmd_is_migration_entry(pmd)) {
> >>
> >> But sounds like you are not, PMD/THP has not been enabled in this case
> >>
> > 
> > No, migration_entry_wait rather than pmd_migration_entry_wait.
> > 
> >>
> >>>>>    - Generally a left behind migration entry is a symptom of a failed migration that did not clean up
> >>>>>      after itself.
> >>>
> >>> I'm on drm-tip as I generally need the latest version of my driver
> >>> because of the speed we move at.
> >>>
> >>> Yes, I agree it looks like somehow a migration PTE is not getting
> >>> properly removed.
> >>>
> >>> I'm happy to cherry pick any patches that you think might be helpful
> >>> into my tree.
> >>>
> >>
> >> Could you try the mm/mm-new tree with the current xe driver?
> >>
> > 
> > Unfortunately, this is a tough one. We land a lot of patches in Xe/DRM,
> > so bringing the driver up to date with an MM branch is difficult, and
> > I’m not an expert at merging branches. It would be nice if, in the DRM
> > flow, we could merge patches from outside our subsystem into a
> > bleeding-edge kernel for the things we typically care about—but we’d
> > need a maintainer to sign up for that.
> > 
> >> In general, w.r.t failure, I would check for the following
> >>
> >> 1. Are the dst_pfns in migrate_vma_pages() setup correctly by the device driver?
> >> 2. Any failures in folio_migrate_mapping()?
> >> 3. In migrate_vma_finalize() check to see if remove_migration_ptes() failed
> >>
> >> If (3) fails that will explain the left over migration entries
> >>
> > 
> > Good tips, but think I got it via biscet.
> > 
> > Offending patch is:
> > 
> > 'mm/migrate_device: handle partially mapped folios during collection'
> > 
> > The failing test case involves some remap-related issue. It’s a
> > parameterized test, so I honestly couldn’t tell you exactly what it’s
> > doing beyond the fact that it seems nonsensical but stresses remap. I
> > thought commit '66d81853fa3d selftests/mm/hmm-tests: partial unmap,
> > mremap and anon_write tests' would catch this, but it looks like I need
> > to make the remap HMM test cases a bit more robust—similar to my
> > driver-side tests. I can take an action item to follow up on this.
> > 
> > Good news, I can tell you how to fix this...
> > 
> > In 'mm/migrate_device: handle partially mapped folios during collection': 
> > 
> > 109 +#if 0
> > 110 +                       folio = page ? page_folio(page) : NULL;
> > 111 +                       if (folio && folio_test_large(folio)) {
> > 112 +                               int ret;
> > 113 +
> > 114 +                               pte_unmap_unlock(ptep, ptl);
> > 115 +                               ret = migrate_vma_split_folio(folio,
> > 116 +								  migrate->fault_page);
> > 117 +
> > 118 +                               if (ret) {
> > 119 +                                       ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> > 120 +                                       goto next;
> > 121 +                               }
> > 122 +
> > 123 +                               addr = start;
> > 124 +                               goto again;
> > 125 +                       }
> > 126 +#endif
> > 
> > You can probably just delete this and use my patch below, but if you
> > want to try fixing it with a quick look: if migrate_vma_split_folio
> > fails, you probably need to collect a hole. On success, you likely want
> > to continue executing the remainder of the loop. I can try playing with
> > this tomorrow, but it’s late here.
> > 
> > I had privately sent you a version of this patch as a fix for Xe, and
> > this one seems to work:
> > 
> > [PATCH] mm/migrate: Split THP found in middle of PMD during page collection
> > 
> > The migrate layer is not coded to handle a THP found in the middle of a
> > PMD. This can occur if a user manipulates mappings with mremap(). If a
> > THP is found mid-PMD during page collection, split it.
> > 
> > Cc: Balbir Singh <balbirs@nvidia.com>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  mm/migrate_device.c | 37 +++++++++++++++++++++++++++++++++++--
> >  1 file changed, 35 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> > index abd9f6850db6..9ffc025bad50 100644
> > --- a/mm/migrate_device.c
> > +++ b/mm/migrate_device.c
> > @@ -65,6 +65,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >         struct vm_area_struct *vma = walk->vma;
> >         struct mm_struct *mm = vma->vm_mm;
> >         unsigned long addr = start, unmapped = 0;
> > +       struct folio *split_folio = NULL;
> >         spinlock_t *ptl;
> >         pte_t *ptep;
> > 
> > @@ -107,10 +108,11 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >                 }
> >         }
> > 
> > -       ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> > +       ptep = pte_offset_map_lock(mm, pmdp, start, &ptl);
> >         if (!ptep)
> >                 goto again;
> >         arch_enter_lazy_mmu_mode();
> > +       ptep += (addr - start) / PAGE_SIZE;
> > 
> >         for (; addr < end; addr += PAGE_SIZE, ptep++) {
> >                 struct dev_pagemap *pgmap;
> > @@ -209,6 +211,11 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >                         bool anon_exclusive;
> >                         pte_t swp_pte;
> > 
> > +                       if (folio_order(folio)) {
> > +                               split_folio = folio;
> > +                               goto split;
> > +                       }
> > +
> >                         flush_cache_page(vma, addr, pte_pfn(pte));
> >                         anon_exclusive = folio_test_anon(folio) &&
> >                                           PageAnonExclusive(page);
> > @@ -287,8 +294,34 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >         if (unmapped)
> >                 flush_tlb_range(walk->vma, start, end);
> > 
> > +split:
> >         arch_leave_lazy_mmu_mode();
> > -       pte_unmap_unlock(ptep - 1, ptl);
> > +       pte_unmap_unlock(ptep - 1 + !!split_folio, ptl);
> > +
> > +       /*
> > +        * XXX: No clean way to support higher-order folios that don't match PMD
> > +        * boundaries for now — split them instead. Once mTHP support lands, add
> > +        * proper support for this case.
> > +        *
> > +        * The test, which exposed this as problematic, remapped (memremap) a
> > +        * large folio to an unaligned address, resulting in the folio being
> > +        * found in the middle of the PTEs. The requested number of pages was
> > +        * less than the folio size. Likely to be handled gracefully by upper
> > +        * layers eventually, but not yet.
> > +        */
> > +       if (split_folio) {
> > +               int ret;
> > +
> > +               ret = split_folio(split_folio);
> > +               if (fault_folio != split_folio)
> > +                       folio_unlock(split_folio);
> > +               folio_put(split_folio);
> > +               if (ret)
> > +                       return migrate_vma_collect_skip(addr, end, walk);
> > +
> > +               split_folio = NULL;
> > +               goto again;
> > +       }
> > 
> >         return 0;
> >  }
> > 
> > If I apply the #if 0 change along with my patch above (plus one core
> > MM patch needed for Xe that adds a support function), Xe SVM fully
> > passes our test cases with both THP enabled and disabled.
> > 
> Excellent work! Since you found this, do you mind sending the fix to Andrew as a fixup

Done. Here is a dri-devel patchworks link [1] to the patch.

Matt

[1] https://patchwork.freedesktop.org/series/157859/

> to the original patch. Since I don't have the test case, I have no way of validating the
> change or any change on top of it would continue to work
> 
> FYI: The original code does something similar, I might be missing the 
> migrate_vma_collect_skip() bits.
> 
> Thanks!
> Balbir
> 
>

Re: [v7 00/16] mm: support device-private THP

Posted by Balbir Singh 2 months, 2 weeks ago

[...]>>>>>>>
>>>>>>> How'd it go?
>>>>>>>
>>>>>
>>>>> My apologies for the delay—I got distracted by other tasks in Xe (my
>>>>> driver) and was out for a bit. Unfortunately, this series breaks
>>>>> something in the existing core MM code for the Xe SVM implementation. I
>>>>> have an extensive test case that hammers on SVM, which fully passes
>>>>> prior to applying this series, but fails randomly with the series
>>>>> applied (to drm-tip-rc6) due to the below kernel lockup.
>>>>>
>>>>> I've tried to trace where the migration PTE gets installed but not
>>>>> removed or isolate a test case which causes this failure but no luck so
>>>>> far. I'll keep digging as I have time.
>>>>>
>>>>> Beyond that, if I enable Xe SVM + THP, it seems to mostly work (though
>>>>> the same issue as above eventually occurs), but I do need two additional
>>>>> core MM patches—one is new code required for Xe, and the other could be
>>>>> considered a bug fix. Those patches can included when Xe merges SVM THP
>>>>> support but we need at least not break Xe SVM before this series merges.
>>>>>
>>>>> Stack trace:
>>>>>
>>>>> INFO: task kworker/u65:2:1642 blocked for more than 30
>>>>> seconds.
>>>>> [  212.624286]       Tainted: G S      W           6.18.0-rc6-xe+ #1719
>>>>> [  212.630561] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>>>> disables this message.
>>>>> [  212.638285] task:kworker/u65:2   state:D stack:0     pid:1642
>>>>> tgid:1642  ppid:2      task_flags:0x4208060 flags:0x00080000
>>>>> [  212.638288] Workqueue: xe_page_fault_work_queue
>>>>> xe_pagefault_queue_work [xe]
>>>>> [  212.638323] Call Trace:
>>>>> [  212.638324]  <TASK>
>>>>> [  212.638325]  __schedule+0x4b0/0x990
>>>>> [  212.638330]  schedule+0x22/0xd0
>>>>> [  212.638331]  io_schedule+0x41/0x60
>>>>> [  212.638333]  migration_entry_wait_on_locked+0x1d8/0x2d0
>>>>> [  212.638336]  ? __pfx_wake_page_function+0x10/0x10
>>>>> [  212.638339]  migration_entry_wait+0xd2/0xe0
>>>>> [  212.638341]  hmm_vma_walk_pmd+0x7c9/0x8d0
>>>>> [  212.638343]  walk_pgd_range+0x51d/0xa40
>>>>> [  212.638345]  __walk_page_range+0x75/0x1e0
>>>>> [  212.638347]  walk_page_range_mm+0x138/0x1f0
>>>>> [  212.638349]  hmm_range_fault+0x59/0xa0
>>>>> [  212.638351]  drm_gpusvm_get_pages+0x194/0x7b0 [drm_gpusvm_helper]
>>>>> [  212.638354]  drm_gpusvm_range_get_pages+0x2d/0x40 [drm_gpusvm_helper]
>>>>> [  212.638355]  __xe_svm_handle_pagefault+0x259/0x900 [xe]
>>>>> [  212.638375]  ? update_load_avg+0x7f/0x6c0
>>>>> [  212.638377]  ? update_curr+0x13d/0x170
>>>>> [  212.638379]  xe_svm_handle_pagefault+0x37/0x90 [xe]
>>>>> [  212.638396]  xe_pagefault_queue_work+0x2da/0x3c0 [xe]
>>>>> [  212.638420]  process_one_work+0x16e/0x2e0
>>>>> [  212.638422]  worker_thread+0x284/0x410
>>>>> [  212.638423]  ? __pfx_worker_thread+0x10/0x10
>>>>> [  212.638425]  kthread+0xec/0x210
>>>>> [  212.638427]  ? __pfx_kthread+0x10/0x10
>>>>> [  212.638428]  ? __pfx_kthread+0x10/0x10
>>>>> [  212.638430]  ret_from_fork+0xbd/0x100
>>>>> [  212.638433]  ? __pfx_kthread+0x10/0x10
>>>>> [  212.638434]  ret_from_fork_asm+0x1a/0x30
>>>>> [  212.638436]  </TASK>
>>>>>
>>>>
>>>> Hi, Matt
>>>>
>>>> Thanks for the report, two questions
>>>>
>>>> 1. Are you using mm/mm-unstable, we've got some fixes in there (including fixes to remove_migration_pmd())
>>
>> remove_migration_pmd - This is a PTE migration entry.
>>
> 
> I don't have your symbols, I thought we were hitting, the following condition in the walk
> 
> 	if (thp_migration_supported() && pmd_is_migration_entry(pmd)) {
> 
> But sounds like you are not, PMD/THP has not been enabled in this case
> 
> 
>>>>    - Generally a left behind migration entry is a symptom of a failed migration that did not clean up
>>>>      after itself.
>>
>> I'm on drm-tip as I generally need the latest version of my driver
>> because of the speed we move at.
>>
>> Yes, I agree it looks like somehow a migration PTE is not getting
>> properly removed.
>>
>> I'm happy to cherry pick any patches that you think might be helpful
>> into my tree.
>>
> 
> Could you try the mm/mm-new tree with the current xe driver?
> 
> In general, w.r.t failure, I would check for the following
> 
> 1. Are the dst_pfns in migrate_vma_pages() setup correctly by the device driver?
> 2. Any failures in folio_migrate_mapping()?
> 3. In migrate_vma_finalize() check to see if remove_migration_ptes() failed
> 
> If (3) fails that will explain the left over migration entries
> 
Just thought of two other places to look at

1. split_folio(), do you have a large entry on the CPU side that needs to be split
   prior to migration?
2. Any partial munmap() code, because that can cause a pmd split, but the folio
   is not fully split yet

I also have a patch for debugging migrations via trace-points (to be updated)
https://patchew.org/linux/20251016054619.3174997-1-balbirs@nvidia.com/

May be it'll help you figure out if something failed to migrate.

>>>> 2. The stack trace is from hmm_range_fault(), not something that this code touches.
>>>>
>>
>> Agree this is a symptom of the above issue.
>>
>>>> The stack trace shows your code is seeing a migration entry and waiting on it.
>>>> Can you please provide a reproducer for the issue? In the form of a test in hmm-tests.c
>>>>
>>
>> That will be my plan. Right now I'm opening my test up which runs 1000s
>> of variations of SVM tests and the test that hangs is not consistent.
>> Some of these are threaded or multi-process so it might possibly be a
>> timing issue which could be hard to reproduce in hmm-tests.c. I'll do my
>> best here.
>>
>>>> Have you been able to bisect the issue?
>>>
>>
>> That is my next step along with isolating a test case.
>>
>>> Also could you please try with 10b9feee2d0d ("mm/hmm: populate PFNs from PMD swap entry")
>>> reverted?
>>>
>>
>> I can try but I highly doubt this is related. The hanging HMM code in is
>> PTE walk step after this, also I am not even enabling THP device pages
>> in my SVM code to reproduce this.
>>
> 
> Thanks, do regular hmm-tests pass for you in that setup/environment?
> 
> Balbir
> 

[..]

Balbir

Re: [v7 00/16] mm: support device-private THP

Posted by Andrew Morton 2 months, 3 weeks ago

On Wed, 12 Nov 2025 10:52:43 +1100 Balbir Singh <balbirs@nvidia.com> wrote:

> > https://lkml.kernel.org/r/62073ca1-5bb6-49e8-b8d4-447c5e0e582e@
> > 
> 
> I can't seem to open this

https://lore.kernel.org/all/62073ca1-5bb6-49e8-b8d4-447c5e0e582e@nvidia.com/T/#u

> > plus a general re-read of the
> > mm-migrate_device-add-thp-splitting-during-migration.patch review
> > discussion.
> > 
> That's the patch I have
> 

What I meant was - please re-read the review against that patch (it was
fairly long) and double-check that everything was addressed.

Re: [v7 00/16] mm: support device-private THP

Posted by Balbir Singh 2 months, 3 weeks ago

On 11/12/25 11:24, Andrew Morton wrote:
> On Wed, 12 Nov 2025 10:52:43 +1100 Balbir Singh <balbirs@nvidia.com> wrote:
> 
>>> https://lkml.kernel.org/r/62073ca1-5bb6-49e8-b8d4-447c5e0e582e@
>>>
>>
>> I can't seem to open this
> 
> https://lore.kernel.org/all/62073ca1-5bb6-49e8-b8d4-447c5e0e582e@nvidia.com/T/#u
> 
>>> plus a general re-read of the
>>> mm-migrate_device-add-thp-splitting-during-migration.patch review
>>> discussion.
>>>
>> That's the patch I have
>>
> 
> What I meant was - please re-read the review against that patch (it was
> fairly long) and double-check that everything was addressed.
> 

The last discussion ended at https://lore.kernel.org/all/CABzRoyZZ8QLF5PSeDCVxgcnQmF9kFQ3RZdNq0Deik3o9OrK+BQ@mail.gmail.com/T/#m2368cc4ed7d88b2ae99157a649a9a5877585f006

In summary, removing the cluster_info and deferred_list handling
into another function lead to duplication and so I've avoided
going down that route. I'll send the new patch out shortly

Balbir

Re: [v7 00/16] mm: support device-private THP

Posted by Balbir Singh 3 months, 3 weeks ago

On 10/9/25 21:33, Matthew Brost wrote:
> On Thu, Oct 09, 2025 at 02:26:30PM +1100, Balbir Singh wrote:
>> On 10/9/25 14:17, Andrew Morton wrote:
>>> On Wed,  1 Oct 2025 16:56:51 +1000 Balbir Singh <balbirs@nvidia.com> wrote:
>>>
>>>> This patch series introduces support for Transparent Huge Page
>>>> (THP) migration in zone device-private memory. The implementation enables
>>>> efficient migration of large folios between system memory and
>>>> device-private memory
>>>
>>> Lots of chatter for the v6 series, but none for v7.  I hope that's a
>>> good sign.
>>>
>>
>> I hope so too, I've tried to address the comments in v6.
>>
> 
> Circling back to this series, we will itegrate and test this version.
> 

Look forward to your feedback

>>>>
>>>> HMM support for large folios, patches are already posted and in
>>>> mm-unstable.
>>>
>>> Not any more.  Which series was this?
>>
>> Not a series, but a patch
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?id=10b9feee2d0dc81c44f7a9e69e7a894e33f8c4a1
> 
> I think this [1] means this patch is Linus's tree?
> 
> Matt
> 
> [1] https://github.com/torvalds/linux/commit/10b9feee2d0dc81c44f7a9e69e7a894e33f8c4a1 
> 

Thanks!
Balbir