[v10] Migrate on fault for device pages

[PATCH v10 0/5] Migrate on fault for device pages

Posted by mpenttil@redhat.com 1 month, 1 week ago

From: Mika Penttilä <mpenttil@redhat.com>

Currently, the way device page faulting and migration works
is not optimal, if you want to do both fault handling and
migration at once.

Being able to migrate not present pages (or pages mapped with incorrect
permissions, eg. COW) to the GPU requires doing either of the
following sequences:

1. hmm_range_fault() - fault in non-present pages with correct permissions, etc.
2. migrate_vma_*() - migrate the pages

Or:

1. migrate_vma_*() - migrate present pages
2. If non-present pages detected by migrate_vma_*():
   a) call hmm_range_fault() to fault pages in
   b) call migrate_vma_*() again to migrate now present pages

The problem with the first sequence is that you always have to do two
page walks even when most of the time the pages are present or zero page
mappings so the common case takes a performance hit.

The second sequence is better for the common case, but far worse if
pages aren't present because now you have to walk the page tables three
times (once to find the page is not present, once so hmm_range_fault()
can find a non-present page to fault in and once again to setup the
migration). It is also tricky to code correctly. One page table walk
could costs over 1000 cpu cycles on X86-64, which is a significant hit.

We should be able to walk the page table once, faulting
pages in as required and replacing them with migration entries if
requested.

Add a new flag to HMM APIs, HMM_PFN_REQ_MIGRATE,
which tells to prepare for migration also during fault handling.
Also, for the migrate_vma_setup() call paths, a flag, MIGRATE_VMA_FAULT,
is added to tell to add fault handling to migrate.

One extra benefit of migrating with hmm_range_fault() path
is the migrate_vma.vma gets populated, so no need to
retrieve that separataly.

Tested in X86-64 VM with HMM test device, passing the selftests.
For performance, the migrate throughput tests from the selftests
show similar numbers (within error margin) as unmodified kernel.
Tested also rebased on the
"Remove device private pages from physical address space" series:
https://lore.kernel.org/linux-mm/20260130111050.53670-1-jniethe@nvidia.com/
plus a small patch to adjust with no problems.

Changes v9-v10
  - Fix for issue Intel CI found, forgotten pte_unmap() before
    migration_entry_wait()

Changes v8-v9
  - rebase on drm-tip
  - fixed uaf around  migrate_vma_split_folio() usage
  - added missing pmd unlock

Changes v7-v8
  - rebase on 7.0
  - fixed subject in two patches
  - enhanced commit messages
  - squashed patch 6 into patch 4 to fix kernel test robot warning
  - readded dropped Cc block from cover letter
  - fixed white space

Changes v6-v7
  - rebase on 7.0.0-rc6
  - added documentation and comments
  - denote to be migrated zero page as HMM_PFN_MIGRATE alone
  - got rid of HMM_PFN_INOUT_FLAGS movement in patch 2
  - picked up Acked-By from David for patch 1
  
Changes v5-v6
  - rebase on 7.0.0-rc4
  - use range based TLB flushing while unmapping ptes
  - gate migration behind HMM_PFN_REQ_MIGRATE for fault and
    migrate paths
  - always infer migration flags from migrate->flags only

Changes v4-v5
  - rebase on 6.19
  - fixed David's email address
  - fixed link issue without CONFIG_TRANSPARENT_HUGEPAGE
  - refactored into smaller commits
  - added more comments to code

Changes v3-v4:
  - rebase on 6.19-rc8
  - fixed issues found by kernel test robot with random configs
  - fixed typos

Changes v2-v3:
  - rebase on 6.19-rc7
  - fixed issues found by kernel test robot
  - fixed smatch issues reported by Dan Carpenter <dan.carpenter@linaro.org>
  - fixes to lock handling (pmd/pte) on errors
  - added assertions for pmd/pte lock states
  - other issues discovered by Matthew, thanks!

Changes v1-v2:
  - rebase on 6.19-rc6
  - fixed issues found by kernel test robot
  - fixed locking (pmd/ptl) to cover handle_ and prepare_ regions
    parts if migrating
  - other issues discovered by Matthew, thanks!

Changes RFC-v1:
  - rebase on 6.19-rc5
  - adjust for the device THP
  - changes from feedback

Revisions:
  - RFC https://lore.kernel.org/linux-mm/20250814072045.3637192-1-mpenttil@redhat.com/
  - v1: https://lore.kernel.org/all/20260114091923.3950465-1-mpenttil@redhat.com/
  - v2: https://lore.kernel.org/all/20260119112502.645059-1-mpenttil@redhat.com/
  - v3: https://lore.kernel.org/all/20260126111939.1332983-2-mpenttil@redhat.com/
  - v4: https://lore.kernel.org/all/20260202112622.2104213-1-mpenttil@redhat.com/
  - v5: https://lore.kernel.org/linux-mm/20260211081301.2940672-1-mpenttil@redhat.com/
  - v6: https://lore.kernel.org/linux-mm/20260316062407.3354636-1-mpenttil@redhat.com/
  - v7: https://lore.kernel.org/linux-mm/20260330115611.347988-1-mpenttil@redhat.com/
  - v8: https://lore.kernel.org/linux-mm/20260414041226.1539439-1-mpenttil@redhat.com/
  - v9: https://lore.kernel.org/linux-mm/20260505051658.2219537-1-mpenttil@redhat.com/

Cc: David Hildenbrand <david@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Leon Romanovsky <leonro@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>

Mika Penttilä (5):
  mm/Kconfig: changes for migrate on fault for device pages
  mm: Add helper to convert HMM pfn to migrate pfn
  mm/hmm: do the plumbing for HMM to participate in migration
  mm: setup device page migration in HMM pagewalk
  lib/test_hmm:: add a new testcase for the migrate on fault

 include/linux/hmm.h                    |  19 +-
 include/linux/migrate.h                |  26 +-
 lib/test_hmm.c                         | 101 ++-
 lib/test_hmm_uapi.h                    |  19 +-
 mm/Kconfig                             |   2 +
 mm/hmm.c                               | 836 +++++++++++++++++++++++--
 mm/migrate_device.c                    | 583 +++--------------
 tools/testing/selftests/mm/hmm-tests.c |  54 ++
 8 files changed, 1067 insertions(+), 573 deletions(-)

drm-tip
base-commit: 94d56a898a2db27f841b17f6966a81ba502fe63c
-- 
2.50.0

Re: [PATCH v10 0/5] Migrate on fault for device pages

Posted by Balbir Singh 4 weeks, 1 day ago

On Tue, May 05, 2026 at 09:44:16PM +0300, mpenttil@redhat.com wrote:
> From: Mika Penttilä <mpenttil@redhat.com>
> 
> Currently, the way device page faulting and migration works
> is not optimal, if you want to do both fault handling and
> migration at once.
> 
> Being able to migrate not present pages (or pages mapped with incorrect
> permissions, eg. COW) to the GPU requires doing either of the
> following sequences:
> 
> 1. hmm_range_fault() - fault in non-present pages with correct permissions, etc.
> 2. migrate_vma_*() - migrate the pages
> 
> Or:
> 
> 1. migrate_vma_*() - migrate present pages
> 2. If non-present pages detected by migrate_vma_*():
>    a) call hmm_range_fault() to fault pages in
>    b) call migrate_vma_*() again to migrate now present pages
> 
> The problem with the first sequence is that you always have to do two
> page walks even when most of the time the pages are present or zero page
> mappings so the common case takes a performance hit.
> 
> The second sequence is better for the common case, but far worse if
> pages aren't present because now you have to walk the page tables three
> times (once to find the page is not present, once so hmm_range_fault()
> can find a non-present page to fault in and once again to setup the
> migration). It is also tricky to code correctly. One page table walk
> could costs over 1000 cpu cycles on X86-64, which is a significant hit.
> 
> We should be able to walk the page table once, faulting
> pages in as required and replacing them with migration entries if
> requested.
> 
> Add a new flag to HMM APIs, HMM_PFN_REQ_MIGRATE,
> which tells to prepare for migration also during fault handling.
> Also, for the migrate_vma_setup() call paths, a flag, MIGRATE_VMA_FAULT,
> is added to tell to add fault handling to migrate.
> 
> One extra benefit of migrating with hmm_range_fault() path
> is the migrate_vma.vma gets populated, so no need to
> retrieve that separataly.
> 
> Tested in X86-64 VM with HMM test device, passing the selftests.
> For performance, the migrate throughput tests from the selftests
> show similar numbers (within error margin) as unmodified kernel.
> Tested also rebased on the
> "Remove device private pages from physical address space" series:
> https://lore.kernel.org/linux-mm/20260130111050.53670-1-jniethe@nvidia.com/
> plus a small patch to adjust with no problems.
> 
> Changes v9-v10
>   - Fix for issue Intel CI found, forgotten pte_unmap() before
>     migration_entry_wait()
> 
> Changes v8-v9
>   - rebase on drm-tip
>   - fixed uaf around  migrate_vma_split_folio() usage
>   - added missing pmd unlock
> 
> Changes v7-v8
>   - rebase on 7.0
>   - fixed subject in two patches
>   - enhanced commit messages
>   - squashed patch 6 into patch 4 to fix kernel test robot warning
>   - readded dropped Cc block from cover letter
>   - fixed white space
> 
> Changes v6-v7
>   - rebase on 7.0.0-rc6
>   - added documentation and comments
>   - denote to be migrated zero page as HMM_PFN_MIGRATE alone
>   - got rid of HMM_PFN_INOUT_FLAGS movement in patch 2
>   - picked up Acked-By from David for patch 1
>   
> Changes v5-v6
>   - rebase on 7.0.0-rc4
>   - use range based TLB flushing while unmapping ptes
>   - gate migration behind HMM_PFN_REQ_MIGRATE for fault and
>     migrate paths
>   - always infer migration flags from migrate->flags only
> 
> Changes v4-v5
>   - rebase on 6.19
>   - fixed David's email address
>   - fixed link issue without CONFIG_TRANSPARENT_HUGEPAGE
>   - refactored into smaller commits
>   - added more comments to code
> 
> Changes v3-v4:
>   - rebase on 6.19-rc8
>   - fixed issues found by kernel test robot with random configs
>   - fixed typos
> 
> Changes v2-v3:
>   - rebase on 6.19-rc7
>   - fixed issues found by kernel test robot
>   - fixed smatch issues reported by Dan Carpenter <dan.carpenter@linaro.org>
>   - fixes to lock handling (pmd/pte) on errors
>   - added assertions for pmd/pte lock states
>   - other issues discovered by Matthew, thanks!
> 
> Changes v1-v2:
>   - rebase on 6.19-rc6
>   - fixed issues found by kernel test robot
>   - fixed locking (pmd/ptl) to cover handle_ and prepare_ regions
>     parts if migrating
>   - other issues discovered by Matthew, thanks!
> 
> Changes RFC-v1:
>   - rebase on 6.19-rc5
>   - adjust for the device THP
>   - changes from feedback
> 
> Revisions:
>   - RFC https://lore.kernel.org/linux-mm/20250814072045.3637192-1-mpenttil@redhat.com/
>   - v1: https://lore.kernel.org/all/20260114091923.3950465-1-mpenttil@redhat.com/
>   - v2: https://lore.kernel.org/all/20260119112502.645059-1-mpenttil@redhat.com/
>   - v3: https://lore.kernel.org/all/20260126111939.1332983-2-mpenttil@redhat.com/
>   - v4: https://lore.kernel.org/all/20260202112622.2104213-1-mpenttil@redhat.com/
>   - v5: https://lore.kernel.org/linux-mm/20260211081301.2940672-1-mpenttil@redhat.com/
>   - v6: https://lore.kernel.org/linux-mm/20260316062407.3354636-1-mpenttil@redhat.com/
>   - v7: https://lore.kernel.org/linux-mm/20260330115611.347988-1-mpenttil@redhat.com/
>   - v8: https://lore.kernel.org/linux-mm/20260414041226.1539439-1-mpenttil@redhat.com/
>   - v9: https://lore.kernel.org/linux-mm/20260505051658.2219537-1-mpenttil@redhat.com/
> 
> Cc: David Hildenbrand <david@kernel.org>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Leon Romanovsky <leonro@nvidia.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Balbir Singh <balbirs@nvidia.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> 
> Mika Penttilä (5):
>   mm/Kconfig: changes for migrate on fault for device pages
>   mm: Add helper to convert HMM pfn to migrate pfn
>   mm/hmm: do the plumbing for HMM to participate in migration
>   mm: setup device page migration in HMM pagewalk
>   lib/test_hmm:: add a new testcase for the migrate on fault
> 
>  include/linux/hmm.h                    |  19 +-
>  include/linux/migrate.h                |  26 +-
>  lib/test_hmm.c                         | 101 ++-
>  lib/test_hmm_uapi.h                    |  19 +-
>  mm/Kconfig                             |   2 +
>  mm/hmm.c                               | 836 +++++++++++++++++++++++--
>  mm/migrate_device.c                    | 583 +++--------------
>  tools/testing/selftests/mm/hmm-tests.c |  54 ++
>  8 files changed, 1067 insertions(+), 573 deletions(-)
> 
> drm-tip
> base-commit: 94d56a898a2db27f841b17f6966a81ba502fe63c
> -- 

FYI: While testing with hmm_tests I ran into

[  107.866004] ============================================
[  107.866284] WARNING: possible recursive locking detected
[  107.866577] 7.1.0-rc3-00311-g4277273ca0e1 #12 Not tainted
[  107.866877] --------------------------------------------
[  107.867217] hmm-tests/1098 is trying to acquire lock:
[  107.867491] ffff888113571b38 (&mm->mmap_lock){++++}-{4:4}, at: dmirror_range_fault+0x147/0x610 [test_hmm] <- line 368 of lib/test_hmm.c
[  107.868076] 
[  107.868076] but task is already holding lock:
[  107.868383] ffff888113571b38 (&mm->mmap_lock){++++}-{4:4}, at: dmirror_fault_and_migrate_to_device.constprop.0+0x3aa/0x6a0 [test_hmm] <- line 1267 of lib/test_hmm.c
[  107.869076] 
[  107.869076] other info that might help us debug this:
[  107.869415]  Possible unsafe locking scenario:
[  107.869415] 
[  107.869729]        CPU0
[  107.869866]        ----
[  107.870054]   lock(&mm->mmap_lock);
[  107.870247]   lock(&mm->mmap_lock);
[  107.870436] 
[  107.870436]  *** DEADLOCK ***
[  107.870436] 
[  107.870743]  May be due to missing lock nesting notation
[  107.870743] 
[  107.871158] 1 lock held by hmm-tests/1098:
[  107.871377]  #0: ffff888113571b38 (&mm->mmap_lock){++++}-{4:4}, at: dmirror_fault_and_migrate_to_device.constprop.0+0x3aa/0x6a0 [test_hmm]
[  107.872081] 
[  107.872081] stack backtrace:
[  107.872348] CPU: 1 UID: 0 PID: 1098 Comm: hmm-tests Not tainted 7.1.0-rc3-00311-g4277273ca0e1 #12 PREEMPT(full) 
[  107.872350] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS edk2-20260213-6.fc44 02/13/2026
[  107.872354] Call Trace:
[  107.872357]  <TASK>
[  107.872358]  dump_stack_lvl+0x5d/0x80
[  107.872385]  print_deadlock_bug.cold+0xc0/0xe2
[  107.872393]  __lock_acquire+0x10cf/0x1b90
[  107.872400]  lock_acquire+0x189/0x2f0
[  107.872401]  ? dmirror_range_fault+0x147/0x610 [test_hmm]
[  107.872404]  down_read+0x9b/0x4b0
[  107.872420]  ? dmirror_range_fault+0x147/0x610 [test_hmm]
[  107.872421]  ? lock_acquire+0x189/0x2f0
[  107.872422]  ? __pfx_down_read+0x10/0x10
[  107.872424]  ? __lock_acquire+0x3c2/0x1b90
[  107.872425]  dmirror_range_fault+0x147/0x610 [test_hmm]
[  107.872427]  ? __pfx_down_read+0x10/0x10
[  107.872429]  ? __pfx_dmirror_range_fault+0x10/0x10 [test_hmm]
[  107.872430]  ? __lock_acquire+0x3c2/0x1b90
[  107.872434]  dmirror_fault_and_migrate_to_device.constprop.0+0x3bf/0x6a0 [test_hmm]
[  107.872436]  ? __pfx_dmirror_fault_and_migrate_to_device.constprop.0+0x10/0x10 [test_hmm]
[  107.872439]  ? find_held_lock+0x2b/0x80
[  107.872444]  ? dmirror_device_remove_chunks+0x5b8/0xa00 [test_hmm]
[  107.872445]  ? __is_insn_slot_addr+0xee/0x1f0
[  107.872458]  ? lock_acquire+0x189/0x2f0
[  107.872460]  ? avc_has_extended_perms+0x234/0x1350
[  107.872476]  ? __might_fault+0x89/0x150
[  107.872484]  ? lock_release+0xe1/0x320
[  107.872486]  dmirror_fops_unlocked_ioctl+0x9ba/0xdb0 [test_hmm]
[  107.872488]  ? ioctl_has_perm.constprop.0.isra.0+0x2fe/0x6c0
[  107.872494]  ? __pfx_dmirror_fops_unlocked_ioctl+0x10/0x10 [test_hmm]
[  107.872498]  ? count_memcg_events_mm.constprop.0+0x22/0x1a0
[  107.872499]  ? __pfx_ioctl_has_perm.constprop.0.isra.0+0x10/0x10
[  107.872501]  ? count_memcg_events_mm.constprop.0+0xaa/0x1a0
[  107.872503]  ? lock_release+0xe1/0x320
[  107.872504]  ? find_held_lock+0x2b/0x80
[  107.872506]  ? exc_page_fault+0x7e/0xf0
[  107.872510]  __x64_sys_ioctl+0x13c/0x1d0
[  107.872521]  ? lockdep_hardirqs_on_prepare+0xd9/0x190
[  107.872523]  do_syscall_64+0xf3/0x6a0
[  107.872526]  ? exc_page_fault+0xde/0xf0
[  107.872528]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[  107.872529] RIP: 0033:0x7f7381c543ad
[  107.872531] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
[  107.872532] RSP: 002b:00007ffc3160a9b0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  107.872539] RAX: ffffffffffffffda RBX: 00007f7381b44000 RCX: 00007f7381c543ad
[  107.872540] RDX: 00007ffc3160aa30 RSI: 00000000c0284803 RDI: 0000000000000022
[  107.872541] RBP: 00007ffc3160aa00 R08: 00000000ffffffff R09: 0000000000000000
[  107.872541] R10: 0000000000000022 R11: 0000000000000246 R12: 00007ffc3160aa24
[  107.872542] R13: 000000000041f380 R14: 0000000000000200 R15: 00007f7381200000
[  107.872544]  </TASK>


Thanks,
Balbir

Re: [PATCH v10 0/5] Migrate on fault for device pages

Posted by Mika Penttilä 4 weeks, 1 day ago

Hi,

> FYI: While testing with hmm_tests I ran into
>
> [  107.866004] ============================================
> [  107.866284] WARNING: possible recursive locking detected
> [  107.866577] 7.1.0-rc3-00311-g4277273ca0e1 #12 Not tainted
> [  107.866877] --------------------------------------------
> [  107.867217] hmm-tests/1098 is trying to acquire lock:
> [  107.867491] ffff888113571b38 (&mm->mmap_lock){++++}-{4:4}, at: dmirror_range_fault+0x147/0x610 [test_hmm] <- line 368 of lib/test_hmm.c
> [  107.868076] 
> [  107.868076] but task is already holding lock:
> [  107.868383] ffff888113571b38 (&mm->mmap_lock){++++}-{4:4}, at: dmirror_fault_and_migrate_to_device.constprop.0+0x3aa/0x6a0 [test_hmm] <- line 1267 of lib/test_hmm.c
> [  107.869076] 
> [  107.869076] other info that might help us debug this:
> [  107.869415]  Possible unsafe locking scenario:
> [  107.869415] 
> [  107.869729]        CPU0
> [  107.869866]        ----
> [  107.870054]   lock(&mm->mmap_lock);
> [  107.870247]   lock(&mm->mmap_lock);
> [  107.870436] 
> [  107.870436]  *** DEADLOCK ***
> [  107.870436] 
> [  107.870743]  May be due to missing lock nesting notation
> [  107.870743] 
> [  107.871158] 1 lock held by hmm-tests/1098:
> [  107.871377]  #0: ffff888113571b38 (&mm->mmap_lock){++++}-{4:4}, at: dmirror_fault_and_migrate_to_device.constprop.0+0x3aa/0x6a0 [test_hmm]
> [  107.872081] 
> [  107.872081] stack backtrace:
> [  107.872348] CPU: 1 UID: 0 PID: 1098 Comm: hmm-tests Not tainted 7.1.0-rc3-00311-g4277273ca0e1 #12 PREEMPT(full) 
> [  107.872350] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS edk2-20260213-6.fc44 02/13/2026
> [  107.872354] Call Trace:
> [  107.872357]  <TASK>
> [  107.872358]  dump_stack_lvl+0x5d/0x80
> [  107.872385]  print_deadlock_bug.cold+0xc0/0xe2
> [  107.872393]  __lock_acquire+0x10cf/0x1b90
> [  107.872400]  lock_acquire+0x189/0x2f0
> [  107.872401]  ? dmirror_range_fault+0x147/0x610 [test_hmm]
> [  107.872404]  down_read+0x9b/0x4b0
> [  107.872420]  ? dmirror_range_fault+0x147/0x610 [test_hmm]
> [  107.872421]  ? lock_acquire+0x189/0x2f0
> [  107.872422]  ? __pfx_down_read+0x10/0x10
> [  107.872424]  ? __lock_acquire+0x3c2/0x1b90
> [  107.872425]  dmirror_range_fault+0x147/0x610 [test_hmm]
> [  107.872427]  ? __pfx_down_read+0x10/0x10
> [  107.872429]  ? __pfx_dmirror_range_fault+0x10/0x10 [test_hmm]
> [  107.872430]  ? __lock_acquire+0x3c2/0x1b90
> [  107.872434]  dmirror_fault_and_migrate_to_device.constprop.0+0x3bf/0x6a0 [test_hmm]
> [  107.872436]  ? __pfx_dmirror_fault_and_migrate_to_device.constprop.0+0x10/0x10 [test_hmm]
> [  107.872439]  ? find_held_lock+0x2b/0x80
> [  107.872444]  ? dmirror_device_remove_chunks+0x5b8/0xa00 [test_hmm]
> [  107.872445]  ? __is_insn_slot_addr+0xee/0x1f0
> [  107.872458]  ? lock_acquire+0x189/0x2f0
> [  107.872460]  ? avc_has_extended_perms+0x234/0x1350
> [  107.872476]  ? __might_fault+0x89/0x150
> [  107.872484]  ? lock_release+0xe1/0x320
> [  107.872486]  dmirror_fops_unlocked_ioctl+0x9ba/0xdb0 [test_hmm]
> [  107.872488]  ? ioctl_has_perm.constprop.0.isra.0+0x2fe/0x6c0
> [  107.872494]  ? __pfx_dmirror_fops_unlocked_ioctl+0x10/0x10 [test_hmm]
> [  107.872498]  ? count_memcg_events_mm.constprop.0+0x22/0x1a0
> [  107.872499]  ? __pfx_ioctl_has_perm.constprop.0.isra.0+0x10/0x10
> [  107.872501]  ? count_memcg_events_mm.constprop.0+0xaa/0x1a0
> [  107.872503]  ? lock_release+0xe1/0x320
> [  107.872504]  ? find_held_lock+0x2b/0x80
> [  107.872506]  ? exc_page_fault+0x7e/0xf0
> [  107.872510]  __x64_sys_ioctl+0x13c/0x1d0
> [  107.872521]  ? lockdep_hardirqs_on_prepare+0xd9/0x190
> [  107.872523]  do_syscall_64+0xf3/0x6a0
> [  107.872526]  ? exc_page_fault+0xde/0xf0
> [  107.872528]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [  107.872529] RIP: 0033:0x7f7381c543ad
> [  107.872531] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
> [  107.872532] RSP: 002b:00007ffc3160a9b0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [  107.872539] RAX: ffffffffffffffda RBX: 00007f7381b44000 RCX: 00007f7381c543ad
> [  107.872540] RDX: 00007ffc3160aa30 RSI: 00000000c0284803 RDI: 0000000000000022
> [  107.872541] RBP: 00007ffc3160aa00 R08: 00000000ffffffff R09: 0000000000000000
> [  107.872541] R10: 0000000000000022 R11: 0000000000000246 R12: 00007ffc3160aa24
> [  107.872542] R13: 000000000041f380 R14: 0000000000000200 R15: 00007f7381200000
> [  107.872544]  </TASK>
>
>
> Thanks,
> Balbir
>
Thanks, I could reproduce. Had lockdep dropped off so went unnoticed. It is nesting mmap_read_lock in the test suite, I will change that in next version.

--Mika

Re: [PATCH v10 0/5] Migrate on fault for device pages

Posted by Balbir Singh 4 weeks, 1 day ago

On 5/15/26 14:05, Mika Penttilä wrote:
> Hi,
> 
>> FYI: While testing with hmm_tests I ran into
>>
>> [  107.866004] ============================================
>> [  107.866284] WARNING: possible recursive locking detected
>> [  107.866577] 7.1.0-rc3-00311-g4277273ca0e1 #12 Not tainted
>> [  107.866877] --------------------------------------------
>> [  107.867217] hmm-tests/1098 is trying to acquire lock:
>> [  107.867491] ffff888113571b38 (&mm->mmap_lock){++++}-{4:4}, at: dmirror_range_fault+0x147/0x610 [test_hmm] <- line 368 of lib/test_hmm.c
>> [  107.868076] 
>> [  107.868076] but task is already holding lock:
>> [  107.868383] ffff888113571b38 (&mm->mmap_lock){++++}-{4:4}, at: dmirror_fault_and_migrate_to_device.constprop.0+0x3aa/0x6a0 [test_hmm] <- line 1267 of lib/test_hmm.c
>> [  107.869076] 
>> [  107.869076] other info that might help us debug this:
>> [  107.869415]  Possible unsafe locking scenario:
>> [  107.869415] 
>> [  107.869729]        CPU0
>> [  107.869866]        ----
>> [  107.870054]   lock(&mm->mmap_lock);
>> [  107.870247]   lock(&mm->mmap_lock);
>> [  107.870436] 
>> [  107.870436]  *** DEADLOCK ***
>> [  107.870436] 
>> [  107.870743]  May be due to missing lock nesting notation
>> [  107.870743] 
>> [  107.871158] 1 lock held by hmm-tests/1098:
>> [  107.871377]  #0: ffff888113571b38 (&mm->mmap_lock){++++}-{4:4}, at: dmirror_fault_and_migrate_to_device.constprop.0+0x3aa/0x6a0 [test_hmm]
>> [  107.872081] 
>> [  107.872081] stack backtrace:
>> [  107.872348] CPU: 1 UID: 0 PID: 1098 Comm: hmm-tests Not tainted 7.1.0-rc3-00311-g4277273ca0e1 #12 PREEMPT(full) 
>> [  107.872350] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS edk2-20260213-6.fc44 02/13/2026
>> [  107.872354] Call Trace:
>> [  107.872357]  <TASK>
>> [  107.872358]  dump_stack_lvl+0x5d/0x80
>> [  107.872385]  print_deadlock_bug.cold+0xc0/0xe2
>> [  107.872393]  __lock_acquire+0x10cf/0x1b90
>> [  107.872400]  lock_acquire+0x189/0x2f0
>> [  107.872401]  ? dmirror_range_fault+0x147/0x610 [test_hmm]
>> [  107.872404]  down_read+0x9b/0x4b0
>> [  107.872420]  ? dmirror_range_fault+0x147/0x610 [test_hmm]
>> [  107.872421]  ? lock_acquire+0x189/0x2f0
>> [  107.872422]  ? __pfx_down_read+0x10/0x10
>> [  107.872424]  ? __lock_acquire+0x3c2/0x1b90
>> [  107.872425]  dmirror_range_fault+0x147/0x610 [test_hmm]
>> [  107.872427]  ? __pfx_down_read+0x10/0x10
>> [  107.872429]  ? __pfx_dmirror_range_fault+0x10/0x10 [test_hmm]
>> [  107.872430]  ? __lock_acquire+0x3c2/0x1b90
>> [  107.872434]  dmirror_fault_and_migrate_to_device.constprop.0+0x3bf/0x6a0 [test_hmm]
>> [  107.872436]  ? __pfx_dmirror_fault_and_migrate_to_device.constprop.0+0x10/0x10 [test_hmm]
>> [  107.872439]  ? find_held_lock+0x2b/0x80
>> [  107.872444]  ? dmirror_device_remove_chunks+0x5b8/0xa00 [test_hmm]
>> [  107.872445]  ? __is_insn_slot_addr+0xee/0x1f0
>> [  107.872458]  ? lock_acquire+0x189/0x2f0
>> [  107.872460]  ? avc_has_extended_perms+0x234/0x1350
>> [  107.872476]  ? __might_fault+0x89/0x150
>> [  107.872484]  ? lock_release+0xe1/0x320
>> [  107.872486]  dmirror_fops_unlocked_ioctl+0x9ba/0xdb0 [test_hmm]
>> [  107.872488]  ? ioctl_has_perm.constprop.0.isra.0+0x2fe/0x6c0
>> [  107.872494]  ? __pfx_dmirror_fops_unlocked_ioctl+0x10/0x10 [test_hmm]
>> [  107.872498]  ? count_memcg_events_mm.constprop.0+0x22/0x1a0
>> [  107.872499]  ? __pfx_ioctl_has_perm.constprop.0.isra.0+0x10/0x10
>> [  107.872501]  ? count_memcg_events_mm.constprop.0+0xaa/0x1a0
>> [  107.872503]  ? lock_release+0xe1/0x320
>> [  107.872504]  ? find_held_lock+0x2b/0x80
>> [  107.872506]  ? exc_page_fault+0x7e/0xf0
>> [  107.872510]  __x64_sys_ioctl+0x13c/0x1d0
>> [  107.872521]  ? lockdep_hardirqs_on_prepare+0xd9/0x190
>> [  107.872523]  do_syscall_64+0xf3/0x6a0
>> [  107.872526]  ? exc_page_fault+0xde/0xf0
>> [  107.872528]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
>> [  107.872529] RIP: 0033:0x7f7381c543ad
>> [  107.872531] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
>> [  107.872532] RSP: 002b:00007ffc3160a9b0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
>> [  107.872539] RAX: ffffffffffffffda RBX: 00007f7381b44000 RCX: 00007f7381c543ad
>> [  107.872540] RDX: 00007ffc3160aa30 RSI: 00000000c0284803 RDI: 0000000000000022
>> [  107.872541] RBP: 00007ffc3160aa00 R08: 00000000ffffffff R09: 0000000000000000
>> [  107.872541] R10: 0000000000000022 R11: 0000000000000246 R12: 00007ffc3160aa24
>> [  107.872542] R13: 000000000041f380 R14: 0000000000000200 R15: 00007f7381200000
>> [  107.872544]  </TASK>
>>
>>
>> Thanks,
>> Balbir
>>
> Thanks, I could reproduce. Had lockdep dropped off so went unnoticed. It is nesting mmap_read_lock in the test suite, I will change that in next version.
> 
> --Mika
> 
> 

I'll wait for the next version

Balbir