mm/migrate: fix hugetlbfs deadlock by respecting lock ordering

[PATCH] mm/migrate: fix hugetlbfs deadlock by respecting lock ordering

Posted by Jinchao Wang 4 weeks, 1 day ago

Fix an AB-BA deadlock between hugetlbfs_punch_hole() and page migration.

The deadlock occurs because migration violates the lock ordering defined
in mm/rmap.c for hugetlbfs:

  * hugetlbfs PageHuge() take locks in this order:
  * hugetlb_fault_mutex
  * vma_lock
  * mapping->i_mmap_rwsem
  * folio_lock

The following trace illustrates the inversion:

Task A (punch_hole):             Task B (migration):
--------------------             -------------------
1. i_mmap_lock_write(mapping)    1. folio_lock(folio)
2. folio_lock(folio)             2. i_mmap_lock_read(mapping)
   (blocks waiting for B)           (blocks waiting for A)

Task A is blocked in the punch-hole path:
  hugetlbfs_fallocate
    hugetlbfs_punch_hole
      hugetlbfs_zero_partial_page
        folio_lock

Task B is blocked in the migration path:
  migrate_pages
    unmap_and_move_huge_page
      remove_migration_ptes
        __rmap_walk_file
          i_mmap_lock_read

To fix this, adjust unmap_and_move_huge_page() to respect the established
hierarchy. If i_mmap_rwsem is acquired during try_to_migrate(), hold it
until remove_migration_ptes() completes.

This utilizes the existing retry logic, which unlocks the folio and
returns -EAGAIN if hugetlb_folio_mapping_lock_write() fails.

Link: https://lore.kernel.org/all/68e9715a.050a0220.1186a4.000d.GAE@google.com/
Link: https://lore.kernel.org/all/20260108123957.1123502-2-wangjinchao600@gmail.com
Reported-by: syzbot+2d9c96466c978346b55f@syzkaller.appspotmail.com
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jinchao Wang <wangjinchao600@gmail.com>
---
 mm/migrate.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 5169f9717f60..bcaa13541acc 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1458,6 +1458,7 @@ static int unmap_and_move_huge_page(new_folio_t get_new_folio,
 	int page_was_mapped = 0;
 	struct anon_vma *anon_vma = NULL;
 	struct address_space *mapping = NULL;
+	enum ttu_flags ttu = 0;
 
 	if (folio_ref_count(src) == 1) {
 		/* page was freed from under us. So we are done. */
@@ -1498,8 +1499,6 @@ static int unmap_and_move_huge_page(new_folio_t get_new_folio,
 		goto put_anon;
 
 	if (folio_mapped(src)) {
-		enum ttu_flags ttu = 0;
-
 		if (!folio_test_anon(src)) {
 			/*
 			 * In shared mappings, try_to_unmap could potentially
@@ -1516,16 +1515,17 @@ static int unmap_and_move_huge_page(new_folio_t get_new_folio,
 
 		try_to_migrate(src, ttu);
 		page_was_mapped = 1;
-
-		if (ttu & TTU_RMAP_LOCKED)
-			i_mmap_unlock_write(mapping);
 	}
 
 	if (!folio_mapped(src))
 		rc = move_to_new_folio(dst, src, mode);
 
 	if (page_was_mapped)
-		remove_migration_ptes(src, !rc ? dst : src, 0);
+		remove_migration_ptes(src, !rc ? dst : src,
+				      ttu ? RMP_LOCKED : 0);
+
+	if (ttu & TTU_RMAP_LOCKED)
+		i_mmap_unlock_write(mapping);
 
 unlock_put_anon:
 	folio_unlock(dst);
-- 
2.43.0

Re: [PATCH] mm/migrate: fix hugetlbfs deadlock by respecting lock ordering

Posted by David Hildenbrand (Red Hat) 4 weeks, 1 day ago

On 1/9/26 04:47, Jinchao Wang wrote:
> Fix an AB-BA deadlock between hugetlbfs_punch_hole() and page migration.
> 
> The deadlock occurs because migration violates the lock ordering defined
> in mm/rmap.c for hugetlbfs:
> 
>    * hugetlbfs PageHuge() take locks in this order:
>    * hugetlb_fault_mutex
>    * vma_lock
>    * mapping->i_mmap_rwsem
>    * folio_lock
> 
> The following trace illustrates the inversion:
> 
> Task A (punch_hole):             Task B (migration):
> --------------------             -------------------
> 1. i_mmap_lock_write(mapping)    1. folio_lock(folio)
> 2. folio_lock(folio)             2. i_mmap_lock_read(mapping)
>     (blocks waiting for B)           (blocks waiting for A)
> 
> Task A is blocked in the punch-hole path:
>    hugetlbfs_fallocate
>      hugetlbfs_punch_hole
>        hugetlbfs_zero_partial_page
>          folio_lock
> 
> Task B is blocked in the migration path:
>    migrate_pages
>      unmap_and_move_huge_page
>        remove_migration_ptes
>          __rmap_walk_file
>            i_mmap_lock_read
> 
> To fix this, adjust unmap_and_move_huge_page() to respect the established
> hierarchy. If i_mmap_rwsem is acquired during try_to_migrate(), hold it


I'm confused. Isn't it unmap_and_move_huge_page() that grabs the 
i_mmap_rwsem during hugetlb_page_mapping_lock_write() (where we do a 
try-lock)?


We now handle file-backed folios correctly I think. Could we somehow 
also be in trouble for anon folios? Because there, we'd still take the 
rmap lock after grabbing the folio lock.


-- 
Cheers

David

Re: [PATCH] mm/migrate: fix hugetlbfs deadlock by respecting lock ordering

Posted by Jinchao Wang 4 weeks, 1 day ago

On Fri, Jan 09, 2026 at 02:39:08PM +0100, David Hildenbrand (Red Hat) wrote:
> On 1/9/26 04:47, Jinchao Wang wrote:
> > Fix an AB-BA deadlock between hugetlbfs_punch_hole() and page migration.
> > 
> > The deadlock occurs because migration violates the lock ordering defined
> > in mm/rmap.c for hugetlbfs:
> > 
> >    * hugetlbfs PageHuge() take locks in this order:
> >    * hugetlb_fault_mutex
> >    * vma_lock
> >    * mapping->i_mmap_rwsem
> >    * folio_lock
> > 
> > The following trace illustrates the inversion:
> > 
> > Task A (punch_hole):             Task B (migration):
> > --------------------             -------------------
> > 1. i_mmap_lock_write(mapping)    1. folio_lock(folio)
> > 2. folio_lock(folio)             2. i_mmap_lock_read(mapping)
> >     (blocks waiting for B)           (blocks waiting for A)
> > 
> > Task A is blocked in the punch-hole path:
> >    hugetlbfs_fallocate
> >      hugetlbfs_punch_hole
> >        hugetlbfs_zero_partial_page
> >          folio_lock
> > 
> > Task B is blocked in the migration path:
> >    migrate_pages
> >      unmap_and_move_huge_page
> >        remove_migration_ptes
> >          __rmap_walk_file
> >            i_mmap_lock_read
> > 
> > To fix this, adjust unmap_and_move_huge_page() to respect the established
> > hierarchy. If i_mmap_rwsem is acquired during try_to_migrate(), hold it
> 
> 
> I'm confused. Isn't it unmap_and_move_huge_page() that grabs the
> i_mmap_rwsem during hugetlb_page_mapping_lock_write() (where we do a
> try-lock)?
Yes, but the lock is released before remove_migration_ptes().

Task A can enter the race window between
	i_mmap_unlock_write(mapping)
and
	remove_migration_ptes() -> i_mmap_lock_read(mapping).

This window was introduced by the change below:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/diff/mm/migrate.c?id=336bf30eb765

> 
> 
> We now handle file-backed folios correctly I think. Could we somehow also be
> in trouble for anon folios? Because there, we'd still take the rmap lock
> after grabbing the folio lock.
> 
> 
> -- 
> Cheers
> 
> David

Re: [PATCH] mm/migrate: fix hugetlbfs deadlock by respecting lock ordering

Posted by David Hildenbrand (Red Hat) 4 weeks, 1 day ago

On 1/9/26 15:16, Jinchao Wang wrote:
> On Fri, Jan 09, 2026 at 02:39:08PM +0100, David Hildenbrand (Red Hat) wrote:
>> On 1/9/26 04:47, Jinchao Wang wrote:
>>> Fix an AB-BA deadlock between hugetlbfs_punch_hole() and page migration.
>>>
>>> The deadlock occurs because migration violates the lock ordering defined
>>> in mm/rmap.c for hugetlbfs:
>>>
>>>     * hugetlbfs PageHuge() take locks in this order:
>>>     * hugetlb_fault_mutex
>>>     * vma_lock
>>>     * mapping->i_mmap_rwsem
>>>     * folio_lock
>>>
>>> The following trace illustrates the inversion:
>>>
>>> Task A (punch_hole):             Task B (migration):
>>> --------------------             -------------------
>>> 1. i_mmap_lock_write(mapping)    1. folio_lock(folio)
>>> 2. folio_lock(folio)             2. i_mmap_lock_read(mapping)
>>>      (blocks waiting for B)           (blocks waiting for A)
>>>
>>> Task A is blocked in the punch-hole path:
>>>     hugetlbfs_fallocate
>>>       hugetlbfs_punch_hole
>>>         hugetlbfs_zero_partial_page
>>>           folio_lock
>>>
>>> Task B is blocked in the migration path:
>>>     migrate_pages
>>>       unmap_and_move_huge_page
>>>         remove_migration_ptes
>>>           __rmap_walk_file
>>>             i_mmap_lock_read
>>>
>>> To fix this, adjust unmap_and_move_huge_page() to respect the established
>>> hierarchy. If i_mmap_rwsem is acquired during try_to_migrate(), hold it
>>
>>
>> I'm confused. Isn't it unmap_and_move_huge_page() that grabs the
>> i_mmap_rwsem during hugetlb_page_mapping_lock_write() (where we do a
>> try-lock)?
> Yes, but the lock is released before remove_migration_ptes().
> 
> Task A can enter the race window between
> 	i_mmap_unlock_write(mapping)
> and
> 	remove_migration_ptes() -> i_mmap_lock_read(mapping).
> 
> This window was introduced by the change below:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/diff/mm/migrate.c?id=336bf30eb765

try_to_migrate() is not the problem, but remove_migration_ptes() ?

Anyhow, I saw that Willy sent out a version.

-- 
Cheers

David

Re: [PATCH] mm/migrate: fix hugetlbfs deadlock by respecting lock ordering

Posted by Jinchao Wang 4 weeks, 1 day ago

On Fri, Jan 09, 2026 at 03:18:37PM +0100, David Hildenbrand (Red Hat) wrote:
> On 1/9/26 15:16, Jinchao Wang wrote:
> > On Fri, Jan 09, 2026 at 02:39:08PM +0100, David Hildenbrand (Red Hat) wrote:
> > > On 1/9/26 04:47, Jinchao Wang wrote:
> > > > Fix an AB-BA deadlock between hugetlbfs_punch_hole() and page migration.
> > > > 
> > > > The deadlock occurs because migration violates the lock ordering defined
> > > > in mm/rmap.c for hugetlbfs:
> > > > 
> > > >     * hugetlbfs PageHuge() take locks in this order:
> > > >     * hugetlb_fault_mutex
> > > >     * vma_lock
> > > >     * mapping->i_mmap_rwsem
> > > >     * folio_lock
> > > > 
> > > > The following trace illustrates the inversion:
> > > > 
> > > > Task A (punch_hole):             Task B (migration):
> > > > --------------------             -------------------
> > > > 1. i_mmap_lock_write(mapping)    1. folio_lock(folio)
> > > > 2. folio_lock(folio)             2. i_mmap_lock_read(mapping)
> > > >      (blocks waiting for B)           (blocks waiting for A)
> > > > 
> > > > Task A is blocked in the punch-hole path:
> > > >     hugetlbfs_fallocate
> > > >       hugetlbfs_punch_hole
> > > >         hugetlbfs_zero_partial_page
> > > >           folio_lock
> > > > 
> > > > Task B is blocked in the migration path:
> > > >     migrate_pages
> > > >       unmap_and_move_huge_page
> > > >         remove_migration_ptes
> > > >           __rmap_walk_file
> > > >             i_mmap_lock_read
> > > > 
> > > > To fix this, adjust unmap_and_move_huge_page() to respect the established
> > > > hierarchy. If i_mmap_rwsem is acquired during try_to_migrate(), hold it
> > > 
> > > 
> > > I'm confused. Isn't it unmap_and_move_huge_page() that grabs the
> > > i_mmap_rwsem during hugetlb_page_mapping_lock_write() (where we do a
> > > try-lock)?
> > Yes, but the lock is released before remove_migration_ptes().
> > 
> > Task A can enter the race window between
> > 	i_mmap_unlock_write(mapping)
> > and
> > 	remove_migration_ptes() -> i_mmap_lock_read(mapping).
> > 
> > This window was introduced by the change below:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/diff/mm/migrate.c?id=336bf30eb765
> 
> try_to_migrate() is not the problem, but remove_migration_ptes() ?
> 
> Anyhow, I saw that Willy sent out a version.
Thank you for letting me know.

> 
> -- 
> Cheers
> 
> David

Re: [PATCH] mm/migrate: fix hugetlbfs deadlock by respecting lock ordering

Posted by David Hildenbrand (Red Hat) 4 weeks, 1 day ago

On 1/9/26 16:32, Jinchao Wang wrote:
> On Fri, Jan 09, 2026 at 03:18:37PM +0100, David Hildenbrand (Red Hat) wrote:
>> On 1/9/26 15:16, Jinchao Wang wrote:
>>> On Fri, Jan 09, 2026 at 02:39:08PM +0100, David Hildenbrand (Red Hat) wrote:
>>>> On 1/9/26 04:47, Jinchao Wang wrote:
>>>>> Fix an AB-BA deadlock between hugetlbfs_punch_hole() and page migration.
>>>>>
>>>>> The deadlock occurs because migration violates the lock ordering defined
>>>>> in mm/rmap.c for hugetlbfs:
>>>>>
>>>>>      * hugetlbfs PageHuge() take locks in this order:
>>>>>      * hugetlb_fault_mutex
>>>>>      * vma_lock
>>>>>      * mapping->i_mmap_rwsem
>>>>>      * folio_lock
>>>>>
>>>>> The following trace illustrates the inversion:
>>>>>
>>>>> Task A (punch_hole):             Task B (migration):
>>>>> --------------------             -------------------
>>>>> 1. i_mmap_lock_write(mapping)    1. folio_lock(folio)
>>>>> 2. folio_lock(folio)             2. i_mmap_lock_read(mapping)
>>>>>       (blocks waiting for B)           (blocks waiting for A)
>>>>>
>>>>> Task A is blocked in the punch-hole path:
>>>>>      hugetlbfs_fallocate
>>>>>        hugetlbfs_punch_hole
>>>>>          hugetlbfs_zero_partial_page
>>>>>            folio_lock
>>>>>
>>>>> Task B is blocked in the migration path:
>>>>>      migrate_pages
>>>>>        unmap_and_move_huge_page
>>>>>          remove_migration_ptes
>>>>>            __rmap_walk_file
>>>>>              i_mmap_lock_read
>>>>>
>>>>> To fix this, adjust unmap_and_move_huge_page() to respect the established
>>>>> hierarchy. If i_mmap_rwsem is acquired during try_to_migrate(), hold it
>>>>
>>>>
>>>> I'm confused. Isn't it unmap_and_move_huge_page() that grabs the
>>>> i_mmap_rwsem during hugetlb_page_mapping_lock_write() (where we do a
>>>> try-lock)?
>>> Yes, but the lock is released before remove_migration_ptes().
>>>
>>> Task A can enter the race window between
>>> 	i_mmap_unlock_write(mapping)
>>> and
>>> 	remove_migration_ptes() -> i_mmap_lock_read(mapping).
>>>
>>> This window was introduced by the change below:
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/diff/mm/migrate.c?id=336bf30eb765
>>
>> try_to_migrate() is not the problem, but remove_migration_ptes() ?
>>
>> Anyhow, I saw that Willy sent out a version.
> Thank you for letting me know.

For reference:

https://lkml.kernel.org/r/20260109041345.3863089-1-willy@infradead.org

-- 
Cheers

David

Re: [PATCH] mm/migrate: fix hugetlbfs deadlock by respecting lock ordering

Posted by Huang, Ying 4 weeks, 1 day ago

Jinchao Wang <wangjinchao600@gmail.com> writes:

> Fix an AB-BA deadlock between hugetlbfs_punch_hole() and page migration.
>
> The deadlock occurs because migration violates the lock ordering defined
> in mm/rmap.c for hugetlbfs:
>
>   * hugetlbfs PageHuge() take locks in this order:
>   * hugetlb_fault_mutex
>   * vma_lock
>   * mapping->i_mmap_rwsem
>   * folio_lock
>
> The following trace illustrates the inversion:
>
> Task A (punch_hole):             Task B (migration):
> --------------------             -------------------
> 1. i_mmap_lock_write(mapping)    1. folio_lock(folio)
> 2. folio_lock(folio)             2. i_mmap_lock_read(mapping)
>    (blocks waiting for B)           (blocks waiting for A)
>
> Task A is blocked in the punch-hole path:
>   hugetlbfs_fallocate
>     hugetlbfs_punch_hole
>       hugetlbfs_zero_partial_page
>         folio_lock
>
> Task B is blocked in the migration path:
>   migrate_pages
>     unmap_and_move_huge_page
>       remove_migration_ptes
>         __rmap_walk_file
>           i_mmap_lock_read
>
> To fix this, adjust unmap_and_move_huge_page() to respect the established
> hierarchy. If i_mmap_rwsem is acquired during try_to_migrate(), hold it
> until remove_migration_ptes() completes.
>
> This utilizes the existing retry logic, which unlocks the folio and
> returns -EAGAIN if hugetlb_folio_mapping_lock_write() fails.
>
> Link: https://lore.kernel.org/all/68e9715a.050a0220.1186a4.000d.GAE@google.com/
> Link: https://lore.kernel.org/all/20260108123957.1123502-2-wangjinchao600@gmail.com
> Reported-by: syzbot+2d9c96466c978346b55f@syzkaller.appspotmail.com
> Suggested-by: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: Jinchao Wang <wangjinchao600@gmail.com>

Can you provide a "Fixes:" tag?  That is helpful for backporting the bug
fix.

---
Best Regards,
Huang, Ying

Re: [PATCH] mm/migrate: fix hugetlbfs deadlock by respecting lock ordering

Posted by Jinchao Wang 4 weeks, 1 day ago

On Fri, Jan 09, 2026 at 02:37:28PM +0800, Huang, Ying wrote:
> Jinchao Wang <wangjinchao600@gmail.com> writes:
> 
> > Fix an AB-BA deadlock between hugetlbfs_punch_hole() and page migration.
> >
> > The deadlock occurs because migration violates the lock ordering defined
> > in mm/rmap.c for hugetlbfs:
> >
> >   * hugetlbfs PageHuge() take locks in this order:
> >   * hugetlb_fault_mutex
> >   * vma_lock
> >   * mapping->i_mmap_rwsem
> >   * folio_lock
> >
> > The following trace illustrates the inversion:
> >
> > Task A (punch_hole):             Task B (migration):
> > --------------------             -------------------
> > 1. i_mmap_lock_write(mapping)    1. folio_lock(folio)
> > 2. folio_lock(folio)             2. i_mmap_lock_read(mapping)
> >    (blocks waiting for B)           (blocks waiting for A)
> >
> > Task A is blocked in the punch-hole path:
> >   hugetlbfs_fallocate
> >     hugetlbfs_punch_hole
> >       hugetlbfs_zero_partial_page
> >         folio_lock
> >
> > Task B is blocked in the migration path:
> >   migrate_pages
> >     unmap_and_move_huge_page
> >       remove_migration_ptes
> >         __rmap_walk_file
> >           i_mmap_lock_read
> >
> > To fix this, adjust unmap_and_move_huge_page() to respect the established
> > hierarchy. If i_mmap_rwsem is acquired during try_to_migrate(), hold it
> > until remove_migration_ptes() completes.
> >
> > This utilizes the existing retry logic, which unlocks the folio and
> > returns -EAGAIN if hugetlb_folio_mapping_lock_write() fails.
> >
> > Link: https://lore.kernel.org/all/68e9715a.050a0220.1186a4.000d.GAE@google.com/
> > Link: https://lore.kernel.org/all/20260108123957.1123502-2-wangjinchao600@gmail.com
> > Reported-by: syzbot+2d9c96466c978346b55f@syzkaller.appspotmail.com
> > Suggested-by: Matthew Wilcox <willy@infradead.org>
> > Signed-off-by: Jinchao Wang <wangjinchao600@gmail.com>
> 
> Can you provide a "Fixes:" tag?  That is helpful for backporting the bug
> fix.

Thanks for the suggestion. 
The deadlock appears to be caused by a violation of the lock ordering 
introduced in commit 336bf30eb765 ("hugetlbfs: fix anon huge page migration 
race"). Although commit 68d32527d340 ("hugetlbfs: zero partial pages during 
fallocate hole punch") was the one that first triggered the crash, 
I believe the 336bf30eb765 commit is the root cause.

I will add the following tag to v2:
Fixes: 336bf30eb765 ("hugetlbfs: fix anon huge page migration race")
> 
> ---
> Best Regards,
> Huang, Ying

Re: [PATCH] mm/migrate: fix hugetlbfs deadlock by respecting lock ordering

Posted by Matthew Wilcox 4 weeks, 1 day ago

On Fri, Jan 09, 2026 at 11:47:16AM +0800, Jinchao Wang wrote:
> Link: https://lore.kernel.org/all/68e9715a.050a0220.1186a4.000d.GAE@google.com/
> Link: https://lore.kernel.org/all/20260108123957.1123502-2-wangjinchao600@gmail.com
> Reported-by: syzbot+2d9c96466c978346b55f@syzkaller.appspotmail.com
> Suggested-by: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: Jinchao Wang <wangjinchao600@gmail.com>

... and by "Suggested-by", you mean "completely written by", right?

Or did you change it in some way?

Re: [PATCH] mm/migrate: fix hugetlbfs deadlock by respecting lock ordering

Posted by Jinchao Wang 4 weeks, 1 day ago

On Fri, Jan 09, 2026 at 04:06:22AM +0000, Matthew Wilcox wrote:
> On Fri, Jan 09, 2026 at 11:47:16AM +0800, Jinchao Wang wrote:
> > Link: https://lore.kernel.org/all/68e9715a.050a0220.1186a4.000d.GAE@google.com/
> > Link: https://lore.kernel.org/all/20260108123957.1123502-2-wangjinchao600@gmail.com
> > Reported-by: syzbot+2d9c96466c978346b55f@syzkaller.appspotmail.com
> > Suggested-by: Matthew Wilcox <willy@infradead.org>
> > Signed-off-by: Jinchao Wang <wangjinchao600@gmail.com>
> 
> ... and by "Suggested-by", you mean "completely written by", right?
> 
> Or did you change it in some way?

Yes, it is completely written by you. I verified it against the syzkaller
reproducer and reviewed the code logic.

If you prefer, I am happy to update the attribution, for example by replacing
Suggested-by with Co-developed-by, or by listing you as the author instead.  I
can also drop my patch if that is more appropriate.

Please let me know what you prefer.

Thanks.