mm: thp: lazy PTE page table allocation at PMD split

[RFC v2 12/21] mm: thp: handle split failure in device migration

Posted by Usama Arif 1 month, 1 week ago

Device memory migration has two call sites that split huge PMDs:

migrate_vma_split_unmapped_folio():
  Called from migrate_vma_pages() when migrating a PMD-mapped THP to a
  destination that doesn't support compound pages.  It splits the PMD
  then splits the folio via folio_split_unmapped().

  If the PMD split fails, folio_split_unmapped() would operate on an
  unsplit folio with inconsistent page table state.  Propagate -ENOMEM
  to skip this page's migration. This is safe as folio_split_unmapped
  failure would be propagated in a similar way.

migrate_vma_insert_page():
  Called from migrate_vma_pages() when inserting a page into a VMA
  during migration back from device memory.  If a huge zero PMD exists
  at the target address, it must be split before PTE insertion.

  If the split fails, the subsequent pte_alloc() and set_pte_at() would
  operate on a PMD slot still occupied by the huge zero entry.  Use
  goto abort, consistent with other allocation failures in this function.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/migrate_device.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 78c7acf024615..bc53e06fd9735 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -909,7 +909,13 @@ static int migrate_vma_split_unmapped_folio(struct migrate_vma *migrate,
 	int ret = 0;
 
 	folio_get(folio);
-	split_huge_pmd_address(migrate->vma, addr, true);
+	/*
+	 * If PMD split fails, folio_split_unmapped would operate on an
+	 * unsplit folio with inconsistent page table state.
+	 */
+	ret = split_huge_pmd_address(migrate->vma, addr, true);
+	if (ret)
+		return ret;
 	ret = folio_split_unmapped(folio, 0);
 	if (ret)
 		return ret;
@@ -1005,7 +1011,13 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 		if (pmd_trans_huge(*pmdp)) {
 			if (!is_huge_zero_pmd(*pmdp))
 				goto abort;
-			split_huge_pmd(vma, pmdp, addr);
+			/*
+			 * If split fails, the huge zero PMD remains and
+			 * pte_alloc/PTE insertion that follows would be
+			 * incorrect.
+			 */
+			if (split_huge_pmd(vma, pmdp, addr))
+				goto abort;
 		} else if (pmd_leaf(*pmdp))
 			goto abort;
 	}
-- 
2.47.3

Re: [RFC v2 12/21] mm: thp: handle split failure in device migration

Posted by Nico Pache 1 month ago

On Thu, Feb 26, 2026 at 4:34 AM Usama Arif <usama.arif@linux.dev> wrote:
>
> Device memory migration has two call sites that split huge PMDs:
>
> migrate_vma_split_unmapped_folio():
>   Called from migrate_vma_pages() when migrating a PMD-mapped THP to a
>   destination that doesn't support compound pages.  It splits the PMD
>   then splits the folio via folio_split_unmapped().
>
>   If the PMD split fails, folio_split_unmapped() would operate on an
>   unsplit folio with inconsistent page table state.  Propagate -ENOMEM
>   to skip this page's migration. This is safe as folio_split_unmapped
>   failure would be propagated in a similar way.
>
> migrate_vma_insert_page():
>   Called from migrate_vma_pages() when inserting a page into a VMA
>   during migration back from device memory.  If a huge zero PMD exists
>   at the target address, it must be split before PTE insertion.
>
>   If the split fails, the subsequent pte_alloc() and set_pte_at() would
>   operate on a PMD slot still occupied by the huge zero entry.  Use
>   goto abort, consistent with other allocation failures in this function.
>
> Signed-off-by: Usama Arif <usama.arif@linux.dev>
> ---
>  mm/migrate_device.c | 16 ++++++++++++++--
>  1 file changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 78c7acf024615..bc53e06fd9735 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -909,7 +909,13 @@ static int migrate_vma_split_unmapped_folio(struct migrate_vma *migrate,
>         int ret = 0;
>
>         folio_get(folio);

Should we be concerned about this folio_get? Are we incrementing a
reference that was already held if we back out of the split?

-- Nico

> -       split_huge_pmd_address(migrate->vma, addr, true);
> +       /*
> +        * If PMD split fails, folio_split_unmapped would operate on an
> +        * unsplit folio with inconsistent page table state.
> +        */
> +       ret = split_huge_pmd_address(migrate->vma, addr, true);
> +       if (ret)
> +               return ret;
>         ret = folio_split_unmapped(folio, 0);
>         if (ret)
>                 return ret;
> @@ -1005,7 +1011,13 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
>                 if (pmd_trans_huge(*pmdp)) {
>                         if (!is_huge_zero_pmd(*pmdp))
>                                 goto abort;
> -                       split_huge_pmd(vma, pmdp, addr);
> +                       /*
> +                        * If split fails, the huge zero PMD remains and
> +                        * pte_alloc/PTE insertion that follows would be
> +                        * incorrect.
> +                        */
> +                       if (split_huge_pmd(vma, pmdp, addr))
> +                               goto abort;
>                 } else if (pmd_leaf(*pmdp))
>                         goto abort;
>         }
> --
> 2.47.3
>

Re: [RFC v2 12/21] mm: thp: handle split failure in device migration

Posted by Usama Arif 1 month ago


On 02/03/2026 21:20, Nico Pache wrote:
> On Thu, Feb 26, 2026 at 4:34 AM Usama Arif <usama.arif@linux.dev> wrote:
>>
>> Device memory migration has two call sites that split huge PMDs:
>>
>> migrate_vma_split_unmapped_folio():
>>   Called from migrate_vma_pages() when migrating a PMD-mapped THP to a
>>   destination that doesn't support compound pages.  It splits the PMD
>>   then splits the folio via folio_split_unmapped().
>>
>>   If the PMD split fails, folio_split_unmapped() would operate on an
>>   unsplit folio with inconsistent page table state.  Propagate -ENOMEM
>>   to skip this page's migration. This is safe as folio_split_unmapped
>>   failure would be propagated in a similar way.
>>
>> migrate_vma_insert_page():
>>   Called from migrate_vma_pages() when inserting a page into a VMA
>>   during migration back from device memory.  If a huge zero PMD exists
>>   at the target address, it must be split before PTE insertion.
>>
>>   If the split fails, the subsequent pte_alloc() and set_pte_at() would
>>   operate on a PMD slot still occupied by the huge zero entry.  Use
>>   goto abort, consistent with other allocation failures in this function.
>>
>> Signed-off-by: Usama Arif <usama.arif@linux.dev>
>> ---
>>  mm/migrate_device.c | 16 ++++++++++++++--
>>  1 file changed, 14 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> index 78c7acf024615..bc53e06fd9735 100644
>> --- a/mm/migrate_device.c
>> +++ b/mm/migrate_device.c
>> @@ -909,7 +909,13 @@ static int migrate_vma_split_unmapped_folio(struct migrate_vma *migrate,
>>         int ret = 0;
>>
>>         folio_get(folio);
> 
> Should we be concerned about this folio_get? Are we incrementing a
> reference that was already held if we back out of the split?
> 
> -- Nico



Hi Nico,

Thanks for pointing this out. It spun out to an entire investigation for me [1].

Similar to [1], I inserted trace prints [2] and created a new __split_huge_pmd2
that always returns -ENOMEM. Without folio_put on error [3], we get a refcount of 2.

       hmm-tests-129     [000] .l...     1.485514: __migrate_device_finalize: FINALIZE[0]: src=ffb48827440e8000 dst=ffb48827440e8000 src==dst=1 refcount_src=2 mapcount_src=0 order_src=9 migrate=0 BEFORE remove_migration_ptes
       hmm-tests-129     [000] .l...     1.485517: __migrate_device_finalize: FINALIZE[0]: src=ffb48827440e8000 refcount=3 mapcount=1 AFTER remove_migration_ptes
       hmm-tests-129     [000] .l...     1.485518: __migrate_device_finalize: FINALIZE[0]: src=ffb48827440e8000 refcount=2 AFTER folio_put(src)


With folio_put on error [4], we get a refcount of 1.

       hmm-tests-129     [001] .....     1.492216: __migrate_device_finalize: FINALIZE[0]: src=fff7b8be840f0000 dst=fff7b8be840f0000 src==dst=1 refcount_src=1 mapcount_src=0 order_src=9 migrate=0 BEFORE remove_migration_ptes
       hmm-tests-129     [001] .....     1.492219: __migrate_device_finalize: FINALIZE[0]: src=fff7b8be840f0000 refcount=2 mapcount=1 AFTER remove_migration_ptes
       hmm-tests-129     [001] .....     1.492220: __migrate_device_finalize: FINALIZE[0]: src=fff7b8be840f0000 refcount=1 AFTER folio_put(src)


So we need folio_put for split_huge_pmd_address failure, but NOT for
folio_split_unmapped.


[1] https://lore.kernel.org/all/332c9e16-46c3-4e1c-898e-2cb0a87ba1fc@linux.dev/
[2] https://gist.github.com/uarif1/6abe4bedb85814e9be8d48a4fe742b41
[3] https://gist.github.com/uarif1/f718af2113bc1a33484674b61b9dafcc
[4] https://gist.github.com/uarif1/03c42f2549eaf2bc555e8b03e07a63c8

Re: [RFC v2 12/21] mm: thp: handle split failure in device migration

Posted by Nico Pache 4 weeks, 1 day ago

On Thu, Mar 5, 2026 at 9:55 AM Usama Arif <usama.arif@linux.dev> wrote:
>
>
>
> On 02/03/2026 21:20, Nico Pache wrote:
> > On Thu, Feb 26, 2026 at 4:34 AM Usama Arif <usama.arif@linux.dev> wrote:
> >>
> >> Device memory migration has two call sites that split huge PMDs:
> >>
> >> migrate_vma_split_unmapped_folio():
> >>   Called from migrate_vma_pages() when migrating a PMD-mapped THP to a
> >>   destination that doesn't support compound pages.  It splits the PMD
> >>   then splits the folio via folio_split_unmapped().
> >>
> >>   If the PMD split fails, folio_split_unmapped() would operate on an
> >>   unsplit folio with inconsistent page table state.  Propagate -ENOMEM
> >>   to skip this page's migration. This is safe as folio_split_unmapped
> >>   failure would be propagated in a similar way.
> >>
> >> migrate_vma_insert_page():
> >>   Called from migrate_vma_pages() when inserting a page into a VMA
> >>   during migration back from device memory.  If a huge zero PMD exists
> >>   at the target address, it must be split before PTE insertion.
> >>
> >>   If the split fails, the subsequent pte_alloc() and set_pte_at() would
> >>   operate on a PMD slot still occupied by the huge zero entry.  Use
> >>   goto abort, consistent with other allocation failures in this function.
> >>
> >> Signed-off-by: Usama Arif <usama.arif@linux.dev>
> >> ---
> >>  mm/migrate_device.c | 16 ++++++++++++++--
> >>  1 file changed, 14 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> >> index 78c7acf024615..bc53e06fd9735 100644
> >> --- a/mm/migrate_device.c
> >> +++ b/mm/migrate_device.c
> >> @@ -909,7 +909,13 @@ static int migrate_vma_split_unmapped_folio(struct migrate_vma *migrate,
> >>         int ret = 0;
> >>
> >>         folio_get(folio);
> >
> > Should we be concerned about this folio_get? Are we incrementing a
> > reference that was already held if we back out of the split?
> >
> > -- Nico
>
>
>
> Hi Nico,
>
> Thanks for pointing this out. It spun out to an entire investigation for me [1].

Hey Usama,

I'm sorry my question lead you down a rabbit hole but I'm glad you did
the proper investigation and found the correct answer :) Thanks for
looking into it and for clearing that up via a comment!

Cheers,
-- Nico

>
> Similar to [1], I inserted trace prints [2] and created a new __split_huge_pmd2
> that always returns -ENOMEM. Without folio_put on error [3], we get a refcount of 2.
>
>        hmm-tests-129     [000] .l...     1.485514: __migrate_device_finalize: FINALIZE[0]: src=ffb48827440e8000 dst=ffb48827440e8000 src==dst=1 refcount_src=2 mapcount_src=0 order_src=9 migrate=0 BEFORE remove_migration_ptes
>        hmm-tests-129     [000] .l...     1.485517: __migrate_device_finalize: FINALIZE[0]: src=ffb48827440e8000 refcount=3 mapcount=1 AFTER remove_migration_ptes
>        hmm-tests-129     [000] .l...     1.485518: __migrate_device_finalize: FINALIZE[0]: src=ffb48827440e8000 refcount=2 AFTER folio_put(src)
>
>
> With folio_put on error [4], we get a refcount of 1.
>
>        hmm-tests-129     [001] .....     1.492216: __migrate_device_finalize: FINALIZE[0]: src=fff7b8be840f0000 dst=fff7b8be840f0000 src==dst=1 refcount_src=1 mapcount_src=0 order_src=9 migrate=0 BEFORE remove_migration_ptes
>        hmm-tests-129     [001] .....     1.492219: __migrate_device_finalize: FINALIZE[0]: src=fff7b8be840f0000 refcount=2 mapcount=1 AFTER remove_migration_ptes
>        hmm-tests-129     [001] .....     1.492220: __migrate_device_finalize: FINALIZE[0]: src=fff7b8be840f0000 refcount=1 AFTER folio_put(src)
>
>
> So we need folio_put for split_huge_pmd_address failure, but NOT for
> folio_split_unmapped.
>
>
> [1] https://lore.kernel.org/all/332c9e16-46c3-4e1c-898e-2cb0a87ba1fc@linux.dev/
> [2] https://gist.github.com/uarif1/6abe4bedb85814e9be8d48a4fe742b41
> [3] https://gist.github.com/uarif1/f718af2113bc1a33484674b61b9dafcc
> [4] https://gist.github.com/uarif1/03c42f2549eaf2bc555e8b03e07a63c8
>

Re: [RFC v2 12/21] mm: thp: handle split failure in device migration

Posted by Usama Arif 4 weeks ago


On 09/03/2026 18:09, Nico Pache wrote:
> On Thu, Mar 5, 2026 at 9:55 AM Usama Arif <usama.arif@linux.dev> wrote:
>>
>>
>>
>> On 02/03/2026 21:20, Nico Pache wrote:
>>> On Thu, Feb 26, 2026 at 4:34 AM Usama Arif <usama.arif@linux.dev> wrote:
>>>>
>>>> Device memory migration has two call sites that split huge PMDs:
>>>>
>>>> migrate_vma_split_unmapped_folio():
>>>>   Called from migrate_vma_pages() when migrating a PMD-mapped THP to a
>>>>   destination that doesn't support compound pages.  It splits the PMD
>>>>   then splits the folio via folio_split_unmapped().
>>>>
>>>>   If the PMD split fails, folio_split_unmapped() would operate on an
>>>>   unsplit folio with inconsistent page table state.  Propagate -ENOMEM
>>>>   to skip this page's migration. This is safe as folio_split_unmapped
>>>>   failure would be propagated in a similar way.
>>>>
>>>> migrate_vma_insert_page():
>>>>   Called from migrate_vma_pages() when inserting a page into a VMA
>>>>   during migration back from device memory.  If a huge zero PMD exists
>>>>   at the target address, it must be split before PTE insertion.
>>>>
>>>>   If the split fails, the subsequent pte_alloc() and set_pte_at() would
>>>>   operate on a PMD slot still occupied by the huge zero entry.  Use
>>>>   goto abort, consistent with other allocation failures in this function.
>>>>
>>>> Signed-off-by: Usama Arif <usama.arif@linux.dev>
>>>> ---
>>>>  mm/migrate_device.c | 16 ++++++++++++++--
>>>>  1 file changed, 14 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>> index 78c7acf024615..bc53e06fd9735 100644
>>>> --- a/mm/migrate_device.c
>>>> +++ b/mm/migrate_device.c
>>>> @@ -909,7 +909,13 @@ static int migrate_vma_split_unmapped_folio(struct migrate_vma *migrate,
>>>>         int ret = 0;
>>>>
>>>>         folio_get(folio);
>>>
>>> Should we be concerned about this folio_get? Are we incrementing a
>>> reference that was already held if we back out of the split?
>>>
>>> -- Nico
>>
>>
>>
>> Hi Nico,
>>
>> Thanks for pointing this out. It spun out to an entire investigation for me [1].
> 
> Hey Usama,
> 
> I'm sorry my question lead you down a rabbit hole but I'm glad you did
> the proper investigation and found the correct answer :) Thanks for
> looking into it and for clearing that up via a comment!
> 
> Cheers,
> -- Nico
>

Thanks for the review! There is a need for folio_put in this patch
specifically for split_huge_pmd_address which I wouldnt have figured out
without your comment, so really appreciate it!

Usama

Re: [RFC v2 12/21] mm: thp: handle split failure in device migration

Posted by Usama Arif 1 month ago


On 02/03/2026 21:20, Nico Pache wrote:
> On Thu, Feb 26, 2026 at 4:34 AM Usama Arif <usama.arif@linux.dev> wrote:
>>
>> Device memory migration has two call sites that split huge PMDs:
>>
>> migrate_vma_split_unmapped_folio():
>>   Called from migrate_vma_pages() when migrating a PMD-mapped THP to a
>>   destination that doesn't support compound pages.  It splits the PMD
>>   then splits the folio via folio_split_unmapped().
>>
>>   If the PMD split fails, folio_split_unmapped() would operate on an
>>   unsplit folio with inconsistent page table state.  Propagate -ENOMEM
>>   to skip this page's migration. This is safe as folio_split_unmapped
>>   failure would be propagated in a similar way.
>>
>> migrate_vma_insert_page():
>>   Called from migrate_vma_pages() when inserting a page into a VMA
>>   during migration back from device memory.  If a huge zero PMD exists
>>   at the target address, it must be split before PTE insertion.
>>
>>   If the split fails, the subsequent pte_alloc() and set_pte_at() would
>>   operate on a PMD slot still occupied by the huge zero entry.  Use
>>   goto abort, consistent with other allocation failures in this function.
>>
>> Signed-off-by: Usama Arif <usama.arif@linux.dev>
>> ---
>>  mm/migrate_device.c | 16 ++++++++++++++--
>>  1 file changed, 14 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> index 78c7acf024615..bc53e06fd9735 100644
>> --- a/mm/migrate_device.c
>> +++ b/mm/migrate_device.c
>> @@ -909,7 +909,13 @@ static int migrate_vma_split_unmapped_folio(struct migrate_vma *migrate,
>>         int ret = 0;
>>
>>         folio_get(folio);
> 
> Should we be concerned about this folio_get? Are we incrementing a
> reference that was already held if we back out of the split?
> 

Good catch! I think this bug existed even before this patch, if
folio_split_unmapped fails, the reference is still there. Let me
send an independent fix for this.

> -- Nico
> 
>> -       split_huge_pmd_address(migrate->vma, addr, true);
>> +       /*
>> +        * If PMD split fails, folio_split_unmapped would operate on an
>> +        * unsplit folio with inconsistent page table state.
>> +        */
>> +       ret = split_huge_pmd_address(migrate->vma, addr, true);
>> +       if (ret)
>> +               return ret;
>>         ret = folio_split_unmapped(folio, 0);
>>         if (ret)
>>                 return ret;