[v1] mm: better GUP pin lru_add_drain_all()

[PATCH 2/7] mm/gup: check ref_count instead of lru before migration

Posted by Hugh Dickins 1 month ago

Will Deacon reports:-

When taking a longterm GUP pin via pin_user_pages(),
__gup_longterm_locked() tries to migrate target folios that should not
be longterm pinned, for example because they reside in a CMA region or
movable zone. This is done by first pinning all of the target folios
anyway, collecting all of the longterm-unpinnable target folios into a
list, dropping the pins that were just taken and finally handing the
list off to migrate_pages() for the actual migration.

It is critically important that no unexpected references are held on the
folios being migrated, otherwise the migration will fail and
pin_user_pages() will return -ENOMEM to its caller. Unfortunately, it is
relatively easy to observe migration failures when running pKVM (which
uses pin_user_pages() on crosvm's virtual address space to resolve
stage-2 page faults from the guest) on a 6.15-based Pixel 6 device and
this results in the VM terminating prematurely.

In the failure case, 'crosvm' has called mlock(MLOCK_ONFAULT) on its
mapping of guest memory prior to the pinning. Subsequently, when
pin_user_pages() walks the page-table, the relevant 'pte' is not
present and so the faulting logic allocates a new folio, mlocks it
with mlock_folio() and maps it in the page-table.

Since commit 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page()
batch by pagevec"), mlock/munlock operations on a folio (formerly page),
are deferred. For example, mlock_folio() takes an additional reference
on the target folio before placing it into a per-cpu 'folio_batch' for
later processing by mlock_folio_batch(), which drops the refcount once
the operation is complete. Processing of the batches is coupled with
the LRU batch logic and can be forcefully drained with
lru_add_drain_all() but as long as a folio remains unprocessed on the
batch, its refcount will be elevated.

This deferred batching therefore interacts poorly with the pKVM pinning
scenario as we can find ourselves in a situation where the migration
code fails to migrate a folio due to the elevated refcount from the
pending mlock operation.

Hugh Dickins adds:-

!folio_test_lru() has never been a very reliable way to tell if an
lru_add_drain_all() is worth calling, to remove LRU cache references
to make the folio migratable: the LRU flag may be set even while the
folio is held with an extra reference in a per-CPU LRU cache.

5.18 commit 2fbb0c10d1e8 may have made it more unreliable.  Then 6.11
commit 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before adding
to LRU batch") tried to make it reliable, by moving LRU flag clearing;
but missed the mlock/munlock batches, so still unreliable as reported.

And it turns out to be difficult to extend 33dfe9204f29's LRU flag
clearing to the mlock/munlock batches: if they do benefit from batching,
mlock/munlock cannot be so effective when easily suppressed while !LRU.

Instead, switch to an expected ref_count check, which was more reliable
all along: some more false positives (unhelpful drains) than before, and
never a guarantee that the folio will prove migratable, but better.

Note for stable backports: requires 6.16 commit 86ebd50224c0 ("mm:
add folio_expected_ref_count() for reference count calculation") and
6.17 commit ("mm: fix folio_expected_ref_count() when PG_private_2").

Reported-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/linux-mm/20250815101858.24352-1-will@kernel.org/
Fixes: 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
 mm/gup.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index adffe663594d..82aec6443c0a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2307,7 +2307,8 @@ static unsigned long collect_longterm_unpinnable_folios(
 			continue;
 		}
 
-		if (!folio_test_lru(folio) && drain_allow) {
+		if (drain_allow && folio_ref_count(folio) !=
+				   folio_expected_ref_count(folio) + 1) {
 			lru_add_drain_all();
 			drain_allow = false;
 		}
-- 
2.51.0

Re: [PATCH 2/7] mm/gup: check ref_count instead of lru before migration

Posted by David Hildenbrand 1 month ago

On 31.08.25 11:05, Hugh Dickins wrote:
> Will Deacon reports:-
> 
> When taking a longterm GUP pin via pin_user_pages(),
> __gup_longterm_locked() tries to migrate target folios that should not
> be longterm pinned, for example because they reside in a CMA region or
> movable zone. This is done by first pinning all of the target folios
> anyway, collecting all of the longterm-unpinnable target folios into a
> list, dropping the pins that were just taken and finally handing the
> list off to migrate_pages() for the actual migration.
> 
> It is critically important that no unexpected references are held on the
> folios being migrated, otherwise the migration will fail and
> pin_user_pages() will return -ENOMEM to its caller. Unfortunately, it is
> relatively easy to observe migration failures when running pKVM (which
> uses pin_user_pages() on crosvm's virtual address space to resolve
> stage-2 page faults from the guest) on a 6.15-based Pixel 6 device and
> this results in the VM terminating prematurely.
> 
> In the failure case, 'crosvm' has called mlock(MLOCK_ONFAULT) on its
> mapping of guest memory prior to the pinning. Subsequently, when
> pin_user_pages() walks the page-table, the relevant 'pte' is not
> present and so the faulting logic allocates a new folio, mlocks it
> with mlock_folio() and maps it in the page-table.
> 
> Since commit 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page()
> batch by pagevec"), mlock/munlock operations on a folio (formerly page),
> are deferred. For example, mlock_folio() takes an additional reference
> on the target folio before placing it into a per-cpu 'folio_batch' for
> later processing by mlock_folio_batch(), which drops the refcount once
> the operation is complete. Processing of the batches is coupled with
> the LRU batch logic and can be forcefully drained with
> lru_add_drain_all() but as long as a folio remains unprocessed on the
> batch, its refcount will be elevated.
> 
> This deferred batching therefore interacts poorly with the pKVM pinning
> scenario as we can find ourselves in a situation where the migration
> code fails to migrate a folio due to the elevated refcount from the
> pending mlock operation.
> 
> Hugh Dickins adds:-
> 
> !folio_test_lru() has never been a very reliable way to tell if an
> lru_add_drain_all() is worth calling, to remove LRU cache references
> to make the folio migratable: the LRU flag may be set even while the
> folio is held with an extra reference in a per-CPU LRU cache.
> 
> 5.18 commit 2fbb0c10d1e8 may have made it more unreliable.  Then 6.11
> commit 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before adding
> to LRU batch") tried to make it reliable, by moving LRU flag clearing;
> but missed the mlock/munlock batches, so still unreliable as reported.
> 
> And it turns out to be difficult to extend 33dfe9204f29's LRU flag
> clearing to the mlock/munlock batches: if they do benefit from batching,
> mlock/munlock cannot be so effective when easily suppressed while !LRU.
> 
> Instead, switch to an expected ref_count check, which was more reliable
> all along: some more false positives (unhelpful drains) than before, and
> never a guarantee that the folio will prove migratable, but better.
> 
> Note for stable backports: requires 6.16 commit 86ebd50224c0 ("mm:
> add folio_expected_ref_count() for reference count calculation") and
> 6.17 commit ("mm: fix folio_expected_ref_count() when PG_private_2").
> 
> Reported-by: Will Deacon <will@kernel.org>
> Link: https://lore.kernel.org/linux-mm/20250815101858.24352-1-will@kernel.org/
> Fixes: 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: <stable@vger.kernel.org>
> ---
>   mm/gup.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index adffe663594d..82aec6443c0a 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2307,7 +2307,8 @@ static unsigned long collect_longterm_unpinnable_folios(
>   			continue;
>   		}
>   
> -		if (!folio_test_lru(folio) && drain_allow) {
> +		if (drain_allow && folio_ref_count(folio) !=
> +				   folio_expected_ref_count(folio) + 1) {
>   			lru_add_drain_all();
>   			drain_allow = false;
>   		}

In general, to the fix idea

	Acked-by: David Hildenbrand <david@redhat.com>

But as raised in reply to patch #1, we have to be a bit careful about 
including private_2 in folio_expected_ref_count() at this point.

If we cannot include it in folio_expected_ref_count(), it's all going to 
be a mess until PG_private_2 is removed for good.

So that part still needs to be figured out.

-- 
Cheers

David / dhildenb

Re: [PATCH 2/7] mm/gup: check ref_count instead of lru before migration

Posted by Hugh Dickins 3 weeks, 4 days ago

On Mon, 1 Sep 2025, David Hildenbrand wrote:
> On 31.08.25 11:05, Hugh Dickins wrote:
> > diff --git a/mm/gup.c b/mm/gup.c
> > index adffe663594d..82aec6443c0a 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -2307,7 +2307,8 @@ static unsigned long
> > collect_longterm_unpinnable_folios(
> >     	continue;
> >     }
> >   -		if (!folio_test_lru(folio) && drain_allow) {
> > +		if (drain_allow && folio_ref_count(folio) !=
> > +				   folio_expected_ref_count(folio) + 1) {
> >      lru_add_drain_all();
> >      drain_allow = false;
> >     }
> 
> In general, to the fix idea
> 
> 	Acked-by: David Hildenbrand <david@redhat.com>

Thanks, but I'd better not assume that in v2, even though code the same.
Will depend on how you feel about added paragraph in v2 commit message.

> 
> But as raised in reply to patch #1, we have to be a bit careful about
> including private_2 in folio_expected_ref_count() at this point.
> 
> If we cannot include it in folio_expected_ref_count(), it's all going to be a
> mess until PG_private_2 is removed for good.
> 
> So that part still needs to be figured out.

Here's that added paragraph:

Note on PG_private_2: ceph and nfs are still using the deprecated
PG_private_2 flag, with the aid of netfs and filemap support functions.
Although it is consistently matched by an increment of folio ref_count,
folio_expected_ref_count() intentionally does not recognize it, and ceph
folio migration currently depends on that for PG_private_2 folios to be
rejected.  New references to the deprecated flag are discouraged, so do
not add it into the collect_longterm_unpinnable_folios() calculation:
but longterm pinning of transiently PG_private_2 ceph and nfs folios
(an uncommon case) may invoke a redundant lru_add_drain_all().  And
this makes easy the backport to earlier releases: up to and including
6.12, btrfs also used PG_private_2, but without a ref_count increment.

Hugh

Re: [PATCH 2/7] mm/gup: check ref_count instead of lru before migration

Posted by David Hildenbrand 3 weeks, 3 days ago

On 08.09.25 12:40, Hugh Dickins wrote:
> On Mon, 1 Sep 2025, David Hildenbrand wrote:
>> On 31.08.25 11:05, Hugh Dickins wrote:
>>> diff --git a/mm/gup.c b/mm/gup.c
>>> index adffe663594d..82aec6443c0a 100644
>>> --- a/mm/gup.c
>>> +++ b/mm/gup.c
>>> @@ -2307,7 +2307,8 @@ static unsigned long
>>> collect_longterm_unpinnable_folios(
>>>      	continue;
>>>      }
>>>    -		if (!folio_test_lru(folio) && drain_allow) {
>>> +		if (drain_allow && folio_ref_count(folio) !=
>>> +				   folio_expected_ref_count(folio) + 1) {
>>>       lru_add_drain_all();
>>>       drain_allow = false;
>>>      }
>>
>> In general, to the fix idea
>>
>> 	Acked-by: David Hildenbrand <david@redhat.com>
> 
> Thanks, but I'd better not assume that in v2, even though code the same.
> Will depend on how you feel about added paragraph in v2 commit message.
> 
>>
>> But as raised in reply to patch #1, we have to be a bit careful about
>> including private_2 in folio_expected_ref_count() at this point.
>>
>> If we cannot include it in folio_expected_ref_count(), it's all going to be a
>> mess until PG_private_2 is removed for good.
>>
>> So that part still needs to be figured out.
> 
> Here's that added paragraph:
> 
> Note on PG_private_2: ceph and nfs are still using the deprecated
> PG_private_2 flag, with the aid of netfs and filemap support functions.
> Although it is consistently matched by an increment of folio ref_count,
> folio_expected_ref_count() intentionally does not recognize it, and ceph
> folio migration currently depends on that for PG_private_2 folios to be
> rejected.  New references to the deprecated flag are discouraged, so do
> not add it into the collect_longterm_unpinnable_folios() calculation:
> but longterm pinning of transiently PG_private_2 ceph and nfs folios
> (an uncommon case) may invoke a redundant lru_add_drain_all(). 

Would we also loop forever trying to migrate these folios if they reside 
on ZONE_MOVABLE? I would assume that is already the case, that migration 
will always fail due to the raised reference.

-- 
Cheers

David / dhildenb

Re: [PATCH 2/7] mm/gup: check ref_count instead of lru before migration

Posted by Hugh Dickins 3 weeks, 3 days ago

On Mon, 8 Sep 2025, David Hildenbrand wrote:
> On 08.09.25 12:40, Hugh Dickins wrote:
> > On Mon, 1 Sep 2025, David Hildenbrand wrote:
> >> On 31.08.25 11:05, Hugh Dickins wrote:
> >>> diff --git a/mm/gup.c b/mm/gup.c
> >>> index adffe663594d..82aec6443c0a 100644
> >>> --- a/mm/gup.c
> >>> +++ b/mm/gup.c
> >>> @@ -2307,7 +2307,8 @@ static unsigned long
> >>> collect_longterm_unpinnable_folios(
> >>>      	continue;
> >>>      }
> >>>    -		if (!folio_test_lru(folio) && drain_allow) {
> >>> +		if (drain_allow && folio_ref_count(folio) !=
> >>> +				   folio_expected_ref_count(folio) + 1) {
> >>>       lru_add_drain_all();
> >>>       drain_allow = false;
> >>>      }
> >>
> >> In general, to the fix idea
> >>
> >>  Acked-by: David Hildenbrand <david@redhat.com>
> > 
> > Thanks, but I'd better not assume that in v2, even though code the same.
> > Will depend on how you feel about added paragraph in v2 commit message.
> > 
> >>
> >> But as raised in reply to patch #1, we have to be a bit careful about
> >> including private_2 in folio_expected_ref_count() at this point.
> >>
> >> If we cannot include it in folio_expected_ref_count(), it's all going to be
> >> a
> >> mess until PG_private_2 is removed for good.
> >>
> >> So that part still needs to be figured out.
> > 
> > Here's that added paragraph:
> > 
> > Note on PG_private_2: ceph and nfs are still using the deprecated
> > PG_private_2 flag, with the aid of netfs and filemap support functions.
> > Although it is consistently matched by an increment of folio ref_count,
> > folio_expected_ref_count() intentionally does not recognize it, and ceph
> > folio migration currently depends on that for PG_private_2 folios to be
> > rejected.  New references to the deprecated flag are discouraged, so do
> > not add it into the collect_longterm_unpinnable_folios() calculation:
> > but longterm pinning of transiently PG_private_2 ceph and nfs folios
> > (an uncommon case) may invoke a redundant lru_add_drain_all(). 
> 
> Would we also loop forever trying to migrate these folios if they reside on
> ZONE_MOVABLE? I would assume that is already the case, that migration will
> always fail due to the raised reference.

Loop around forever?  That would be unfortunate (but I presume killable).
But when I looked, it appeared that any failure of migrate_pages() there
gets reported as -ENOMEM, which would end up as an OOM?  But you know
mm/gup.c very much better than I do.

If it does loop around, it's not so bad in the PG_private_2 case, because
that's (nowadays always) a transient flag, much more like PG_writeback
than PG_private.

But whatever, yes, the move from testing lru to checking ref_count
makes no difference to that: the failure occurs in migration either way.

Hugh

Re: [PATCH 2/7] mm/gup: check ref_count instead of lru before migration

Posted by David Hildenbrand 3 weeks, 3 days ago

On 08.09.25 21:57, Hugh Dickins wrote:
> On Mon, 8 Sep 2025, David Hildenbrand wrote:
>> On 08.09.25 12:40, Hugh Dickins wrote:
>>> On Mon, 1 Sep 2025, David Hildenbrand wrote:
>>>> On 31.08.25 11:05, Hugh Dickins wrote:
>>>>> diff --git a/mm/gup.c b/mm/gup.c
>>>>> index adffe663594d..82aec6443c0a 100644
>>>>> --- a/mm/gup.c
>>>>> +++ b/mm/gup.c
>>>>> @@ -2307,7 +2307,8 @@ static unsigned long
>>>>> collect_longterm_unpinnable_folios(
>>>>>       	continue;
>>>>>       }
>>>>>     -		if (!folio_test_lru(folio) && drain_allow) {
>>>>> +		if (drain_allow && folio_ref_count(folio) !=
>>>>> +				   folio_expected_ref_count(folio) + 1) {
>>>>>        lru_add_drain_all();
>>>>>        drain_allow = false;
>>>>>       }
>>>>
>>>> In general, to the fix idea
>>>>
>>>>   Acked-by: David Hildenbrand <david@redhat.com>
>>>
>>> Thanks, but I'd better not assume that in v2, even though code the same.
>>> Will depend on how you feel about added paragraph in v2 commit message.
>>>
>>>>
>>>> But as raised in reply to patch #1, we have to be a bit careful about
>>>> including private_2 in folio_expected_ref_count() at this point.
>>>>
>>>> If we cannot include it in folio_expected_ref_count(), it's all going to be
>>>> a
>>>> mess until PG_private_2 is removed for good.
>>>>
>>>> So that part still needs to be figured out.
>>>
>>> Here's that added paragraph:
>>>
>>> Note on PG_private_2: ceph and nfs are still using the deprecated
>>> PG_private_2 flag, with the aid of netfs and filemap support functions.
>>> Although it is consistently matched by an increment of folio ref_count,
>>> folio_expected_ref_count() intentionally does not recognize it, and ceph
>>> folio migration currently depends on that for PG_private_2 folios to be
>>> rejected.  New references to the deprecated flag are discouraged, so do
>>> not add it into the collect_longterm_unpinnable_folios() calculation:
>>> but longterm pinning of transiently PG_private_2 ceph and nfs folios
>>> (an uncommon case) may invoke a redundant lru_add_drain_all().
>>
>> Would we also loop forever trying to migrate these folios if they reside on
>> ZONE_MOVABLE? I would assume that is already the case, that migration will
>> always fail due to the raised reference.
> 
> Loop around forever?  That would be unfortunate (but I presume killable).
> But when I looked, it appeared that any failure of migrate_pages() there
> gets reported as -ENOMEM, which would end up as an OOM?  But you know
> mm/gup.c very much better than I do.

Yes, like I expected, we just bail out. __gup_longterm_locked() will not 
retry in that case. It's interesting that any migration failure is 
treated as -ENOMEM, but well, that's certainly material for a completely 
different discussion.

Thanks Hugh!

-- 
Cheers

David / dhildenb

[PATCH 1/7] mm: fix folio_expected_ref_count() when PG_private_2
[PATCH 2/7] mm/gup: check ref_count instead of lru before migration
[PATCH 3/7] mm/gup: local lru_add_drain() to avoid lru_add_drain_all()
[PATCH 4/7] mm: Revert "mm/gup: clear the LRU flag of a page before adding to LRU batch"
[PATCH 5/7] mm: Revert "mm: vmscan.c: fix OOM on swap stress test"
[PATCH 6/7] mm: folio_may_be_cached() unless folio_test_large()
[PATCH 7/7] mm: lru_add_drain_all() do local lru_add_drain() first