Will Deacon reports:-
When taking a longterm GUP pin via pin_user_pages(),
__gup_longterm_locked() tries to migrate target folios that should not
be longterm pinned, for example because they reside in a CMA region or
movable zone. This is done by first pinning all of the target folios
anyway, collecting all of the longterm-unpinnable target folios into a
list, dropping the pins that were just taken and finally handing the
list off to migrate_pages() for the actual migration.
It is critically important that no unexpected references are held on the
folios being migrated, otherwise the migration will fail and
pin_user_pages() will return -ENOMEM to its caller. Unfortunately, it is
relatively easy to observe migration failures when running pKVM (which
uses pin_user_pages() on crosvm's virtual address space to resolve
stage-2 page faults from the guest) on a 6.15-based Pixel 6 device and
this results in the VM terminating prematurely.
In the failure case, 'crosvm' has called mlock(MLOCK_ONFAULT) on its
mapping of guest memory prior to the pinning. Subsequently, when
pin_user_pages() walks the page-table, the relevant 'pte' is not
present and so the faulting logic allocates a new folio, mlocks it
with mlock_folio() and maps it in the page-table.
Since commit 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page()
batch by pagevec"), mlock/munlock operations on a folio (formerly page),
are deferred. For example, mlock_folio() takes an additional reference
on the target folio before placing it into a per-cpu 'folio_batch' for
later processing by mlock_folio_batch(), which drops the refcount once
the operation is complete. Processing of the batches is coupled with
the LRU batch logic and can be forcefully drained with
lru_add_drain_all() but as long as a folio remains unprocessed on the
batch, its refcount will be elevated.
This deferred batching therefore interacts poorly with the pKVM pinning
scenario as we can find ourselves in a situation where the migration
code fails to migrate a folio due to the elevated refcount from the
pending mlock operation.
Hugh Dickins adds:-
!folio_test_lru() has never been a very reliable way to tell if an
lru_add_drain_all() is worth calling, to remove LRU cache references
to make the folio migratable: the LRU flag may be set even while the
folio is held with an extra reference in a per-CPU LRU cache.
5.18 commit 2fbb0c10d1e8 may have made it more unreliable. Then 6.11
commit 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before adding
to LRU batch") tried to make it reliable, by moving LRU flag clearing;
but missed the mlock/munlock batches, so still unreliable as reported.
And it turns out to be difficult to extend 33dfe9204f29's LRU flag
clearing to the mlock/munlock batches: if they do benefit from batching,
mlock/munlock cannot be so effective when easily suppressed while !LRU.
Instead, switch to an expected ref_count check, which was more reliable
all along: some more false positives (unhelpful drains) than before, and
never a guarantee that the folio will prove migratable, but better.
Note for stable backports: requires 6.16 commit 86ebd50224c0 ("mm:
add folio_expected_ref_count() for reference count calculation") and
6.17 commit ("mm: fix folio_expected_ref_count() when PG_private_2").
Reported-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/linux-mm/20250815101858.24352-1-will@kernel.org/
Fixes: 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
mm/gup.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/gup.c b/mm/gup.c
index adffe663594d..82aec6443c0a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2307,7 +2307,8 @@ static unsigned long collect_longterm_unpinnable_folios(
continue;
}
- if (!folio_test_lru(folio) && drain_allow) {
+ if (drain_allow && folio_ref_count(folio) !=
+ folio_expected_ref_count(folio) + 1) {
lru_add_drain_all();
drain_allow = false;
}
--
2.51.0
On 31.08.25 11:05, Hugh Dickins wrote: > Will Deacon reports:- > > When taking a longterm GUP pin via pin_user_pages(), > __gup_longterm_locked() tries to migrate target folios that should not > be longterm pinned, for example because they reside in a CMA region or > movable zone. This is done by first pinning all of the target folios > anyway, collecting all of the longterm-unpinnable target folios into a > list, dropping the pins that were just taken and finally handing the > list off to migrate_pages() for the actual migration. > > It is critically important that no unexpected references are held on the > folios being migrated, otherwise the migration will fail and > pin_user_pages() will return -ENOMEM to its caller. Unfortunately, it is > relatively easy to observe migration failures when running pKVM (which > uses pin_user_pages() on crosvm's virtual address space to resolve > stage-2 page faults from the guest) on a 6.15-based Pixel 6 device and > this results in the VM terminating prematurely. > > In the failure case, 'crosvm' has called mlock(MLOCK_ONFAULT) on its > mapping of guest memory prior to the pinning. Subsequently, when > pin_user_pages() walks the page-table, the relevant 'pte' is not > present and so the faulting logic allocates a new folio, mlocks it > with mlock_folio() and maps it in the page-table. > > Since commit 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page() > batch by pagevec"), mlock/munlock operations on a folio (formerly page), > are deferred. For example, mlock_folio() takes an additional reference > on the target folio before placing it into a per-cpu 'folio_batch' for > later processing by mlock_folio_batch(), which drops the refcount once > the operation is complete. Processing of the batches is coupled with > the LRU batch logic and can be forcefully drained with > lru_add_drain_all() but as long as a folio remains unprocessed on the > batch, its refcount will be elevated. > > This deferred batching therefore interacts poorly with the pKVM pinning > scenario as we can find ourselves in a situation where the migration > code fails to migrate a folio due to the elevated refcount from the > pending mlock operation. > > Hugh Dickins adds:- > > !folio_test_lru() has never been a very reliable way to tell if an > lru_add_drain_all() is worth calling, to remove LRU cache references > to make the folio migratable: the LRU flag may be set even while the > folio is held with an extra reference in a per-CPU LRU cache. > > 5.18 commit 2fbb0c10d1e8 may have made it more unreliable. Then 6.11 > commit 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before adding > to LRU batch") tried to make it reliable, by moving LRU flag clearing; > but missed the mlock/munlock batches, so still unreliable as reported. > > And it turns out to be difficult to extend 33dfe9204f29's LRU flag > clearing to the mlock/munlock batches: if they do benefit from batching, > mlock/munlock cannot be so effective when easily suppressed while !LRU. > > Instead, switch to an expected ref_count check, which was more reliable > all along: some more false positives (unhelpful drains) than before, and > never a guarantee that the folio will prove migratable, but better. > > Note for stable backports: requires 6.16 commit 86ebd50224c0 ("mm: > add folio_expected_ref_count() for reference count calculation") and > 6.17 commit ("mm: fix folio_expected_ref_count() when PG_private_2"). > > Reported-by: Will Deacon <will@kernel.org> > Link: https://lore.kernel.org/linux-mm/20250815101858.24352-1-will@kernel.org/ > Fixes: 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region") > Signed-off-by: Hugh Dickins <hughd@google.com> > Cc: <stable@vger.kernel.org> > --- > mm/gup.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/mm/gup.c b/mm/gup.c > index adffe663594d..82aec6443c0a 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2307,7 +2307,8 @@ static unsigned long collect_longterm_unpinnable_folios( > continue; > } > > - if (!folio_test_lru(folio) && drain_allow) { > + if (drain_allow && folio_ref_count(folio) != > + folio_expected_ref_count(folio) + 1) { > lru_add_drain_all(); > drain_allow = false; > } In general, to the fix idea Acked-by: David Hildenbrand <david@redhat.com> But as raised in reply to patch #1, we have to be a bit careful about including private_2 in folio_expected_ref_count() at this point. If we cannot include it in folio_expected_ref_count(), it's all going to be a mess until PG_private_2 is removed for good. So that part still needs to be figured out. -- Cheers David / dhildenb
On Mon, 1 Sep 2025, David Hildenbrand wrote: > On 31.08.25 11:05, Hugh Dickins wrote: > > diff --git a/mm/gup.c b/mm/gup.c > > index adffe663594d..82aec6443c0a 100644 > > --- a/mm/gup.c > > +++ b/mm/gup.c > > @@ -2307,7 +2307,8 @@ static unsigned long > > collect_longterm_unpinnable_folios( > > continue; > > } > > - if (!folio_test_lru(folio) && drain_allow) { > > + if (drain_allow && folio_ref_count(folio) != > > + folio_expected_ref_count(folio) + 1) { > > lru_add_drain_all(); > > drain_allow = false; > > } > > In general, to the fix idea > > Acked-by: David Hildenbrand <david@redhat.com> Thanks, but I'd better not assume that in v2, even though code the same. Will depend on how you feel about added paragraph in v2 commit message. > > But as raised in reply to patch #1, we have to be a bit careful about > including private_2 in folio_expected_ref_count() at this point. > > If we cannot include it in folio_expected_ref_count(), it's all going to be a > mess until PG_private_2 is removed for good. > > So that part still needs to be figured out. Here's that added paragraph: Note on PG_private_2: ceph and nfs are still using the deprecated PG_private_2 flag, with the aid of netfs and filemap support functions. Although it is consistently matched by an increment of folio ref_count, folio_expected_ref_count() intentionally does not recognize it, and ceph folio migration currently depends on that for PG_private_2 folios to be rejected. New references to the deprecated flag are discouraged, so do not add it into the collect_longterm_unpinnable_folios() calculation: but longterm pinning of transiently PG_private_2 ceph and nfs folios (an uncommon case) may invoke a redundant lru_add_drain_all(). And this makes easy the backport to earlier releases: up to and including 6.12, btrfs also used PG_private_2, but without a ref_count increment. Hugh
On 08.09.25 12:40, Hugh Dickins wrote: > On Mon, 1 Sep 2025, David Hildenbrand wrote: >> On 31.08.25 11:05, Hugh Dickins wrote: >>> diff --git a/mm/gup.c b/mm/gup.c >>> index adffe663594d..82aec6443c0a 100644 >>> --- a/mm/gup.c >>> +++ b/mm/gup.c >>> @@ -2307,7 +2307,8 @@ static unsigned long >>> collect_longterm_unpinnable_folios( >>> continue; >>> } >>> - if (!folio_test_lru(folio) && drain_allow) { >>> + if (drain_allow && folio_ref_count(folio) != >>> + folio_expected_ref_count(folio) + 1) { >>> lru_add_drain_all(); >>> drain_allow = false; >>> } >> >> In general, to the fix idea >> >> Acked-by: David Hildenbrand <david@redhat.com> > > Thanks, but I'd better not assume that in v2, even though code the same. > Will depend on how you feel about added paragraph in v2 commit message. > >> >> But as raised in reply to patch #1, we have to be a bit careful about >> including private_2 in folio_expected_ref_count() at this point. >> >> If we cannot include it in folio_expected_ref_count(), it's all going to be a >> mess until PG_private_2 is removed for good. >> >> So that part still needs to be figured out. > > Here's that added paragraph: > > Note on PG_private_2: ceph and nfs are still using the deprecated > PG_private_2 flag, with the aid of netfs and filemap support functions. > Although it is consistently matched by an increment of folio ref_count, > folio_expected_ref_count() intentionally does not recognize it, and ceph > folio migration currently depends on that for PG_private_2 folios to be > rejected. New references to the deprecated flag are discouraged, so do > not add it into the collect_longterm_unpinnable_folios() calculation: > but longterm pinning of transiently PG_private_2 ceph and nfs folios > (an uncommon case) may invoke a redundant lru_add_drain_all(). Would we also loop forever trying to migrate these folios if they reside on ZONE_MOVABLE? I would assume that is already the case, that migration will always fail due to the raised reference. -- Cheers David / dhildenb
On Mon, 8 Sep 2025, David Hildenbrand wrote: > On 08.09.25 12:40, Hugh Dickins wrote: > > On Mon, 1 Sep 2025, David Hildenbrand wrote: > >> On 31.08.25 11:05, Hugh Dickins wrote: > >>> diff --git a/mm/gup.c b/mm/gup.c > >>> index adffe663594d..82aec6443c0a 100644 > >>> --- a/mm/gup.c > >>> +++ b/mm/gup.c > >>> @@ -2307,7 +2307,8 @@ static unsigned long > >>> collect_longterm_unpinnable_folios( > >>> continue; > >>> } > >>> - if (!folio_test_lru(folio) && drain_allow) { > >>> + if (drain_allow && folio_ref_count(folio) != > >>> + folio_expected_ref_count(folio) + 1) { > >>> lru_add_drain_all(); > >>> drain_allow = false; > >>> } > >> > >> In general, to the fix idea > >> > >> Acked-by: David Hildenbrand <david@redhat.com> > > > > Thanks, but I'd better not assume that in v2, even though code the same. > > Will depend on how you feel about added paragraph in v2 commit message. > > > >> > >> But as raised in reply to patch #1, we have to be a bit careful about > >> including private_2 in folio_expected_ref_count() at this point. > >> > >> If we cannot include it in folio_expected_ref_count(), it's all going to be > >> a > >> mess until PG_private_2 is removed for good. > >> > >> So that part still needs to be figured out. > > > > Here's that added paragraph: > > > > Note on PG_private_2: ceph and nfs are still using the deprecated > > PG_private_2 flag, with the aid of netfs and filemap support functions. > > Although it is consistently matched by an increment of folio ref_count, > > folio_expected_ref_count() intentionally does not recognize it, and ceph > > folio migration currently depends on that for PG_private_2 folios to be > > rejected. New references to the deprecated flag are discouraged, so do > > not add it into the collect_longterm_unpinnable_folios() calculation: > > but longterm pinning of transiently PG_private_2 ceph and nfs folios > > (an uncommon case) may invoke a redundant lru_add_drain_all(). > > Would we also loop forever trying to migrate these folios if they reside on > ZONE_MOVABLE? I would assume that is already the case, that migration will > always fail due to the raised reference. Loop around forever? That would be unfortunate (but I presume killable). But when I looked, it appeared that any failure of migrate_pages() there gets reported as -ENOMEM, which would end up as an OOM? But you know mm/gup.c very much better than I do. If it does loop around, it's not so bad in the PG_private_2 case, because that's (nowadays always) a transient flag, much more like PG_writeback than PG_private. But whatever, yes, the move from testing lru to checking ref_count makes no difference to that: the failure occurs in migration either way. Hugh
On 08.09.25 21:57, Hugh Dickins wrote: > On Mon, 8 Sep 2025, David Hildenbrand wrote: >> On 08.09.25 12:40, Hugh Dickins wrote: >>> On Mon, 1 Sep 2025, David Hildenbrand wrote: >>>> On 31.08.25 11:05, Hugh Dickins wrote: >>>>> diff --git a/mm/gup.c b/mm/gup.c >>>>> index adffe663594d..82aec6443c0a 100644 >>>>> --- a/mm/gup.c >>>>> +++ b/mm/gup.c >>>>> @@ -2307,7 +2307,8 @@ static unsigned long >>>>> collect_longterm_unpinnable_folios( >>>>> continue; >>>>> } >>>>> - if (!folio_test_lru(folio) && drain_allow) { >>>>> + if (drain_allow && folio_ref_count(folio) != >>>>> + folio_expected_ref_count(folio) + 1) { >>>>> lru_add_drain_all(); >>>>> drain_allow = false; >>>>> } >>>> >>>> In general, to the fix idea >>>> >>>> Acked-by: David Hildenbrand <david@redhat.com> >>> >>> Thanks, but I'd better not assume that in v2, even though code the same. >>> Will depend on how you feel about added paragraph in v2 commit message. >>> >>>> >>>> But as raised in reply to patch #1, we have to be a bit careful about >>>> including private_2 in folio_expected_ref_count() at this point. >>>> >>>> If we cannot include it in folio_expected_ref_count(), it's all going to be >>>> a >>>> mess until PG_private_2 is removed for good. >>>> >>>> So that part still needs to be figured out. >>> >>> Here's that added paragraph: >>> >>> Note on PG_private_2: ceph and nfs are still using the deprecated >>> PG_private_2 flag, with the aid of netfs and filemap support functions. >>> Although it is consistently matched by an increment of folio ref_count, >>> folio_expected_ref_count() intentionally does not recognize it, and ceph >>> folio migration currently depends on that for PG_private_2 folios to be >>> rejected. New references to the deprecated flag are discouraged, so do >>> not add it into the collect_longterm_unpinnable_folios() calculation: >>> but longterm pinning of transiently PG_private_2 ceph and nfs folios >>> (an uncommon case) may invoke a redundant lru_add_drain_all(). >> >> Would we also loop forever trying to migrate these folios if they reside on >> ZONE_MOVABLE? I would assume that is already the case, that migration will >> always fail due to the raised reference. > > Loop around forever? That would be unfortunate (but I presume killable). > But when I looked, it appeared that any failure of migrate_pages() there > gets reported as -ENOMEM, which would end up as an OOM? But you know > mm/gup.c very much better than I do. Yes, like I expected, we just bail out. __gup_longterm_locked() will not retry in that case. It's interesting that any migration failure is treated as -ENOMEM, but well, that's certainly material for a completely different discussion. Thanks Hugh! -- Cheers David / dhildenb
© 2016 - 2025 Red Hat, Inc.