Will Deacon reports:-
When taking a longterm GUP pin via pin_user_pages(),
__gup_longterm_locked() tries to migrate target folios that should not
be longterm pinned, for example because they reside in a CMA region or
movable zone. This is done by first pinning all of the target folios
anyway, collecting all of the longterm-unpinnable target folios into a
list, dropping the pins that were just taken and finally handing the
list off to migrate_pages() for the actual migration.
It is critically important that no unexpected references are held on the
folios being migrated, otherwise the migration will fail and
pin_user_pages() will return -ENOMEM to its caller. Unfortunately, it is
relatively easy to observe migration failures when running pKVM (which
uses pin_user_pages() on crosvm's virtual address space to resolve
stage-2 page faults from the guest) on a 6.15-based Pixel 6 device and
this results in the VM terminating prematurely.
In the failure case, 'crosvm' has called mlock(MLOCK_ONFAULT) on its
mapping of guest memory prior to the pinning. Subsequently, when
pin_user_pages() walks the page-table, the relevant 'pte' is not
present and so the faulting logic allocates a new folio, mlocks it
with mlock_folio() and maps it in the page-table.
Since commit 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page()
batch by pagevec"), mlock/munlock operations on a folio (formerly page),
are deferred. For example, mlock_folio() takes an additional reference
on the target folio before placing it into a per-cpu 'folio_batch' for
later processing by mlock_folio_batch(), which drops the refcount once
the operation is complete. Processing of the batches is coupled with
the LRU batch logic and can be forcefully drained with
lru_add_drain_all() but as long as a folio remains unprocessed on the
batch, its refcount will be elevated.
This deferred batching therefore interacts poorly with the pKVM pinning
scenario as we can find ourselves in a situation where the migration
code fails to migrate a folio due to the elevated refcount from the
pending mlock operation.
Hugh Dickins adds:-
!folio_test_lru() has never been a very reliable way to tell if an
lru_add_drain_all() is worth calling, to remove LRU cache references
to make the folio migratable: the LRU flag may be set even while the
folio is held with an extra reference in a per-CPU LRU cache.
5.18 commit 2fbb0c10d1e8 may have made it more unreliable. Then 6.11
commit 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before adding
to LRU batch") tried to make it reliable, by moving LRU flag clearing;
but missed the mlock/munlock batches, so still unreliable as reported.
And it turns out to be difficult to extend 33dfe9204f29's LRU flag
clearing to the mlock/munlock batches: if they do benefit from batching,
mlock/munlock cannot be so effective when easily suppressed while !LRU.
Instead, switch to an expected ref_count check, which was more reliable
all along: some more false positives (unhelpful drains) than before, and
never a guarantee that the folio will prove migratable, but better.
Note on PG_private_2: ceph and nfs are still using the deprecated
PG_private_2 flag, with the aid of netfs and filemap support functions.
Although it is consistently matched by an increment of folio ref_count,
folio_expected_ref_count() intentionally does not recognize it, and ceph
folio migration currently depends on that for PG_private_2 folios to be
rejected. New references to the deprecated flag are discouraged, so do
not add it into the collect_longterm_unpinnable_folios() calculation:
but longterm pinning of transiently PG_private_2 ceph and nfs folios
(an uncommon case) may invoke a redundant lru_add_drain_all(). And
this makes easy the backport to earlier releases: up to and including
6.12, btrfs also used PG_private_2, but without a ref_count increment.
Note for stable backports: requires 6.16 commit 86ebd50224c0 ("mm:
add folio_expected_ref_count() for reference count calculation").
Reported-by: Will Deacon <will@kernel.org>
Closes: https://lore.kernel.org/linux-mm/20250815101858.24352-1-will@kernel.org/
Fixes: 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
mm/gup.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/gup.c b/mm/gup.c
index adffe663594d..82aec6443c0a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2307,7 +2307,8 @@ static unsigned long collect_longterm_unpinnable_folios(
continue;
}
- if (!folio_test_lru(folio) && drain_allow) {
+ if (drain_allow && folio_ref_count(folio) !=
+ folio_expected_ref_count(folio) + 1) {
lru_add_drain_all();
drain_allow = false;
}
--
2.51.0
On Mon, Sep 08, 2025 at 03:15:03PM -0700, Hugh Dickins wrote: > Will Deacon reports:- > > When taking a longterm GUP pin via pin_user_pages(), > __gup_longterm_locked() tries to migrate target folios that should not > be longterm pinned, for example because they reside in a CMA region or > movable zone. This is done by first pinning all of the target folios > anyway, collecting all of the longterm-unpinnable target folios into a > list, dropping the pins that were just taken and finally handing the > list off to migrate_pages() for the actual migration. > > It is critically important that no unexpected references are held on the > folios being migrated, otherwise the migration will fail and > pin_user_pages() will return -ENOMEM to its caller. Unfortunately, it is > relatively easy to observe migration failures when running pKVM (which > uses pin_user_pages() on crosvm's virtual address space to resolve > stage-2 page faults from the guest) on a 6.15-based Pixel 6 device and > this results in the VM terminating prematurely. > > In the failure case, 'crosvm' has called mlock(MLOCK_ONFAULT) on its > mapping of guest memory prior to the pinning. Subsequently, when > pin_user_pages() walks the page-table, the relevant 'pte' is not > present and so the faulting logic allocates a new folio, mlocks it > with mlock_folio() and maps it in the page-table. > > Since commit 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page() > batch by pagevec"), mlock/munlock operations on a folio (formerly page), > are deferred. For example, mlock_folio() takes an additional reference > on the target folio before placing it into a per-cpu 'folio_batch' for > later processing by mlock_folio_batch(), which drops the refcount once > the operation is complete. Processing of the batches is coupled with > the LRU batch logic and can be forcefully drained with > lru_add_drain_all() but as long as a folio remains unprocessed on the > batch, its refcount will be elevated. > > This deferred batching therefore interacts poorly with the pKVM pinning > scenario as we can find ourselves in a situation where the migration > code fails to migrate a folio due to the elevated refcount from the > pending mlock operation. > > Hugh Dickins adds:- > > !folio_test_lru() has never been a very reliable way to tell if an > lru_add_drain_all() is worth calling, to remove LRU cache references > to make the folio migratable: the LRU flag may be set even while the > folio is held with an extra reference in a per-CPU LRU cache. > > 5.18 commit 2fbb0c10d1e8 may have made it more unreliable. Then 6.11 > commit 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before adding > to LRU batch") tried to make it reliable, by moving LRU flag clearing; > but missed the mlock/munlock batches, so still unreliable as reported. > > And it turns out to be difficult to extend 33dfe9204f29's LRU flag > clearing to the mlock/munlock batches: if they do benefit from batching, > mlock/munlock cannot be so effective when easily suppressed while !LRU. > > Instead, switch to an expected ref_count check, which was more reliable > all along: some more false positives (unhelpful drains) than before, and > never a guarantee that the folio will prove migratable, but better. > > Note on PG_private_2: ceph and nfs are still using the deprecated > PG_private_2 flag, with the aid of netfs and filemap support functions. > Although it is consistently matched by an increment of folio ref_count, > folio_expected_ref_count() intentionally does not recognize it, and ceph > folio migration currently depends on that for PG_private_2 folios to be > rejected. New references to the deprecated flag are discouraged, so do > not add it into the collect_longterm_unpinnable_folios() calculation: > but longterm pinning of transiently PG_private_2 ceph and nfs folios > (an uncommon case) may invoke a redundant lru_add_drain_all(). And > this makes easy the backport to earlier releases: up to and including > 6.12, btrfs also used PG_private_2, but without a ref_count increment. > > Note for stable backports: requires 6.16 commit 86ebd50224c0 ("mm: > add folio_expected_ref_count() for reference count calculation"). > > Reported-by: Will Deacon <will@kernel.org> > Closes: https://lore.kernel.org/linux-mm/20250815101858.24352-1-will@kernel.org/ > Fixes: 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region") > Signed-off-by: Hugh Dickins <hughd@google.com> > Cc: <stable@vger.kernel.org> Acked-by: Kiryl Shutsemau <kas@kernel.org> -- Kiryl Shutsemau / Kirill A. Shutemov
On 09.09.25 00:15, Hugh Dickins wrote: > Will Deacon reports:- > > When taking a longterm GUP pin via pin_user_pages(), > __gup_longterm_locked() tries to migrate target folios that should not > be longterm pinned, for example because they reside in a CMA region or > movable zone. This is done by first pinning all of the target folios > anyway, collecting all of the longterm-unpinnable target folios into a > list, dropping the pins that were just taken and finally handing the > list off to migrate_pages() for the actual migration. > > It is critically important that no unexpected references are held on the > folios being migrated, otherwise the migration will fail and > pin_user_pages() will return -ENOMEM to its caller. Unfortunately, it is > relatively easy to observe migration failures when running pKVM (which > uses pin_user_pages() on crosvm's virtual address space to resolve > stage-2 page faults from the guest) on a 6.15-based Pixel 6 device and > this results in the VM terminating prematurely. > > In the failure case, 'crosvm' has called mlock(MLOCK_ONFAULT) on its > mapping of guest memory prior to the pinning. Subsequently, when > pin_user_pages() walks the page-table, the relevant 'pte' is not > present and so the faulting logic allocates a new folio, mlocks it > with mlock_folio() and maps it in the page-table. > > Since commit 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page() > batch by pagevec"), mlock/munlock operations on a folio (formerly page), > are deferred. For example, mlock_folio() takes an additional reference > on the target folio before placing it into a per-cpu 'folio_batch' for > later processing by mlock_folio_batch(), which drops the refcount once > the operation is complete. Processing of the batches is coupled with > the LRU batch logic and can be forcefully drained with > lru_add_drain_all() but as long as a folio remains unprocessed on the > batch, its refcount will be elevated. > > This deferred batching therefore interacts poorly with the pKVM pinning > scenario as we can find ourselves in a situation where the migration > code fails to migrate a folio due to the elevated refcount from the > pending mlock operation. > > Hugh Dickins adds:- > > !folio_test_lru() has never been a very reliable way to tell if an > lru_add_drain_all() is worth calling, to remove LRU cache references > to make the folio migratable: the LRU flag may be set even while the > folio is held with an extra reference in a per-CPU LRU cache. > > 5.18 commit 2fbb0c10d1e8 may have made it more unreliable. Then 6.11 > commit 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before adding > to LRU batch") tried to make it reliable, by moving LRU flag clearing; > but missed the mlock/munlock batches, so still unreliable as reported. > > And it turns out to be difficult to extend 33dfe9204f29's LRU flag > clearing to the mlock/munlock batches: if they do benefit from batching, > mlock/munlock cannot be so effective when easily suppressed while !LRU. > > Instead, switch to an expected ref_count check, which was more reliable > all along: some more false positives (unhelpful drains) than before, and > never a guarantee that the folio will prove migratable, but better. > > Note on PG_private_2: ceph and nfs are still using the deprecated > PG_private_2 flag, with the aid of netfs and filemap support functions. > Although it is consistently matched by an increment of folio ref_count, > folio_expected_ref_count() intentionally does not recognize it, and ceph > folio migration currently depends on that for PG_private_2 folios to be > rejected. New references to the deprecated flag are discouraged, so do > not add it into the collect_longterm_unpinnable_folios() calculation: > but longterm pinning of transiently PG_private_2 ceph and nfs folios > (an uncommon case) may invoke a redundant lru_add_drain_all(). And > this makes easy the backport to earlier releases: up to and including > 6.12, btrfs also used PG_private_2, but without a ref_count increment. > > Note for stable backports: requires 6.16 commit 86ebd50224c0 ("mm: > add folio_expected_ref_count() for reference count calculation"). > > Reported-by: Will Deacon <will@kernel.org> > Closes: https://lore.kernel.org/linux-mm/20250815101858.24352-1-will@kernel.org/ > Fixes: 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region") > Signed-off-by: Hugh Dickins <hughd@google.com> > Cc: <stable@vger.kernel.org> > --- > mm/gup.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/mm/gup.c b/mm/gup.c > index adffe663594d..82aec6443c0a 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2307,7 +2307,8 @@ static unsigned long collect_longterm_unpinnable_folios( > continue; > } > > - if (!folio_test_lru(folio) && drain_allow) { > + if (drain_allow && folio_ref_count(folio) != > + folio_expected_ref_count(folio) + 1) { > lru_add_drain_all(); > drain_allow = false; > } Acked-by: David Hildenbrand <david@redhat.com> -- Cheers David / dhildenb
© 2016 - 2025 Red Hat, Inc.