mm/migrate: wait for folio refcount during longterm pin migration

[RFC PATCH 0/2] mm/migrate: wait for folio refcount during longterm pin migration

Posted by John Hubbard 2 months, 1 week ago

Hi,

This adds a bounded sleep to migration so that FOLL_LONGTERM pinning can
wait for transient folio references to drain, instead of failing after a
fixed number of retries. The wait uses a one-second timeout. An
alternative approach would be to call wait_var_event_killable() with no
timeout, but that doesn't match as well with migration's "this will
probably work" API. In other words, a short sleeping wait is more
appropriate here.

When migrating pages for FOLL_LONGTERM pinning, migration can fail with
-EAGAIN if a folio has unexpected references. These references are often
transient, but the current retry loop gives up too quickly. This series
adds wait_var_event_timeout() at the retry points, paired with
wake_up_var() in folio_put() to wake the sleeper as soon as the refcount
drops.

The wake_up_var() calls in folio_put() are gated behind a static key,
disabled by default, so non-migration workloads pay zero cost.
migrate_pages() enables the key on entry when the reason is
MR_LONGTERM_PIN, and disables it on exit.

Toggling the key is not free. folio_put() is static inline, so every
compilation unit that calls it gets its own patch site (roughly 500 in
vmlinux, plus modules). On x86, jump label patching is batched (256
sites per batch, 3 IPI rounds per batch), so enabling the key costs
6-9 IPI broadcasts, a few hundred microseconds on a large machine.
That cost is paid twice per migrate_pages() call. Migration itself
spends several milliseconds per batch on LRU isolation, TLB flushes,
and page copies. Concurrent longterm-pin migrations after the first
just do an atomic_inc (no patching).

Matthew Brost offered to performance-test this series [1], as Intel has
tests that stress migration and good metrics to catch regressions.

[1] https://lore.kernel.org/all/aX+oUorOWPt1xbgw@lstrano-desk.jf.intel.com/

John Hubbard (2):
  mm: wake up folio refcount waiters on folio_put()
  mm/migrate: wait for folio refcount during longterm pin migration

 include/linux/mm.h |  8 ++++++++
 mm/migrate.c       | 30 ++++++++++++++++++++++++++++++
 mm/swap.c          | 10 +++++++++-
 3 files changed, 47 insertions(+), 1 deletion(-)


base-commit: 9a9c8ce300cd3859cc87b408ef552cd697cc2ab7
-- 
2.53.0

Re: [RFC PATCH 0/2] mm/migrate: wait for folio refcount during longterm pin migration

Posted by Huang, Ying 2 months ago

Hi, John,

John Hubbard <jhubbard@nvidia.com> writes:

> Hi,
>
> This adds a bounded sleep to migration so that FOLL_LONGTERM pinning can
> wait for transient folio references to drain, instead of failing after a
> fixed number of retries. The wait uses a one-second timeout. An

Is the one-second timeout appropriate for all users?  Do some users
prefer fail-fast behavior instead?  If so, should we add another FOLL
flag to support a timed wait?

> alternative approach would be to call wait_var_event_killable() with no
> timeout, but that doesn't match as well with migration's "this will
> probably work" API. In other words, a short sleeping wait is more
> appropriate here.
>
> When migrating pages for FOLL_LONGTERM pinning, migration can fail with
> -EAGAIN if a folio has unexpected references. These references are often
> transient, but the current retry loop gives up too quickly. This series
> adds wait_var_event_timeout() at the retry points, paired with
> wake_up_var() in folio_put() to wake the sleeper as soon as the refcount
> drops.
>
> The wake_up_var() calls in folio_put() are gated behind a static key,
> disabled by default, so non-migration workloads pay zero cost.
> migrate_pages() enables the key on entry when the reason is
> MR_LONGTERM_PIN, and disables it on exit.
>
> Toggling the key is not free. folio_put() is static inline, so every
> compilation unit that calls it gets its own patch site (roughly 500 in
> vmlinux, plus modules). On x86, jump label patching is batched (256
> sites per batch, 3 IPI rounds per batch), so enabling the key costs
> 6-9 IPI broadcasts, a few hundred microseconds on a large machine.
> That cost is paid twice per migrate_pages() call. Migration itself
> spends several milliseconds per batch on LRU isolation, TLB flushes,
> and page copies. Concurrent longterm-pin migrations after the first
> just do an atomic_inc (no patching).
>
> Matthew Brost offered to performance-test this series [1], as Intel has
> tests that stress migration and good metrics to catch regressions.
>
> [1] https://lore.kernel.org/all/aX+oUorOWPt1xbgw@lstrano-desk.jf.intel.com/
>
> John Hubbard (2):
>   mm: wake up folio refcount waiters on folio_put()
>   mm/migrate: wait for folio refcount during longterm pin migration
>
>  include/linux/mm.h |  8 ++++++++
>  mm/migrate.c       | 30 ++++++++++++++++++++++++++++++
>  mm/swap.c          | 10 +++++++++-
>  3 files changed, 47 insertions(+), 1 deletion(-)
>
>
> base-commit: 9a9c8ce300cd3859cc87b408ef552cd697cc2ab7

---
Best Regards,
Huang, Ying

Re: [RFC PATCH 0/2] mm/migrate: wait for folio refcount during longterm pin migration

Posted by David Hildenbrand (Arm) 2 months ago

On 4/21/26 11:19, Huang, Ying wrote:
> Hi, John,
> 
> John Hubbard <jhubbard@nvidia.com> writes:
> 
>> Hi,
>>
>> This adds a bounded sleep to migration so that FOLL_LONGTERM pinning can
>> wait for transient folio references to drain, instead of failing after a
>> fixed number of retries. The wait uses a one-second timeout. An
> 
> Is the one-second timeout appropriate for all users?  Do some users
> prefer fail-fast behavior instead?  If so, should we add another FOLL
> flag to support a timed wait?

We should avoid a FOLL flag to affect that behavior. FOLL_LONGTERM
already implies that things could take a while.

So we have real examples were failing is even desirable? :)

-- 
Cheers,

David

Re: [RFC PATCH 0/2] mm/migrate: wait for folio refcount during longterm pin migration

Posted by Huang, Ying 1 month, 4 weeks ago

"David Hildenbrand (Arm)" <david@kernel.org> writes:

> On 4/21/26 11:19, Huang, Ying wrote:
>> Hi, John,
>> 
>> John Hubbard <jhubbard@nvidia.com> writes:
>> 
>>> Hi,
>>>
>>> This adds a bounded sleep to migration so that FOLL_LONGTERM pinning can
>>> wait for transient folio references to drain, instead of failing after a
>>> fixed number of retries. The wait uses a one-second timeout. An
>> 
>> Is the one-second timeout appropriate for all users?  Do some users
>> prefer fail-fast behavior instead?  If so, should we add another FOLL
>> flag to support a timed wait?
>
> We should avoid a FOLL flag to affect that behavior. FOLL_LONGTERM
> already implies that things could take a while.

Yes.  I am just not sure whether one-second is OK for all users.

> So we have real examples were failing is even desirable? :)

Hi, John,

Could you do some research on this?

---
Best Regards,
Huang, Ying

Re: [RFC PATCH 0/2] mm/migrate: wait for folio refcount during longterm pin migration

Posted by John Hubbard 1 month, 4 weeks ago

On 4/21/26 6:46 PM, Huang, Ying wrote:
> "David Hildenbrand (Arm)" <david@kernel.org> writes:
>> On 4/21/26 11:19, Huang, Ying wrote:
>>> Hi, John,
>>>
>>> John Hubbard <jhubbard@nvidia.com> writes:
>>>
>>>> Hi,
>>>>
>>>> This adds a bounded sleep to migration so that FOLL_LONGTERM pinning can
>>>> wait for transient folio references to drain, instead of failing after a
>>>> fixed number of retries. The wait uses a one-second timeout. An
>>>
>>> Is the one-second timeout appropriate for all users?  Do some users
>>> prefer fail-fast behavior instead?  If so, should we add another FOLL
>>> flag to support a timed wait?
>>
>> We should avoid a FOLL flag to affect that behavior. FOLL_LONGTERM
>> already implies that things could take a while.
> 
> Yes.  I am just not sure whether one-second is OK for all users.
> 
>> So we have real examples were failing is even desirable? :)
> 
> Hi, John,
> 
> Could you do some research on this?
> 

Yes, absolutely! Great set of questions from everyone, much appreciated.

I'm hoping to have some answers to post before LSF/MM, these are not
too hard to answer.

thanks,
-- 
John Hubbard

Re: [RFC PATCH 0/2] mm/migrate: wait for folio refcount during longterm pin migration

Posted by Alistair Popple 2 months ago

On 2026-04-10 at 13:23 +1000, John Hubbard <jhubbard@nvidia.com> wrote...
> Hi,
> 
> This adds a bounded sleep to migration so that FOLL_LONGTERM pinning can
> wait for transient folio references to drain, instead of failing after a
> fixed number of retries. The wait uses a one-second timeout. An
> alternative approach would be to call wait_var_event_killable() with no
> timeout, but that doesn't match as well with migration's "this will
> probably work" API. In other words, a short sleeping wait is more
> appropriate here.

This is much better than retrying $RANDOM times. It also seems it would provide
a nice definition of what a transient vs. longterm pin is. Any pins longer than
the migration timeout would be longterm.

> When migrating pages for FOLL_LONGTERM pinning, migration can fail with
> -EAGAIN if a folio has unexpected references. These references are often
> transient, but the current retry loop gives up too quickly. This series
> adds wait_var_event_timeout() at the retry points, paired with
> wake_up_var() in folio_put() to wake the sleeper as soon as the refcount
> drops.

Nothing wrong with the above, just a minor nit that I wanted to check
my understanding of. FOLL_LONGTERM causing migration implies this is in
ZONE_MOVABLE, and the aim of ZONE_MOVABLE is that memory is always movable. That
implies any unexpected page references should *always* be transient, not often
transient. At least that's my understanding assuming drivers are behaving.

> The wake_up_var() calls in folio_put() are gated behind a static key,
> disabled by default, so non-migration workloads pay zero cost.
> migrate_pages() enables the key on entry when the reason is
> MR_LONGTERM_PIN, and disables it on exit.
> 
> Toggling the key is not free. folio_put() is static inline, so every
> compilation unit that calls it gets its own patch site (roughly 500 in
> vmlinux, plus modules). On x86, jump label patching is batched (256
> sites per batch, 3 IPI rounds per batch), so enabling the key costs
> 6-9 IPI broadcasts, a few hundred microseconds on a large machine.
> That cost is paid twice per migrate_pages() call. Migration itself
> spends several milliseconds per batch on LRU isolation, TLB flushes,
> and page copies. Concurrent longterm-pin migrations after the first
> just do an atomic_inc (no patching).
> 
> Matthew Brost offered to performance-test this series [1], as Intel has
> tests that stress migration and good metrics to catch regressions.
> 
> [1] https://lore.kernel.org/all/aX+oUorOWPt1xbgw@lstrano-desk.jf.intel.com/
> 
> John Hubbard (2):
>   mm: wake up folio refcount waiters on folio_put()
>   mm/migrate: wait for folio refcount during longterm pin migration
> 
>  include/linux/mm.h |  8 ++++++++
>  mm/migrate.c       | 30 ++++++++++++++++++++++++++++++
>  mm/swap.c          | 10 +++++++++-
>  3 files changed, 47 insertions(+), 1 deletion(-)
> 
> 
> base-commit: 9a9c8ce300cd3859cc87b408ef552cd697cc2ab7
> -- 
> 2.53.0
>