include/linux/mm.h | 8 ++++++++ mm/migrate.c | 30 ++++++++++++++++++++++++++++++ mm/swap.c | 10 +++++++++- 3 files changed, 47 insertions(+), 1 deletion(-)
Hi, This adds a bounded sleep to migration so that FOLL_LONGTERM pinning can wait for transient folio references to drain, instead of failing after a fixed number of retries. The wait uses a one-second timeout. An alternative approach would be to call wait_var_event_killable() with no timeout, but that doesn't match as well with migration's "this will probably work" API. In other words, a short sleeping wait is more appropriate here. When migrating pages for FOLL_LONGTERM pinning, migration can fail with -EAGAIN if a folio has unexpected references. These references are often transient, but the current retry loop gives up too quickly. This series adds wait_var_event_timeout() at the retry points, paired with wake_up_var() in folio_put() to wake the sleeper as soon as the refcount drops. The wake_up_var() calls in folio_put() are gated behind a static key, disabled by default, so non-migration workloads pay zero cost. migrate_pages() enables the key on entry when the reason is MR_LONGTERM_PIN, and disables it on exit. Toggling the key is not free. folio_put() is static inline, so every compilation unit that calls it gets its own patch site (roughly 500 in vmlinux, plus modules). On x86, jump label patching is batched (256 sites per batch, 3 IPI rounds per batch), so enabling the key costs 6-9 IPI broadcasts, a few hundred microseconds on a large machine. That cost is paid twice per migrate_pages() call. Migration itself spends several milliseconds per batch on LRU isolation, TLB flushes, and page copies. Concurrent longterm-pin migrations after the first just do an atomic_inc (no patching). Matthew Brost offered to performance-test this series [1], as Intel has tests that stress migration and good metrics to catch regressions. [1] https://lore.kernel.org/all/aX+oUorOWPt1xbgw@lstrano-desk.jf.intel.com/ John Hubbard (2): mm: wake up folio refcount waiters on folio_put() mm/migrate: wait for folio refcount during longterm pin migration include/linux/mm.h | 8 ++++++++ mm/migrate.c | 30 ++++++++++++++++++++++++++++++ mm/swap.c | 10 +++++++++- 3 files changed, 47 insertions(+), 1 deletion(-) base-commit: 9a9c8ce300cd3859cc87b408ef552cd697cc2ab7 -- 2.53.0
Hi, John, John Hubbard <jhubbard@nvidia.com> writes: > Hi, > > This adds a bounded sleep to migration so that FOLL_LONGTERM pinning can > wait for transient folio references to drain, instead of failing after a > fixed number of retries. The wait uses a one-second timeout. An Is the one-second timeout appropriate for all users? Do some users prefer fail-fast behavior instead? If so, should we add another FOLL flag to support a timed wait? > alternative approach would be to call wait_var_event_killable() with no > timeout, but that doesn't match as well with migration's "this will > probably work" API. In other words, a short sleeping wait is more > appropriate here. > > When migrating pages for FOLL_LONGTERM pinning, migration can fail with > -EAGAIN if a folio has unexpected references. These references are often > transient, but the current retry loop gives up too quickly. This series > adds wait_var_event_timeout() at the retry points, paired with > wake_up_var() in folio_put() to wake the sleeper as soon as the refcount > drops. > > The wake_up_var() calls in folio_put() are gated behind a static key, > disabled by default, so non-migration workloads pay zero cost. > migrate_pages() enables the key on entry when the reason is > MR_LONGTERM_PIN, and disables it on exit. > > Toggling the key is not free. folio_put() is static inline, so every > compilation unit that calls it gets its own patch site (roughly 500 in > vmlinux, plus modules). On x86, jump label patching is batched (256 > sites per batch, 3 IPI rounds per batch), so enabling the key costs > 6-9 IPI broadcasts, a few hundred microseconds on a large machine. > That cost is paid twice per migrate_pages() call. Migration itself > spends several milliseconds per batch on LRU isolation, TLB flushes, > and page copies. Concurrent longterm-pin migrations after the first > just do an atomic_inc (no patching). > > Matthew Brost offered to performance-test this series [1], as Intel has > tests that stress migration and good metrics to catch regressions. > > [1] https://lore.kernel.org/all/aX+oUorOWPt1xbgw@lstrano-desk.jf.intel.com/ > > John Hubbard (2): > mm: wake up folio refcount waiters on folio_put() > mm/migrate: wait for folio refcount during longterm pin migration > > include/linux/mm.h | 8 ++++++++ > mm/migrate.c | 30 ++++++++++++++++++++++++++++++ > mm/swap.c | 10 +++++++++- > 3 files changed, 47 insertions(+), 1 deletion(-) > > > base-commit: 9a9c8ce300cd3859cc87b408ef552cd697cc2ab7 --- Best Regards, Huang, Ying
On 4/21/26 11:19, Huang, Ying wrote: > Hi, John, > > John Hubbard <jhubbard@nvidia.com> writes: > >> Hi, >> >> This adds a bounded sleep to migration so that FOLL_LONGTERM pinning can >> wait for transient folio references to drain, instead of failing after a >> fixed number of retries. The wait uses a one-second timeout. An > > Is the one-second timeout appropriate for all users? Do some users > prefer fail-fast behavior instead? If so, should we add another FOLL > flag to support a timed wait? We should avoid a FOLL flag to affect that behavior. FOLL_LONGTERM already implies that things could take a while. So we have real examples were failing is even desirable? :) -- Cheers, David
"David Hildenbrand (Arm)" <david@kernel.org> writes: > On 4/21/26 11:19, Huang, Ying wrote: >> Hi, John, >> >> John Hubbard <jhubbard@nvidia.com> writes: >> >>> Hi, >>> >>> This adds a bounded sleep to migration so that FOLL_LONGTERM pinning can >>> wait for transient folio references to drain, instead of failing after a >>> fixed number of retries. The wait uses a one-second timeout. An >> >> Is the one-second timeout appropriate for all users? Do some users >> prefer fail-fast behavior instead? If so, should we add another FOLL >> flag to support a timed wait? > > We should avoid a FOLL flag to affect that behavior. FOLL_LONGTERM > already implies that things could take a while. Yes. I am just not sure whether one-second is OK for all users. > So we have real examples were failing is even desirable? :) Hi, John, Could you do some research on this? --- Best Regards, Huang, Ying
On 4/21/26 6:46 PM, Huang, Ying wrote: > "David Hildenbrand (Arm)" <david@kernel.org> writes: >> On 4/21/26 11:19, Huang, Ying wrote: >>> Hi, John, >>> >>> John Hubbard <jhubbard@nvidia.com> writes: >>> >>>> Hi, >>>> >>>> This adds a bounded sleep to migration so that FOLL_LONGTERM pinning can >>>> wait for transient folio references to drain, instead of failing after a >>>> fixed number of retries. The wait uses a one-second timeout. An >>> >>> Is the one-second timeout appropriate for all users? Do some users >>> prefer fail-fast behavior instead? If so, should we add another FOLL >>> flag to support a timed wait? >> >> We should avoid a FOLL flag to affect that behavior. FOLL_LONGTERM >> already implies that things could take a while. > > Yes. I am just not sure whether one-second is OK for all users. > >> So we have real examples were failing is even desirable? :) > > Hi, John, > > Could you do some research on this? > Yes, absolutely! Great set of questions from everyone, much appreciated. I'm hoping to have some answers to post before LSF/MM, these are not too hard to answer. thanks, -- John Hubbard
On 2026-04-10 at 13:23 +1000, John Hubbard <jhubbard@nvidia.com> wrote... > Hi, > > This adds a bounded sleep to migration so that FOLL_LONGTERM pinning can > wait for transient folio references to drain, instead of failing after a > fixed number of retries. The wait uses a one-second timeout. An > alternative approach would be to call wait_var_event_killable() with no > timeout, but that doesn't match as well with migration's "this will > probably work" API. In other words, a short sleeping wait is more > appropriate here. This is much better than retrying $RANDOM times. It also seems it would provide a nice definition of what a transient vs. longterm pin is. Any pins longer than the migration timeout would be longterm. > When migrating pages for FOLL_LONGTERM pinning, migration can fail with > -EAGAIN if a folio has unexpected references. These references are often > transient, but the current retry loop gives up too quickly. This series > adds wait_var_event_timeout() at the retry points, paired with > wake_up_var() in folio_put() to wake the sleeper as soon as the refcount > drops. Nothing wrong with the above, just a minor nit that I wanted to check my understanding of. FOLL_LONGTERM causing migration implies this is in ZONE_MOVABLE, and the aim of ZONE_MOVABLE is that memory is always movable. That implies any unexpected page references should *always* be transient, not often transient. At least that's my understanding assuming drivers are behaving. > The wake_up_var() calls in folio_put() are gated behind a static key, > disabled by default, so non-migration workloads pay zero cost. > migrate_pages() enables the key on entry when the reason is > MR_LONGTERM_PIN, and disables it on exit. > > Toggling the key is not free. folio_put() is static inline, so every > compilation unit that calls it gets its own patch site (roughly 500 in > vmlinux, plus modules). On x86, jump label patching is batched (256 > sites per batch, 3 IPI rounds per batch), so enabling the key costs > 6-9 IPI broadcasts, a few hundred microseconds on a large machine. > That cost is paid twice per migrate_pages() call. Migration itself > spends several milliseconds per batch on LRU isolation, TLB flushes, > and page copies. Concurrent longterm-pin migrations after the first > just do an atomic_inc (no patching). > > Matthew Brost offered to performance-test this series [1], as Intel has > tests that stress migration and good metrics to catch regressions. > > [1] https://lore.kernel.org/all/aX+oUorOWPt1xbgw@lstrano-desk.jf.intel.com/ > > John Hubbard (2): > mm: wake up folio refcount waiters on folio_put() > mm/migrate: wait for folio refcount during longterm pin migration > > include/linux/mm.h | 8 ++++++++ > mm/migrate.c | 30 ++++++++++++++++++++++++++++++ > mm/swap.c | 10 +++++++++- > 3 files changed, 47 insertions(+), 1 deletion(-) > > > base-commit: 9a9c8ce300cd3859cc87b408ef552cd697cc2ab7 > -- > 2.53.0 >
© 2016 - 2026 Red Hat, Inc.