mm/swap.c and mm/mlock.c agree to drain any per-CPU batch as soon as
a large folio is added: so collect_longterm_unpinnable_folios() just
wastes effort when calling lru_add_drain_all() on a large folio.
But although there is good reason not to batch up PMD-sized folios,
we might well benefit from batching a small number of low-order mTHPs
(though unclear how that "small number" limitation will be implemented).
So ask if folio_may_be_cached() rather than !folio_test_large(), to
insulate those particular checks from future change. Name preferred
to "folio_is_batchable" because large folios can well be put on a batch:
it's just the per-CPU LRU caches, drained much later, which need care.
Marked for stable, to counter the increase in lru_add_drain_all()s
from "mm/gup: check ref_count instead of lru before migration".
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
include/linux/swap.h | 10 ++++++++++
mm/gup.c | 5 +++--
mm/mlock.c | 6 +++---
mm/swap.c | 2 +-
4 files changed, 17 insertions(+), 6 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2fe6ed2cc3fd..b49a61c32238 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -385,6 +385,16 @@ void folio_add_lru_vma(struct folio *, struct vm_area_struct *);
void mark_page_accessed(struct page *);
void folio_mark_accessed(struct folio *);
+static inline bool folio_may_be_cached(struct folio *folio)
+{
+ /*
+ * Holding PMD-sized folios in per-CPU LRU cache unbalances accounting.
+ * Holding small numbers of low-order mTHP folios in per-CPU LRU cache
+ * will be sensible, but nobody has implemented and tested that yet.
+ */
+ return !folio_test_large(folio);
+}
+
extern atomic_t lru_disable_count;
static inline bool lru_cache_disabled(void)
diff --git a/mm/gup.c b/mm/gup.c
index 9f7c87f504a9..e70544c0f958 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2309,8 +2309,9 @@ static unsigned long collect_longterm_unpinnable_folios(
continue;
}
- if (drain_allow && folio_ref_count(folio) !=
- folio_expected_ref_count(folio) + 1) {
+ if (drain_allow && folio_may_be_cached(folio) &&
+ folio_ref_count(folio) !=
+ folio_expected_ref_count(folio) + 1) {
lru_add_drain_all();
drain_allow = false;
}
diff --git a/mm/mlock.c b/mm/mlock.c
index a1d93ad33c6d..427339dea380 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -255,7 +255,7 @@ void mlock_folio(struct folio *folio)
folio_get(folio);
if (!folio_batch_add(fbatch, mlock_lru(folio)) ||
- folio_test_large(folio) || lru_cache_disabled())
+ !folio_may_be_cached(folio) || lru_cache_disabled())
mlock_folio_batch(fbatch);
local_unlock(&mlock_fbatch.lock);
}
@@ -278,7 +278,7 @@ void mlock_new_folio(struct folio *folio)
folio_get(folio);
if (!folio_batch_add(fbatch, mlock_new(folio)) ||
- folio_test_large(folio) || lru_cache_disabled())
+ !folio_may_be_cached(folio) || lru_cache_disabled())
mlock_folio_batch(fbatch);
local_unlock(&mlock_fbatch.lock);
}
@@ -299,7 +299,7 @@ void munlock_folio(struct folio *folio)
*/
folio_get(folio);
if (!folio_batch_add(fbatch, folio) ||
- folio_test_large(folio) || lru_cache_disabled())
+ !folio_may_be_cached(folio) || lru_cache_disabled())
mlock_folio_batch(fbatch);
local_unlock(&mlock_fbatch.lock);
}
diff --git a/mm/swap.c b/mm/swap.c
index 6ae2d5680574..17438fd1f51a 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -192,7 +192,7 @@ static void __folio_batch_add_and_move(struct folio_batch __percpu *fbatch,
local_lock(&cpu_fbatches.lock);
if (!folio_batch_add(this_cpu_ptr(fbatch), folio) ||
- folio_test_large(folio) || lru_cache_disabled())
+ !folio_may_be_cached(folio) || lru_cache_disabled())
folio_batch_move_lru(this_cpu_ptr(fbatch), move_fn);
if (disable_irq)
--
2.51.0
On 31.08.25 11:16, Hugh Dickins wrote: > mm/swap.c and mm/mlock.c agree to drain any per-CPU batch as soon as > a large folio is added: so collect_longterm_unpinnable_folios() just > wastes effort when calling lru_add_drain_all() on a large folio. > > But although there is good reason not to batch up PMD-sized folios, > we might well benefit from batching a small number of low-order mTHPs > (though unclear how that "small number" limitation will be implemented). > > So ask if folio_may_be_cached() rather than !folio_test_large(), to > insulate those particular checks from future change. Name preferred > to "folio_is_batchable" because large folios can well be put on a batch: > it's just the per-CPU LRU caches, drained much later, which need care. > > Marked for stable, to counter the increase in lru_add_drain_all()s > from "mm/gup: check ref_count instead of lru before migration". > > Suggested-by: David Hildenbrand <david@redhat.com> > Signed-off-by: Hugh Dickins <hughd@google.com> > Cc: <stable@vger.kernel.org> > --- > include/linux/swap.h | 10 ++++++++++ > mm/gup.c | 5 +++-- > mm/mlock.c | 6 +++--- > mm/swap.c | 2 +- > 4 files changed, 17 insertions(+), 6 deletions(-) > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 2fe6ed2cc3fd..b49a61c32238 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -385,6 +385,16 @@ void folio_add_lru_vma(struct folio *, struct vm_area_struct *); > void mark_page_accessed(struct page *); > void folio_mark_accessed(struct folio *); > Two smaller things: (1) We have other "folio_maybe_*" functions, so this one should likely better start with that as well. (2) With things like fscache in mind, the function can be a bit misleading. So I wonder if (a) we should just add kerneldoc to document it clearly (lru cache, mlock cache?) and (b) maybe call it folio_may_be_lru_cached(). Not sure if we can find a better abstraction for these two caches. Thinking again, "maybe_cached" might be a bit misleading because it implements a very very very bad heuristic for small folios. Maybe it's more like "supports being cached". folio_lru_caching_supported() Something like that, maybe? (again, unclear about lru/mlock cache abstraction) -- Cheers David / dhildenb
On Mon, 1 Sep 2025, David Hildenbrand wrote: > On 31.08.25 11:16, Hugh Dickins wrote: > > mm/swap.c and mm/mlock.c agree to drain any per-CPU batch as soon as > > a large folio is added: so collect_longterm_unpinnable_folios() just > > wastes effort when calling lru_add_drain_all() on a large folio. > > > > But although there is good reason not to batch up PMD-sized folios, > > we might well benefit from batching a small number of low-order mTHPs > > (though unclear how that "small number" limitation will be implemented). > > > > So ask if folio_may_be_cached() rather than !folio_test_large(), to > > insulate those particular checks from future change. Name preferred > > to "folio_is_batchable" because large folios can well be put on a batch: > > it's just the per-CPU LRU caches, drained much later, which need care. > > > > Marked for stable, to counter the increase in lru_add_drain_all()s > > from "mm/gup: check ref_count instead of lru before migration". > > > > Suggested-by: David Hildenbrand <david@redhat.com> > > Signed-off-by: Hugh Dickins <hughd@google.com> > > Cc: <stable@vger.kernel.org> > > --- > > include/linux/swap.h | 10 ++++++++++ > > mm/gup.c | 5 +++-- > > mm/mlock.c | 6 +++--- > > mm/swap.c | 2 +- > > 4 files changed, 17 insertions(+), 6 deletions(-) > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h > > index 2fe6ed2cc3fd..b49a61c32238 100644 > > --- a/include/linux/swap.h > > +++ b/include/linux/swap.h > > @@ -385,6 +385,16 @@ void folio_add_lru_vma(struct folio *, struct > > vm_area_struct *); > > void mark_page_accessed(struct page *); > > void folio_mark_accessed(struct folio *); > > > > Two smaller things: > > (1) We have other "folio_maybe_*" functions, so this one should likely > better start with that as well. > > (2) With things like fscache in mind, the function can be a bit > misleading. > > So I wonder if (a) we should just add kerneldoc to document it clearly (lru > cache, mlock cache?) and (b) maybe call it folio_may_be_lru_cached(). Not sure > if we can find a better abstraction for these two caches. > > Thinking again, "maybe_cached" might be a bit misleading because it implements > a very very very bad heuristic for small folios. > > Maybe it's more like "supports being cached". > > folio_lru_caching_supported() folio_may_be_cached() -> folio_may_be_lru_cached(), yes, that's very much better, thanks. (Settimg aside that I've never perceived those pagevecs/batches as a "cache"; but lru_cache_disable() gave us that terminology, and we've gone with the flow ever since. lru_add_drain() would be better named lru_cache_drain() now, I've always got hung up on "adding a drain".) "may be" rather than "maybe" was intentional: perhaps too subtle, but to a native speaker it neatly expresses both the "we can do this" and "might this have been done" cases. kernel-doc? I don't think so, this is very much an mm-internal matter, and I don't care for the way kernel-doc forces us towards boilerplate ("@folio: The folio.") rather than helpful comment. Hugh
On 08.09.25 13:19, Hugh Dickins wrote: > On Mon, 1 Sep 2025, David Hildenbrand wrote: >> On 31.08.25 11:16, Hugh Dickins wrote: >>> mm/swap.c and mm/mlock.c agree to drain any per-CPU batch as soon as >>> a large folio is added: so collect_longterm_unpinnable_folios() just >>> wastes effort when calling lru_add_drain_all() on a large folio. >>> >>> But although there is good reason not to batch up PMD-sized folios, >>> we might well benefit from batching a small number of low-order mTHPs >>> (though unclear how that "small number" limitation will be implemented). >>> >>> So ask if folio_may_be_cached() rather than !folio_test_large(), to >>> insulate those particular checks from future change. Name preferred >>> to "folio_is_batchable" because large folios can well be put on a batch: >>> it's just the per-CPU LRU caches, drained much later, which need care. >>> >>> Marked for stable, to counter the increase in lru_add_drain_all()s >>> from "mm/gup: check ref_count instead of lru before migration". >>> >>> Suggested-by: David Hildenbrand <david@redhat.com> >>> Signed-off-by: Hugh Dickins <hughd@google.com> >>> Cc: <stable@vger.kernel.org> >>> --- >>> include/linux/swap.h | 10 ++++++++++ >>> mm/gup.c | 5 +++-- >>> mm/mlock.c | 6 +++--- >>> mm/swap.c | 2 +- >>> 4 files changed, 17 insertions(+), 6 deletions(-) >>> >>> diff --git a/include/linux/swap.h b/include/linux/swap.h >>> index 2fe6ed2cc3fd..b49a61c32238 100644 >>> --- a/include/linux/swap.h >>> +++ b/include/linux/swap.h >>> @@ -385,6 +385,16 @@ void folio_add_lru_vma(struct folio *, struct >>> vm_area_struct *); >>> void mark_page_accessed(struct page *); >>> void folio_mark_accessed(struct folio *); >>> >> >> Two smaller things: >> >> (1) We have other "folio_maybe_*" functions, so this one should likely >> better start with that as well. >> >> (2) With things like fscache in mind, the function can be a bit >> misleading. >> >> So I wonder if (a) we should just add kerneldoc to document it clearly (lru >> cache, mlock cache?) and (b) maybe call it folio_may_be_lru_cached(). Not sure >> if we can find a better abstraction for these two caches. >> >> Thinking again, "maybe_cached" might be a bit misleading because it implements >> a very very very bad heuristic for small folios. >> >> Maybe it's more like "supports being cached". >> >> folio_lru_caching_supported() > > folio_may_be_cached() -> folio_may_be_lru_cached(), yes, that's > very much better, thanks. > > (Settimg aside that I've never perceived those pagevecs/batches as a > "cache"; but lru_cache_disable() gave us that terminology, and we've > gone with the flow ever since. lru_add_drain() would be better named > lru_cache_drain() now, I've always got hung up on "adding a drain".) Yeah, the terminology is not that intuitive :) Not sure if using "batched" instead of "cached" might be clearer long-term? > > "may be" rather than "maybe" was intentional: perhaps too subtle, > but to a native speaker it neatly expresses both the "we can do this" > and "might this have been done" cases. I would wish we could find something that also non-native speakers can immediately understand ;) "may_get_lru_cached" / "may_get_lru_batched"? /me could not even phrase it in German properly > > kernel-doc? I don't think so, this is very much an mm-internal > matter, and I don't care for the way kernel-doc forces us towards > boilerplate ("@folio: The folio.") rather than helpful comment. So a comment that this is an internal helper might be nice. Or we just move it straight to mm/internal.h ? -- Cheers David / dhildenb
On Mon, 8 Sep 2025, David Hildenbrand wrote: > On 08.09.25 13:19, Hugh Dickins wrote: ... > > > > (Settimg aside that I've never perceived those pagevecs/batches as a > > "cache"; but lru_cache_disable() gave us that terminology, and we've > > gone with the flow ever since. lru_add_drain() would be better named > > lru_cache_drain() now, I've always got hung up on "adding a drain".) > > Yeah, the terminology is not that intuitive :) > > Not sure if using "batched" instead of "cached" might be clearer long-term? > > > > > "may be" rather than "maybe" was intentional: perhaps too subtle, > > but to a native speaker it neatly expresses both the "we can do this" > > and "might this have been done" cases. > > I would wish we could find something that also non-native speakers can > immediately understand ;) > > "may_get_lru_cached" / "may_get_lru_batched"? > > /me could not even phrase it in German properly > > > > > kernel-doc? I don't think so, this is very much an mm-internal > > matter, and I don't care for the way kernel-doc forces us towards > > boilerplate ("@folio: The folio.") rather than helpful comment. > > So a comment that this is an internal helper might be nice. Or we just move it > straight to mm/internal.h ? mm/internal.h, where we hide things (GFP_RECLAIM_MASK etc!) that belong elsewhere? No thanks. David, I think you're over-thinking this: I'm coming to regret not just going with your excellent folio_test_large() optimization, and let someone else mess around with the naming. I'll post later with my current naming, but if it's the Suggested-by that's making you uncomfortable, we can delete it. Hugh
On 08.09.25 22:04, Hugh Dickins wrote: > On Mon, 8 Sep 2025, David Hildenbrand wrote: >> On 08.09.25 13:19, Hugh Dickins wrote: > ... >>> >>> (Settimg aside that I've never perceived those pagevecs/batches as a >>> "cache"; but lru_cache_disable() gave us that terminology, and we've >>> gone with the flow ever since. lru_add_drain() would be better named >>> lru_cache_drain() now, I've always got hung up on "adding a drain".) >> >> Yeah, the terminology is not that intuitive :) >> >> Not sure if using "batched" instead of "cached" might be clearer long-term? >> >>> >>> "may be" rather than "maybe" was intentional: perhaps too subtle, >>> but to a native speaker it neatly expresses both the "we can do this" >>> and "might this have been done" cases. >> >> I would wish we could find something that also non-native speakers can >> immediately understand ;) >> >> "may_get_lru_cached" / "may_get_lru_batched"? >> >> /me could not even phrase it in German properly >> >>> >>> kernel-doc? I don't think so, this is very much an mm-internal >>> matter, and I don't care for the way kernel-doc forces us towards >>> boilerplate ("@folio: The folio.") rather than helpful comment. >> >> So a comment that this is an internal helper might be nice. Or we just move it >> straight to mm/internal.h ? > > mm/internal.h, where we hide things (GFP_RECLAIM_MASK etc!) that belong > elsewhere? No thanks. Yes, or other mm-internal helpers that actually belong there. Like folio_raw_mapping() or folio_nr_pages_mapped(). > > David, I think you're over-thinking this: I'm coming to regret not just > going with your excellent folio_test_large() optimization, and let > someone else mess around with the naming. Please don't feel like I'm pushing to hard here. If you feel the current naming is fine and for some reason I don't completely understand it should not be in mm/internal.h, all good. -- Cheers David / dhildenb
© 2016 - 2025 Red Hat, Inc.