[v1] mm: better GUP pin lru_add_drain_all()

[PATCH 6/7] mm: folio_may_be_cached() unless folio_test_large()

Posted by Hugh Dickins 5 months, 1 week ago

mm/swap.c and mm/mlock.c agree to drain any per-CPU batch as soon as
a large folio is added: so collect_longterm_unpinnable_folios() just
wastes effort when calling lru_add_drain_all() on a large folio.

But although there is good reason not to batch up PMD-sized folios,
we might well benefit from batching a small number of low-order mTHPs
(though unclear how that "small number" limitation will be implemented).

So ask if folio_may_be_cached() rather than !folio_test_large(), to
insulate those particular checks from future change.  Name preferred
to "folio_is_batchable" because large folios can well be put on a batch:
it's just the per-CPU LRU caches, drained much later, which need care.

Marked for stable, to counter the increase in lru_add_drain_all()s
from "mm/gup: check ref_count instead of lru before migration".

Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
 include/linux/swap.h | 10 ++++++++++
 mm/gup.c             |  5 +++--
 mm/mlock.c           |  6 +++---
 mm/swap.c            |  2 +-
 4 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2fe6ed2cc3fd..b49a61c32238 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -385,6 +385,16 @@ void folio_add_lru_vma(struct folio *, struct vm_area_struct *);
 void mark_page_accessed(struct page *);
 void folio_mark_accessed(struct folio *);
 
+static inline bool folio_may_be_cached(struct folio *folio)
+{
+	/*
+	 * Holding PMD-sized folios in per-CPU LRU cache unbalances accounting.
+	 * Holding small numbers of low-order mTHP folios in per-CPU LRU cache
+	 * will be sensible, but nobody has implemented and tested that yet.
+	 */
+	return !folio_test_large(folio);
+}
+
 extern atomic_t lru_disable_count;
 
 static inline bool lru_cache_disabled(void)
diff --git a/mm/gup.c b/mm/gup.c
index 9f7c87f504a9..e70544c0f958 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2309,8 +2309,9 @@ static unsigned long collect_longterm_unpinnable_folios(
 			continue;
 		}
 
-		if (drain_allow && folio_ref_count(folio) !=
-				   folio_expected_ref_count(folio) + 1) {
+		if (drain_allow && folio_may_be_cached(folio) &&
+				folio_ref_count(folio) !=
+				folio_expected_ref_count(folio) + 1) {
 			lru_add_drain_all();
 			drain_allow = false;
 		}
diff --git a/mm/mlock.c b/mm/mlock.c
index a1d93ad33c6d..427339dea380 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -255,7 +255,7 @@ void mlock_folio(struct folio *folio)
 
 	folio_get(folio);
 	if (!folio_batch_add(fbatch, mlock_lru(folio)) ||
-	    folio_test_large(folio) || lru_cache_disabled())
+	    !folio_may_be_cached(folio) || lru_cache_disabled())
 		mlock_folio_batch(fbatch);
 	local_unlock(&mlock_fbatch.lock);
 }
@@ -278,7 +278,7 @@ void mlock_new_folio(struct folio *folio)
 
 	folio_get(folio);
 	if (!folio_batch_add(fbatch, mlock_new(folio)) ||
-	    folio_test_large(folio) || lru_cache_disabled())
+	    !folio_may_be_cached(folio) || lru_cache_disabled())
 		mlock_folio_batch(fbatch);
 	local_unlock(&mlock_fbatch.lock);
 }
@@ -299,7 +299,7 @@ void munlock_folio(struct folio *folio)
 	 */
 	folio_get(folio);
 	if (!folio_batch_add(fbatch, folio) ||
-	    folio_test_large(folio) || lru_cache_disabled())
+	    !folio_may_be_cached(folio) || lru_cache_disabled())
 		mlock_folio_batch(fbatch);
 	local_unlock(&mlock_fbatch.lock);
 }
diff --git a/mm/swap.c b/mm/swap.c
index 6ae2d5680574..17438fd1f51a 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -192,7 +192,7 @@ static void __folio_batch_add_and_move(struct folio_batch __percpu *fbatch,
 		local_lock(&cpu_fbatches.lock);
 
 	if (!folio_batch_add(this_cpu_ptr(fbatch), folio) ||
-			folio_test_large(folio) || lru_cache_disabled())
+			!folio_may_be_cached(folio) || lru_cache_disabled())
 		folio_batch_move_lru(this_cpu_ptr(fbatch), move_fn);
 
 	if (disable_irq)
-- 
2.51.0

Re: [PATCH 6/7] mm: folio_may_be_cached() unless folio_test_large()

Posted by David Hildenbrand 5 months, 1 week ago

On 31.08.25 11:16, Hugh Dickins wrote:
> mm/swap.c and mm/mlock.c agree to drain any per-CPU batch as soon as
> a large folio is added: so collect_longterm_unpinnable_folios() just
> wastes effort when calling lru_add_drain_all() on a large folio.
> 
> But although there is good reason not to batch up PMD-sized folios,
> we might well benefit from batching a small number of low-order mTHPs
> (though unclear how that "small number" limitation will be implemented).
> 
> So ask if folio_may_be_cached() rather than !folio_test_large(), to
> insulate those particular checks from future change.  Name preferred
> to "folio_is_batchable" because large folios can well be put on a batch:
> it's just the per-CPU LRU caches, drained much later, which need care.
> 
> Marked for stable, to counter the increase in lru_add_drain_all()s
> from "mm/gup: check ref_count instead of lru before migration".
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: <stable@vger.kernel.org>
> ---
>   include/linux/swap.h | 10 ++++++++++
>   mm/gup.c             |  5 +++--
>   mm/mlock.c           |  6 +++---
>   mm/swap.c            |  2 +-
>   4 files changed, 17 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 2fe6ed2cc3fd..b49a61c32238 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -385,6 +385,16 @@ void folio_add_lru_vma(struct folio *, struct vm_area_struct *);
>   void mark_page_accessed(struct page *);
>   void folio_mark_accessed(struct folio *);
>   

Two smaller things:

(1) We have other "folio_maybe_*" functions, so this one should likely
     better start with that as well.

(2) With things like fscache in mind, the function can be a bit
     misleading.

So I wonder if (a) we should just add kerneldoc to document it clearly 
(lru cache, mlock cache?) and (b) maybe call it 
folio_may_be_lru_cached(). Not sure if we can find a better abstraction 
for these two caches.

Thinking again, "maybe_cached" might be a bit misleading because it 
implements a very very very bad heuristic for small folios.

Maybe it's more like "supports being cached".

folio_lru_caching_supported()

Something like that, maybe? (again, unclear about lru/mlock cache 
abstraction)


-- 
Cheers

David / dhildenb

Re: [PATCH 6/7] mm: folio_may_be_cached() unless folio_test_large()

Posted by Hugh Dickins 5 months ago

On Mon, 1 Sep 2025, David Hildenbrand wrote:
> On 31.08.25 11:16, Hugh Dickins wrote:
> > mm/swap.c and mm/mlock.c agree to drain any per-CPU batch as soon as
> > a large folio is added: so collect_longterm_unpinnable_folios() just
> > wastes effort when calling lru_add_drain_all() on a large folio.
> > 
> > But although there is good reason not to batch up PMD-sized folios,
> > we might well benefit from batching a small number of low-order mTHPs
> > (though unclear how that "small number" limitation will be implemented).
> > 
> > So ask if folio_may_be_cached() rather than !folio_test_large(), to
> > insulate those particular checks from future change.  Name preferred
> > to "folio_is_batchable" because large folios can well be put on a batch:
> > it's just the per-CPU LRU caches, drained much later, which need care.
> > 
> > Marked for stable, to counter the increase in lru_add_drain_all()s
> > from "mm/gup: check ref_count instead of lru before migration".
> > 
> > Suggested-by: David Hildenbrand <david@redhat.com>
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > Cc: <stable@vger.kernel.org>
> > ---
> >   include/linux/swap.h | 10 ++++++++++
> >   mm/gup.c             |  5 +++--
> >   mm/mlock.c           |  6 +++---
> >   mm/swap.c            |  2 +-
> >   4 files changed, 17 insertions(+), 6 deletions(-)
> > 
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 2fe6ed2cc3fd..b49a61c32238 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -385,6 +385,16 @@ void folio_add_lru_vma(struct folio *, struct
> > vm_area_struct *);
> >   void mark_page_accessed(struct page *);
> >   void folio_mark_accessed(struct folio *);
> >   
> 
> Two smaller things:
> 
> (1) We have other "folio_maybe_*" functions, so this one should likely
>     better start with that as well.
> 
> (2) With things like fscache in mind, the function can be a bit
>     misleading.
> 
> So I wonder if (a) we should just add kerneldoc to document it clearly (lru
> cache, mlock cache?) and (b) maybe call it folio_may_be_lru_cached(). Not sure
> if we can find a better abstraction for these two caches.
> 
> Thinking again, "maybe_cached" might be a bit misleading because it implements
> a very very very bad heuristic for small folios.
> 
> Maybe it's more like "supports being cached".
> 
> folio_lru_caching_supported()

folio_may_be_cached() -> folio_may_be_lru_cached(), yes, that's
very much better, thanks.

(Settimg aside that I've never perceived those pagevecs/batches as a
"cache"; but lru_cache_disable() gave us that terminology, and we've
gone with the flow ever since.  lru_add_drain() would be better named
lru_cache_drain() now, I've always got hung up on "adding a drain".)

"may be" rather than "maybe" was intentional: perhaps too subtle,
but to a native speaker it neatly expresses both the "we can do this"
and "might this have been done" cases.

kernel-doc?  I don't think so, this is very much an mm-internal
matter, and I don't care for the way kernel-doc forces us towards
boilerplate ("@folio: The folio.") rather than helpful comment.

Hugh

Re: [PATCH 6/7] mm: folio_may_be_cached() unless folio_test_large()

Posted by David Hildenbrand 5 months ago

On 08.09.25 13:19, Hugh Dickins wrote:
> On Mon, 1 Sep 2025, David Hildenbrand wrote:
>> On 31.08.25 11:16, Hugh Dickins wrote:
>>> mm/swap.c and mm/mlock.c agree to drain any per-CPU batch as soon as
>>> a large folio is added: so collect_longterm_unpinnable_folios() just
>>> wastes effort when calling lru_add_drain_all() on a large folio.
>>>
>>> But although there is good reason not to batch up PMD-sized folios,
>>> we might well benefit from batching a small number of low-order mTHPs
>>> (though unclear how that "small number" limitation will be implemented).
>>>
>>> So ask if folio_may_be_cached() rather than !folio_test_large(), to
>>> insulate those particular checks from future change.  Name preferred
>>> to "folio_is_batchable" because large folios can well be put on a batch:
>>> it's just the per-CPU LRU caches, drained much later, which need care.
>>>
>>> Marked for stable, to counter the increase in lru_add_drain_all()s
>>> from "mm/gup: check ref_count instead of lru before migration".
>>>
>>> Suggested-by: David Hildenbrand <david@redhat.com>
>>> Signed-off-by: Hugh Dickins <hughd@google.com>
>>> Cc: <stable@vger.kernel.org>
>>> ---
>>>    include/linux/swap.h | 10 ++++++++++
>>>    mm/gup.c             |  5 +++--
>>>    mm/mlock.c           |  6 +++---
>>>    mm/swap.c            |  2 +-
>>>    4 files changed, 17 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>> index 2fe6ed2cc3fd..b49a61c32238 100644
>>> --- a/include/linux/swap.h
>>> +++ b/include/linux/swap.h
>>> @@ -385,6 +385,16 @@ void folio_add_lru_vma(struct folio *, struct
>>> vm_area_struct *);
>>>    void mark_page_accessed(struct page *);
>>>    void folio_mark_accessed(struct folio *);
>>>    
>>
>> Two smaller things:
>>
>> (1) We have other "folio_maybe_*" functions, so this one should likely
>>      better start with that as well.
>>
>> (2) With things like fscache in mind, the function can be a bit
>>      misleading.
>>
>> So I wonder if (a) we should just add kerneldoc to document it clearly (lru
>> cache, mlock cache?) and (b) maybe call it folio_may_be_lru_cached(). Not sure
>> if we can find a better abstraction for these two caches.
>>
>> Thinking again, "maybe_cached" might be a bit misleading because it implements
>> a very very very bad heuristic for small folios.
>>
>> Maybe it's more like "supports being cached".
>>
>> folio_lru_caching_supported()
> 
> folio_may_be_cached() -> folio_may_be_lru_cached(), yes, that's
> very much better, thanks.
> 
> (Settimg aside that I've never perceived those pagevecs/batches as a
> "cache"; but lru_cache_disable() gave us that terminology, and we've
> gone with the flow ever since.  lru_add_drain() would be better named
> lru_cache_drain() now, I've always got hung up on "adding a drain".)

Yeah, the terminology is not that intuitive :)

Not sure if using "batched" instead of "cached" might be clearer long-term?

> 
> "may be" rather than "maybe" was intentional: perhaps too subtle,
> but to a native speaker it neatly expresses both the "we can do this"
> and "might this have been done" cases.

I would wish we could find something that also non-native speakers can 
immediately understand ;)

"may_get_lru_cached" / "may_get_lru_batched"?

/me could not even phrase it in German properly

> 
> kernel-doc?  I don't think so, this is very much an mm-internal
> matter, and I don't care for the way kernel-doc forces us towards
> boilerplate ("@folio: The folio.") rather than helpful comment.

So a comment that this is an internal helper might be nice. Or we just 
move it straight to mm/internal.h ?

-- 
Cheers

David / dhildenb

Re: [PATCH 6/7] mm: folio_may_be_cached() unless folio_test_large()

Posted by Hugh Dickins 5 months ago

On Mon, 8 Sep 2025, David Hildenbrand wrote:
> On 08.09.25 13:19, Hugh Dickins wrote:
...
> > 
> > (Settimg aside that I've never perceived those pagevecs/batches as a
> > "cache"; but lru_cache_disable() gave us that terminology, and we've
> > gone with the flow ever since.  lru_add_drain() would be better named
> > lru_cache_drain() now, I've always got hung up on "adding a drain".)
> 
> Yeah, the terminology is not that intuitive :)
> 
> Not sure if using "batched" instead of "cached" might be clearer long-term?
> 
> > 
> > "may be" rather than "maybe" was intentional: perhaps too subtle,
> > but to a native speaker it neatly expresses both the "we can do this"
> > and "might this have been done" cases.
> 
> I would wish we could find something that also non-native speakers can
> immediately understand ;)
> 
> "may_get_lru_cached" / "may_get_lru_batched"?
> 
> /me could not even phrase it in German properly
> 
> > 
> > kernel-doc?  I don't think so, this is very much an mm-internal
> > matter, and I don't care for the way kernel-doc forces us towards
> > boilerplate ("@folio: The folio.") rather than helpful comment.
> 
> So a comment that this is an internal helper might be nice. Or we just move it
> straight to mm/internal.h ?

mm/internal.h, where we hide things (GFP_RECLAIM_MASK etc!) that belong
elsewhere?  No thanks.

David, I think you're over-thinking this: I'm coming to regret not just
going with your excellent folio_test_large() optimization, and let
someone else mess around with the naming.

I'll post later with my current naming, but if it's the
Suggested-by that's making you uncomfortable, we can delete it.

Hugh

Re: [PATCH 6/7] mm: folio_may_be_cached() unless folio_test_large()

Posted by David Hildenbrand 5 months ago

On 08.09.25 22:04, Hugh Dickins wrote:
> On Mon, 8 Sep 2025, David Hildenbrand wrote:
>> On 08.09.25 13:19, Hugh Dickins wrote:
> ...
>>>
>>> (Settimg aside that I've never perceived those pagevecs/batches as a
>>> "cache"; but lru_cache_disable() gave us that terminology, and we've
>>> gone with the flow ever since.  lru_add_drain() would be better named
>>> lru_cache_drain() now, I've always got hung up on "adding a drain".)
>>
>> Yeah, the terminology is not that intuitive :)
>>
>> Not sure if using "batched" instead of "cached" might be clearer long-term?
>>
>>>
>>> "may be" rather than "maybe" was intentional: perhaps too subtle,
>>> but to a native speaker it neatly expresses both the "we can do this"
>>> and "might this have been done" cases.
>>
>> I would wish we could find something that also non-native speakers can
>> immediately understand ;)
>>
>> "may_get_lru_cached" / "may_get_lru_batched"?
>>
>> /me could not even phrase it in German properly
>>
>>>
>>> kernel-doc?  I don't think so, this is very much an mm-internal
>>> matter, and I don't care for the way kernel-doc forces us towards
>>> boilerplate ("@folio: The folio.") rather than helpful comment.
>>
>> So a comment that this is an internal helper might be nice. Or we just move it
>> straight to mm/internal.h ?
> 
> mm/internal.h, where we hide things (GFP_RECLAIM_MASK etc!) that belong
> elsewhere?  No thanks.

Yes, or other mm-internal helpers that actually belong there. Like 
folio_raw_mapping() or folio_nr_pages_mapped().

> 
> David, I think you're over-thinking this: I'm coming to regret not just
> going with your excellent folio_test_large() optimization, and let
> someone else mess around with the naming.

Please don't feel like I'm pushing to hard here. If you feel the current 
naming is fine and for some reason I don't completely understand it 
should not be in mm/internal.h, all good.

-- 
Cheers

David / dhildenb

[PATCH 1/7] mm: fix folio_expected_ref_count() when PG_private_2
[PATCH 2/7] mm/gup: check ref_count instead of lru before migration
[PATCH 3/7] mm/gup: local lru_add_drain() to avoid lru_add_drain_all()
[PATCH 4/7] mm: Revert "mm/gup: clear the LRU flag of a page before adding to LRU batch"
[PATCH 5/7] mm: Revert "mm: vmscan.c: fix OOM on swap stress test"
[PATCH 6/7] mm: folio_may_be_cached() unless folio_test_large()
[PATCH 7/7] mm: lru_add_drain_all() do local lru_add_drain() first