[v1] mm: better GUP pin lru_add_drain_all()

[PATCH 1/7] mm: fix folio_expected_ref_count() when PG_private_2

Posted by Hugh Dickins 1 month ago

6.16's folio_expected_ref_count() is forgetting the PG_private_2 flag,
which (like PG_private, but not in addition to PG_private) counts for
1 more reference: it needs to be using folio_has_private() in place of
folio_test_private().

But this went wrong earlier: folio_expected_ref_count() was based on
(and replaced) mm/migrate.c's folio_expected_refs(), which has been
using folio_test_private() since 6.0 converted to folios(): before
that, expected_page_refs() was correctly using page_has_private().

Just a few filesystems are still using PG_private_2 a.k.a. PG_fscache.
Potentially, this fix re-enables page migration on their folios; but
it would not be surprising to learn that in practice those folios are
not migratable for other reasons.

Fixes: 86ebd50224c0 ("mm: add folio_expected_ref_count() for reference count calculation")
Fixes: 108ca8358139 ("mm/migrate: Convert expected_page_refs() to folio_expected_refs()")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
 include/linux/mm.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1ae97a0b8ec7..ee8e535eadac 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2180,8 +2180,8 @@ static inline int folio_expected_ref_count(const struct folio *folio)
 	} else {
 		/* One reference per page from the pagecache. */
 		ref_count += !!folio->mapping << order;
-		/* One reference from PG_private. */
-		ref_count += folio_test_private(folio);
+		/* One reference from PG_private or PG_private_2. */
+		ref_count += folio_has_private(folio);
 	}
 
 	/* One reference per page table mapping. */
-- 
2.51.0

Re: [PATCH 1/7] mm: fix folio_expected_ref_count() when PG_private_2

Posted by Matthew Wilcox 1 month ago

On Sun, Aug 31, 2025 at 02:01:16AM -0700, Hugh Dickins wrote:
> 6.16's folio_expected_ref_count() is forgetting the PG_private_2 flag,
> which (like PG_private, but not in addition to PG_private) counts for
> 1 more reference: it needs to be using folio_has_private() in place of
> folio_test_private().

No, it doesn't.  I know it used to, but no filesystem was actually doing
that.  So I changed mm to match how filesystems actually worked.
I'm not sure if there's still documentation lying around that gets
this wrong or if you're remembering how things used to be documented,
but it's never how any filesystem has ever worked.

We're achingly close to getting rid of PG_private_2.  I think it's just
ceph and nfs that still use it.

Re: [PATCH 1/7] mm: fix folio_expected_ref_count() when PG_private_2

Posted by Hugh Dickins 1 month ago

On Mon, 1 Sep 2025, Matthew Wilcox wrote:
> On Sun, Aug 31, 2025 at 02:01:16AM -0700, Hugh Dickins wrote:
> > 6.16's folio_expected_ref_count() is forgetting the PG_private_2 flag,
> > which (like PG_private, but not in addition to PG_private) counts for
> > 1 more reference: it needs to be using folio_has_private() in place of
> > folio_test_private().
> 
> No, it doesn't.  I know it used to, but no filesystem was actually doing
> that.  So I changed mm to match how filesystems actually worked.
> I'm not sure if there's still documentation lying around that gets
> this wrong or if you're remembering how things used to be documented,
> but it's never how any filesystem has ever worked.
> 
> We're achingly close to getting rid of PG_private_2.  I think it's just
> ceph and nfs that still use it.

I knew you were trying to get rid of it (hurrah! thank you), so when I
tried porting my lru_add_drainage to 6.12 I was careful to check whether
folio_expected_ref_count() would need to add it to the accounting there:
apparently yes; but then I was surprised to find that it's still present
in 6.17-rc, I'd assumed it gone long ago.

I didn't try to read the filesystems (which could easily have been
inconsistent about it) to understand: what convinced me amidst all
the confusion was this comment and code in mm/filemap.c:

/**
 * folio_end_private_2 - Clear PG_private_2 and wake any waiters.
 * @folio: The folio.
 *
 * Clear the PG_private_2 bit on a folio and wake up any sleepers waiting for
 * it.  The folio reference held for PG_private_2 being set is released.
 *
 * This is, for example, used when a netfs folio is being written to a local
 * disk cache, thereby allowing writes to the cache for the same folio to be
 * serialised.
 */
void folio_end_private_2(struct folio *folio)
{
	VM_BUG_ON_FOLIO(!folio_test_private_2(folio), folio);
	clear_bit_unlock(PG_private_2, folio_flags(folio, 0));
	folio_wake_bit(folio, PG_private_2);
	folio_put(folio);
}
EXPORT_SYMBOL(folio_end_private_2);

That seems to be clear that PG_private_2 is matched by a folio reference,
but perhaps you can explain it away - worth changing the comment if so.

I was also anxious to work out whether PG_private with PG_private_2
would mean +1 or +2: I don't think I found any decisive statement,
but traditional use of page_has_private() implied +1; and I expect
there's no filesystem which actually could have both on the same folio.

Thanks,
Hugh

Re: [PATCH 1/7] mm: fix folio_expected_ref_count() when PG_private_2

Posted by David Hildenbrand 1 month ago

On 01.09.25 03:17, Hugh Dickins wrote:
> On Mon, 1 Sep 2025, Matthew Wilcox wrote:
>> On Sun, Aug 31, 2025 at 02:01:16AM -0700, Hugh Dickins wrote:
>>> 6.16's folio_expected_ref_count() is forgetting the PG_private_2 flag,
>>> which (like PG_private, but not in addition to PG_private) counts for
>>> 1 more reference: it needs to be using folio_has_private() in place of
>>> folio_test_private().
>>
>> No, it doesn't.  I know it used to, but no filesystem was actually doing
>> that.  So I changed mm to match how filesystems actually worked.
>> I'm not sure if there's still documentation lying around that gets
>> this wrong or if you're remembering how things used to be documented,
>> but it's never how any filesystem has ever worked.
>>
>> We're achingly close to getting rid of PG_private_2.  I think it's just
>> ceph and nfs that still use it.
> 
> I knew you were trying to get rid of it (hurrah! thank you), so when I
> tried porting my lru_add_drainage to 6.12 I was careful to check whether
> folio_expected_ref_count() would need to add it to the accounting there:
> apparently yes; but then I was surprised to find that it's still present
> in 6.17-rc, I'd assumed it gone long ago.
> 
> I didn't try to read the filesystems (which could easily have been
> inconsistent about it) to understand: what convinced me amidst all
> the confusion was this comment and code in mm/filemap.c:
> 
> /**
>   * folio_end_private_2 - Clear PG_private_2 and wake any waiters.
>   * @folio: The folio.
>   *
>   * Clear the PG_private_2 bit on a folio and wake up any sleepers waiting for
>   * it.  The folio reference held for PG_private_2 being set is released.
>   *
>   * This is, for example, used when a netfs folio is being written to a local
>   * disk cache, thereby allowing writes to the cache for the same folio to be
>   * serialised.
>   */
> void folio_end_private_2(struct folio *folio)
> {
> 	VM_BUG_ON_FOLIO(!folio_test_private_2(folio), folio);
> 	clear_bit_unlock(PG_private_2, folio_flags(folio, 0));
> 	folio_wake_bit(folio, PG_private_2);
> 	folio_put(folio);
> }
> EXPORT_SYMBOL(folio_end_private_2);
> 
> That seems to be clear that PG_private_2 is matched by a folio reference,
> but perhaps you can explain it away - worth changing the comment if so.
> 
> I was also anxious to work out whether PG_private with PG_private_2
> would mean +1 or +2: I don't think I found any decisive statement,
> but traditional use of page_has_private() implied +1; and I expect
> there's no filesystem which actually could have both on the same folio.

I think it's "+1", like we used to have.

I was seriously confused when discovering (iow, concerned about false 
positives):

	PG_fscache = PG_private_2,

But in the end PG_fscache is only used in comments and e.g., 
__fscache_clear_page_bits() calls folio_end_private_2(). So both are 
really just aliases.

[Either PG_fscache should be dropped and referred to as PG_private_2, or 
PG_private_2 should be dropped and PG_fscache used instead. It's even 
inconsistently used in that fscache. file.

Or both should be dropped, of course, once we can actually get rid of it 
...]

So PG_private_2 should not be used for any other purpose.

folio_start_private_2() / folio_end_private_2() indeed pair the flag 
with a reference. There are no other callers that would set/clear the 
flag without involving a reference.

The usage of private_2 is declared deprecated all over the place. So the 
question is if we really still care.

The ceph usage is guarded by CONFIG_CEPH_FSCACHE, the NFS one by 
NFS_FSCACHE, nothing really seems to prevent it from getting configured 
in easily.

Now, one problem would be if migration / splitting / ... code where we 
use folio_expected_ref_count() cannot deal with that additional 
reference properly, in which case this patch would indeed cause harm.

If all folio_expected_ref_count() callers can deal with updating that 
reference, all good.

nfs_migrate_folio(), for example, has folio_test_private_2() handling in 
there (just wait until it is gone). ceph handles it during 
ceph_writepages_start(), but uses ordinary filemap_migrate_folio.

Long story short: this patch is problematic if one 
folio_expected_ref_count() users is not aware of how to handle that 
additional reference.

-- 
Cheers

David / dhildenb

Re: [PATCH 1/7] mm: fix folio_expected_ref_count() when PG_private_2

Posted by David Hildenbrand 1 month ago

On 01.09.25 09:52, David Hildenbrand wrote:
> On 01.09.25 03:17, Hugh Dickins wrote:
>> On Mon, 1 Sep 2025, Matthew Wilcox wrote:
>>> On Sun, Aug 31, 2025 at 02:01:16AM -0700, Hugh Dickins wrote:
>>>> 6.16's folio_expected_ref_count() is forgetting the PG_private_2 flag,
>>>> which (like PG_private, but not in addition to PG_private) counts for
>>>> 1 more reference: it needs to be using folio_has_private() in place of
>>>> folio_test_private().
>>>
>>> No, it doesn't.  I know it used to, but no filesystem was actually doing
>>> that.  So I changed mm to match how filesystems actually worked.
>>> I'm not sure if there's still documentation lying around that gets
>>> this wrong or if you're remembering how things used to be documented,
>>> but it's never how any filesystem has ever worked.
>>>
>>> We're achingly close to getting rid of PG_private_2.  I think it's just
>>> ceph and nfs that still use it.
>>
>> I knew you were trying to get rid of it (hurrah! thank you), so when I
>> tried porting my lru_add_drainage to 6.12 I was careful to check whether
>> folio_expected_ref_count() would need to add it to the accounting there:
>> apparently yes; but then I was surprised to find that it's still present
>> in 6.17-rc, I'd assumed it gone long ago.
>>
>> I didn't try to read the filesystems (which could easily have been
>> inconsistent about it) to understand: what convinced me amidst all
>> the confusion was this comment and code in mm/filemap.c:
>>
>> /**
>>    * folio_end_private_2 - Clear PG_private_2 and wake any waiters.
>>    * @folio: The folio.
>>    *
>>    * Clear the PG_private_2 bit on a folio and wake up any sleepers waiting for
>>    * it.  The folio reference held for PG_private_2 being set is released.
>>    *
>>    * This is, for example, used when a netfs folio is being written to a local
>>    * disk cache, thereby allowing writes to the cache for the same folio to be
>>    * serialised.
>>    */
>> void folio_end_private_2(struct folio *folio)
>> {
>> 	VM_BUG_ON_FOLIO(!folio_test_private_2(folio), folio);
>> 	clear_bit_unlock(PG_private_2, folio_flags(folio, 0));
>> 	folio_wake_bit(folio, PG_private_2);
>> 	folio_put(folio);
>> }
>> EXPORT_SYMBOL(folio_end_private_2);
>>
>> That seems to be clear that PG_private_2 is matched by a folio reference,
>> but perhaps you can explain it away - worth changing the comment if so.
>>
>> I was also anxious to work out whether PG_private with PG_private_2
>> would mean +1 or +2: I don't think I found any decisive statement,
>> but traditional use of page_has_private() implied +1; and I expect
>> there's no filesystem which actually could have both on the same folio.
> 
> I think it's "+1", like we used to have.
> 
> I was seriously confused when discovering (iow, concerned about false
> positives):
> 
> 	PG_fscache = PG_private_2,
> 
> But in the end PG_fscache is only used in comments and e.g.,
> __fscache_clear_page_bits() calls folio_end_private_2(). So both are
> really just aliases.
> 
> [Either PG_fscache should be dropped and referred to as PG_private_2, or
> PG_private_2 should be dropped and PG_fscache used instead. It's even
> inconsistently used in that fscache. file.
> 
> Or both should be dropped, of course, once we can actually get rid of it
> ...]
> 
> So PG_private_2 should not be used for any other purpose.
> 
> folio_start_private_2() / folio_end_private_2() indeed pair the flag
> with a reference. There are no other callers that would set/clear the
> flag without involving a reference.
> 
> The usage of private_2 is declared deprecated all over the place. So the
> question is if we really still care.
> 
> The ceph usage is guarded by CONFIG_CEPH_FSCACHE, the NFS one by
> NFS_FSCACHE, nothing really seems to prevent it from getting configured
> in easily.
> 
> Now, one problem would be if migration / splitting / ... code where we
> use folio_expected_ref_count() cannot deal with that additional
> reference properly, in which case this patch would indeed cause harm.
> 
> If all folio_expected_ref_count() callers can deal with updating that
> reference, all good.
> 
> nfs_migrate_folio(), for example, has folio_test_private_2() handling in
> there (just wait until it is gone). ceph handles it during
> ceph_writepages_start(), but uses ordinary filemap_migrate_folio.
> 
> Long story short: this patch is problematic if one
> folio_expected_ref_count() users is not aware of how to handle that
> additional reference.
> 

Case in point, I just stumbled over

commit 682a71a1b6b363bff71440f4eca6498f827a839d
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Sep 2 20:46:46 2022 +0100

     migrate: convert __unmap_and_move() to use folios

and

commit 8faa8ef5dd11abe119ad0c8ccd39f2064ca7ed0e
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jun 6 09:34:36 2022 -0400

     mm/migrate: Convert fallback_migrate_page() to fallback_migrate_folio()
     
     Use a folio throughout.  migrate_page() will be converted to
     migrate_folio() later.


where we converted from page_has_private() to folio_test_private(). Maybe
that's all sane, but it raises the question if migration (and maybe splitting)
as a whole is no incompatible with PG_private_2

-- 
Cheers

David / dhildenb

Re: [PATCH 1/7] mm: fix folio_expected_ref_count() when PG_private_2

Posted by Hugh Dickins 3 weeks, 4 days ago

On Mon, 1 Sep 2025, David Hildenbrand wrote:
> On 01.09.25 09:52, David Hildenbrand wrote:
> > On 01.09.25 03:17, Hugh Dickins wrote:
> >> On Mon, 1 Sep 2025, Matthew Wilcox wrote:
> >>> On Sun, Aug 31, 2025 at 02:01:16AM -0700, Hugh Dickins wrote:
> >>>> 6.16's folio_expected_ref_count() is forgetting the PG_private_2 flag,
> >>>> which (like PG_private, but not in addition to PG_private) counts for
> >>>> 1 more reference: it needs to be using folio_has_private() in place of
> >>>> folio_test_private().
> >>>
> >>> No, it doesn't.  I know it used to, but no filesystem was actually doing
> >>> that.  So I changed mm to match how filesystems actually worked.

I think Matthew may be remembering how he wanted it to behave (? but he
wanted it to go away completely) rather than how it ended up behaving:
we've both found that PG_private_2 always goes with refcount increment.

(Always? Well, until 6.13, btrfs used PG_private_2 without any such
increment: that's gone, so now it's consistently with refcount increment.)

Confusing, given David Howells removed deprecated use of PG_private_2
then later reverted the removal: I've not looked up which releases those
came and went, but reverted in stable trees too, so story all the same;
but maybe some of Matthew's mods interleaved between removal and revert.

> >>> I'm not sure if there's still documentation lying around that gets
> >>> this wrong or if you're remembering how things used to be documented,
> >>> but it's never how any filesystem has ever worked.

Not how btrfs used to work, but it is how ceph and nfs work.

> >>>
> >>> We're achingly close to getting rid of PG_private_2.  I think it's just
> >>> ceph and nfs that still use it.
> >>
> >> I knew you were trying to get rid of it (hurrah! thank you), so when I
> >> tried porting my lru_add_drainage to 6.12 I was careful to check whether
> >> folio_expected_ref_count() would need to add it to the accounting there:
> >> apparently yes; but then I was surprised to find that it's still present
> >> in 6.17-rc, I'd assumed it gone long ago.
> >>
> >> I didn't try to read the filesystems (which could easily have been
> >> inconsistent about it) to understand: what convinced me amidst all
> >> the confusion was this comment and code in mm/filemap.c:
> >>
> >> /**
> >>    * folio_end_private_2 - Clear PG_private_2 and wake any waiters.
> >>    * @folio: The folio.
> >>    *
> >>    * Clear the PG_private_2 bit on a folio and wake up any sleepers waiting
> >>    for
> >>    * it.  The folio reference held for PG_private_2 being set is released.
> >>    *
> >>    * This is, for example, used when a netfs folio is being written to a
> >>    local
> >>    * disk cache, thereby allowing writes to the cache for the same folio to
> >>    be
> >>    * serialised.
> >>    */
> >> void folio_end_private_2(struct folio *folio)
> >> {
> >>  VM_BUG_ON_FOLIO(!folio_test_private_2(folio), folio);
> >>  clear_bit_unlock(PG_private_2, folio_flags(folio, 0));
> >>  folio_wake_bit(folio, PG_private_2);
> >>  folio_put(folio);
> >> }
> >> EXPORT_SYMBOL(folio_end_private_2);
> >>
> >> That seems to be clear that PG_private_2 is matched by a folio reference,
> >> but perhaps you can explain it away - worth changing the comment if so.
> >>
> >> I was also anxious to work out whether PG_private with PG_private_2
> >> would mean +1 or +2: I don't think I found any decisive statement,
> >> but traditional use of page_has_private() implied +1; and I expect
> >> there's no filesystem which actually could have both on the same folio.
> > 
> > I think it's "+1", like we used to have.

I've given up worrying about that.  I'm inclined to think it's +2,
since there's no test_private when incrementing and decrementing
for private_2; but I don't need to care any more.

> > 
> > I was seriously confused when discovering (iow, concerned about false
> > positives):
> > 
> >  PG_fscache = PG_private_2,
> > 
> > But in the end PG_fscache is only used in comments and e.g.,
> > __fscache_clear_page_bits() calls folio_end_private_2(). So both are
> > really just aliases.
> > 
> > [Either PG_fscache should be dropped and referred to as PG_private_2, or
> > PG_private_2 should be dropped and PG_fscache used instead. It's even
> > inconsistently used in that fscache. file.
> > 
> > Or both should be dropped, of course, once we can actually get rid of it
> > ...]
> > 
> > So PG_private_2 should not be used for any other purpose.

Yes, ghastly the hiding of one behind the other; that, and the
PageFlags versus folio_flags, made it all tiresome to track down.

I have considered adding PG_Spanish_Inquisition = PG_private_2
since folio_expect_ref_count() ignoring PG_private_2 implies that
no-one expects the PG_private_2.

> > 
> > folio_start_private_2() / folio_end_private_2() indeed pair the flag
> > with a reference. There are no other callers that would set/clear the
> > flag without involving a reference.
> > 
> > The usage of private_2 is declared deprecated all over the place. So the
> > question is if we really still care.
> > 
> > The ceph usage is guarded by CONFIG_CEPH_FSCACHE, the NFS one by
> > NFS_FSCACHE, nothing really seems to prevent it from getting configured
> > in easily.
> > 
> > Now, one problem would be if migration / splitting / ... code where we
> > use folio_expected_ref_count() cannot deal with that additional
> > reference properly, in which case this patch would indeed cause harm.

Yes, that appears to be why Matthew said NAK and "dangerously wrong".

So far as I could tell, there is no problem with nfs, it has, and has
all along had, the appropriate release_folio and migrate_folio methods.

ceph used to have what's needed, but 6.0's changes from page_has_private()
to folio_test_private() (the change from "has" either bit to "test" just
the one bit really should have been highlighted) broke the migration of
ceph's PG_private_2 folios.

(I think it may have got re-enabled in intervening releases: David
Howells reinstated folio_has_private() inside fallback_migrate_folio()'s
filemap_release_folio(), which may have been enough to get ceph's
PG_private_2s migratable again; but then 6.15's ceph .migrate_folio =
filemap_migrate_folio will have broken it again.)

Folio migration does not and never has copied over PG_private_2 from
src to dst; so my 1/7 patch would have permitted migration of a ceph
PG_private_2 src folio to a dst folio left with refcount 1 more than
it should be (plus whatever the consequences of migrating such a
folio which should have waited for the flag to be cleared first).

Earlier, I did intend to add protection against PG_private_2 into
folio_migrate_mapping() and/or whatever else needs it in mm/migrate.c,
as part of the 1/7 patch; and later submit a ceph patch to give it
back the release_folio wait on PG_private_2 it wants.

But (a) I ran out of steam, and (b) I couldn't test it or advise
ceph folks how to test it, and (c) guessed that Matthew would hate
me populating the codebase with further references to PG_private_2,
and (d) realized that this PG_private_2 thing is a transient
condition (more like writeback than private) which probably nobody
cares too much about (its lack of migration has gone unnoticed).

I'm just going to drop this 1/7, and add a (briefer than this!)
paragraph to 2/7 == 1/6's commit message in v2 later today.

> > 
> > If all folio_expected_ref_count() callers can deal with updating that
> > reference, all good.
> > 
> > nfs_migrate_folio(), for example, has folio_test_private_2() handling in
> > there (just wait until it is gone). ceph handles it during
> > ceph_writepages_start(), but uses ordinary filemap_migrate_folio.
> > 
> > Long story short: this patch is problematic if one
> > folio_expected_ref_count() users is not aware of how to handle that
> > additional reference.
> > 
> 
> Case in point, I just stumbled over
> 
> commit 682a71a1b6b363bff71440f4eca6498f827a839d
> Author: Matthew Wilcox (Oracle) <willy@infradead.org>
> Date:   Fri Sep 2 20:46:46 2022 +0100
> 
>     migrate: convert __unmap_and_move() to use folios
> 
> and
> 
> commit 8faa8ef5dd11abe119ad0c8ccd39f2064ca7ed0e
> Author: Matthew Wilcox (Oracle) <willy@infradead.org>
> Date:   Mon Jun 6 09:34:36 2022 -0400
> 
>     mm/migrate: Convert fallback_migrate_page() to fallback_migrate_folio()
>     
>     Use a folio throughout.  migrate_page() will be converted to
>     migrate_folio() later.
> 
> 
> where we converted from page_has_private() to folio_test_private(). Maybe
> that's all sane, but it raises the question if migration (and maybe splitting)
> as a whole is no incompatible with PG_private_2

The commit I blamed in my notes was 108ca835, I think that's the one
that changes "has" to "test" in the "expected" calculaton; but yes,
8faa8ef5 is significant for skipping the call to folio_release.

Hugh

Re: [PATCH 1/7] mm: fix folio_expected_ref_count() when PG_private_2

Posted by Matthew Wilcox 3 weeks, 3 days ago

On Mon, Sep 08, 2025 at 03:27:47AM -0700, Hugh Dickins wrote:
> On Mon, 1 Sep 2025, David Hildenbrand wrote:
> > On 01.09.25 09:52, David Hildenbrand wrote:
> > > On 01.09.25 03:17, Hugh Dickins wrote:
> > >> On Mon, 1 Sep 2025, Matthew Wilcox wrote:
> > >>> On Sun, Aug 31, 2025 at 02:01:16AM -0700, Hugh Dickins wrote:
> > >>>> 6.16's folio_expected_ref_count() is forgetting the PG_private_2 flag,
> > >>>> which (like PG_private, but not in addition to PG_private) counts for
> > >>>> 1 more reference: it needs to be using folio_has_private() in place of
> > >>>> folio_test_private().
> > >>>
> > >>> No, it doesn't.  I know it used to, but no filesystem was actually doing
> > >>> that.  So I changed mm to match how filesystems actually worked.
> 
> I think Matthew may be remembering how he wanted it to behave (? but he
> wanted it to go away completely) rather than how it ended up behaving:
> we've both found that PG_private_2 always goes with refcount increment.

Let me explain that better.  No filesystem followed the documented rule
that the refcount must be incremented by one if either PG_private or
PG_private_2 was set.  And no surprise; that's a very complicated rule
for filesystems to follow.  Many of them weren't even following the rule
to increment the refcount by one when PG_private was set.

So some were incrementing the refcount by one if PG_private were set, but
not bumping the refcount by one if PG_private_2 were set (I think this is
how btrfs worked, and you seem to believe the same thing).  Others were
bumping the refcount by two if both PG_private and PG_private_2 were set
(I think this is how netfs works today).

> > > Now, one problem would be if migration / splitting / ... code where we
> > > use folio_expected_ref_count() cannot deal with that additional
> > > reference properly, in which case this patch would indeed cause harm.
> 
> Yes, that appears to be why Matthew said NAK and "dangerously wrong".
> 
> So far as I could tell, there is no problem with nfs, it has, and has
> all along had, the appropriate release_folio and migrate_folio methods.
> 
> ceph used to have what's needed, but 6.0's changes from page_has_private()
> to folio_test_private() (the change from "has" either bit to "test" just
> the one bit really should have been highlighted) broke the migration of
> ceph's PG_private_2 folios.
> 
> (I think it may have got re-enabled in intervening releases: David
> Howells reinstated folio_has_private() inside fallback_migrate_folio()'s
> filemap_release_folio(), which may have been enough to get ceph's
> PG_private_2s migratable again; but then 6.15's ceph .migrate_folio =
> filemap_migrate_folio will have broken it again.)
> 
> Folio migration does not and never has copied over PG_private_2 from
> src to dst; so my 1/7 patch would have permitted migration of a ceph
> PG_private_2 src folio to a dst folio left with refcount 1 more than
> it should be (plus whatever the consequences of migrating such a
> folio which should have waited for the flag to be cleared first).

But that's another problem.  The current meaning of PG_fscache (and also
that has changed over the years!) is that the data in the folio is being
written to the fscache.  So we _shouldn't_ migrate the folio as some
piece of storage hardware is busy reading from the old folio.  And if
somebody else starts writing to the old folio, we'll have a corrupted
fscache.

So the current behaviour where we set private_2 and bump the refcount,
but don't take the private_2 status into account is the safe one,
because the elevated refcount means we'll skip the PG_fscache folio.
Maybe it'd be better to wait for it to clear.  But since Dave Howells
is busy killing it off, I'm just inclined to wait for that to happen.

> I'm just going to drop this 1/7, and add a (briefer than this!)
> paragraph to 2/7 == 1/6's commit message in v2 later today.

Thank you!

Re: [PATCH 1/7] mm: fix folio_expected_ref_count() when PG_private_2

Posted by Hugh Dickins 3 weeks, 3 days ago

On Mon, 8 Sep 2025, Matthew Wilcox wrote:
> On Mon, Sep 08, 2025 at 03:27:47AM -0700, Hugh Dickins wrote:
> > On Mon, 1 Sep 2025, David Hildenbrand wrote:
> > > On 01.09.25 09:52, David Hildenbrand wrote:
> > > > On 01.09.25 03:17, Hugh Dickins wrote:
> > > >> On Mon, 1 Sep 2025, Matthew Wilcox wrote:
> > > >>> On Sun, Aug 31, 2025 at 02:01:16AM -0700, Hugh Dickins wrote:
> > > >>>> 6.16's folio_expected_ref_count() is forgetting the PG_private_2 flag,
> > > >>>> which (like PG_private, but not in addition to PG_private) counts for
> > > >>>> 1 more reference: it needs to be using folio_has_private() in place of
> > > >>>> folio_test_private().
> > > >>>
> > > >>> No, it doesn't.  I know it used to, but no filesystem was actually doing
> > > >>> that.  So I changed mm to match how filesystems actually worked.
> > 
> > I think Matthew may be remembering how he wanted it to behave (? but he
> > wanted it to go away completely) rather than how it ended up behaving:
> > we've both found that PG_private_2 always goes with refcount increment.
> 
> Let me explain that better.  No filesystem followed the documented rule
> that the refcount must be incremented by one if either PG_private or
> PG_private_2 was set.  And no surprise; that's a very complicated rule
> for filesystems to follow.  Many of them weren't even following the rule
> to increment the refcount by one when PG_private was set.

Thanks, yes, I hadn't realized that you were referring to the +1 (versus
+2) part of it: yes, a quite unnecessarily difficult rule to follow.

...

> So the current behaviour where we set private_2 and bump the refcount,
> but don't take the private_2 status into account is the safe one,
> because the elevated refcount means we'll skip the PG_fscache folio.
> Maybe it'd be better to wait for it to clear.  But since Dave Howells
> is busy killing it off, I'm just inclined to wait for that to happen.

Yes, that's where my internalized-Matthew brought me in the end;
though killing off PG_private_2 seems to have been just around
the corner for a very long time.

It's a pity that there isn't much incentive on ceph folks to
get it fixed: the one who suffers is the compactor or pinner.

Hugh

[PATCH 1/7] mm: fix folio_expected_ref_count() when PG_private_2
[PATCH 2/7] mm/gup: check ref_count instead of lru before migration
[PATCH 3/7] mm/gup: local lru_add_drain() to avoid lru_add_drain_all()
[PATCH 4/7] mm: Revert "mm/gup: clear the LRU flag of a page before adding to LRU batch"
[PATCH 5/7] mm: Revert "mm: vmscan.c: fix OOM on swap stress test"
[PATCH 6/7] mm: folio_may_be_cached() unless folio_test_large()
[PATCH 7/7] mm: lru_add_drain_all() do local lru_add_drain() first