[v3] Get rid of PG_reclaim and rename PG_dropbehind

[PATCHv3 06/11] mm/vmscan: Use PG_dropbehind instead of PG_reclaim

Posted by Kirill A. Shutemov 1 year ago

The recently introduced PG_dropbehind allows for freeing folios
immediately after writeback. Unlike PG_reclaim, it does not need vmscan
to be involved to get the folio freed.

Instead of using folio_set_reclaim(), use folio_set_dropbehind() in
pageout().

It is safe to leave PG_dropbehind on the folio if, for some reason
(bug?), the folio is not in a writeback state after ->writepage().
In these cases, the kernel had to clear PG_reclaim as it shared a page
flag bit with PG_readahead.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 mm/vmscan.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bc1826020159..c97adb0fdaa4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -692,19 +692,16 @@ static pageout_t pageout(struct folio *folio, struct address_space *mapping,
 		if (shmem_mapping(mapping) && folio_test_large(folio))
 			wbc.list = folio_list;
 
-		folio_set_reclaim(folio);
+		folio_set_dropbehind(folio);
+
 		res = mapping->a_ops->writepage(&folio->page, &wbc);
 		if (res < 0)
 			handle_write_error(mapping, folio, res);
 		if (res == AOP_WRITEPAGE_ACTIVATE) {
-			folio_clear_reclaim(folio);
+			folio_clear_dropbehind(folio);
 			return PAGE_ACTIVATE;
 		}
 
-		if (!folio_test_writeback(folio)) {
-			/* synchronous write or broken a_ops? */
-			folio_clear_reclaim(folio);
-		}
 		trace_mm_vmscan_write_folio(folio);
 		node_stat_add_folio(folio, NR_VMSCAN_WRITE);
 		return PAGE_SUCCESS;
-- 
2.47.2

Re: [PATCHv3 06/11] mm/vmscan: Use PG_dropbehind instead of PG_reclaim

Posted by Kairui Song 1 year ago

On Thu, Jan 30, 2025 at 6:02 PM Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> The recently introduced PG_dropbehind allows for freeing folios
> immediately after writeback. Unlike PG_reclaim, it does not need vmscan
> to be involved to get the folio freed.
>
> Instead of using folio_set_reclaim(), use folio_set_dropbehind() in
> pageout().
>
> It is safe to leave PG_dropbehind on the folio if, for some reason
> (bug?), the folio is not in a writeback state after ->writepage().
> In these cases, the kernel had to clear PG_reclaim as it shared a page
> flag bit with PG_readahead.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/vmscan.c | 9 +++------
>  1 file changed, 3 insertions(+), 6 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bc1826020159..c97adb0fdaa4 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -692,19 +692,16 @@ static pageout_t pageout(struct folio *folio, struct address_space *mapping,
>                 if (shmem_mapping(mapping) && folio_test_large(folio))
>                         wbc.list = folio_list;
>
> -               folio_set_reclaim(folio);
> +               folio_set_dropbehind(folio);
> +
>                 res = mapping->a_ops->writepage(&folio->page, &wbc);
>                 if (res < 0)
>                         handle_write_error(mapping, folio, res);
>                 if (res == AOP_WRITEPAGE_ACTIVATE) {
> -                       folio_clear_reclaim(folio);
> +                       folio_clear_dropbehind(folio);
>                         return PAGE_ACTIVATE;
>                 }
>
> -               if (!folio_test_writeback(folio)) {
> -                       /* synchronous write or broken a_ops? */
> -                       folio_clear_reclaim(folio);
> -               }
>                 trace_mm_vmscan_write_folio(folio);
>                 node_stat_add_folio(folio, NR_VMSCAN_WRITE);
>                 return PAGE_SUCCESS;
> --
> 2.47.2
>

Hi, I'm seeing following panic with SWAP after this commit:

[   29.672319] Oops: general protection fault, probably for
non-canonical address 0xffff88909a3be3: 0000 [#1] PREEMPT SMP NOPTI
[   29.675503] CPU: 82 UID: 0 PID: 5145 Comm: tar Kdump: loaded Not
tainted 6.13.0.ptch-g1fe9ea48ec98 #917
[   29.677508] Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015
[   29.678886] RIP: 0010:__lock_acquire+0x20/0x15d0
[   29.679891] Code: 90 90 90 90 90 90 90 90 90 90 41 57 41 56 41 55
41 54 55 53 48 83 ec 30 8b 2d 10 ac f3 01 44 8b ac 24 88 00 00 00 85
ed 74 64 <48> 8b 07 49 89 ff 48 3d 20 1d bf 83 74 56 8b 1d 8c f5 b1 01
41 89
[   29.683852] RSP: 0018:ffffc9000bea3148 EFLAGS: 00010002
[   29.684980] RAX: ffff8890874b2940 RBX: 0000000000000200 RCX: 0000000000000000
[   29.686510] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00ffff88909a3be3
[   29.688031] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000
[   29.689561] R10: 0000000000000000 R11: 0000000000000020 R12: 00ffff88909a3be3
[   29.691087] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[   29.692613] FS:  00007fa05c2824c0(0000) GS:ffff88a03fa80000(0000)
knlGS:0000000000000000
[   29.694339] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   29.695581] CR2: 000055f9abb7fc7d CR3: 00000010932f2002 CR4: 0000000000770eb0
[   29.697109] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   29.698637] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   29.700161] PKRU: 55555554
[   29.700759] Call Trace:
[   29.701296]  <TASK>
[   29.701770]  ? __die_body+0x1e/0x60
[   29.702540]  ? die_addr+0x3c/0x60
[   29.703267]  ? exc_general_protection+0x18f/0x3c0
[   29.704290]  ? asm_exc_general_protection+0x26/0x30
[   29.705345]  ? __lock_acquire+0x20/0x15d0
[   29.706215]  ? lockdep_hardirqs_on_prepare+0xda/0x190
[   29.707304]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[   29.708452]  lock_acquire+0xbf/0x2e0
[   29.709229]  ? folio_unmap_invalidate+0x12f/0x220
[   29.710257]  ? __folio_end_writeback+0x15d/0x430
[   29.711260]  ? __folio_end_writeback+0x116/0x430
[   29.712261]  _raw_spin_lock+0x30/0x40
[   29.713064]  ? folio_unmap_invalidate+0x12f/0x220
[   29.714076]  folio_unmap_invalidate+0x12f/0x220
[   29.715058]  folio_end_writeback+0xdf/0x190
[   29.715967]  swap_writepage_bdev_sync+0x1e0/0x450
[   29.716994]  ? __pfx_submit_bio_wait_endio+0x10/0x10
[   29.718074]  swap_writepage+0x46b/0x6b0
[   29.718917]  pageout+0x14b/0x360
[   29.719628]  shrink_folio_list+0x67d/0xec0
[   29.720519]  ? mark_held_locks+0x48/0x80
[   29.721375]  evict_folios+0x2a7/0x9e0
[   29.722179]  try_to_shrink_lruvec+0x19a/0x270
[   29.723130]  lru_gen_shrink_lruvec+0x70/0xc0
[   29.724060]  ? __lock_acquire+0x558/0x15d0
[   29.724954]  shrink_lruvec+0x57/0x780
[   29.725754]  ? find_held_lock+0x2d/0xa0
[   29.726588]  ? rcu_read_unlock+0x17/0x60
[   29.727449]  shrink_node+0x2ad/0x930
[   29.728229]  do_try_to_free_pages+0xbd/0x4e0
[   29.729160]  try_to_free_mem_cgroup_pages+0x123/0x2c0
[   29.730252]  try_charge_memcg+0x222/0x660
[   29.731128]  charge_memcg+0x3c/0x80
[   29.731888]  __mem_cgroup_charge+0x30/0x70
[   29.732776]  shmem_alloc_and_add_folio+0x1a5/0x480
[   29.733818]  ? filemap_get_entry+0x155/0x390
[   29.734748]  shmem_get_folio_gfp+0x28c/0x6c0
[   29.735680]  shmem_write_begin+0x5a/0xc0
[   29.736535]  generic_perform_write+0x12a/0x2e0
[   29.737503]  shmem_file_write_iter+0x86/0x90
[   29.738428]  vfs_write+0x364/0x530
[   29.739180]  ksys_write+0x6c/0xe0
[   29.739906]  do_syscall_64+0x66/0x140
[   29.740713]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   29.741800] RIP: 0033:0x7fa05c439984
[   29.742584] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f
84 00 00 00 00 00 f3 0f 1e fa 80 3d c5 06 0e 00 00 74 13 b8 01 00 00
00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20
48 89
[   29.746542] RSP: 002b:00007ffece7720f8 EFLAGS: 00000202 ORIG_RAX:
0000000000000001
[   29.748157] RAX: ffffffffffffffda RBX: 0000000000002800 RCX: 00007fa05c439984
[   29.749682] RDX: 0000000000002800 RSI: 000055f9cfa08000 RDI: 0000000000000004
[   29.751216] RBP: 00007ffece772140 R08: 0000000000002800 R09: 0000000000000007
[   29.752743] R10: 0000000000000180 R11: 0000000000000202 R12: 000055f9cfa08000
[   29.754262] R13: 0000000000000004 R14: 0000000000002800 R15: 00000000000009af
[   29.755797]  </TASK>
[   29.756285] Modules linked in: zram virtiofs

I'm testing with PROVE_LOCKING on. It seems folio_unmap_invalidate is
called for swapcache folio and it doesn't work well, following PATCH
on top of mm-unstable seems fix it well:

diff --git a/mm/filemap.c b/mm/filemap.c
index 4fe551037bf7..98493443d120 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1605,8 +1605,9 @@ static void folio_end_reclaim_write(struct folio *folio)
         * invalidation in that case.
         */
        if (in_task() && folio_trylock(folio)) {
-               if (folio->mapping)
-                       folio_unmap_invalidate(folio->mapping, folio, 0);
+               struct address_space *mapping = folio_mapping(folio);
+               if (mapping)
+                       folio_unmap_invalidate(mapping, folio, 0);
                folio_unlock(folio);
        }
 }
diff --git a/mm/truncate.c b/mm/truncate.c
index e922ceb66c44..4f3e34c52d8b 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -565,23 +565,29 @@ int folio_unmap_invalidate(struct address_space
*mapping, struct folio *folio,
        if (!filemap_release_folio(folio, gfp))
                return -EBUSY;

-       spin_lock(&mapping->host->i_lock);
+       if (!folio_test_swapcache(folio)) {
+               spin_lock(&mapping->host->i_lock);
+               BUG_ON(folio_has_private(folio));
+       }
+
        xa_lock_irq(&mapping->i_pages);
        if (folio_test_dirty(folio))
                goto failed;

-       BUG_ON(folio_has_private(folio));
        __filemap_remove_folio(folio, NULL);
        xa_unlock_irq(&mapping->i_pages);
        if (mapping_shrinkable(mapping))
                inode_add_lru(mapping->host);
-       spin_unlock(&mapping->host->i_lock);
+
+       if (!folio_test_swapcache(folio))
+               spin_unlock(&mapping->host->i_lock);

        filemap_free_folio(mapping, folio);
        return 1;
 failed:
        xa_unlock_irq(&mapping->i_pages);
-       spin_unlock(&mapping->host->i_lock);
+       if (!folio_test_swapcache(folio))
+               spin_unlock(&mapping->host->i_lock);
        return -EBUSY;
 }

Re: [PATCHv3 06/11] mm/vmscan: Use PG_dropbehind instead of PG_reclaim

Posted by Kirill A. Shutemov 1 year ago

On Sat, Feb 01, 2025 at 04:01:43PM +0800, Kairui Song wrote:
> On Thu, Jan 30, 2025 at 6:02 PM Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > The recently introduced PG_dropbehind allows for freeing folios
> > immediately after writeback. Unlike PG_reclaim, it does not need vmscan
> > to be involved to get the folio freed.
> >
> > Instead of using folio_set_reclaim(), use folio_set_dropbehind() in
> > pageout().
> >
> > It is safe to leave PG_dropbehind on the folio if, for some reason
> > (bug?), the folio is not in a writeback state after ->writepage().
> > In these cases, the kernel had to clear PG_reclaim as it shared a page
> > flag bit with PG_readahead.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Acked-by: David Hildenbrand <david@redhat.com>
> > ---
> >  mm/vmscan.c | 9 +++------
> >  1 file changed, 3 insertions(+), 6 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index bc1826020159..c97adb0fdaa4 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -692,19 +692,16 @@ static pageout_t pageout(struct folio *folio, struct address_space *mapping,
> >                 if (shmem_mapping(mapping) && folio_test_large(folio))
> >                         wbc.list = folio_list;
> >
> > -               folio_set_reclaim(folio);
> > +               folio_set_dropbehind(folio);
> > +
> >                 res = mapping->a_ops->writepage(&folio->page, &wbc);
> >                 if (res < 0)
> >                         handle_write_error(mapping, folio, res);
> >                 if (res == AOP_WRITEPAGE_ACTIVATE) {
> > -                       folio_clear_reclaim(folio);
> > +                       folio_clear_dropbehind(folio);
> >                         return PAGE_ACTIVATE;
> >                 }
> >
> > -               if (!folio_test_writeback(folio)) {
> > -                       /* synchronous write or broken a_ops? */
> > -                       folio_clear_reclaim(folio);
> > -               }
> >                 trace_mm_vmscan_write_folio(folio);
> >                 node_stat_add_folio(folio, NR_VMSCAN_WRITE);
> >                 return PAGE_SUCCESS;
> > --
> > 2.47.2
> >
> 
> Hi, I'm seeing following panic with SWAP after this commit:
> 
> [   29.672319] Oops: general protection fault, probably for
> non-canonical address 0xffff88909a3be3: 0000 [#1] PREEMPT SMP NOPTI
> [   29.675503] CPU: 82 UID: 0 PID: 5145 Comm: tar Kdump: loaded Not
> tainted 6.13.0.ptch-g1fe9ea48ec98 #917
> [   29.677508] Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015
> [   29.678886] RIP: 0010:__lock_acquire+0x20/0x15d0

Ouch.

I failed to trigger it my setup. Could you share your reproducer?

> I'm testing with PROVE_LOCKING on. It seems folio_unmap_invalidate is
> called for swapcache folio and it doesn't work well, following PATCH
> on top of mm-unstable seems fix it well:

Right. I don't understand swapping good enough. I missed this.

> diff --git a/mm/filemap.c b/mm/filemap.c
> index 4fe551037bf7..98493443d120 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1605,8 +1605,9 @@ static void folio_end_reclaim_write(struct folio *folio)
>          * invalidation in that case.
>          */
>         if (in_task() && folio_trylock(folio)) {
> -               if (folio->mapping)
> -                       folio_unmap_invalidate(folio->mapping, folio, 0);
> +               struct address_space *mapping = folio_mapping(folio);
> +               if (mapping)
> +                       folio_unmap_invalidate(mapping, folio, 0);
>                 folio_unlock(folio);
>         }
>  }

Once you do this, folio_unmap_invalidate() will never succeed for
swapcache as folio->mapping != mapping check will always be true and it
will fail with -EBUSY.

I guess we need to do something similar to what __remove_mapping() does
for swapcache folios.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCHv3 06/11] mm/vmscan: Use PG_dropbehind instead of PG_reclaim

Posted by Sergey Senozhatsky 1 year ago

On (25/02/03 10:39), Kirill A. Shutemov wrote:
> > Hi, I'm seeing following panic with SWAP after this commit:
> >
> > [   29.672319] Oops: general protection fault, probably for
> > non-canonical address 0xffff88909a3be3: 0000 [#1] PREEMPT SMP NOPTI
> > [   29.675503] CPU: 82 UID: 0 PID: 5145 Comm: tar Kdump: loaded Not
> > tainted 6.13.0.ptch-g1fe9ea48ec98 #917
> > [   29.677508] Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015
> > [   29.678886] RIP: 0010:__lock_acquire+0x20/0x15d0
>
> Ouch.
>
> I failed to trigger it my setup. Could you share your reproducer?

I'm seeing this as well (backtraces below).

My repro is:

- 4GB VM with 2 zram devices
  - one is setup as swap
  - the other one has ext4 fs on it
	- I dd large files to it


---

xa_lock_irq(&mapping->i_pages):

[   94.609589][  T157] Oops: general protection fault, probably for non-canonical address 0xe01ffbf11020301a: 0000 [#1] PREEMPT SMP KASAN PTI
[   94.611881][  T157] KASAN: maybe wild-memory-access in range [0x00ffff88810180d0-0x00ffff88810180d7]
[   94.613567][  T157] CPU: 1 UID: 0 PID: 157 Comm: kswapd0 Not tainted 6.13.0+ #927
[   94.614947][  T157] RIP: 0010:__lock_acquire+0x6a/0x1ef0
[   94.615942][  T157] Code: 08 84 d2 0f 85 ed 13 00 00 44 8b 05 24 30 d5 02 45 85 c0 0f 84 bc 07 00 00 48 b8 00 00 00 00 00 fc ff df 4c 89 e2 48 c1 ea 03 <80> 3c 02 00 0f 85 eb 18 00 00 49 8b 04 24 48 3d a0 8b ac 84 0f 84
[   94.619668][  T157] RSP: 0018:ffff88810510eec0 EFLAGS: 00010002
[   94.620835][  T157] RAX: dffffc0000000000 RBX: 1ffff11020a21df5 RCX: 1ffffffff084c092
[   94.622329][  T157] RDX: 001ffff11020301a RSI: 0000000000000000 RDI: 00ffff88810180d1
[   94.623779][  T157] RBP: 00ffff88810180d1 R08: 0000000000000001 R09: 0000000000000000
[   94.625213][  T157] R10: ffffffff8425d0d7 R11: 0000000000000000 R12: 00ffff88810180d1
[   94.626656][  T157] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
[   94.628086][  T157] FS:  0000000000000000(0000) GS:ffff88815aa80000(0000) knlGS:0000000000000000
[   94.629700][  T157] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   94.630894][  T157] CR2: 00007f757719c2b0 CR3: 0000000003c82005 CR4: 0000000000770ef0
[   94.632333][  T157] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   94.633796][  T157] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   94.635265][  T157] PKRU: 55555554
[   94.635909][  T157] Call Trace:
[   94.636512][  T157]  <TASK>
[   94.637052][  T157]  ? show_trace_log_lvl+0x1a7/0x2e0
[   94.638005][  T157]  ? show_trace_log_lvl+0x1a7/0x2e0
[   94.638960][  T157]  ? lock_acquire.part.0+0xfa/0x310
[   94.639909][  T157]  ? __die_body.cold+0x8/0x12
[   94.640765][  T157]  ? die_addr+0x42/0x70
[   94.641530][  T157]  ? exc_general_protection+0x12e/0x210
[   94.642558][  T157]  ? asm_exc_general_protection+0x22/0x30
[   94.643610][  T157]  ? __lock_acquire+0x6a/0x1ef0
[   94.644506][  T157]  ? _raw_spin_unlock_irq+0x24/0x40
[   94.645468][  T157]  ? __wait_for_common+0x2f2/0x610
[   94.646412][  T157]  ? pci_mmcfg_reserved+0x120/0x120
[   94.647364][  T157]  ? submit_bio_noacct_nocheck+0x32e/0x3e0
[   94.648448][  T157]  ? lock_is_held_type+0x81/0xe0
[   94.649360][  T157]  lock_acquire.part.0+0xfa/0x310
[   94.650288][  T157]  ? folio_unmap_invalidate+0x286/0x550
[   94.651324][  T157]  ? __lock_acquire+0x1ef0/0x1ef0
[   94.652250][  T157]  ? submit_bio_wait+0x17c/0x200
[   94.653166][  T157]  ? submit_bio_wait_endio+0x40/0x40
[   94.654140][  T157]  ? lock_acquire+0x18a/0x1f0
[   94.655008][  T157]  _raw_spin_lock+0x2c/0x40
[   94.655853][  T157]  ? folio_unmap_invalidate+0x286/0x550
[   94.656879][  T157]  folio_unmap_invalidate+0x286/0x550
[   94.657866][  T157]  folio_end_writeback+0x146/0x190
[   94.658815][  T157]  swap_writepage_bdev_sync+0x312/0x410
[   94.659840][  T157]  ? swap_read_folio_bdev_sync+0x3c0/0x3c0
[   94.660917][  T157]  ? do_raw_spin_lock+0x12a/0x260
[   94.661845][  T157]  ? __rwlock_init+0x150/0x150
[   94.662726][  T157]  ? bio_kmalloc+0x20/0x20
[   94.663548][  T157]  ? swapcache_clear+0xd0/0xd0
[   94.664431][  T157]  swap_writepage+0x2a5/0x720
[   94.665298][  T157]  pageout+0x304/0x6a0
[   94.666052][  T157]  ? get_pte_pfn.isra.0+0x4d0/0x4d0
[   94.667025][  T157]  ? find_held_lock+0x2d/0x110
[   94.667912][  T157]  ? enable_swap_slots_cache+0x90/0x90
[   94.668925][  T157]  ? arch_tlbbatch_flush+0x1f6/0x370
[   94.669903][  T157]  shrink_folio_list+0x19b5/0x2600
[   94.670856][  T157]  ? pageout+0x6a0/0x6a0
[   94.671649][  T157]  ? isolate_folios+0x156/0x320
[   94.672544][  T157]  ? find_held_lock+0x2d/0x110
[   94.673428][  T157]  ? mark_lock+0xcc/0x12c0
[   94.674258][  T157]  ? mark_lock_irq+0x1cd0/0x1cd0
[   94.675174][  T157]  ? reacquire_held_locks+0x4d0/0x4d0
[   94.676166][  T157]  ? mark_held_locks+0x94/0xe0
[   94.677045][  T157]  evict_folios+0x4bb/0x1580
[   94.677890][  T157]  ? isolate_folios+0x320/0x320
[   94.678787][  T157]  ? __lock_acquire+0xc4c/0x1ef0
[   94.679695][  T157]  ? lock_is_held_type+0x81/0xe0
[   94.680607][  T157]  try_to_shrink_lruvec+0x41e/0x9e0
[   94.681564][  T157]  ? __lock_acquire+0xc4c/0x1ef0
[   94.682482][  T157]  ? evict_folios+0x1580/0x1580
[   94.683390][  T157]  ? lock_release+0x105/0x260
[   94.684255][  T157]  lru_gen_shrink_node+0x25d/0x660
[   94.685202][  T157]  ? balance_pgdat+0x5b5/0xf00
[   94.686083][  T157]  ? try_to_shrink_lruvec+0x9e0/0x9e0
[   94.687076][  T157]  ? pgdat_balanced+0xb8/0x110
[   94.687957][  T157]  balance_pgdat+0x532/0xf00
[   94.688803][  T157]  ? shrink_node.part.0+0xc30/0xc30
[   94.689758][  T157]  ? io_schedule_timeout+0x110/0x110
[   94.690741][  T157]  ? reacquire_held_locks+0x4d0/0x4d0
[   94.691723][  T157]  ? __lock_acquire+0x1ef0/0x1ef0
[   94.692643][  T157]  ? zone_watermark_ok_safe+0x32/0x290
[   94.693650][  T157]  ? inactive_is_low.isra.0+0xe0/0xe0
[   94.694639][  T157]  ? do_raw_spin_lock+0x12a/0x260
[   94.695567][  T157]  kswapd+0x2ef/0x4e0
[   94.696297][  T157]  ? balance_pgdat+0xf00/0xf00
[   94.697176][  T157]  ? __kthread_parkme+0xb1/0x1c0
[   94.698087][  T157]  ? balance_pgdat+0xf00/0xf00
[   94.698971][  T157]  kthread+0x38b/0x700
[   94.699721][  T157]  ? kthread_is_per_cpu+0xb0/0xb0
[   94.700648][  T157]  ? lock_acquire+0x18a/0x1f0
[   94.701516][  T157]  ? kthread_is_per_cpu+0xb0/0xb0
[   94.702438][  T157]  ret_from_fork+0x2d/0x70
[   94.703267][  T157]  ? kthread_is_per_cpu+0xb0/0xb0
[   94.704193][  T157]  ret_from_fork_asm+0x11/0x20
[   94.705074][  T157]  </TASK>


Also UAF in compactd

[   95.249096][  T146] ==================================================================
[   95.254091][  T146] BUG: KASAN: slab-use-after-free in kcompactd+0x9cd/0xa60
[   95.257959][  T146] Read of size 4 at addr ffff888105100018 by task kcompactd0/146
[   95.262100][  T146] 
[   95.263347][  T146] CPU: 11 UID: 0 PID: 146 Comm: kcompactd0 Tainted: G      D W          6.13.0+ #927
[   95.263363][  T146] Tainted: [D]=DIE, [W]=WARN
[   95.263367][  T146] Call Trace:
[   95.263379][  T146]  <TASK>
[   95.263386][  T146]  dump_stack_lvl+0x57/0x80
[   95.263403][  T146]  print_address_description.constprop.0+0x88/0x330
[   95.263416][  T146]  ? kcompactd+0x9cd/0xa60
[   95.263425][  T146]  print_report+0xe2/0x1cc
[   95.263433][  T146]  ? __virt_addr_valid+0x1d1/0x3b0
[   95.263442][  T146]  ? kcompactd+0x9cd/0xa60
[   95.263449][  T146]  ? kcompactd+0x9cd/0xa60
[   95.263456][  T146]  kasan_report+0xb9/0x180
[   95.263466][  T146]  ? kcompactd+0x9cd/0xa60
[   95.263476][  T146]  kcompactd+0x9cd/0xa60
[   95.263487][  T146]  ? kcompactd_do_work+0x710/0x710
[   95.263495][  T146]  ? prepare_to_swait_exclusive+0x260/0x260
[   95.263506][  T146]  ? __kthread_parkme+0xb1/0x1c0
[   95.263520][  T146]  ? kcompactd_do_work+0x710/0x710
[   95.263527][  T146]  kthread+0x38b/0x700
[   95.263535][  T146]  ? kthread_is_per_cpu+0xb0/0xb0
[   95.263542][  T146]  ? lock_acquire+0x18a/0x1f0
[   95.263552][  T146]  ? kthread_is_per_cpu+0xb0/0xb0
[   95.263559][  T146]  ret_from_fork+0x2d/0x70
[   95.263569][  T146]  ? kthread_is_per_cpu+0xb0/0xb0
[   95.263576][  T146]  ret_from_fork_asm+0x11/0x20
[   95.263589][  T146]  </TASK>
[   95.263592][  T146] 
[   95.293474][  T146] Allocated by task 2:
[   95.294209][  T146]  kasan_save_stack+0x1e/0x40
[   95.295111][  T146]  kasan_save_track+0x10/0x30
[   95.295978][  T146]  __kasan_slab_alloc+0x62/0x70
[   95.296860][  T146]  kmem_cache_alloc_node_noprof+0xdb/0x2a0
[   95.297915][  T146]  dup_task_struct+0x32/0x550
[   95.298797][  T146]  copy_process+0x309/0x45d0
[   95.299656][  T146]  kernel_clone+0xb7/0x600
[   95.300451][  T146]  kernel_thread+0xb0/0xe0
[   95.301253][  T146]  kthreadd+0x3b5/0x620
[   95.302019][  T146]  ret_from_fork+0x2d/0x70
[   95.302865][  T146]  ret_from_fork_asm+0x11/0x20
[   95.303724][  T146] 
[   95.304146][  T146] Freed by task 0:
[   95.304836][  T146]  kasan_save_stack+0x1e/0x40
[   95.305708][  T146]  kasan_save_track+0x10/0x30
[   95.306569][  T146]  kasan_save_free_info+0x37/0x50
[   95.307515][  T146]  __kasan_slab_free+0x33/0x40
[   95.308402][  T146]  kmem_cache_free+0xff/0x480
[   95.309256][  T146]  delayed_put_task_struct+0x15a/0x1d0
[   95.310258][  T146]  rcu_do_batch+0x2ee/0xb70
[   95.311113][  T146]  rcu_core+0x4a6/0xa10
[   95.311868][  T146]  handle_softirqs+0x191/0x650
[   95.312747][  T146]  __irq_exit_rcu+0xaf/0xe0
[   95.313643][  T146]  irq_exit_rcu+0xa/0x20
[   95.314536][  T146]  sysvec_apic_timer_interrupt+0x65/0x80
[   95.315616][  T146]  asm_sysvec_apic_timer_interrupt+0x16/0x20
[   95.316702][  T146] 
[   95.317127][  T146] Last potentially related work creation:
[   95.318155][  T146]  kasan_save_stack+0x1e/0x40
[   95.319006][  T146]  kasan_record_aux_stack+0x97/0xa0
[   95.319947][  T146]  __call_rcu_common.constprop.0+0x70/0x7b0
[   95.321014][  T146]  __schedule+0x75d/0x1720
[   95.321817][  T146]  schedule_idle+0x55/0x80
[   95.322624][  T146]  cpu_startup_entry+0x50/0x60
[   95.323490][  T146]  start_secondary+0x1b6/0x210
[   95.324354][  T146]  common_startup_64+0x12c/0x138
[   95.325248][  T146] 
[   95.325669][  T146] The buggy address belongs to the object at ffff888105100000
[   95.325669][  T146]  which belongs to the cache task_struct of size 8200
[   95.328215][  T146] The buggy address is located 24 bytes inside of
[   95.328215][  T146]  freed 8200-byte region [ffff888105100000, ffff888105102008)
[   95.330692][  T146] 
[   95.331116][  T146] The buggy address belongs to the physical page:
[   95.332275][  T146] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x105100
[   95.333862][  T146] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[   95.335399][  T146] flags: 0x8000000000000040(head|zone=2)
[   95.336425][  T146] page_type: f5(slab)
[   95.337155][  T146] raw: 8000000000000040 ffff888100a80c80 dead000000000122 0000000000000000
[   95.338716][  T146] raw: 0000000000000000 0000000000030003 00000000f5000000 0000000000000000
[   95.340273][  T146] head: 8000000000000040 ffff888100a80c80 dead000000000122 0000000000000000
[   95.341844][  T146] head: 0000000000000000 0000000000030003 00000000f5000000 0000000000000000
[   95.343418][  T146] head: 8000000000000003 ffffea0004144001 ffffffffffffffff 0000000000000000
[   95.344977][  T146] head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
[   95.346540][  T146] page dumped because: kasan: bad access detected
[   95.347701][  T146] 
[   95.348123][  T146] Memory state around the buggy address:
[   95.349139][  T146]  ffff8881050fff00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[   95.350598][  T146]  ffff8881050fff80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[   95.352054][  T146] >ffff888105100000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   95.353510][  T146]                             ^
[   95.354389][  T146]  ffff888105100080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   95.355856][  T146]  ffff888105100100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   95.357315][  T146] ==================================================================

Re: [PATCHv3 06/11] mm/vmscan: Use PG_dropbehind instead of PG_reclaim

Posted by Andrew Morton 1 year ago

On Mon, 3 Feb 2025 10:39:58 +0200 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:

> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 4fe551037bf7..98493443d120 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -1605,8 +1605,9 @@ static void folio_end_reclaim_write(struct folio *folio)
> >          * invalidation in that case.
> >          */
> >         if (in_task() && folio_trylock(folio)) {
> > -               if (folio->mapping)
> > -                       folio_unmap_invalidate(folio->mapping, folio, 0);
> > +               struct address_space *mapping = folio_mapping(folio);
> > +               if (mapping)
> > +                       folio_unmap_invalidate(mapping, folio, 0);
> >                 folio_unlock(folio);
> >         }
> >  }
> 
> Once you do this, folio_unmap_invalidate() will never succeed for
> swapcache as folio->mapping != mapping check will always be true and it
> will fail with -EBUSY.
> 
> I guess we need to do something similar to what __remove_mapping() does
> for swapcache folios.

Thanks, I'll drop the v3 series from mm.git.

Re: [PATCHv3 06/11] mm/vmscan: Use PG_dropbehind instead of PG_reclaim

Posted by Yu Zhao 1 year ago

On Sat, Feb 1, 2025 at 1:02 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Thu, Jan 30, 2025 at 6:02 PM Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > The recently introduced PG_dropbehind allows for freeing folios
> > immediately after writeback. Unlike PG_reclaim, it does not need vmscan
> > to be involved to get the folio freed.
> >
> > Instead of using folio_set_reclaim(), use folio_set_dropbehind() in
> > pageout().
> >
> > It is safe to leave PG_dropbehind on the folio if, for some reason
> > (bug?), the folio is not in a writeback state after ->writepage().
> > In these cases, the kernel had to clear PG_reclaim as it shared a page
> > flag bit with PG_readahead.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Acked-by: David Hildenbrand <david@redhat.com>
> > ---
> >  mm/vmscan.c | 9 +++------
> >  1 file changed, 3 insertions(+), 6 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index bc1826020159..c97adb0fdaa4 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -692,19 +692,16 @@ static pageout_t pageout(struct folio *folio, struct address_space *mapping,
> >                 if (shmem_mapping(mapping) && folio_test_large(folio))
> >                         wbc.list = folio_list;
> >
> > -               folio_set_reclaim(folio);
> > +               folio_set_dropbehind(folio);
> > +
> >                 res = mapping->a_ops->writepage(&folio->page, &wbc);
> >                 if (res < 0)
> >                         handle_write_error(mapping, folio, res);
> >                 if (res == AOP_WRITEPAGE_ACTIVATE) {
> > -                       folio_clear_reclaim(folio);
> > +                       folio_clear_dropbehind(folio);
> >                         return PAGE_ACTIVATE;
> >                 }
> >
> > -               if (!folio_test_writeback(folio)) {
> > -                       /* synchronous write or broken a_ops? */
> > -                       folio_clear_reclaim(folio);
> > -               }
> >                 trace_mm_vmscan_write_folio(folio);
> >                 node_stat_add_folio(folio, NR_VMSCAN_WRITE);
> >                 return PAGE_SUCCESS;
> > --
> > 2.47.2
> >
>
> Hi, I'm seeing following panic with SWAP after this commit:
>
> [   29.672319] Oops: general protection fault, probably for
> non-canonical address 0xffff88909a3be3: 0000 [#1] PREEMPT SMP NOPTI
> [   29.675503] CPU: 82 UID: 0 PID: 5145 Comm: tar Kdump: loaded Not
> tainted 6.13.0.ptch-g1fe9ea48ec98 #917
> [   29.677508] Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015
> [   29.678886] RIP: 0010:__lock_acquire+0x20/0x15d0
> [   29.679891] Code: 90 90 90 90 90 90 90 90 90 90 41 57 41 56 41 55
> 41 54 55 53 48 83 ec 30 8b 2d 10 ac f3 01 44 8b ac 24 88 00 00 00 85
> ed 74 64 <48> 8b 07 49 89 ff 48 3d 20 1d bf 83 74 56 8b 1d 8c f5 b1 01
> 41 89
> [   29.683852] RSP: 0018:ffffc9000bea3148 EFLAGS: 00010002
> [   29.684980] RAX: ffff8890874b2940 RBX: 0000000000000200 RCX: 0000000000000000
> [   29.686510] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00ffff88909a3be3
> [   29.688031] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000
> [   29.689561] R10: 0000000000000000 R11: 0000000000000020 R12: 00ffff88909a3be3
> [   29.691087] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [   29.692613] FS:  00007fa05c2824c0(0000) GS:ffff88a03fa80000(0000)
> knlGS:0000000000000000
> [   29.694339] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   29.695581] CR2: 000055f9abb7fc7d CR3: 00000010932f2002 CR4: 0000000000770eb0
> [   29.697109] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   29.698637] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [   29.700161] PKRU: 55555554
> [   29.700759] Call Trace:
> [   29.701296]  <TASK>
> [   29.701770]  ? __die_body+0x1e/0x60
> [   29.702540]  ? die_addr+0x3c/0x60
> [   29.703267]  ? exc_general_protection+0x18f/0x3c0
> [   29.704290]  ? asm_exc_general_protection+0x26/0x30
> [   29.705345]  ? __lock_acquire+0x20/0x15d0
> [   29.706215]  ? lockdep_hardirqs_on_prepare+0xda/0x190
> [   29.707304]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [   29.708452]  lock_acquire+0xbf/0x2e0
> [   29.709229]  ? folio_unmap_invalidate+0x12f/0x220
> [   29.710257]  ? __folio_end_writeback+0x15d/0x430
> [   29.711260]  ? __folio_end_writeback+0x116/0x430
> [   29.712261]  _raw_spin_lock+0x30/0x40
> [   29.713064]  ? folio_unmap_invalidate+0x12f/0x220
> [   29.714076]  folio_unmap_invalidate+0x12f/0x220
> [   29.715058]  folio_end_writeback+0xdf/0x190
> [   29.715967]  swap_writepage_bdev_sync+0x1e0/0x450
> [   29.716994]  ? __pfx_submit_bio_wait_endio+0x10/0x10
> [   29.718074]  swap_writepage+0x46b/0x6b0
> [   29.718917]  pageout+0x14b/0x360
> [   29.719628]  shrink_folio_list+0x67d/0xec0
> [   29.720519]  ? mark_held_locks+0x48/0x80
> [   29.721375]  evict_folios+0x2a7/0x9e0
> [   29.722179]  try_to_shrink_lruvec+0x19a/0x270
> [   29.723130]  lru_gen_shrink_lruvec+0x70/0xc0
> [   29.724060]  ? __lock_acquire+0x558/0x15d0
> [   29.724954]  shrink_lruvec+0x57/0x780
> [   29.725754]  ? find_held_lock+0x2d/0xa0
> [   29.726588]  ? rcu_read_unlock+0x17/0x60
> [   29.727449]  shrink_node+0x2ad/0x930
> [   29.728229]  do_try_to_free_pages+0xbd/0x4e0
> [   29.729160]  try_to_free_mem_cgroup_pages+0x123/0x2c0
> [   29.730252]  try_charge_memcg+0x222/0x660
> [   29.731128]  charge_memcg+0x3c/0x80
> [   29.731888]  __mem_cgroup_charge+0x30/0x70
> [   29.732776]  shmem_alloc_and_add_folio+0x1a5/0x480
> [   29.733818]  ? filemap_get_entry+0x155/0x390
> [   29.734748]  shmem_get_folio_gfp+0x28c/0x6c0
> [   29.735680]  shmem_write_begin+0x5a/0xc0
> [   29.736535]  generic_perform_write+0x12a/0x2e0
> [   29.737503]  shmem_file_write_iter+0x86/0x90
> [   29.738428]  vfs_write+0x364/0x530
> [   29.739180]  ksys_write+0x6c/0xe0
> [   29.739906]  do_syscall_64+0x66/0x140
> [   29.740713]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [   29.741800] RIP: 0033:0x7fa05c439984
> [   29.742584] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f
> 84 00 00 00 00 00 f3 0f 1e fa 80 3d c5 06 0e 00 00 74 13 b8 01 00 00
> 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20
> 48 89
> [   29.746542] RSP: 002b:00007ffece7720f8 EFLAGS: 00000202 ORIG_RAX:
> 0000000000000001
> [   29.748157] RAX: ffffffffffffffda RBX: 0000000000002800 RCX: 00007fa05c439984
> [   29.749682] RDX: 0000000000002800 RSI: 000055f9cfa08000 RDI: 0000000000000004
> [   29.751216] RBP: 00007ffece772140 R08: 0000000000002800 R09: 0000000000000007
> [   29.752743] R10: 0000000000000180 R11: 0000000000000202 R12: 000055f9cfa08000
> [   29.754262] R13: 0000000000000004 R14: 0000000000002800 R15: 00000000000009af
> [   29.755797]  </TASK>
> [   29.756285] Modules linked in: zram virtiofs
>
> I'm testing with PROVE_LOCKING on. It seems folio_unmap_invalidate is
> called for swapcache folio and it doesn't work well, following PATCH
> on top of mm-unstable seems fix it well:

I think there is a bigger problem here. folio_end_reclaim_write()
currently calls folio_unmap_invalidate() to remove the mapping, and
that's very different from what __remove_mapping() does in the reclaim
path: not only it breaks the swapcache case, the shadow entry is also
lost.

Re: [PATCHv3 06/11] mm/vmscan: Use PG_dropbehind instead of PG_reclaim

Posted by Shakeel Butt 1 year ago

On Thu, Jan 30, 2025 at 12:00:44PM +0200, Kirill A. Shutemov wrote:
> The recently introduced PG_dropbehind allows for freeing folios
> immediately after writeback. Unlike PG_reclaim, it does not need vmscan
> to be involved to get the folio freed.
> 
> Instead of using folio_set_reclaim(), use folio_set_dropbehind() in
> pageout().
> 
> It is safe to leave PG_dropbehind on the folio if, for some reason
> (bug?), the folio is not in a writeback state after ->writepage().
> In these cases, the kernel had to clear PG_reclaim as it shared a page
> flag bit with PG_readahead.

Is it correct to say that leaving PG_dropbehind on folios which doesn't
have writeback state after ->writepage() (i.e. store to zswap) is fine
because PG_dropbehind is not in PAGE_FLAGS_CHECK_AT_FREE?

> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/vmscan.c | 9 +++------
>  1 file changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bc1826020159..c97adb0fdaa4 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -692,19 +692,16 @@ static pageout_t pageout(struct folio *folio, struct address_space *mapping,
>  		if (shmem_mapping(mapping) && folio_test_large(folio))
>  			wbc.list = folio_list;
>  
> -		folio_set_reclaim(folio);
> +		folio_set_dropbehind(folio);
> +
>  		res = mapping->a_ops->writepage(&folio->page, &wbc);
>  		if (res < 0)
>  			handle_write_error(mapping, folio, res);
>  		if (res == AOP_WRITEPAGE_ACTIVATE) {
> -			folio_clear_reclaim(folio);
> +			folio_clear_dropbehind(folio);
>  			return PAGE_ACTIVATE;
>  		}
>  
> -		if (!folio_test_writeback(folio)) {
> -			/* synchronous write or broken a_ops? */
> -			folio_clear_reclaim(folio);
> -		}
>  		trace_mm_vmscan_write_folio(folio);
>  		node_stat_add_folio(folio, NR_VMSCAN_WRITE);
>  		return PAGE_SUCCESS;
> -- 
> 2.47.2
>