The recently introduced PG_dropbehind allows for freeing folios
immediately after writeback. Unlike PG_reclaim, it does not need vmscan
to be involved to get the folio freed.
Instead of using folio_set_reclaim(), use folio_set_dropbehind() in
pageout().
It is safe to leave PG_dropbehind on the folio if, for some reason
(bug?), the folio is not in a writeback state after ->writepage().
In these cases, the kernel had to clear PG_reclaim as it shared a page
flag bit with PG_readahead.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
mm/vmscan.c | 9 +++------
1 file changed, 3 insertions(+), 6 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bc1826020159..c97adb0fdaa4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -692,19 +692,16 @@ static pageout_t pageout(struct folio *folio, struct address_space *mapping,
if (shmem_mapping(mapping) && folio_test_large(folio))
wbc.list = folio_list;
- folio_set_reclaim(folio);
+ folio_set_dropbehind(folio);
+
res = mapping->a_ops->writepage(&folio->page, &wbc);
if (res < 0)
handle_write_error(mapping, folio, res);
if (res == AOP_WRITEPAGE_ACTIVATE) {
- folio_clear_reclaim(folio);
+ folio_clear_dropbehind(folio);
return PAGE_ACTIVATE;
}
- if (!folio_test_writeback(folio)) {
- /* synchronous write or broken a_ops? */
- folio_clear_reclaim(folio);
- }
trace_mm_vmscan_write_folio(folio);
node_stat_add_folio(folio, NR_VMSCAN_WRITE);
return PAGE_SUCCESS;
--
2.47.2
On Thu, Jan 30, 2025 at 6:02 PM Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> The recently introduced PG_dropbehind allows for freeing folios
> immediately after writeback. Unlike PG_reclaim, it does not need vmscan
> to be involved to get the folio freed.
>
> Instead of using folio_set_reclaim(), use folio_set_dropbehind() in
> pageout().
>
> It is safe to leave PG_dropbehind on the folio if, for some reason
> (bug?), the folio is not in a writeback state after ->writepage().
> In these cases, the kernel had to clear PG_reclaim as it shared a page
> flag bit with PG_readahead.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> ---
> mm/vmscan.c | 9 +++------
> 1 file changed, 3 insertions(+), 6 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bc1826020159..c97adb0fdaa4 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -692,19 +692,16 @@ static pageout_t pageout(struct folio *folio, struct address_space *mapping,
> if (shmem_mapping(mapping) && folio_test_large(folio))
> wbc.list = folio_list;
>
> - folio_set_reclaim(folio);
> + folio_set_dropbehind(folio);
> +
> res = mapping->a_ops->writepage(&folio->page, &wbc);
> if (res < 0)
> handle_write_error(mapping, folio, res);
> if (res == AOP_WRITEPAGE_ACTIVATE) {
> - folio_clear_reclaim(folio);
> + folio_clear_dropbehind(folio);
> return PAGE_ACTIVATE;
> }
>
> - if (!folio_test_writeback(folio)) {
> - /* synchronous write or broken a_ops? */
> - folio_clear_reclaim(folio);
> - }
> trace_mm_vmscan_write_folio(folio);
> node_stat_add_folio(folio, NR_VMSCAN_WRITE);
> return PAGE_SUCCESS;
> --
> 2.47.2
>
Hi, I'm seeing following panic with SWAP after this commit:
[ 29.672319] Oops: general protection fault, probably for
non-canonical address 0xffff88909a3be3: 0000 [#1] PREEMPT SMP NOPTI
[ 29.675503] CPU: 82 UID: 0 PID: 5145 Comm: tar Kdump: loaded Not
tainted 6.13.0.ptch-g1fe9ea48ec98 #917
[ 29.677508] Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015
[ 29.678886] RIP: 0010:__lock_acquire+0x20/0x15d0
[ 29.679891] Code: 90 90 90 90 90 90 90 90 90 90 41 57 41 56 41 55
41 54 55 53 48 83 ec 30 8b 2d 10 ac f3 01 44 8b ac 24 88 00 00 00 85
ed 74 64 <48> 8b 07 49 89 ff 48 3d 20 1d bf 83 74 56 8b 1d 8c f5 b1 01
41 89
[ 29.683852] RSP: 0018:ffffc9000bea3148 EFLAGS: 00010002
[ 29.684980] RAX: ffff8890874b2940 RBX: 0000000000000200 RCX: 0000000000000000
[ 29.686510] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00ffff88909a3be3
[ 29.688031] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000
[ 29.689561] R10: 0000000000000000 R11: 0000000000000020 R12: 00ffff88909a3be3
[ 29.691087] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 29.692613] FS: 00007fa05c2824c0(0000) GS:ffff88a03fa80000(0000)
knlGS:0000000000000000
[ 29.694339] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 29.695581] CR2: 000055f9abb7fc7d CR3: 00000010932f2002 CR4: 0000000000770eb0
[ 29.697109] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 29.698637] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 29.700161] PKRU: 55555554
[ 29.700759] Call Trace:
[ 29.701296] <TASK>
[ 29.701770] ? __die_body+0x1e/0x60
[ 29.702540] ? die_addr+0x3c/0x60
[ 29.703267] ? exc_general_protection+0x18f/0x3c0
[ 29.704290] ? asm_exc_general_protection+0x26/0x30
[ 29.705345] ? __lock_acquire+0x20/0x15d0
[ 29.706215] ? lockdep_hardirqs_on_prepare+0xda/0x190
[ 29.707304] ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 29.708452] lock_acquire+0xbf/0x2e0
[ 29.709229] ? folio_unmap_invalidate+0x12f/0x220
[ 29.710257] ? __folio_end_writeback+0x15d/0x430
[ 29.711260] ? __folio_end_writeback+0x116/0x430
[ 29.712261] _raw_spin_lock+0x30/0x40
[ 29.713064] ? folio_unmap_invalidate+0x12f/0x220
[ 29.714076] folio_unmap_invalidate+0x12f/0x220
[ 29.715058] folio_end_writeback+0xdf/0x190
[ 29.715967] swap_writepage_bdev_sync+0x1e0/0x450
[ 29.716994] ? __pfx_submit_bio_wait_endio+0x10/0x10
[ 29.718074] swap_writepage+0x46b/0x6b0
[ 29.718917] pageout+0x14b/0x360
[ 29.719628] shrink_folio_list+0x67d/0xec0
[ 29.720519] ? mark_held_locks+0x48/0x80
[ 29.721375] evict_folios+0x2a7/0x9e0
[ 29.722179] try_to_shrink_lruvec+0x19a/0x270
[ 29.723130] lru_gen_shrink_lruvec+0x70/0xc0
[ 29.724060] ? __lock_acquire+0x558/0x15d0
[ 29.724954] shrink_lruvec+0x57/0x780
[ 29.725754] ? find_held_lock+0x2d/0xa0
[ 29.726588] ? rcu_read_unlock+0x17/0x60
[ 29.727449] shrink_node+0x2ad/0x930
[ 29.728229] do_try_to_free_pages+0xbd/0x4e0
[ 29.729160] try_to_free_mem_cgroup_pages+0x123/0x2c0
[ 29.730252] try_charge_memcg+0x222/0x660
[ 29.731128] charge_memcg+0x3c/0x80
[ 29.731888] __mem_cgroup_charge+0x30/0x70
[ 29.732776] shmem_alloc_and_add_folio+0x1a5/0x480
[ 29.733818] ? filemap_get_entry+0x155/0x390
[ 29.734748] shmem_get_folio_gfp+0x28c/0x6c0
[ 29.735680] shmem_write_begin+0x5a/0xc0
[ 29.736535] generic_perform_write+0x12a/0x2e0
[ 29.737503] shmem_file_write_iter+0x86/0x90
[ 29.738428] vfs_write+0x364/0x530
[ 29.739180] ksys_write+0x6c/0xe0
[ 29.739906] do_syscall_64+0x66/0x140
[ 29.740713] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 29.741800] RIP: 0033:0x7fa05c439984
[ 29.742584] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f
84 00 00 00 00 00 f3 0f 1e fa 80 3d c5 06 0e 00 00 74 13 b8 01 00 00
00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20
48 89
[ 29.746542] RSP: 002b:00007ffece7720f8 EFLAGS: 00000202 ORIG_RAX:
0000000000000001
[ 29.748157] RAX: ffffffffffffffda RBX: 0000000000002800 RCX: 00007fa05c439984
[ 29.749682] RDX: 0000000000002800 RSI: 000055f9cfa08000 RDI: 0000000000000004
[ 29.751216] RBP: 00007ffece772140 R08: 0000000000002800 R09: 0000000000000007
[ 29.752743] R10: 0000000000000180 R11: 0000000000000202 R12: 000055f9cfa08000
[ 29.754262] R13: 0000000000000004 R14: 0000000000002800 R15: 00000000000009af
[ 29.755797] </TASK>
[ 29.756285] Modules linked in: zram virtiofs
I'm testing with PROVE_LOCKING on. It seems folio_unmap_invalidate is
called for swapcache folio and it doesn't work well, following PATCH
on top of mm-unstable seems fix it well:
diff --git a/mm/filemap.c b/mm/filemap.c
index 4fe551037bf7..98493443d120 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1605,8 +1605,9 @@ static void folio_end_reclaim_write(struct folio *folio)
* invalidation in that case.
*/
if (in_task() && folio_trylock(folio)) {
- if (folio->mapping)
- folio_unmap_invalidate(folio->mapping, folio, 0);
+ struct address_space *mapping = folio_mapping(folio);
+ if (mapping)
+ folio_unmap_invalidate(mapping, folio, 0);
folio_unlock(folio);
}
}
diff --git a/mm/truncate.c b/mm/truncate.c
index e922ceb66c44..4f3e34c52d8b 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -565,23 +565,29 @@ int folio_unmap_invalidate(struct address_space
*mapping, struct folio *folio,
if (!filemap_release_folio(folio, gfp))
return -EBUSY;
- spin_lock(&mapping->host->i_lock);
+ if (!folio_test_swapcache(folio)) {
+ spin_lock(&mapping->host->i_lock);
+ BUG_ON(folio_has_private(folio));
+ }
+
xa_lock_irq(&mapping->i_pages);
if (folio_test_dirty(folio))
goto failed;
- BUG_ON(folio_has_private(folio));
__filemap_remove_folio(folio, NULL);
xa_unlock_irq(&mapping->i_pages);
if (mapping_shrinkable(mapping))
inode_add_lru(mapping->host);
- spin_unlock(&mapping->host->i_lock);
+
+ if (!folio_test_swapcache(folio))
+ spin_unlock(&mapping->host->i_lock);
filemap_free_folio(mapping, folio);
return 1;
failed:
xa_unlock_irq(&mapping->i_pages);
- spin_unlock(&mapping->host->i_lock);
+ if (!folio_test_swapcache(folio))
+ spin_unlock(&mapping->host->i_lock);
return -EBUSY;
}
On Sat, Feb 01, 2025 at 04:01:43PM +0800, Kairui Song wrote:
> On Thu, Jan 30, 2025 at 6:02 PM Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > The recently introduced PG_dropbehind allows for freeing folios
> > immediately after writeback. Unlike PG_reclaim, it does not need vmscan
> > to be involved to get the folio freed.
> >
> > Instead of using folio_set_reclaim(), use folio_set_dropbehind() in
> > pageout().
> >
> > It is safe to leave PG_dropbehind on the folio if, for some reason
> > (bug?), the folio is not in a writeback state after ->writepage().
> > In these cases, the kernel had to clear PG_reclaim as it shared a page
> > flag bit with PG_readahead.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Acked-by: David Hildenbrand <david@redhat.com>
> > ---
> > mm/vmscan.c | 9 +++------
> > 1 file changed, 3 insertions(+), 6 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index bc1826020159..c97adb0fdaa4 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -692,19 +692,16 @@ static pageout_t pageout(struct folio *folio, struct address_space *mapping,
> > if (shmem_mapping(mapping) && folio_test_large(folio))
> > wbc.list = folio_list;
> >
> > - folio_set_reclaim(folio);
> > + folio_set_dropbehind(folio);
> > +
> > res = mapping->a_ops->writepage(&folio->page, &wbc);
> > if (res < 0)
> > handle_write_error(mapping, folio, res);
> > if (res == AOP_WRITEPAGE_ACTIVATE) {
> > - folio_clear_reclaim(folio);
> > + folio_clear_dropbehind(folio);
> > return PAGE_ACTIVATE;
> > }
> >
> > - if (!folio_test_writeback(folio)) {
> > - /* synchronous write or broken a_ops? */
> > - folio_clear_reclaim(folio);
> > - }
> > trace_mm_vmscan_write_folio(folio);
> > node_stat_add_folio(folio, NR_VMSCAN_WRITE);
> > return PAGE_SUCCESS;
> > --
> > 2.47.2
> >
>
> Hi, I'm seeing following panic with SWAP after this commit:
>
> [ 29.672319] Oops: general protection fault, probably for
> non-canonical address 0xffff88909a3be3: 0000 [#1] PREEMPT SMP NOPTI
> [ 29.675503] CPU: 82 UID: 0 PID: 5145 Comm: tar Kdump: loaded Not
> tainted 6.13.0.ptch-g1fe9ea48ec98 #917
> [ 29.677508] Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015
> [ 29.678886] RIP: 0010:__lock_acquire+0x20/0x15d0
Ouch.
I failed to trigger it my setup. Could you share your reproducer?
> I'm testing with PROVE_LOCKING on. It seems folio_unmap_invalidate is
> called for swapcache folio and it doesn't work well, following PATCH
> on top of mm-unstable seems fix it well:
Right. I don't understand swapping good enough. I missed this.
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 4fe551037bf7..98493443d120 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1605,8 +1605,9 @@ static void folio_end_reclaim_write(struct folio *folio)
> * invalidation in that case.
> */
> if (in_task() && folio_trylock(folio)) {
> - if (folio->mapping)
> - folio_unmap_invalidate(folio->mapping, folio, 0);
> + struct address_space *mapping = folio_mapping(folio);
> + if (mapping)
> + folio_unmap_invalidate(mapping, folio, 0);
> folio_unlock(folio);
> }
> }
Once you do this, folio_unmap_invalidate() will never succeed for
swapcache as folio->mapping != mapping check will always be true and it
will fail with -EBUSY.
I guess we need to do something similar to what __remove_mapping() does
for swapcache folios.
--
Kiryl Shutsemau / Kirill A. Shutemov
On (25/02/03 10:39), Kirill A. Shutemov wrote: > > Hi, I'm seeing following panic with SWAP after this commit: > > > > [ 29.672319] Oops: general protection fault, probably for > > non-canonical address 0xffff88909a3be3: 0000 [#1] PREEMPT SMP NOPTI > > [ 29.675503] CPU: 82 UID: 0 PID: 5145 Comm: tar Kdump: loaded Not > > tainted 6.13.0.ptch-g1fe9ea48ec98 #917 > > [ 29.677508] Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015 > > [ 29.678886] RIP: 0010:__lock_acquire+0x20/0x15d0 > > Ouch. > > I failed to trigger it my setup. Could you share your reproducer? I'm seeing this as well (backtraces below). My repro is: - 4GB VM with 2 zram devices - one is setup as swap - the other one has ext4 fs on it - I dd large files to it --- xa_lock_irq(&mapping->i_pages): [ 94.609589][ T157] Oops: general protection fault, probably for non-canonical address 0xe01ffbf11020301a: 0000 [#1] PREEMPT SMP KASAN PTI [ 94.611881][ T157] KASAN: maybe wild-memory-access in range [0x00ffff88810180d0-0x00ffff88810180d7] [ 94.613567][ T157] CPU: 1 UID: 0 PID: 157 Comm: kswapd0 Not tainted 6.13.0+ #927 [ 94.614947][ T157] RIP: 0010:__lock_acquire+0x6a/0x1ef0 [ 94.615942][ T157] Code: 08 84 d2 0f 85 ed 13 00 00 44 8b 05 24 30 d5 02 45 85 c0 0f 84 bc 07 00 00 48 b8 00 00 00 00 00 fc ff df 4c 89 e2 48 c1 ea 03 <80> 3c 02 00 0f 85 eb 18 00 00 49 8b 04 24 48 3d a0 8b ac 84 0f 84 [ 94.619668][ T157] RSP: 0018:ffff88810510eec0 EFLAGS: 00010002 [ 94.620835][ T157] RAX: dffffc0000000000 RBX: 1ffff11020a21df5 RCX: 1ffffffff084c092 [ 94.622329][ T157] RDX: 001ffff11020301a RSI: 0000000000000000 RDI: 00ffff88810180d1 [ 94.623779][ T157] RBP: 00ffff88810180d1 R08: 0000000000000001 R09: 0000000000000000 [ 94.625213][ T157] R10: ffffffff8425d0d7 R11: 0000000000000000 R12: 00ffff88810180d1 [ 94.626656][ T157] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001 [ 94.628086][ T157] FS: 0000000000000000(0000) GS:ffff88815aa80000(0000) knlGS:0000000000000000 [ 94.629700][ T157] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 94.630894][ T157] CR2: 00007f757719c2b0 CR3: 0000000003c82005 CR4: 0000000000770ef0 [ 94.632333][ T157] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 94.633796][ T157] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 94.635265][ T157] PKRU: 55555554 [ 94.635909][ T157] Call Trace: [ 94.636512][ T157] <TASK> [ 94.637052][ T157] ? show_trace_log_lvl+0x1a7/0x2e0 [ 94.638005][ T157] ? show_trace_log_lvl+0x1a7/0x2e0 [ 94.638960][ T157] ? lock_acquire.part.0+0xfa/0x310 [ 94.639909][ T157] ? __die_body.cold+0x8/0x12 [ 94.640765][ T157] ? die_addr+0x42/0x70 [ 94.641530][ T157] ? exc_general_protection+0x12e/0x210 [ 94.642558][ T157] ? asm_exc_general_protection+0x22/0x30 [ 94.643610][ T157] ? __lock_acquire+0x6a/0x1ef0 [ 94.644506][ T157] ? _raw_spin_unlock_irq+0x24/0x40 [ 94.645468][ T157] ? __wait_for_common+0x2f2/0x610 [ 94.646412][ T157] ? pci_mmcfg_reserved+0x120/0x120 [ 94.647364][ T157] ? submit_bio_noacct_nocheck+0x32e/0x3e0 [ 94.648448][ T157] ? lock_is_held_type+0x81/0xe0 [ 94.649360][ T157] lock_acquire.part.0+0xfa/0x310 [ 94.650288][ T157] ? folio_unmap_invalidate+0x286/0x550 [ 94.651324][ T157] ? __lock_acquire+0x1ef0/0x1ef0 [ 94.652250][ T157] ? submit_bio_wait+0x17c/0x200 [ 94.653166][ T157] ? submit_bio_wait_endio+0x40/0x40 [ 94.654140][ T157] ? lock_acquire+0x18a/0x1f0 [ 94.655008][ T157] _raw_spin_lock+0x2c/0x40 [ 94.655853][ T157] ? folio_unmap_invalidate+0x286/0x550 [ 94.656879][ T157] folio_unmap_invalidate+0x286/0x550 [ 94.657866][ T157] folio_end_writeback+0x146/0x190 [ 94.658815][ T157] swap_writepage_bdev_sync+0x312/0x410 [ 94.659840][ T157] ? swap_read_folio_bdev_sync+0x3c0/0x3c0 [ 94.660917][ T157] ? do_raw_spin_lock+0x12a/0x260 [ 94.661845][ T157] ? __rwlock_init+0x150/0x150 [ 94.662726][ T157] ? bio_kmalloc+0x20/0x20 [ 94.663548][ T157] ? swapcache_clear+0xd0/0xd0 [ 94.664431][ T157] swap_writepage+0x2a5/0x720 [ 94.665298][ T157] pageout+0x304/0x6a0 [ 94.666052][ T157] ? get_pte_pfn.isra.0+0x4d0/0x4d0 [ 94.667025][ T157] ? find_held_lock+0x2d/0x110 [ 94.667912][ T157] ? enable_swap_slots_cache+0x90/0x90 [ 94.668925][ T157] ? arch_tlbbatch_flush+0x1f6/0x370 [ 94.669903][ T157] shrink_folio_list+0x19b5/0x2600 [ 94.670856][ T157] ? pageout+0x6a0/0x6a0 [ 94.671649][ T157] ? isolate_folios+0x156/0x320 [ 94.672544][ T157] ? find_held_lock+0x2d/0x110 [ 94.673428][ T157] ? mark_lock+0xcc/0x12c0 [ 94.674258][ T157] ? mark_lock_irq+0x1cd0/0x1cd0 [ 94.675174][ T157] ? reacquire_held_locks+0x4d0/0x4d0 [ 94.676166][ T157] ? mark_held_locks+0x94/0xe0 [ 94.677045][ T157] evict_folios+0x4bb/0x1580 [ 94.677890][ T157] ? isolate_folios+0x320/0x320 [ 94.678787][ T157] ? __lock_acquire+0xc4c/0x1ef0 [ 94.679695][ T157] ? lock_is_held_type+0x81/0xe0 [ 94.680607][ T157] try_to_shrink_lruvec+0x41e/0x9e0 [ 94.681564][ T157] ? __lock_acquire+0xc4c/0x1ef0 [ 94.682482][ T157] ? evict_folios+0x1580/0x1580 [ 94.683390][ T157] ? lock_release+0x105/0x260 [ 94.684255][ T157] lru_gen_shrink_node+0x25d/0x660 [ 94.685202][ T157] ? balance_pgdat+0x5b5/0xf00 [ 94.686083][ T157] ? try_to_shrink_lruvec+0x9e0/0x9e0 [ 94.687076][ T157] ? pgdat_balanced+0xb8/0x110 [ 94.687957][ T157] balance_pgdat+0x532/0xf00 [ 94.688803][ T157] ? shrink_node.part.0+0xc30/0xc30 [ 94.689758][ T157] ? io_schedule_timeout+0x110/0x110 [ 94.690741][ T157] ? reacquire_held_locks+0x4d0/0x4d0 [ 94.691723][ T157] ? __lock_acquire+0x1ef0/0x1ef0 [ 94.692643][ T157] ? zone_watermark_ok_safe+0x32/0x290 [ 94.693650][ T157] ? inactive_is_low.isra.0+0xe0/0xe0 [ 94.694639][ T157] ? do_raw_spin_lock+0x12a/0x260 [ 94.695567][ T157] kswapd+0x2ef/0x4e0 [ 94.696297][ T157] ? balance_pgdat+0xf00/0xf00 [ 94.697176][ T157] ? __kthread_parkme+0xb1/0x1c0 [ 94.698087][ T157] ? balance_pgdat+0xf00/0xf00 [ 94.698971][ T157] kthread+0x38b/0x700 [ 94.699721][ T157] ? kthread_is_per_cpu+0xb0/0xb0 [ 94.700648][ T157] ? lock_acquire+0x18a/0x1f0 [ 94.701516][ T157] ? kthread_is_per_cpu+0xb0/0xb0 [ 94.702438][ T157] ret_from_fork+0x2d/0x70 [ 94.703267][ T157] ? kthread_is_per_cpu+0xb0/0xb0 [ 94.704193][ T157] ret_from_fork_asm+0x11/0x20 [ 94.705074][ T157] </TASK> Also UAF in compactd [ 95.249096][ T146] ================================================================== [ 95.254091][ T146] BUG: KASAN: slab-use-after-free in kcompactd+0x9cd/0xa60 [ 95.257959][ T146] Read of size 4 at addr ffff888105100018 by task kcompactd0/146 [ 95.262100][ T146] [ 95.263347][ T146] CPU: 11 UID: 0 PID: 146 Comm: kcompactd0 Tainted: G D W 6.13.0+ #927 [ 95.263363][ T146] Tainted: [D]=DIE, [W]=WARN [ 95.263367][ T146] Call Trace: [ 95.263379][ T146] <TASK> [ 95.263386][ T146] dump_stack_lvl+0x57/0x80 [ 95.263403][ T146] print_address_description.constprop.0+0x88/0x330 [ 95.263416][ T146] ? kcompactd+0x9cd/0xa60 [ 95.263425][ T146] print_report+0xe2/0x1cc [ 95.263433][ T146] ? __virt_addr_valid+0x1d1/0x3b0 [ 95.263442][ T146] ? kcompactd+0x9cd/0xa60 [ 95.263449][ T146] ? kcompactd+0x9cd/0xa60 [ 95.263456][ T146] kasan_report+0xb9/0x180 [ 95.263466][ T146] ? kcompactd+0x9cd/0xa60 [ 95.263476][ T146] kcompactd+0x9cd/0xa60 [ 95.263487][ T146] ? kcompactd_do_work+0x710/0x710 [ 95.263495][ T146] ? prepare_to_swait_exclusive+0x260/0x260 [ 95.263506][ T146] ? __kthread_parkme+0xb1/0x1c0 [ 95.263520][ T146] ? kcompactd_do_work+0x710/0x710 [ 95.263527][ T146] kthread+0x38b/0x700 [ 95.263535][ T146] ? kthread_is_per_cpu+0xb0/0xb0 [ 95.263542][ T146] ? lock_acquire+0x18a/0x1f0 [ 95.263552][ T146] ? kthread_is_per_cpu+0xb0/0xb0 [ 95.263559][ T146] ret_from_fork+0x2d/0x70 [ 95.263569][ T146] ? kthread_is_per_cpu+0xb0/0xb0 [ 95.263576][ T146] ret_from_fork_asm+0x11/0x20 [ 95.263589][ T146] </TASK> [ 95.263592][ T146] [ 95.293474][ T146] Allocated by task 2: [ 95.294209][ T146] kasan_save_stack+0x1e/0x40 [ 95.295111][ T146] kasan_save_track+0x10/0x30 [ 95.295978][ T146] __kasan_slab_alloc+0x62/0x70 [ 95.296860][ T146] kmem_cache_alloc_node_noprof+0xdb/0x2a0 [ 95.297915][ T146] dup_task_struct+0x32/0x550 [ 95.298797][ T146] copy_process+0x309/0x45d0 [ 95.299656][ T146] kernel_clone+0xb7/0x600 [ 95.300451][ T146] kernel_thread+0xb0/0xe0 [ 95.301253][ T146] kthreadd+0x3b5/0x620 [ 95.302019][ T146] ret_from_fork+0x2d/0x70 [ 95.302865][ T146] ret_from_fork_asm+0x11/0x20 [ 95.303724][ T146] [ 95.304146][ T146] Freed by task 0: [ 95.304836][ T146] kasan_save_stack+0x1e/0x40 [ 95.305708][ T146] kasan_save_track+0x10/0x30 [ 95.306569][ T146] kasan_save_free_info+0x37/0x50 [ 95.307515][ T146] __kasan_slab_free+0x33/0x40 [ 95.308402][ T146] kmem_cache_free+0xff/0x480 [ 95.309256][ T146] delayed_put_task_struct+0x15a/0x1d0 [ 95.310258][ T146] rcu_do_batch+0x2ee/0xb70 [ 95.311113][ T146] rcu_core+0x4a6/0xa10 [ 95.311868][ T146] handle_softirqs+0x191/0x650 [ 95.312747][ T146] __irq_exit_rcu+0xaf/0xe0 [ 95.313643][ T146] irq_exit_rcu+0xa/0x20 [ 95.314536][ T146] sysvec_apic_timer_interrupt+0x65/0x80 [ 95.315616][ T146] asm_sysvec_apic_timer_interrupt+0x16/0x20 [ 95.316702][ T146] [ 95.317127][ T146] Last potentially related work creation: [ 95.318155][ T146] kasan_save_stack+0x1e/0x40 [ 95.319006][ T146] kasan_record_aux_stack+0x97/0xa0 [ 95.319947][ T146] __call_rcu_common.constprop.0+0x70/0x7b0 [ 95.321014][ T146] __schedule+0x75d/0x1720 [ 95.321817][ T146] schedule_idle+0x55/0x80 [ 95.322624][ T146] cpu_startup_entry+0x50/0x60 [ 95.323490][ T146] start_secondary+0x1b6/0x210 [ 95.324354][ T146] common_startup_64+0x12c/0x138 [ 95.325248][ T146] [ 95.325669][ T146] The buggy address belongs to the object at ffff888105100000 [ 95.325669][ T146] which belongs to the cache task_struct of size 8200 [ 95.328215][ T146] The buggy address is located 24 bytes inside of [ 95.328215][ T146] freed 8200-byte region [ffff888105100000, ffff888105102008) [ 95.330692][ T146] [ 95.331116][ T146] The buggy address belongs to the physical page: [ 95.332275][ T146] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x105100 [ 95.333862][ T146] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 [ 95.335399][ T146] flags: 0x8000000000000040(head|zone=2) [ 95.336425][ T146] page_type: f5(slab) [ 95.337155][ T146] raw: 8000000000000040 ffff888100a80c80 dead000000000122 0000000000000000 [ 95.338716][ T146] raw: 0000000000000000 0000000000030003 00000000f5000000 0000000000000000 [ 95.340273][ T146] head: 8000000000000040 ffff888100a80c80 dead000000000122 0000000000000000 [ 95.341844][ T146] head: 0000000000000000 0000000000030003 00000000f5000000 0000000000000000 [ 95.343418][ T146] head: 8000000000000003 ffffea0004144001 ffffffffffffffff 0000000000000000 [ 95.344977][ T146] head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000 [ 95.346540][ T146] page dumped because: kasan: bad access detected [ 95.347701][ T146] [ 95.348123][ T146] Memory state around the buggy address: [ 95.349139][ T146] ffff8881050fff00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 95.350598][ T146] ffff8881050fff80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 95.352054][ T146] >ffff888105100000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 95.353510][ T146] ^ [ 95.354389][ T146] ffff888105100080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 95.355856][ T146] ffff888105100100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 95.357315][ T146] ==================================================================
On Mon, 3 Feb 2025 10:39:58 +0200 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 4fe551037bf7..98493443d120 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -1605,8 +1605,9 @@ static void folio_end_reclaim_write(struct folio *folio)
> > * invalidation in that case.
> > */
> > if (in_task() && folio_trylock(folio)) {
> > - if (folio->mapping)
> > - folio_unmap_invalidate(folio->mapping, folio, 0);
> > + struct address_space *mapping = folio_mapping(folio);
> > + if (mapping)
> > + folio_unmap_invalidate(mapping, folio, 0);
> > folio_unlock(folio);
> > }
> > }
>
> Once you do this, folio_unmap_invalidate() will never succeed for
> swapcache as folio->mapping != mapping check will always be true and it
> will fail with -EBUSY.
>
> I guess we need to do something similar to what __remove_mapping() does
> for swapcache folios.
Thanks, I'll drop the v3 series from mm.git.
On Sat, Feb 1, 2025 at 1:02 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Thu, Jan 30, 2025 at 6:02 PM Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > The recently introduced PG_dropbehind allows for freeing folios
> > immediately after writeback. Unlike PG_reclaim, it does not need vmscan
> > to be involved to get the folio freed.
> >
> > Instead of using folio_set_reclaim(), use folio_set_dropbehind() in
> > pageout().
> >
> > It is safe to leave PG_dropbehind on the folio if, for some reason
> > (bug?), the folio is not in a writeback state after ->writepage().
> > In these cases, the kernel had to clear PG_reclaim as it shared a page
> > flag bit with PG_readahead.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Acked-by: David Hildenbrand <david@redhat.com>
> > ---
> > mm/vmscan.c | 9 +++------
> > 1 file changed, 3 insertions(+), 6 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index bc1826020159..c97adb0fdaa4 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -692,19 +692,16 @@ static pageout_t pageout(struct folio *folio, struct address_space *mapping,
> > if (shmem_mapping(mapping) && folio_test_large(folio))
> > wbc.list = folio_list;
> >
> > - folio_set_reclaim(folio);
> > + folio_set_dropbehind(folio);
> > +
> > res = mapping->a_ops->writepage(&folio->page, &wbc);
> > if (res < 0)
> > handle_write_error(mapping, folio, res);
> > if (res == AOP_WRITEPAGE_ACTIVATE) {
> > - folio_clear_reclaim(folio);
> > + folio_clear_dropbehind(folio);
> > return PAGE_ACTIVATE;
> > }
> >
> > - if (!folio_test_writeback(folio)) {
> > - /* synchronous write or broken a_ops? */
> > - folio_clear_reclaim(folio);
> > - }
> > trace_mm_vmscan_write_folio(folio);
> > node_stat_add_folio(folio, NR_VMSCAN_WRITE);
> > return PAGE_SUCCESS;
> > --
> > 2.47.2
> >
>
> Hi, I'm seeing following panic with SWAP after this commit:
>
> [ 29.672319] Oops: general protection fault, probably for
> non-canonical address 0xffff88909a3be3: 0000 [#1] PREEMPT SMP NOPTI
> [ 29.675503] CPU: 82 UID: 0 PID: 5145 Comm: tar Kdump: loaded Not
> tainted 6.13.0.ptch-g1fe9ea48ec98 #917
> [ 29.677508] Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015
> [ 29.678886] RIP: 0010:__lock_acquire+0x20/0x15d0
> [ 29.679891] Code: 90 90 90 90 90 90 90 90 90 90 41 57 41 56 41 55
> 41 54 55 53 48 83 ec 30 8b 2d 10 ac f3 01 44 8b ac 24 88 00 00 00 85
> ed 74 64 <48> 8b 07 49 89 ff 48 3d 20 1d bf 83 74 56 8b 1d 8c f5 b1 01
> 41 89
> [ 29.683852] RSP: 0018:ffffc9000bea3148 EFLAGS: 00010002
> [ 29.684980] RAX: ffff8890874b2940 RBX: 0000000000000200 RCX: 0000000000000000
> [ 29.686510] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00ffff88909a3be3
> [ 29.688031] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000
> [ 29.689561] R10: 0000000000000000 R11: 0000000000000020 R12: 00ffff88909a3be3
> [ 29.691087] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [ 29.692613] FS: 00007fa05c2824c0(0000) GS:ffff88a03fa80000(0000)
> knlGS:0000000000000000
> [ 29.694339] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 29.695581] CR2: 000055f9abb7fc7d CR3: 00000010932f2002 CR4: 0000000000770eb0
> [ 29.697109] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 29.698637] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 29.700161] PKRU: 55555554
> [ 29.700759] Call Trace:
> [ 29.701296] <TASK>
> [ 29.701770] ? __die_body+0x1e/0x60
> [ 29.702540] ? die_addr+0x3c/0x60
> [ 29.703267] ? exc_general_protection+0x18f/0x3c0
> [ 29.704290] ? asm_exc_general_protection+0x26/0x30
> [ 29.705345] ? __lock_acquire+0x20/0x15d0
> [ 29.706215] ? lockdep_hardirqs_on_prepare+0xda/0x190
> [ 29.707304] ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [ 29.708452] lock_acquire+0xbf/0x2e0
> [ 29.709229] ? folio_unmap_invalidate+0x12f/0x220
> [ 29.710257] ? __folio_end_writeback+0x15d/0x430
> [ 29.711260] ? __folio_end_writeback+0x116/0x430
> [ 29.712261] _raw_spin_lock+0x30/0x40
> [ 29.713064] ? folio_unmap_invalidate+0x12f/0x220
> [ 29.714076] folio_unmap_invalidate+0x12f/0x220
> [ 29.715058] folio_end_writeback+0xdf/0x190
> [ 29.715967] swap_writepage_bdev_sync+0x1e0/0x450
> [ 29.716994] ? __pfx_submit_bio_wait_endio+0x10/0x10
> [ 29.718074] swap_writepage+0x46b/0x6b0
> [ 29.718917] pageout+0x14b/0x360
> [ 29.719628] shrink_folio_list+0x67d/0xec0
> [ 29.720519] ? mark_held_locks+0x48/0x80
> [ 29.721375] evict_folios+0x2a7/0x9e0
> [ 29.722179] try_to_shrink_lruvec+0x19a/0x270
> [ 29.723130] lru_gen_shrink_lruvec+0x70/0xc0
> [ 29.724060] ? __lock_acquire+0x558/0x15d0
> [ 29.724954] shrink_lruvec+0x57/0x780
> [ 29.725754] ? find_held_lock+0x2d/0xa0
> [ 29.726588] ? rcu_read_unlock+0x17/0x60
> [ 29.727449] shrink_node+0x2ad/0x930
> [ 29.728229] do_try_to_free_pages+0xbd/0x4e0
> [ 29.729160] try_to_free_mem_cgroup_pages+0x123/0x2c0
> [ 29.730252] try_charge_memcg+0x222/0x660
> [ 29.731128] charge_memcg+0x3c/0x80
> [ 29.731888] __mem_cgroup_charge+0x30/0x70
> [ 29.732776] shmem_alloc_and_add_folio+0x1a5/0x480
> [ 29.733818] ? filemap_get_entry+0x155/0x390
> [ 29.734748] shmem_get_folio_gfp+0x28c/0x6c0
> [ 29.735680] shmem_write_begin+0x5a/0xc0
> [ 29.736535] generic_perform_write+0x12a/0x2e0
> [ 29.737503] shmem_file_write_iter+0x86/0x90
> [ 29.738428] vfs_write+0x364/0x530
> [ 29.739180] ksys_write+0x6c/0xe0
> [ 29.739906] do_syscall_64+0x66/0x140
> [ 29.740713] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 29.741800] RIP: 0033:0x7fa05c439984
> [ 29.742584] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f
> 84 00 00 00 00 00 f3 0f 1e fa 80 3d c5 06 0e 00 00 74 13 b8 01 00 00
> 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20
> 48 89
> [ 29.746542] RSP: 002b:00007ffece7720f8 EFLAGS: 00000202 ORIG_RAX:
> 0000000000000001
> [ 29.748157] RAX: ffffffffffffffda RBX: 0000000000002800 RCX: 00007fa05c439984
> [ 29.749682] RDX: 0000000000002800 RSI: 000055f9cfa08000 RDI: 0000000000000004
> [ 29.751216] RBP: 00007ffece772140 R08: 0000000000002800 R09: 0000000000000007
> [ 29.752743] R10: 0000000000000180 R11: 0000000000000202 R12: 000055f9cfa08000
> [ 29.754262] R13: 0000000000000004 R14: 0000000000002800 R15: 00000000000009af
> [ 29.755797] </TASK>
> [ 29.756285] Modules linked in: zram virtiofs
>
> I'm testing with PROVE_LOCKING on. It seems folio_unmap_invalidate is
> called for swapcache folio and it doesn't work well, following PATCH
> on top of mm-unstable seems fix it well:
I think there is a bigger problem here. folio_end_reclaim_write()
currently calls folio_unmap_invalidate() to remove the mapping, and
that's very different from what __remove_mapping() does in the reclaim
path: not only it breaks the swapcache case, the shadow entry is also
lost.
On Thu, Jan 30, 2025 at 12:00:44PM +0200, Kirill A. Shutemov wrote:
> The recently introduced PG_dropbehind allows for freeing folios
> immediately after writeback. Unlike PG_reclaim, it does not need vmscan
> to be involved to get the folio freed.
>
> Instead of using folio_set_reclaim(), use folio_set_dropbehind() in
> pageout().
>
> It is safe to leave PG_dropbehind on the folio if, for some reason
> (bug?), the folio is not in a writeback state after ->writepage().
> In these cases, the kernel had to clear PG_reclaim as it shared a page
> flag bit with PG_readahead.
Is it correct to say that leaving PG_dropbehind on folios which doesn't
have writeback state after ->writepage() (i.e. store to zswap) is fine
because PG_dropbehind is not in PAGE_FLAGS_CHECK_AT_FREE?
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> ---
> mm/vmscan.c | 9 +++------
> 1 file changed, 3 insertions(+), 6 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bc1826020159..c97adb0fdaa4 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -692,19 +692,16 @@ static pageout_t pageout(struct folio *folio, struct address_space *mapping,
> if (shmem_mapping(mapping) && folio_test_large(folio))
> wbc.list = folio_list;
>
> - folio_set_reclaim(folio);
> + folio_set_dropbehind(folio);
> +
> res = mapping->a_ops->writepage(&folio->page, &wbc);
> if (res < 0)
> handle_write_error(mapping, folio, res);
> if (res == AOP_WRITEPAGE_ACTIVATE) {
> - folio_clear_reclaim(folio);
> + folio_clear_dropbehind(folio);
> return PAGE_ACTIVATE;
> }
>
> - if (!folio_test_writeback(folio)) {
> - /* synchronous write or broken a_ops? */
> - folio_clear_reclaim(folio);
> - }
> trace_mm_vmscan_write_folio(folio);
> node_stat_add_folio(folio, NR_VMSCAN_WRITE);
> return PAGE_SUCCESS;
> --
> 2.47.2
>
© 2016 - 2026 Red Hat, Inc.