fs/ext4/ext4.h | 3 + fs/ext4/extents.c | 2 +- fs/ext4/extents_status.c | 27 +- fs/ext4/extents_status.h | 2 +- fs/ext4/inode.c | 28 +- fs/ext4/ioctl.c | 10 - fs/ext4/move_extent.c | 773 ++++++++++++++++-------------------- fs/ext4/super.c | 1 + include/trace/events/ext4.h | 97 ++++- 9 files changed, 486 insertions(+), 457 deletions(-)
From: Zhang Yi <yi.zhang@huawei.com> Hello! Currently, the online defragmentation of the ext4 is primarily implemented through the move extent operation in the kernel. This extent-moving operates at the granularity of PAGE_SIZE, iteratively performing extent swapping and data movement operations, which is quite inefficient. Especially since ext4 now supports large folios, iterations at the PAGE_SIZE granularity are no longer practical and fail to leverage the advantages of large folios. Additionally, the current implementation is tightly coupled with buffer_head, making it unable to support after the conversion of buffered I/O processes to the iomap infrastructure. This patch set (based on 6.17-rc7) optimizes the extent-moving process, deprecates the old move_extent_per_page() interface, and introduces a new mext_move_extent() interface. The new interface iterates over and copies data based on the extents of the original file instead of the PAGE_SIZE, and supporting large folios. The data processing logic in the iteration remains largely consistent with previous versions, with no additional optimizations or changes made. Additionally, the primary objective of this set of patches is to prepare for converting the buffered I/O process for regular files to the iomap infrastructure. These patches decouple the buffer_head from the main extent-moving process, restricting its use to only the helpers mext_folio_mkwrite() and mext_folio_mkuptodate(), which handle updating and marking pages in the swapped page cache as dirty. The overall coding style of the extent-moving process aligns with the iomap infrastructure, laying the foundation for supporting online defragmentation once the iomap infrastructure is adopted. Patch overview: Patch 1: Fix an off-by-one issue. Patch 2: Fix a minor issue related to validity checking. Patch 3-5: Introduce a sequence counter for the mapping extent status tree, this also prepares for the iomap infrastructure. Patch 6-8: Refactor the mext_check_arguments() helper function and the validity checking to improve code readability. Patch 9-13: Drop move_extent_per_page() and switch to using the new mext_move_extent(). Additionally, add support for large folios. With this patch set, the efficiency of online defragmentation for the ext4 file system can also be improved under general circumstances. Below is a set of typical test obtained using the fio e4defrag ioengine on the environment with Intel Xeon Gold 6240 CPU, 400G memory and a NVMe SSD device. [defrag] directory=/mnt filesize=400G buffered=1 fadvise_hint=0 ioengine=e4defrag bs=4k # 4k,32k,128k donorname=test.def filename=test inplace=0 rw=write overwrite=0 # 0 for unwritten extent and 1 for written extent numjobs=1 iodepth=1 runtime=30s [w/o] U 4k: IOPS=225k, BW=877MiB/s # U: unwritten extent-moving U 32k: IOPS=33.2k, BW=1037MiB/s U 128k: IOPS=8510, BW=1064MiB/s M 4k: IOPS=19.8k, BW=77.2MiB/s # M: written extent-moving M 32k: IOPS=2502, BW=78.2MiB/s M 128k: IOPS=635, BW=79.5MiB/s [w] U 4k: IOPS=246k, BW=963MiB/s U 32k: IOPS=209k, BW=6529MiB/s U 128k: IOPS=146k, BW=17.8GiB/s M 4k: IOPS=19.5k, BW=76.2MiB/s M 32k: IOPS=4091, BW=128MiB/s M 128k: IOPS=2814, BW=352MiB/s Best Regards, Yi. Zhang Yi (13): ext4: fix an off-by-one issue during moving extents ext4: correct the checking of quota files before moving extents ext4: introduce seq counter for the extent status entry ext4: make ext4_es_lookup_extent() pass out the extent seq counter ext4: pass out extent seq counter when mapping blocks ext4: use EXT4_B_TO_LBLK() in mext_check_arguments() ext4: add mext_check_validity() to do basic check ext4: refactor mext_check_arguments() ext4: rename mext_page_mkuptodate() to mext_folio_mkuptodate() ext4: introduce mext_move_extent() ext4: switch to using the new extent movement method ext4: add large folios support for moving extents ext4: add two trace points for moving extents fs/ext4/ext4.h | 3 + fs/ext4/extents.c | 2 +- fs/ext4/extents_status.c | 27 +- fs/ext4/extents_status.h | 2 +- fs/ext4/inode.c | 28 +- fs/ext4/ioctl.c | 10 - fs/ext4/move_extent.c | 773 ++++++++++++++++-------------------- fs/ext4/super.c | 1 + include/trace/events/ext4.h | 97 ++++- 9 files changed, 486 insertions(+), 457 deletions(-) -- 2.46.1
syzbot ci has tested the following series [v1] ext4: optimize online defragment https://lore.kernel.org/all/20250923012724.2378858-1-yi.zhang@huaweicloud.com * [PATCH 01/13] ext4: fix an off-by-one issue during moving extents * [PATCH 02/13] ext4: correct the checking of quota files before moving extents * [PATCH 03/13] ext4: introduce seq counter for the extent status entry * [PATCH 04/13] ext4: make ext4_es_lookup_extent() pass out the extent seq counter * [PATCH 05/13] ext4: pass out extent seq counter when mapping blocks * [PATCH 06/13] ext4: use EXT4_B_TO_LBLK() in mext_check_arguments() * [PATCH 07/13] ext4: add mext_check_validity() to do basic check * [PATCH 08/13] ext4: refactor mext_check_arguments() * [PATCH 09/13] ext4: rename mext_page_mkuptodate() to mext_folio_mkuptodate() * [PATCH 10/13] ext4: introduce mext_move_extent() * [PATCH 11/13] ext4: switch to using the new extent movement method * [PATCH 12/13] ext4: add large folios support for moving extents * [PATCH 13/13] ext4: add two trace points for moving extents and found the following issues: * KASAN: slab-out-of-bounds Read in ext4_inode_journal_mode * general protection fault in ext4_inode_journal_mode Full report is available here: https://ci.syzbot.org/series/89adca9b-1e59-47cd-8ba6-0a57d76309c9 *** KASAN: slab-out-of-bounds Read in ext4_inode_journal_mode tree: torvalds URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux base: 07e27ad16399afcd693be20211b0dfae63e0615f arch: amd64 compiler: Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8 config: https://ci.syzbot.org/builds/17d2b187-99c8-4493-9c72-e8fcf7741d20/config C repro: https://ci.syzbot.org/findings/b98c412d-c481-4663-b80b-a50550db3406/c_repro syz repro: https://ci.syzbot.org/findings/b98c412d-c481-4663-b80b-a50550db3406/syz_repro EXT4-fs (loop0): mounted filesystem 00000000-0000-0000-0000-000000000000 r/w without journal. Quota mode: writeback. ext4 filesystem being mounted at /0/bus supports timestamps until 2038-01-19 (0x7fffffff) ================================================================== BUG: KASAN: slab-out-of-bounds in ext4_inode_journal_mode+0x7b/0x480 fs/ext4/ext4_jbd2.c:12 Read of size 8 at addr ffff88801cefc378 by task syz.0.17/5984 CPU: 0 UID: 0 PID: 5984 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:378 [inline] print_report+0xca/0x240 mm/kasan/report.c:482 kasan_report+0x118/0x150 mm/kasan/report.c:595 ext4_inode_journal_mode+0x7b/0x480 fs/ext4/ext4_jbd2.c:12 ext4_should_journal_data fs/ext4/ext4_jbd2.h:381 [inline] mext_check_validity fs/ext4/move_extent.c:426 [inline] ext4_move_extents+0x2bb/0x3630 fs/ext4/move_extent.c:579 __ext4_ioctl fs/ext4/ioctl.c:1356 [inline] ext4_ioctl+0x26a7/0x33c0 fs/ext4/ioctl.c:1616 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:598 [inline] __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:584 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f6a6678ec29 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffea3688b38 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007f6a669d5fa0 RCX: 00007f6a6678ec29 RDX: 0000200000000040 RSI: 00000000c028660f RDI: 0000000000000004 RBP: 00007f6a66811e41 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007f6a669d5fa0 R14: 00007f6a669d5fa0 R15: 0000000000000003 </TASK> Allocated by task 1: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3e/0x80 mm/kasan/common.c:68 poison_kmalloc_redzone mm/kasan/common.c:388 [inline] __kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:405 kasan_kmalloc include/linux/kasan.h:260 [inline] __kmalloc_cache_noprof+0x230/0x3d0 mm/slub.c:4407 kmalloc_noprof include/linux/slab.h:905 [inline] kzalloc_noprof include/linux/slab.h:1039 [inline] shmem_fill_super+0xc8/0x1190 mm/shmem.c:5059 vfs_get_super fs/super.c:1325 [inline] get_tree_nodev+0xbb/0x150 fs/super.c:1344 vfs_get_tree+0x92/0x2b0 fs/super.c:1815 fc_mount fs/namespace.c:1247 [inline] vfs_kern_mount+0xbe/0x160 fs/namespace.c:1286 devtmpfs_init+0x98/0x330 drivers/base/devtmpfs.c:484 driver_init+0x15/0x60 drivers/base/init.c:25 do_basic_setup+0xf/0x70 init/main.c:1363 kernel_init_freeable+0x334/0x4b0 init/main.c:1579 kernel_init+0x1d/0x1d0 init/main.c:1469 ret_from_fork+0x439/0x7d0 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 The buggy address belongs to the object at ffff88801cefc000 which belongs to the cache kmalloc-512 of size 512 The buggy address is located 544 bytes to the right of allocated 344-byte region [ffff88801cefc000, ffff88801cefc158) The buggy address belongs to the physical page: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1cefc head: order:2 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 flags: 0xfff00000000040(head|node=0|zone=1|lastcpupid=0x7ff) page_type: f5(slab) raw: 00fff00000000040 ffff88801a441c80 dead000000000122 0000000000000000 raw: 0000000000000000 0000000000100010 00000000f5000000 0000000000000000 head: 00fff00000000040 ffff88801a441c80 dead000000000122 0000000000000000 head: 0000000000000000 0000000000100010 00000000f5000000 0000000000000000 head: 00fff00000000002 ffffea000073bf01 00000000ffffffff 00000000ffffffff head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000004 page dumped because: kasan: bad access detected page_owner tracks the page as allocated page last allocated via order 2, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 1, tgid 1 (swapper/0), ts 1877776345, free_ts 0 set_page_owner include/linux/page_owner.h:32 [inline] post_alloc_hook+0x240/0x2a0 mm/page_alloc.c:1851 prep_new_page mm/page_alloc.c:1859 [inline] get_page_from_freelist+0x21e4/0x22c0 mm/page_alloc.c:3858 __alloc_frozen_pages_noprof+0x181/0x370 mm/page_alloc.c:5148 alloc_pages_mpol+0x232/0x4a0 mm/mempolicy.c:2416 alloc_slab_page mm/slub.c:2492 [inline] allocate_slab+0x8a/0x370 mm/slub.c:2660 new_slab mm/slub.c:2714 [inline] ___slab_alloc+0xbeb/0x1420 mm/slub.c:3901 __slab_alloc mm/slub.c:3992 [inline] __slab_alloc_node mm/slub.c:4067 [inline] slab_alloc_node mm/slub.c:4228 [inline] __kmalloc_cache_noprof+0x296/0x3d0 mm/slub.c:4402 kmalloc_noprof include/linux/slab.h:905 [inline] kzalloc_noprof include/linux/slab.h:1039 [inline] shmem_fill_super+0xc8/0x1190 mm/shmem.c:5059 vfs_get_super fs/super.c:1325 [inline] get_tree_nodev+0xbb/0x150 fs/super.c:1344 vfs_get_tree+0x92/0x2b0 fs/super.c:1815 fc_mount fs/namespace.c:1247 [inline] vfs_kern_mount+0xbe/0x160 fs/namespace.c:1286 devtmpfs_init+0x98/0x330 drivers/base/devtmpfs.c:484 driver_init+0x15/0x60 drivers/base/init.c:25 do_basic_setup+0xf/0x70 init/main.c:1363 kernel_init_freeable+0x334/0x4b0 init/main.c:1579 kernel_init+0x1d/0x1d0 init/main.c:1469 page_owner free stack trace missing Memory state around the buggy address: ffff88801cefc200: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff88801cefc280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff88801cefc300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ^ ffff88801cefc380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff88801cefc400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ================================================================== *** general protection fault in ext4_inode_journal_mode tree: torvalds URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux base: 07e27ad16399afcd693be20211b0dfae63e0615f arch: amd64 compiler: Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8 config: https://ci.syzbot.org/builds/17d2b187-99c8-4493-9c72-e8fcf7741d20/config syz repro: https://ci.syzbot.org/findings/9f9fdff9-ee39-4921-9a7a-35ab05cc081b/syz_repro EXT4-fs (loop1): mounted filesystem 76b65be2-f6da-4727-8c75-0525a5b65a09 r/w without journal. Quota mode: none. ext4 filesystem being mounted at /0/mnt supports timestamps until 2038-01-19 (0x7fffffff) Oops: general protection fault, probably for non-canonical address 0xdffffc000000006f: 0000 [#1] SMP KASAN PTI KASAN: null-ptr-deref in range [0x0000000000000378-0x000000000000037f] CPU: 0 UID: 0 PID: 6013 Comm: syz.1.18 Not tainted syzkaller #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 RIP: 0010:ext4_inode_journal_mode+0x6d/0x480 fs/ext4/ext4_jbd2.c:12 Code: 00 4d 03 7d 00 4c 89 f8 48 c1 e8 03 80 3c 28 00 74 08 4c 89 ff e8 03 9e b6 ff 41 bc 78 03 00 00 4d 03 27 4c 89 e0 48 c1 e8 03 <80> 3c 28 00 74 08 4c 89 e7 e8 e5 9d b6 ff 49 83 3c 24 00 0f 84 01 RSP: 0018:ffffc90002d6f638 EFLAGS: 00010206 RAX: 000000000000006f RBX: ffff88811249ad48 RCX: ffff888021ed9cc0 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88811249ad48 RBP: dffffc0000000000 R08: ffff88811249ae2f R09: 1ffff110224935c5 R10: dffffc0000000000 R11: ffffed10224935c6 R12: 0000000000000378 R13: ffff88811249ad70 R14: 1ffff110224935ae R15: ffff88801bfa6640 FS: 00007fa922e456c0(0000) GS:ffff8880b8612000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000200000000040 CR3: 000000010f5f6000 CR4: 00000000000006f0 Call Trace: <TASK> ext4_should_journal_data fs/ext4/ext4_jbd2.h:381 [inline] mext_check_validity fs/ext4/move_extent.c:426 [inline] ext4_move_extents+0x2bb/0x3630 fs/ext4/move_extent.c:579 __ext4_ioctl fs/ext4/ioctl.c:1356 [inline] ext4_ioctl+0x26a7/0x33c0 fs/ext4/ioctl.c:1616 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:598 [inline] __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:584 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fa921f8ec29 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fa922e45038 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007fa9221d5fa0 RCX: 00007fa921f8ec29 RDX: 0000200000000040 RSI: 00000000c028660f RDI: 0000000000000004 RBP: 00007fa922011e41 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007fa9221d6038 R14: 00007fa9221d5fa0 R15: 00007ffcaaedca68 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- RIP: 0010:ext4_inode_journal_mode+0x6d/0x480 fs/ext4/ext4_jbd2.c:12 Code: 00 4d 03 7d 00 4c 89 f8 48 c1 e8 03 80 3c 28 00 74 08 4c 89 ff e8 03 9e b6 ff 41 bc 78 03 00 00 4d 03 27 4c 89 e0 48 c1 e8 03 <80> 3c 28 00 74 08 4c 89 e7 e8 e5 9d b6 ff 49 83 3c 24 00 0f 84 01 RSP: 0018:ffffc90002d6f638 EFLAGS: 00010206 RAX: 000000000000006f RBX: ffff88811249ad48 RCX: ffff888021ed9cc0 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88811249ad48 RBP: dffffc0000000000 R08: ffff88811249ae2f R09: 1ffff110224935c5 R10: dffffc0000000000 R11: ffffed10224935c6 R12: 0000000000000378 R13: ffff88811249ad70 R14: 1ffff110224935ae R15: ffff88801bfa6640 FS: 00007fa922e456c0(0000) GS:ffff8881a3c12000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00002000000012c0 CR3: 000000010f5f6000 CR4: 00000000000006f0 ---------------- Code disassembly (best guess): 0: 00 4d 03 add %cl,0x3(%rbp) 3: 7d 00 jge 0x5 5: 4c 89 f8 mov %r15,%rax 8: 48 c1 e8 03 shr $0x3,%rax c: 80 3c 28 00 cmpb $0x0,(%rax,%rbp,1) 10: 74 08 je 0x1a 12: 4c 89 ff mov %r15,%rdi 15: e8 03 9e b6 ff call 0xffb69e1d 1a: 41 bc 78 03 00 00 mov $0x378,%r12d 20: 4d 03 27 add (%r15),%r12 23: 4c 89 e0 mov %r12,%rax 26: 48 c1 e8 03 shr $0x3,%rax * 2a: 80 3c 28 00 cmpb $0x0,(%rax,%rbp,1) <-- trapping instruction 2e: 74 08 je 0x38 30: 4c 89 e7 mov %r12,%rdi 33: e8 e5 9d b6 ff call 0xffb69e1d 38: 49 83 3c 24 00 cmpq $0x0,(%r12) 3d: 0f .byte 0xf 3e: 84 01 test %al,(%rcx) *** If these findings have caused you to resend the series or submit a separate fix, please add the following tag to your commit message: Tested-by: syzbot@syzkaller.appspotmail.com --- This report is generated by a bot. It may contain errors. syzbot ci engineers can be reached at syzkaller@googlegroups.com.
© 2016 - 2025 Red Hat, Inc.