[PATCH 00/13] ext4: optimize online defragment

Zhang Yi posted 13 patches 1 week, 2 days ago
fs/ext4/ext4.h              |   3 +
fs/ext4/extents.c           |   2 +-
fs/ext4/extents_status.c    |  27 +-
fs/ext4/extents_status.h    |   2 +-
fs/ext4/inode.c             |  28 +-
fs/ext4/ioctl.c             |  10 -
fs/ext4/move_extent.c       | 773 ++++++++++++++++--------------------
fs/ext4/super.c             |   1 +
include/trace/events/ext4.h |  97 ++++-
9 files changed, 486 insertions(+), 457 deletions(-)
[PATCH 00/13] ext4: optimize online defragment
Posted by Zhang Yi 1 week, 2 days ago
From: Zhang Yi <yi.zhang@huawei.com>

Hello!

Currently, the online defragmentation of the ext4 is primarily
implemented through the move extent operation in the kernel. This
extent-moving operates at the granularity of PAGE_SIZE, iteratively
performing extent swapping and data movement operations, which is quite
inefficient. Especially since ext4 now supports large folios, iterations
at the PAGE_SIZE granularity are no longer practical and fail to
leverage the advantages of large folios. Additionally, the current
implementation is tightly coupled with buffer_head, making it unable to
support after the conversion of buffered I/O processes to the iomap
infrastructure.

This patch set (based on 6.17-rc7) optimizes the extent-moving process,
deprecates the old move_extent_per_page() interface, and introduces a
new mext_move_extent() interface. The new interface iterates over and
copies data based on the extents of the original file instead of the
PAGE_SIZE, and supporting large folios. The data processing logic in the
iteration remains largely consistent with previous versions, with no
additional optimizations or changes made. 

Additionally, the primary objective of this set of patches is to prepare
for converting the buffered I/O process for regular files to the iomap
infrastructure. These patches decouple the buffer_head from the main
extent-moving process, restricting its use to only the helpers
mext_folio_mkwrite() and mext_folio_mkuptodate(), which handle updating
and marking pages in the swapped page cache as dirty. The overall coding
style of the extent-moving process aligns with the iomap infrastructure,
laying the foundation for supporting online defragmentation once the
iomap infrastructure is adopted.

Patch overview:

Patch 1:    Fix an off-by-one issue.
Patch 2:    Fix a minor issue related to validity checking.
Patch 3-5:  Introduce a sequence counter for the mapping extent status
            tree, this also prepares for the iomap infrastructure.
Patch 6-8:  Refactor the mext_check_arguments() helper function and the
            validity checking to improve code readability.
Patch 9-13: Drop move_extent_per_page() and switch to using the new
            mext_move_extent(). Additionally, add support for large
            folios.

With this patch set, the efficiency of online defragmentation for the
ext4 file system can also be improved under general circumstances. Below
is a set of typical test obtained using the fio e4defrag ioengine on the
environment with Intel Xeon Gold 6240 CPU, 400G memory and a NVMe SSD
device.

  [defrag]
  directory=/mnt
  filesize=400G
  buffered=1
  fadvise_hint=0
  ioengine=e4defrag
  bs=4k         # 4k,32k,128k
  donorname=test.def
  filename=test
  inplace=0
  rw=write
  overwrite=0   # 0 for unwritten extent and 1 for written extent
  numjobs=1
  iodepth=1
  runtime=30s

  [w/o]
   U 4k:    IOPS=225k,  BW=877MiB/s      # U: unwritten extent-moving
   U 32k:   IOPS=33.2k, BW=1037MiB/s
   U 128k:  IOPS=8510,  BW=1064MiB/s
   M 4k:    IOPS=19.8k, BW=77.2MiB/s     # M: written extent-moving
   M 32k:   IOPS=2502,  BW=78.2MiB/s
   M 128k:  IOPS=635,   BW=79.5MiB/s

  [w]
   U 4k:    IOPS=246k,  BW=963MiB/s
   U 32k:   IOPS=209k,  BW=6529MiB/s
   U 128k:  IOPS=146k,  BW=17.8GiB/s
   M 4k:    IOPS=19.5k, BW=76.2MiB/s
   M 32k:   IOPS=4091,  BW=128MiB/s
   M 128k:  IOPS=2814,  BW=352MiB/s 


Best Regards,
Yi.


Zhang Yi (13):
  ext4: fix an off-by-one issue during moving extents
  ext4: correct the checking of quota files before moving extents
  ext4: introduce seq counter for the extent status entry
  ext4: make ext4_es_lookup_extent() pass out the extent seq counter
  ext4: pass out extent seq counter when mapping blocks
  ext4: use EXT4_B_TO_LBLK() in mext_check_arguments()
  ext4: add mext_check_validity() to do basic check
  ext4: refactor mext_check_arguments()
  ext4: rename mext_page_mkuptodate() to mext_folio_mkuptodate()
  ext4: introduce mext_move_extent()
  ext4: switch to using the new extent movement method
  ext4: add large folios support for moving extents
  ext4: add two trace points for moving extents

 fs/ext4/ext4.h              |   3 +
 fs/ext4/extents.c           |   2 +-
 fs/ext4/extents_status.c    |  27 +-
 fs/ext4/extents_status.h    |   2 +-
 fs/ext4/inode.c             |  28 +-
 fs/ext4/ioctl.c             |  10 -
 fs/ext4/move_extent.c       | 773 ++++++++++++++++--------------------
 fs/ext4/super.c             |   1 +
 include/trace/events/ext4.h |  97 ++++-
 9 files changed, 486 insertions(+), 457 deletions(-)

-- 
2.46.1
[syzbot ci] Re: ext4: optimize online defragment
Posted by syzbot ci 1 week, 1 day ago
syzbot ci has tested the following series

[v1] ext4: optimize online defragment
https://lore.kernel.org/all/20250923012724.2378858-1-yi.zhang@huaweicloud.com
* [PATCH 01/13] ext4: fix an off-by-one issue during moving extents
* [PATCH 02/13] ext4: correct the checking of quota files before moving extents
* [PATCH 03/13] ext4: introduce seq counter for the extent status entry
* [PATCH 04/13] ext4: make ext4_es_lookup_extent() pass out the extent seq counter
* [PATCH 05/13] ext4: pass out extent seq counter when mapping blocks
* [PATCH 06/13] ext4: use EXT4_B_TO_LBLK() in mext_check_arguments()
* [PATCH 07/13] ext4: add mext_check_validity() to do basic check
* [PATCH 08/13] ext4: refactor mext_check_arguments()
* [PATCH 09/13] ext4: rename mext_page_mkuptodate() to mext_folio_mkuptodate()
* [PATCH 10/13] ext4: introduce mext_move_extent()
* [PATCH 11/13] ext4: switch to using the new extent movement method
* [PATCH 12/13] ext4: add large folios support for moving extents
* [PATCH 13/13] ext4: add two trace points for moving extents

and found the following issues:
* KASAN: slab-out-of-bounds Read in ext4_inode_journal_mode
* general protection fault in ext4_inode_journal_mode

Full report is available here:
https://ci.syzbot.org/series/89adca9b-1e59-47cd-8ba6-0a57d76309c9

***

KASAN: slab-out-of-bounds Read in ext4_inode_journal_mode

tree:      torvalds
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux
base:      07e27ad16399afcd693be20211b0dfae63e0615f
arch:      amd64
compiler:  Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
config:    https://ci.syzbot.org/builds/17d2b187-99c8-4493-9c72-e8fcf7741d20/config
C repro:   https://ci.syzbot.org/findings/b98c412d-c481-4663-b80b-a50550db3406/c_repro
syz repro: https://ci.syzbot.org/findings/b98c412d-c481-4663-b80b-a50550db3406/syz_repro

EXT4-fs (loop0): mounted filesystem 00000000-0000-0000-0000-000000000000 r/w without journal. Quota mode: writeback.
ext4 filesystem being mounted at /0/bus supports timestamps until 2038-01-19 (0x7fffffff)
==================================================================
BUG: KASAN: slab-out-of-bounds in ext4_inode_journal_mode+0x7b/0x480 fs/ext4/ext4_jbd2.c:12
Read of size 8 at addr ffff88801cefc378 by task syz.0.17/5984

CPU: 0 UID: 0 PID: 5984 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
 print_address_description mm/kasan/report.c:378 [inline]
 print_report+0xca/0x240 mm/kasan/report.c:482
 kasan_report+0x118/0x150 mm/kasan/report.c:595
 ext4_inode_journal_mode+0x7b/0x480 fs/ext4/ext4_jbd2.c:12
 ext4_should_journal_data fs/ext4/ext4_jbd2.h:381 [inline]
 mext_check_validity fs/ext4/move_extent.c:426 [inline]
 ext4_move_extents+0x2bb/0x3630 fs/ext4/move_extent.c:579
 __ext4_ioctl fs/ext4/ioctl.c:1356 [inline]
 ext4_ioctl+0x26a7/0x33c0 fs/ext4/ioctl.c:1616
 vfs_ioctl fs/ioctl.c:51 [inline]
 __do_sys_ioctl fs/ioctl.c:598 [inline]
 __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:584
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f6a6678ec29
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffea3688b38 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007f6a669d5fa0 RCX: 00007f6a6678ec29
RDX: 0000200000000040 RSI: 00000000c028660f RDI: 0000000000000004
RBP: 00007f6a66811e41 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f6a669d5fa0 R14: 00007f6a669d5fa0 R15: 0000000000000003
 </TASK>

Allocated by task 1:
 kasan_save_stack mm/kasan/common.c:47 [inline]
 kasan_save_track+0x3e/0x80 mm/kasan/common.c:68
 poison_kmalloc_redzone mm/kasan/common.c:388 [inline]
 __kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:405
 kasan_kmalloc include/linux/kasan.h:260 [inline]
 __kmalloc_cache_noprof+0x230/0x3d0 mm/slub.c:4407
 kmalloc_noprof include/linux/slab.h:905 [inline]
 kzalloc_noprof include/linux/slab.h:1039 [inline]
 shmem_fill_super+0xc8/0x1190 mm/shmem.c:5059
 vfs_get_super fs/super.c:1325 [inline]
 get_tree_nodev+0xbb/0x150 fs/super.c:1344
 vfs_get_tree+0x92/0x2b0 fs/super.c:1815
 fc_mount fs/namespace.c:1247 [inline]
 vfs_kern_mount+0xbe/0x160 fs/namespace.c:1286
 devtmpfs_init+0x98/0x330 drivers/base/devtmpfs.c:484
 driver_init+0x15/0x60 drivers/base/init.c:25
 do_basic_setup+0xf/0x70 init/main.c:1363
 kernel_init_freeable+0x334/0x4b0 init/main.c:1579
 kernel_init+0x1d/0x1d0 init/main.c:1469
 ret_from_fork+0x439/0x7d0 arch/x86/kernel/process.c:148
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

The buggy address belongs to the object at ffff88801cefc000
 which belongs to the cache kmalloc-512 of size 512
The buggy address is located 544 bytes to the right of
 allocated 344-byte region [ffff88801cefc000, ffff88801cefc158)

The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1cefc
head: order:2 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
flags: 0xfff00000000040(head|node=0|zone=1|lastcpupid=0x7ff)
page_type: f5(slab)
raw: 00fff00000000040 ffff88801a441c80 dead000000000122 0000000000000000
raw: 0000000000000000 0000000000100010 00000000f5000000 0000000000000000
head: 00fff00000000040 ffff88801a441c80 dead000000000122 0000000000000000
head: 0000000000000000 0000000000100010 00000000f5000000 0000000000000000
head: 00fff00000000002 ffffea000073bf01 00000000ffffffff 00000000ffffffff
head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000004
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 2, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 1, tgid 1 (swapper/0), ts 1877776345, free_ts 0
 set_page_owner include/linux/page_owner.h:32 [inline]
 post_alloc_hook+0x240/0x2a0 mm/page_alloc.c:1851
 prep_new_page mm/page_alloc.c:1859 [inline]
 get_page_from_freelist+0x21e4/0x22c0 mm/page_alloc.c:3858
 __alloc_frozen_pages_noprof+0x181/0x370 mm/page_alloc.c:5148
 alloc_pages_mpol+0x232/0x4a0 mm/mempolicy.c:2416
 alloc_slab_page mm/slub.c:2492 [inline]
 allocate_slab+0x8a/0x370 mm/slub.c:2660
 new_slab mm/slub.c:2714 [inline]
 ___slab_alloc+0xbeb/0x1420 mm/slub.c:3901
 __slab_alloc mm/slub.c:3992 [inline]
 __slab_alloc_node mm/slub.c:4067 [inline]
 slab_alloc_node mm/slub.c:4228 [inline]
 __kmalloc_cache_noprof+0x296/0x3d0 mm/slub.c:4402
 kmalloc_noprof include/linux/slab.h:905 [inline]
 kzalloc_noprof include/linux/slab.h:1039 [inline]
 shmem_fill_super+0xc8/0x1190 mm/shmem.c:5059
 vfs_get_super fs/super.c:1325 [inline]
 get_tree_nodev+0xbb/0x150 fs/super.c:1344
 vfs_get_tree+0x92/0x2b0 fs/super.c:1815
 fc_mount fs/namespace.c:1247 [inline]
 vfs_kern_mount+0xbe/0x160 fs/namespace.c:1286
 devtmpfs_init+0x98/0x330 drivers/base/devtmpfs.c:484
 driver_init+0x15/0x60 drivers/base/init.c:25
 do_basic_setup+0xf/0x70 init/main.c:1363
 kernel_init_freeable+0x334/0x4b0 init/main.c:1579
 kernel_init+0x1d/0x1d0 init/main.c:1469
page_owner free stack trace missing

Memory state around the buggy address:
 ffff88801cefc200: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ffff88801cefc280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff88801cefc300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
                                                                ^
 ffff88801cefc380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ffff88801cefc400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
==================================================================


***

general protection fault in ext4_inode_journal_mode

tree:      torvalds
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux
base:      07e27ad16399afcd693be20211b0dfae63e0615f
arch:      amd64
compiler:  Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
config:    https://ci.syzbot.org/builds/17d2b187-99c8-4493-9c72-e8fcf7741d20/config
syz repro: https://ci.syzbot.org/findings/9f9fdff9-ee39-4921-9a7a-35ab05cc081b/syz_repro

EXT4-fs (loop1): mounted filesystem 76b65be2-f6da-4727-8c75-0525a5b65a09 r/w without journal. Quota mode: none.
ext4 filesystem being mounted at /0/mnt supports timestamps until 2038-01-19 (0x7fffffff)
Oops: general protection fault, probably for non-canonical address 0xdffffc000000006f: 0000 [#1] SMP KASAN PTI
KASAN: null-ptr-deref in range [0x0000000000000378-0x000000000000037f]
CPU: 0 UID: 0 PID: 6013 Comm: syz.1.18 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:ext4_inode_journal_mode+0x6d/0x480 fs/ext4/ext4_jbd2.c:12
Code: 00 4d 03 7d 00 4c 89 f8 48 c1 e8 03 80 3c 28 00 74 08 4c 89 ff e8 03 9e b6 ff 41 bc 78 03 00 00 4d 03 27 4c 89 e0 48 c1 e8 03 <80> 3c 28 00 74 08 4c 89 e7 e8 e5 9d b6 ff 49 83 3c 24 00 0f 84 01
RSP: 0018:ffffc90002d6f638 EFLAGS: 00010206
RAX: 000000000000006f RBX: ffff88811249ad48 RCX: ffff888021ed9cc0
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88811249ad48
RBP: dffffc0000000000 R08: ffff88811249ae2f R09: 1ffff110224935c5
R10: dffffc0000000000 R11: ffffed10224935c6 R12: 0000000000000378
R13: ffff88811249ad70 R14: 1ffff110224935ae R15: ffff88801bfa6640
FS:  00007fa922e456c0(0000) GS:ffff8880b8612000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000200000000040 CR3: 000000010f5f6000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 ext4_should_journal_data fs/ext4/ext4_jbd2.h:381 [inline]
 mext_check_validity fs/ext4/move_extent.c:426 [inline]
 ext4_move_extents+0x2bb/0x3630 fs/ext4/move_extent.c:579
 __ext4_ioctl fs/ext4/ioctl.c:1356 [inline]
 ext4_ioctl+0x26a7/0x33c0 fs/ext4/ioctl.c:1616
 vfs_ioctl fs/ioctl.c:51 [inline]
 __do_sys_ioctl fs/ioctl.c:598 [inline]
 __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:584
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fa921f8ec29
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fa922e45038 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007fa9221d5fa0 RCX: 00007fa921f8ec29
RDX: 0000200000000040 RSI: 00000000c028660f RDI: 0000000000000004
RBP: 00007fa922011e41 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fa9221d6038 R14: 00007fa9221d5fa0 R15: 00007ffcaaedca68
 </TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:ext4_inode_journal_mode+0x6d/0x480 fs/ext4/ext4_jbd2.c:12
Code: 00 4d 03 7d 00 4c 89 f8 48 c1 e8 03 80 3c 28 00 74 08 4c 89 ff e8 03 9e b6 ff 41 bc 78 03 00 00 4d 03 27 4c 89 e0 48 c1 e8 03 <80> 3c 28 00 74 08 4c 89 e7 e8 e5 9d b6 ff 49 83 3c 24 00 0f 84 01
RSP: 0018:ffffc90002d6f638 EFLAGS: 00010206
RAX: 000000000000006f RBX: ffff88811249ad48 RCX: ffff888021ed9cc0
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88811249ad48
RBP: dffffc0000000000 R08: ffff88811249ae2f R09: 1ffff110224935c5
R10: dffffc0000000000 R11: ffffed10224935c6 R12: 0000000000000378
R13: ffff88811249ad70 R14: 1ffff110224935ae R15: ffff88801bfa6640
FS:  00007fa922e456c0(0000) GS:ffff8881a3c12000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00002000000012c0 CR3: 000000010f5f6000 CR4: 00000000000006f0
----------------
Code disassembly (best guess):
   0:	00 4d 03             	add    %cl,0x3(%rbp)
   3:	7d 00                	jge    0x5
   5:	4c 89 f8             	mov    %r15,%rax
   8:	48 c1 e8 03          	shr    $0x3,%rax
   c:	80 3c 28 00          	cmpb   $0x0,(%rax,%rbp,1)
  10:	74 08                	je     0x1a
  12:	4c 89 ff             	mov    %r15,%rdi
  15:	e8 03 9e b6 ff       	call   0xffb69e1d
  1a:	41 bc 78 03 00 00    	mov    $0x378,%r12d
  20:	4d 03 27             	add    (%r15),%r12
  23:	4c 89 e0             	mov    %r12,%rax
  26:	48 c1 e8 03          	shr    $0x3,%rax
* 2a:	80 3c 28 00          	cmpb   $0x0,(%rax,%rbp,1) <-- trapping instruction
  2e:	74 08                	je     0x38
  30:	4c 89 e7             	mov    %r12,%rdi
  33:	e8 e5 9d b6 ff       	call   0xffb69e1d
  38:	49 83 3c 24 00       	cmpq   $0x0,(%r12)
  3d:	0f                   	.byte 0xf
  3e:	84 01                	test   %al,(%rcx)


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.