[PATCH v3 00/22] ext4: use iomap for regular file's buffered I/O path

Zhang Yi posted 22 patches 1 month, 3 weeks ago
There is a newer version of this series
fs/ext4/ext4.h              |  73 ++-
fs/ext4/ext4_jbd2.c         |   1 +
fs/ext4/ext4_jbd2.h         |   7 +-
fs/ext4/extents.c           |   9 +-
fs/ext4/file.c              |  20 +-
fs/ext4/ialloc.c            |   1 +
fs/ext4/inode.c             | 911 +++++++++++++++++++++++++++++++-----
fs/ext4/move_extent.c       |  11 +
fs/ext4/page-io.c           | 203 ++++++++
fs/ext4/super.c             |  55 ++-
fs/iomap/buffered-io.c      |  20 +-
include/trace/events/ext4.h | 142 ++++++
12 files changed, 1313 insertions(+), 140 deletions(-)
[PATCH v3 00/22] ext4: use iomap for regular file's buffered I/O path
Posted by Zhang Yi 1 month, 3 weeks ago
From: Zhang Yi <yi.zhang@huawei.com>

This series adds the iomap buffered I/O path support for regular files,
based on the latest upstream kernel. It implements the core iomap APIs
on ext4 and introduces the 'buffered_iomap' mount option to enable the
iomap buffered I/O path. It supports default features, default mount
options and bigalloc feature. However, it does not support online
defragmentation, inline data, fsverify, fscrypt, non-extent inodes, and
data=journal mode, it will fall to buffered_head I/O path automatically
if these features and options are used.

This iomap buffered I/O path is not enabled by default because the
preceding features are not supported. Users can explicitly enable or
disable it via 'buffered_iomap' and 'nobuffered_iomap' mount options.

Key notes
=========

1. Lock ordering difference

   The lock ordering of folio lock and transaction start in the iomap
   path is the opposite of that in the buffer_head path.

2. data=ordered mode is not used

   Two main reasons:
   a) The lock ordering of folio lock and transaction start for
      data=ordered mode is opposite to the iomap path, which would cause
      a deadlock.
   b) The iomap writeback path does not support partial folio submission
      (required by data=ordered mode when block size < folio size, and
      it is currently handled by ext4_bio_write_folio()), which would
      also cause a deadlock.

   To replace data=ordered mode functionality:

   - For append write: Always allocate unwritten extents (dioread_nolock
     behavior) to prevent stale data exposure.

   - For post-EOF partial block zeroing: Issue zeroing I/O immediately
     and wait for completion before updating i_disksize. On ordered I/O
     completion, set i_disksize = i_size to avoid lost updates in the
     truncate up case. (Jan suggested).

   - For online defragmentation: Not supported yet, needs further
     consideration.

3. Always enable dioread_nolock

   Two main reasons:
   a) Since data=ordered mode cannot be used, allocating written blocks
      directly would expose stale data.
   b) To optimize writeback, we should allocate blocks based on writeback
      length rather than per-folio mapping. Direct written allocation
      would over-allocate blocks.

   dioread_nolock has been the default mount option for many years, and
   Jan pointed out that we may no longer need to disable it, so gradually
   remove this mount option in the future.

Series structure
================

 - Patch 01-03: Simplify truncate operations and prepare for conversion.
 - Patch 04-18: Implement core iomap buffered read/write, writeback,
                mmap, and partial block zeroing paths.
 - Patch 19-22: Handle ordered I/O for zeroing post-EOF partial block.

Testing and Performance
=======================

Tested with xfstests-bld using -g auto, fast_commit, and 64k
configurations. No new regressions were observed.

For the special case of zeroing post-EOF partial block, I add a new
generic/790 to address this scenario.

  https://lore.kernel.org/fstests/20260422015246.4132376-1-yi.zhang@huaweicloud.com/

Performance tested with fio on a 150 GB memory-backed virtual machine
(no much difference compared to v2, so no update):

 Buffered write (MiB/s)
 ===

  bs       write cache    uncached write
           bh     iomap   bh      iomap
  1k       423    403     36.3    57
  4k       1067   1093    58.4    61
  64k      4321   6488    869     1206
  1M       4640   7378    3158    4818
  
 Buffered read (MiB/s)
 ===

  bs       read hole        read pre-cache     read ondisk data
           bh     iomap     bh     iomap       bh      iomap
  1k       635    643       661    653         605     602 
  4k       1987   2075      2128   2159        1761    1716
  64k      6068   6267      9472   9545        4475    4451 
  1M       5471   6072      8657   9191        4405    4467 

Large I/O write performance improved by approximately 30% to 50%.
Read performance showed no significant difference.

Changes sicne v2:
 - Rebased on the latest upstream kernel (7.1-rc1).
 - Added patches 01-03 to simplify truncate operations.
 - Added patch 13 to fix incorrect did_zero parameter in
   iomap_zero_range().
 - Added patches 19-22 to handle ordered I/O for zeroing post-EOF
   partial block.
 - Minor code and comment optimizations.

Changes since v1:
 - Rebase this series on linux-next 20260122.
 - Refactor partial block zero range, stop passing handle to
   ext4_block_truncate_page() and ext4_zero_partial_blocks(), and move
   partial block zeroing operation outside an active journal transaction
   to prevent potential deadlocks because of the lock ordering of folio
   and transaction start.
 - Clarify the lock ordering of folio lock and transaction start, update
   the comments accordingly.
 - Fix some issues related to fast commit, pollute post-EOF folio.
 - Some minor code and comments optimizations.

v2:     https://lore.kernel.org/linux-ext4/20260203062523.3869120-1-yi.zhang@huawei.com/
v1:     https://lore.kernel.org/linux-ext4/20241022111059.2566137-1-yi.zhang@huaweicloud.com/
RFC v4: https://lore.kernel.org/linux-ext4/20240410142948.2817554-1-yi.zhang@huaweicloud.com/
RFC v3: https://lore.kernel.org/linux-ext4/20240127015825.1608160-1-yi.zhang@huaweicloud.com/
RFC v2: https://lore.kernel.org/linux-ext4/20240102123918.799062-1-yi.zhang@huaweicloud.com/
RFC v1: https://lore.kernel.org/linux-ext4/20231123125121.4064694-1-yi.zhang@huaweicloud.com/

Comments and suggestions are welcome!

Thanks,
Yi.


Zhang Yi (22):
  ext4: simplify size updating in ext4_setattr()
  ext4: factor out ext4_truncate_[up|down]()
  ext4: simplify error handling in ext4_setattr()
  ext4: add iomap address space operations for buffered I/O
  ext4: implement buffered read path using iomap
  ext4: pass out extent seq counter when mapping da blocks
  ext4: do not use data=ordered mode for inodes using buffered iomap
    path
  ext4: implement buffered write path using iomap
  ext4: implement writeback path using iomap
  ext4: implement mmap path using iomap
  iomap: correct the range of a partial dirty clear
  iomap: support invalidating partial folios
  iomap: fix incorrect did_zero setting in iomap_zero_iter()
  ext4: implement partial block zero range path using iomap
  ext4: add block mapping tracepoints for iomap buffered I/O path
  ext4: disable online defrag when inode using iomap buffered I/O path
  ext4: partially enable iomap for the buffered I/O path of regular
    files
  ext4: introduce a mount option for iomap buffered I/O path
  ext4: submit zeroed post-EOF data immediately in the iomap buffered
    I/O path
  ext4: wait for ordered I/O in the iomap buffered I/O path
  ext4: update i_disksize to i_size on ordered I/O completion
  ext4: add tracepoints for ordered I/O in the iomap buffered I/O path

 fs/ext4/ext4.h              |  73 ++-
 fs/ext4/ext4_jbd2.c         |   1 +
 fs/ext4/ext4_jbd2.h         |   7 +-
 fs/ext4/extents.c           |   9 +-
 fs/ext4/file.c              |  20 +-
 fs/ext4/ialloc.c            |   1 +
 fs/ext4/inode.c             | 911 +++++++++++++++++++++++++++++++-----
 fs/ext4/move_extent.c       |  11 +
 fs/ext4/page-io.c           | 203 ++++++++
 fs/ext4/super.c             |  55 ++-
 fs/iomap/buffered-io.c      |  20 +-
 include/trace/events/ext4.h | 142 ++++++
 12 files changed, 1313 insertions(+), 140 deletions(-)

-- 
2.52.0