[PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path

Zhang Yi posted 22 patches 4 days, 6 hours ago
fs/ext4/ext4.h              |  21 +-
fs/ext4/ext4_jbd2.c         |   1 +
fs/ext4/ext4_jbd2.h         |   7 +-
fs/ext4/extents.c           |  31 +-
fs/ext4/file.c              |  40 +-
fs/ext4/ialloc.c            |   1 +
fs/ext4/inode.c             | 822 ++++++++++++++++++++++++++++++++----
fs/ext4/move_extent.c       |  11 +
fs/ext4/page-io.c           | 119 ++++++
fs/ext4/super.c             |  32 +-
fs/iomap/buffered-io.c      |  12 +-
include/trace/events/ext4.h |  45 ++
12 files changed, 1033 insertions(+), 109 deletions(-)
[PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path
Posted by Zhang Yi 4 days, 6 hours ago
From: Zhang Yi <yi.zhang@huaweicloud.com>

Changes since V1:
 - Rebase this series on linux-next 20260122.
 - Refactor partial block zero range, stop passing handle to
   ext4_block_truncate_page() and ext4_zero_partial_blocks(), and move
   partial block zeroing operation outside an active journal transaction
   to prevent potential deadlocks because of the lock ordering of folio
   and transaction start.
 - Clarify the lock ordering of folio lock and transaction start, update
   the comments accordingly.
 - Fix some issues related to fast commit, pollute post-EOF folio.
 - Some minor code and comments optimizations.

v1:     https://lore.kernel.org/linux-ext4/20241022111059.2566137-1-yi.zhang@huaweicloud.com/
RFC v4: https://lore.kernel.org/linux-ext4/20240410142948.2817554-1-yi.zhang@huaweicloud.com/
RFC v3: https://lore.kernel.org/linux-ext4/20240127015825.1608160-1-yi.zhang@huaweicloud.com/
RFC v2: https://lore.kernel.org/linux-ext4/20240102123918.799062-1-yi.zhang@huaweicloud.com/
RFC v1: https://lore.kernel.org/linux-ext4/20231123125121.4064694-1-yi.zhang@huaweicloud.com/

Original Cover (Updated):

This series adds the iomap buffered I/O path supports for regular files.
It implements the core iomap APIs on ext4 and introduces two mount
options called 'buffered_iomap' and "nobuffered_iomap" to enable and
disable the iomap buffered I/O path. This series supports the default
features, default mount options and bigalloc feature for ext4. We do not
yet support online defragmentation, inline data, fs_verify, fs_crypt,
non-extent, and data=journal mode, it will fall to buffered_head I/O
path automatically if these features and options are used.

Key notes on the iomap implementations in this series.
 - Don't use ordered data mode to prevent exposing stale data when
   performing append write and truncating down.
 - Override dioread_nolock mount option, always allocate unwritten
   extents for new blocks.
 - When performing write back, don't use reserved journal handle and
   postponing updating i_disksize until I/O is done.
 - The lock ordering of the folio lock and start transaction is the
   opposite of that in the buffer_head buffered write path.

Series details:

Patch 01-08: Refactor partial block zeroing operation, move it out of an
             active running journal transaction, and handle post EOF
             partial block zeroing properly.
Patch 09-21: Implement the core iomap buffered read, write path, dirty
             folio write back path, mmap path and partial block zeroing
             path for ext4 regular file. 
Patch 22:    Introduce 'buffered_iomap' and 'nobuffer_iomap' mount option
             to enable and disable the iomap buffered I/O path.

Tests and Performance:

I tested this series using xfstests-bld with auto configurations, as
well as fast_commit and 64k configurations. No new regressions were
observed.

I used fio to test my virtual machine with a 150 GB memory disk and
found an improvement of approximately 30% to 50% in large I/O write
performance, while read performance showed no significant difference.

 buffered write
 ==============

  buffer_head:
  bs      write cache    uncached write
  1k       423  MiB/s      36.3 MiB/s
  4k       1067 MiB/s      58.4 MiB/s
  64k      4321 MiB/s      869  MiB/s
  1M       4640 MiB/s      3158 MiB/s
  
  iomap:
  bs      write cache    uncached write
  1k       403  MiB/s      57   MiB/s
  4k       1093 MiB/s      61   MiB/s
  64k      6488 MiB/s      1206 MiB/s
  1M       7378 MiB/s      4818 MiB/s

 buffered read
 =============

  buffer_head:
  bs      read hole   read cache      read data
  1k       635  MiB/s    661  MiB/s    605  MiB/s
  4k       1987 MiB/s    2128 MiB/s    1761 MiB/s
  64k      6068 MiB/s    9472 MiB/s    4475 MiB/s
  1M       5471 MiB/s    8657 MiB/s    4405 MiB/s

  iomap:
  bs      read hole   read cache       read data
  1k       643  MiB/s    653  MiB/s    602  MiB/s
  4k       2075 MiB/s    2159 MiB/s    1716 MiB/s
  64k      6267 MiB/s    9545MiB/s     4451 MiB/s
  1M       6072 MiB/s    9191MiB/s     4467 MiB/s

Comments and suggestions are welcome!

Thanks,
Yi.


Zhang Yi (22):
  ext4: make ext4_block_zero_page_range() pass out did_zero
  ext4: make ext4_block_truncate_page() return zeroed length
  ext4: only order data when partially block truncating down
  ext4: factor out journalled block zeroing range
  ext4: stop passing handle to ext4_journalled_block_zero_range()
  ext4: don't zero partial block under an active handle when truncating
    down
  ext4: move ext4_block_zero_page_range() out of an active handle
  ext4: zero post EOF partial block before appending write
  ext4: add a new iomap aops for regular file's buffered IO path
  ext4: implement buffered read iomap path
  ext4: pass out extent seq counter when mapping da blocks
  ext4: implement buffered write iomap path
  ext4: implement writeback iomap path
  ext4: implement mmap iomap path
  iomap: correct the range of a partial dirty clear
  iomap: support invalidating partial folios
  ext4: implement partial block zero range iomap path
  ext4: do not order data for inodes using buffered iomap path
  ext4: add block mapping tracepoints for iomap buffered I/O path
  ext4: disable online defrag when inode using iomap buffered I/O path
  ext4: partially enable iomap for the buffered I/O path of regular
    files
  ext4: introduce a mount option for iomap buffered I/O path

 fs/ext4/ext4.h              |  21 +-
 fs/ext4/ext4_jbd2.c         |   1 +
 fs/ext4/ext4_jbd2.h         |   7 +-
 fs/ext4/extents.c           |  31 +-
 fs/ext4/file.c              |  40 +-
 fs/ext4/ialloc.c            |   1 +
 fs/ext4/inode.c             | 822 ++++++++++++++++++++++++++++++++----
 fs/ext4/move_extent.c       |  11 +
 fs/ext4/page-io.c           | 119 ++++++
 fs/ext4/super.c             |  32 +-
 fs/iomap/buffered-io.c      |  12 +-
 include/trace/events/ext4.h |  45 ++
 12 files changed, 1033 insertions(+), 109 deletions(-)

-- 
2.52.0
Re: [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path
Posted by Christoph Hellwig 4 days, 6 hours ago
> Original Cover (Updated):

This really should always be first.  The updates are rather minor
compared to the overview that the cover letter provides.

> Key notes on the iomap implementations in this series.
>  - Don't use ordered data mode to prevent exposing stale data when
>    performing append write and truncating down.

I can't parse this.

>  - Override dioread_nolock mount option, always allocate unwritten
>    extents for new blocks.

Why do you override it?

>  - When performing write back, don't use reserved journal handle and
>    postponing updating i_disksize until I/O is done.

Again missing the why and the implications.

>  buffered write
>  ==============
> 
>   buffer_head:
>   bs      write cache    uncached write
>   1k       423  MiB/s      36.3 MiB/s
>   4k       1067 MiB/s      58.4 MiB/s
>   64k      4321 MiB/s      869  MiB/s
>   1M       4640 MiB/s      3158 MiB/s
>   
>   iomap:
>   bs      write cache    uncached write
>   1k       403  MiB/s      57   MiB/s
>   4k       1093 MiB/s      61   MiB/s
>   64k      6488 MiB/s      1206 MiB/s
>   1M       7378 MiB/s      4818 MiB/s

This would read better if you actually compated buffered_head
vs iomap side by side.

What is the bs?  The read unit size?  I guess not the file system
block size as some of the values are too large for that.

Looks like iomap is faster, often much faster except for the
1k cached case, where it is slightly slower.  Do you have
any idea why?

>  buffered read
>  =============
> 
>   buffer_head:
>   bs      read hole   read cache      read data
>   1k       635  MiB/s    661  MiB/s    605  MiB/s
>   4k       1987 MiB/s    2128 MiB/s    1761 MiB/s
>   64k      6068 MiB/s    9472 MiB/s    4475 MiB/s
>   1M       5471 MiB/s    8657 MiB/s    4405 MiB/s
> 
>   iomap:
>   bs      read hole   read cache       read data
>   1k       643  MiB/s    653  MiB/s    602  MiB/s
>   4k       2075 MiB/s    2159 MiB/s    1716 MiB/s
>   64k      6267 MiB/s    9545MiB/s     4451 MiB/s
>   1M       6072 MiB/s    9191MiB/s     4467 MiB/s

What is read cache vs read data here?

Otherwise same comments as for the write case.
Re: [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path
Posted by Zhang Yi 4 days, 3 hours ago
Hi, Christoph!

On 2/3/2026 2:43 PM, Christoph Hellwig wrote:
>> Original Cover (Updated):
> 
> This really should always be first.  The updates are rather minor
> compared to the overview that the cover letter provides.
> 
>> Key notes on the iomap implementations in this series.
>>  - Don't use ordered data mode to prevent exposing stale data when
>>    performing append write and truncating down.
> 
> I can't parse this.

Thank you for looking into this series, and sorry for the lack of
clarity. The reasons of these key notes have been described in
detail in patch 12-13.

This means that the ordered journal mode is no longer in ext4 used
under the iomap infrastructure.  The main reason is that iomap
processes each folio one by one during writeback. It first holds the
folio lock and then starts a transaction to create the block mapping.
If we still use the ordered mode, we need to perform writeback in
the logging process, which may require initiating a new transaction,
potentially leading to deadlock issues. In addition, ordered journal
mode indeed has many synchronization dependencies, which increase
the risk of deadlocks, and I believe this is one of the reasons why
ext4_do_writepages() is implemented in such a complicated manner.
Therefore, I think we need to give up using the ordered data mode.

Currently, there are three scenarios where the ordered mode is used:
1) append write,
2) partial block truncate down, and
3) online defragmentation.

For append write, we can always allocate unwritten blocks to avoid
using the ordered journal mode. For partial block truncate down, we
can explicitly perform a write-back. The third case is the only one
that will be somewhat more complex. It needs to use the ordered mode
to ensure the atomicity of data copying and extents exchange when
exchanging extents and copying data between two files, preventing
data loss. Considering performance, we cannot explicitly perform a
writeback for each extent exchange. I have not yet thought of a
simple way to handle this. This will require consideration of other
solutions when supporting online defragmentation in the future.

> 
>>  - Override dioread_nolock mount option, always allocate unwritten
>>    extents for new blocks.
> 
> Why do you override it?

There are two reasons:

The first one is the previously mentioned reason of not using
ordered journal mode. To prevent exposing stale data during a power
failure that occurs while performing append writes, unwritten
extents are always requested for newly allocated blocks.

The second one is to consider performance during writeback. When
doing writeback, we should allocate blocks as long as possible when
first calling ->writeback_range() based on the writeback length,
rather than mapping each folio individually. Therefore, to avoid the
situation where more blocks are allocated than actually written
(which could cause fsck to complain), we cannot directly allocate
written blocks before performing writeback.

> 
>>  - When performing write back, don't use reserved journal handle and
>>    postponing updating i_disksize until I/O is done.
> 
> Again missing the why and the implications.

The reserved journal handle is used to solve deadlock issues in
transaction dependencies when writeback occurs in ordered journal
mode. This mechanism is no longer necessary if the ordered mode is
not used.

> 
>>  buffered write
>>  ==============
>>
>>   buffer_head:
>>   bs      write cache    uncached write
>>   1k       423  MiB/s      36.3 MiB/s
>>   4k       1067 MiB/s      58.4 MiB/s
>>   64k      4321 MiB/s      869  MiB/s
>>   1M       4640 MiB/s      3158 MiB/s
>>   
>>   iomap:
>>   bs      write cache    uncached write
>>   1k       403  MiB/s      57   MiB/s
>>   4k       1093 MiB/s      61   MiB/s
>>   64k      6488 MiB/s      1206 MiB/s
>>   1M       7378 MiB/s      4818 MiB/s
> 
> This would read better if you actually compated buffered_head
> vs iomap side by side.
> 
> What is the bs?  The read unit size?  I guess not the file system
> block size as some of the values are too large for that.

The 'bs' is the read/write unit size, and the fs block size is the
default 4KB.

> 
> Looks like iomap is faster, often much faster except for the
> 1k cached case, where it is slightly slower.  Do you have
> any idea why?

I observed the on-cpu flame graph. I think the main reason is the
buffer_head loop path detects the folio and buffer_head status.
It saves the uptodate flag in the buffer_head structure when the
first 1KB write for each 4KB folio, it doesn't need to get blocks
for the remaining three writes.  However, the iomap infrastructure
always call ->iomap_begin() to acquire the mapping info for each
1KB write.  Although the first call to ->iomap_begin() has already
allocated the block extent, there are still some overheads due to
synchronization operations such as locking when subsequent calls
are made. The smaller the unit size, the greater the impact, and
this will also have a greater impact on pure cache writes than on
uncached writes.

> 
>>  buffered read
>>  =============
>>
>>   buffer_head:
>>   bs      read hole   read cache      read data
>>   1k       635  MiB/s    661  MiB/s    605  MiB/s
>>   4k       1987 MiB/s    2128 MiB/s    1761 MiB/s
>>   64k      6068 MiB/s    9472 MiB/s    4475 MiB/s
>>   1M       5471 MiB/s    8657 MiB/s    4405 MiB/s
>>
>>   iomap:
>>   bs      read hole   read cache       read data
>>   1k       643  MiB/s    653  MiB/s    602  MiB/s
>>   4k       2075 MiB/s    2159 MiB/s    1716 MiB/s
>>   64k      6267 MiB/s    9545MiB/s     4451 MiB/s
>>   1M       6072 MiB/s    9191MiB/s     4467 MiB/s
> 
> What is read cache vs read data here?
> 

The 'read cache' means that preread is set to 1 during fio tests,
causing it to read cached data. In contrast, the 'read data'
preread is set to 0, so it always reads data directly from the
disk.

Thanks,
Yi.


> Otherwise same comments as for the write case.
>
Re: [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path
Posted by Theodore Tso 3 days, 23 hours ago
On Tue, Feb 03, 2026 at 05:18:10PM +0800, Zhang Yi wrote:
> This means that the ordered journal mode is no longer in ext4 used
> under the iomap infrastructure.  The main reason is that iomap
> processes each folio one by one during writeback. It first holds the
> folio lock and then starts a transaction to create the block mapping.
> If we still use the ordered mode, we need to perform writeback in
> the logging process, which may require initiating a new transaction,
> potentially leading to deadlock issues. In addition, ordered journal
> mode indeed has many synchronization dependencies, which increase
> the risk of deadlocks, and I believe this is one of the reasons why
> ext4_do_writepages() is implemented in such a complicated manner.
> Therefore, I think we need to give up using the ordered data mode.
> 
> Currently, there are three scenarios where the ordered mode is used:
> 1) append write,
> 2) partial block truncate down, and
> 3) online defragmentation.
> 
> For append write, we can always allocate unwritten blocks to avoid
> using the ordered journal mode.

This is going to be a pretty severe performance regression, since it
means that we will be doubling the journal load for append writes.
What we really need to do here is to first write out the data blocks,
and then only start the transaction handle to modify the data blocks
*after* the data blocks have been written (to heretofore, unused
blocks that were just allocated).  It means inverting the order in
which we write data blocks for the append write case, and in fact it
will improve fsync() performance since we won't be gating writing the
commit block on the date blocks getting written out in the append
write case.

Cheers,

					- Ted
Re: [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path
Posted by Baokun Li 3 days, 10 hours ago
On 2026-02-03 21:14, Theodore Tso wrote:
> On Tue, Feb 03, 2026 at 05:18:10PM +0800, Zhang Yi wrote:
>> This means that the ordered journal mode is no longer in ext4 used
>> under the iomap infrastructure.  The main reason is that iomap
>> processes each folio one by one during writeback. It first holds the
>> folio lock and then starts a transaction to create the block mapping.
>> If we still use the ordered mode, we need to perform writeback in
>> the logging process, which may require initiating a new transaction,
>> potentially leading to deadlock issues. In addition, ordered journal
>> mode indeed has many synchronization dependencies, which increase
>> the risk of deadlocks, and I believe this is one of the reasons why
>> ext4_do_writepages() is implemented in such a complicated manner.
>> Therefore, I think we need to give up using the ordered data mode.
>>
>> Currently, there are three scenarios where the ordered mode is used:
>> 1) append write,
>> 2) partial block truncate down, and
>> 3) online defragmentation.
>>
>> For append write, we can always allocate unwritten blocks to avoid
>> using the ordered journal mode.
> This is going to be a pretty severe performance regression, since it
> means that we will be doubling the journal load for append writes.
> What we really need to do here is to first write out the data blocks,
> and then only start the transaction handle to modify the data blocks
> *after* the data blocks have been written (to heretofore, unused
> blocks that were just allocated).  It means inverting the order in
> which we write data blocks for the append write case, and in fact it
> will improve fsync() performance since we won't be gating writing the
> commit block on the date blocks getting written out in the append
> write case.

I have some local demo patches doing something similar, and I think this
work could be decoupled from Yi's patch set.

Since inode preallocation (PA) maintains physical block occupancy with a
logical-to-physical mapping, and ensures on-disk data consistency after
power failure, it is an excellent location for recording temporary
occupancy. Furthermore, since inode PA often allocates more blocks than
requested, it can also help reduce file fragmentation.

The specific approach is as follows:

1. Allocate only the PA during block allocation without inserting it into
   the extent status tree. Return the PA to the caller and increment its
   refcount to prevent it from being discarded.

2. Issue IOs to the blocks within the inode PA. If IO fails, release the
   refcount and return -EIO. If successful, proceed to the next step.

3. Start a handle upon successful IO completion to convert the inode PA to
   extents. Release the refcount and update the extent tree.

4. If a corresponding extent already exists, we’ll need to punch holes to
   release the old extent before inserting the new one.

This ensures data atomicity, while jbd2—being a COW-like implementation
itself—ensures metadata atomicity. By leveraging this "delay map"
mechanism, we can achieve several benefits:

 * Lightweight, high-performance COW.
 * High-performance software atomic writes (hardware-independent).
 * Replacing dio_readnolock, which might otherwise read unexpected zeros.
 * Replacing ordered data and data journal modes.
 * Reduced handle hold time, as it's only held during extent tree updates.
 * Paving the way for snapshot support.

Of course, COW itself can lead to severe file fragmentation, especially
in small-scale overwrite scenarios.

Perhaps I’ve overlooked something. What are your thoughts?


Regards,
Baokun

Re: [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path
Posted by Jan Kara 2 days, 22 hours ago
On Wed 04-02-26 09:59:36, Baokun Li wrote:
> On 2026-02-03 21:14, Theodore Tso wrote:
> > On Tue, Feb 03, 2026 at 05:18:10PM +0800, Zhang Yi wrote:
> >> This means that the ordered journal mode is no longer in ext4 used
> >> under the iomap infrastructure.  The main reason is that iomap
> >> processes each folio one by one during writeback. It first holds the
> >> folio lock and then starts a transaction to create the block mapping.
> >> If we still use the ordered mode, we need to perform writeback in
> >> the logging process, which may require initiating a new transaction,
> >> potentially leading to deadlock issues. In addition, ordered journal
> >> mode indeed has many synchronization dependencies, which increase
> >> the risk of deadlocks, and I believe this is one of the reasons why
> >> ext4_do_writepages() is implemented in such a complicated manner.
> >> Therefore, I think we need to give up using the ordered data mode.
> >>
> >> Currently, there are three scenarios where the ordered mode is used:
> >> 1) append write,
> >> 2) partial block truncate down, and
> >> 3) online defragmentation.
> >>
> >> For append write, we can always allocate unwritten blocks to avoid
> >> using the ordered journal mode.
> > This is going to be a pretty severe performance regression, since it
> > means that we will be doubling the journal load for append writes.
> > What we really need to do here is to first write out the data blocks,
> > and then only start the transaction handle to modify the data blocks
> > *after* the data blocks have been written (to heretofore, unused
> > blocks that were just allocated).  It means inverting the order in
> > which we write data blocks for the append write case, and in fact it
> > will improve fsync() performance since we won't be gating writing the
> > commit block on the date blocks getting written out in the append
> > write case.
> 
> I have some local demo patches doing something similar, and I think this
> work could be decoupled from Yi's patch set.
> 
> Since inode preallocation (PA) maintains physical block occupancy with a
> logical-to-physical mapping, and ensures on-disk data consistency after
> power failure, it is an excellent location for recording temporary
> occupancy. Furthermore, since inode PA often allocates more blocks than
> requested, it can also help reduce file fragmentation.
> 
> The specific approach is as follows:
> 
> 1. Allocate only the PA during block allocation without inserting it into
>    the extent status tree. Return the PA to the caller and increment its
>    refcount to prevent it from being discarded.
> 
> 2. Issue IOs to the blocks within the inode PA. If IO fails, release the
>    refcount and return -EIO. If successful, proceed to the next step.
> 
> 3. Start a handle upon successful IO completion to convert the inode PA to
>    extents. Release the refcount and update the extent tree.
> 
> 4. If a corresponding extent already exists, we’ll need to punch holes to
>    release the old extent before inserting the new one.

Sounds good. Just if I understand correctly case 4 would happen only if you
really try to do something like COW with this? Normally you'd just use the
already present blocks and write contents into them?

> This ensures data atomicity, while jbd2—being a COW-like implementation
> itself—ensures metadata atomicity. By leveraging this "delay map"
> mechanism, we can achieve several benefits:
> 
>  * Lightweight, high-performance COW.
>  * High-performance software atomic writes (hardware-independent).
>  * Replacing dio_readnolock, which might otherwise read unexpected zeros.
>  * Replacing ordered data and data journal modes.
>  * Reduced handle hold time, as it's only held during extent tree updates.
>  * Paving the way for snapshot support.
> 
> Of course, COW itself can lead to severe file fragmentation, especially
> in small-scale overwrite scenarios.

I agree the feature can provide very interesting benefits and we were
pondering about something like that for a long time, just never got to
implementing it. I'd say the immediate benefits are you can completely get
rid of dioread_nolock as well as the legacy dioread_lock modes so overall
code complexity should not increase much. We could also mostly get rid of
data=ordered mode use (although not completely - see my discussion with
Zhang over patch 3) which would be also welcome simplification. These
benefits alone are IMO a good enough reason to have the functionality :).
Even without COW, atomic writes and other fancy stuff.

I don't see how you want to get rid of data=journal mode - perhaps that's
related to the COW functionality?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
Re: [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path
Posted by Baokun Li 2 days, 9 hours ago
On 2026-02-04 22:23, Jan Kara wrote:
> On Wed 04-02-26 09:59:36, Baokun Li wrote:
>> On 2026-02-03 21:14, Theodore Tso wrote:
>>> On Tue, Feb 03, 2026 at 05:18:10PM +0800, Zhang Yi wrote:
>>>> This means that the ordered journal mode is no longer in ext4 used
>>>> under the iomap infrastructure.  The main reason is that iomap
>>>> processes each folio one by one during writeback. It first holds the
>>>> folio lock and then starts a transaction to create the block mapping.
>>>> If we still use the ordered mode, we need to perform writeback in
>>>> the logging process, which may require initiating a new transaction,
>>>> potentially leading to deadlock issues. In addition, ordered journal
>>>> mode indeed has many synchronization dependencies, which increase
>>>> the risk of deadlocks, and I believe this is one of the reasons why
>>>> ext4_do_writepages() is implemented in such a complicated manner.
>>>> Therefore, I think we need to give up using the ordered data mode.
>>>>
>>>> Currently, there are three scenarios where the ordered mode is used:
>>>> 1) append write,
>>>> 2) partial block truncate down, and
>>>> 3) online defragmentation.
>>>>
>>>> For append write, we can always allocate unwritten blocks to avoid
>>>> using the ordered journal mode.
>>> This is going to be a pretty severe performance regression, since it
>>> means that we will be doubling the journal load for append writes.
>>> What we really need to do here is to first write out the data blocks,
>>> and then only start the transaction handle to modify the data blocks
>>> *after* the data blocks have been written (to heretofore, unused
>>> blocks that were just allocated).  It means inverting the order in
>>> which we write data blocks for the append write case, and in fact it
>>> will improve fsync() performance since we won't be gating writing the
>>> commit block on the date blocks getting written out in the append
>>> write case.
>> I have some local demo patches doing something similar, and I think this
>> work could be decoupled from Yi's patch set.
>>
>> Since inode preallocation (PA) maintains physical block occupancy with a
>> logical-to-physical mapping, and ensures on-disk data consistency after
>> power failure, it is an excellent location for recording temporary
>> occupancy. Furthermore, since inode PA often allocates more blocks than
>> requested, it can also help reduce file fragmentation.
>>
>> The specific approach is as follows:
>>
>> 1. Allocate only the PA during block allocation without inserting it into
>>    the extent status tree. Return the PA to the caller and increment its
>>    refcount to prevent it from being discarded.
>>
>> 2. Issue IOs to the blocks within the inode PA. If IO fails, release the
>>    refcount and return -EIO. If successful, proceed to the next step.
>>
>> 3. Start a handle upon successful IO completion to convert the inode PA to
>>    extents. Release the refcount and update the extent tree.
>>
>> 4. If a corresponding extent already exists, we’ll need to punch holes to
>>    release the old extent before inserting the new one.
> Sounds good. Just if I understand correctly case 4 would happen only if you
> really try to do something like COW with this? Normally you'd just use the
> already present blocks and write contents into them?

Yes, case 4 only needs to be considered when implementing COW.

>
>> This ensures data atomicity, while jbd2—being a COW-like implementation
>> itself—ensures metadata atomicity. By leveraging this "delay map"
>> mechanism, we can achieve several benefits:
>>
>>  * Lightweight, high-performance COW.
>>  * High-performance software atomic writes (hardware-independent).
>>  * Replacing dio_readnolock, which might otherwise read unexpected zeros.
>>  * Replacing ordered data and data journal modes.
>>  * Reduced handle hold time, as it's only held during extent tree updates.
>>  * Paving the way for snapshot support.
>>
>> Of course, COW itself can lead to severe file fragmentation, especially
>> in small-scale overwrite scenarios.
> I agree the feature can provide very interesting benefits and we were
> pondering about something like that for a long time, just never got to
> implementing it. I'd say the immediate benefits are you can completely get
> rid of dioread_nolock as well as the legacy dioread_lock modes so overall
> code complexity should not increase much. We could also mostly get rid of
> data=ordered mode use (although not completely - see my discussion with
> Zhang over patch 3) which would be also welcome simplification. These
> benefits alone are IMO a good enough reason to have the functionality :).
> Even without COW, atomic writes and other fancy stuff.

Glad you liked the 'delay map' concept (naming suggestions are welcome!).

With delay-map in place, implementing COW only requires handling overwrite
scenarios, and software atomic writes can be achieved by enabling atomic
delay-maps across multiple PAs.

I expect to send out a minimal RFC version for discussion in a few weeks.

I will share some additional thoughts regarding EOF blocks and
data=ordered mode in patch 3.

Thanks for your feedback!

>
> I don't see how you want to get rid of data=journal mode - perhaps that's
> related to the COW functionality?
>
> 								Honza

Yes. The only real advantage of data=journal mode over data=ordered is
its guarantee of data atomicity for overwrites.

If we can achieve this through COW-based software atomic writes, we can
move away from the performance-heavy data=journal mode.


Cheers,
Baokun

Re: [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path
Posted by Jan Kara 2 days ago
On Thu 05-02-26 10:55:59, Baokun Li wrote:
> > I don't see how you want to get rid of data=journal mode - perhaps that's
> > related to the COW functionality?
> 
> Yes. The only real advantage of data=journal mode over data=ordered is
> its guarantee of data atomicity for overwrites.
> 
> If we can achieve this through COW-based software atomic writes, we can
> move away from the performance-heavy data=journal mode.

Hum, I don't think data=journal actually currently offers overwrite
atomicity - even in data=journal mode you can observe only part of the
write completed after a crash. But what it does offer and why people tend
to use it is that it offers strict linear ordering guarantees between all
data and metadata operations happening in the system. Thus if you can prove
that operation A completed before operation B started, then you are
guaranteed that even after crash you will not see B done and A not done.
This is a very strong consistency guarantee that makes life simpler for the
applications so people that can afford the performance cost and care a lot
about crash safety like it.

I think you are correct that with COW and a bit of care it could be
possible to achieve these guarantees even without journalling data. But I'd
need to think more about it.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
Re: [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path
Posted by Zhang Yi 2 days, 10 hours ago
On 2/4/2026 10:23 PM, Jan Kara wrote:
> On Wed 04-02-26 09:59:36, Baokun Li wrote:
>> On 2026-02-03 21:14, Theodore Tso wrote:
>>> On Tue, Feb 03, 2026 at 05:18:10PM +0800, Zhang Yi wrote:
>>>> This means that the ordered journal mode is no longer in ext4 used
>>>> under the iomap infrastructure.  The main reason is that iomap
>>>> processes each folio one by one during writeback. It first holds the
>>>> folio lock and then starts a transaction to create the block mapping.
>>>> If we still use the ordered mode, we need to perform writeback in
>>>> the logging process, which may require initiating a new transaction,
>>>> potentially leading to deadlock issues. In addition, ordered journal
>>>> mode indeed has many synchronization dependencies, which increase
>>>> the risk of deadlocks, and I believe this is one of the reasons why
>>>> ext4_do_writepages() is implemented in such a complicated manner.
>>>> Therefore, I think we need to give up using the ordered data mode.
>>>>
>>>> Currently, there are three scenarios where the ordered mode is used:
>>>> 1) append write,
>>>> 2) partial block truncate down, and
>>>> 3) online defragmentation.
>>>>
>>>> For append write, we can always allocate unwritten blocks to avoid
>>>> using the ordered journal mode.
>>> This is going to be a pretty severe performance regression, since it
>>> means that we will be doubling the journal load for append writes.
>>> What we really need to do here is to first write out the data blocks,
>>> and then only start the transaction handle to modify the data blocks
>>> *after* the data blocks have been written (to heretofore, unused
>>> blocks that were just allocated).  It means inverting the order in
>>> which we write data blocks for the append write case, and in fact it
>>> will improve fsync() performance since we won't be gating writing the
>>> commit block on the date blocks getting written out in the append
>>> write case.
>>
>> I have some local demo patches doing something similar, and I think this
>> work could be decoupled from Yi's patch set.
>>
>> Since inode preallocation (PA) maintains physical block occupancy with a
>> logical-to-physical mapping, and ensures on-disk data consistency after
>> power failure, it is an excellent location for recording temporary
>> occupancy. Furthermore, since inode PA often allocates more blocks than
>> requested, it can also help reduce file fragmentation.
>>
>> The specific approach is as follows:
>>
>> 1. Allocate only the PA during block allocation without inserting it into
>>    the extent status tree. Return the PA to the caller and increment its
>>    refcount to prevent it from being discarded.
>>
>> 2. Issue IOs to the blocks within the inode PA. If IO fails, release the
>>    refcount and return -EIO. If successful, proceed to the next step.
>>
>> 3. Start a handle upon successful IO completion to convert the inode PA to
>>    extents. Release the refcount and update the extent tree.
>>
>> 4. If a corresponding extent already exists, we’ll need to punch holes to
>>    release the old extent before inserting the new one.
> 
> Sounds good. Just if I understand correctly case 4 would happen only if you
> really try to do something like COW with this? Normally you'd just use the
> already present blocks and write contents into them?
> 
>> This ensures data atomicity, while jbd2—being a COW-like implementation
>> itself—ensures metadata atomicity. By leveraging this "delay map"
>> mechanism, we can achieve several benefits:
>>
>>  * Lightweight, high-performance COW.
>>  * High-performance software atomic writes (hardware-independent).
>>  * Replacing dio_readnolock, which might otherwise read unexpected zeros.
>>  * Replacing ordered data and data journal modes.
>>  * Reduced handle hold time, as it's only held during extent tree updates.
>>  * Paving the way for snapshot support.
>>
>> Of course, COW itself can lead to severe file fragmentation, especially
>> in small-scale overwrite scenarios.
> 
> I agree the feature can provide very interesting benefits and we were
> pondering about something like that for a long time, just never got to
> implementing it. I'd say the immediate benefits are you can completely get
> rid of dioread_nolock as well as the legacy dioread_lock modes so overall
> code complexity should not increase much. We could also mostly get rid of
> data=ordered mode use (although not completely - see my discussion with
> Zhang over patch 3) which would be also welcome simplification. These

I suppose this feature can also be used to get rid of the data=ordered mode
use in online defragmentation. With this feature, perhaps we can develop a
new method of online defragmentation that eliminates the need to pre-allocate
a donor file.  Instead, we can attempt to allocate as many contiguous blocks
as possible through PA. If the allocated length is longer than the original
extent, we can perform the swap and copy the data. Once the copy is complete,
we can atomically construct a new extent, then releases the original blocks
synchronously or asynchronously, similar to a regular copy-on-write (COW)
operation. What does this sounds?

Regards,
Yi.

> benefits alone are IMO a good enough reason to have the functionality :).
> Even without COW, atomic writes and other fancy stuff.
> 
> I don't see how you want to get rid of data=journal mode - perhaps that's
> related to the COW functionality?
> 
> 								Honza

Re: [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path
Posted by Jan Kara 1 day, 23 hours ago
On Thu 05-02-26 10:06:11, Zhang Yi wrote:
> On 2/4/2026 10:23 PM, Jan Kara wrote:
> > On Wed 04-02-26 09:59:36, Baokun Li wrote:
> >> On 2026-02-03 21:14, Theodore Tso wrote:
> >>> On Tue, Feb 03, 2026 at 05:18:10PM +0800, Zhang Yi wrote:
> >>>> This means that the ordered journal mode is no longer in ext4 used
> >>>> under the iomap infrastructure.  The main reason is that iomap
> >>>> processes each folio one by one during writeback. It first holds the
> >>>> folio lock and then starts a transaction to create the block mapping.
> >>>> If we still use the ordered mode, we need to perform writeback in
> >>>> the logging process, which may require initiating a new transaction,
> >>>> potentially leading to deadlock issues. In addition, ordered journal
> >>>> mode indeed has many synchronization dependencies, which increase
> >>>> the risk of deadlocks, and I believe this is one of the reasons why
> >>>> ext4_do_writepages() is implemented in such a complicated manner.
> >>>> Therefore, I think we need to give up using the ordered data mode.
> >>>>
> >>>> Currently, there are three scenarios where the ordered mode is used:
> >>>> 1) append write,
> >>>> 2) partial block truncate down, and
> >>>> 3) online defragmentation.
> >>>>
> >>>> For append write, we can always allocate unwritten blocks to avoid
> >>>> using the ordered journal mode.
> >>> This is going to be a pretty severe performance regression, since it
> >>> means that we will be doubling the journal load for append writes.
> >>> What we really need to do here is to first write out the data blocks,
> >>> and then only start the transaction handle to modify the data blocks
> >>> *after* the data blocks have been written (to heretofore, unused
> >>> blocks that were just allocated).  It means inverting the order in
> >>> which we write data blocks for the append write case, and in fact it
> >>> will improve fsync() performance since we won't be gating writing the
> >>> commit block on the date blocks getting written out in the append
> >>> write case.
> >>
> >> I have some local demo patches doing something similar, and I think this
> >> work could be decoupled from Yi's patch set.
> >>
> >> Since inode preallocation (PA) maintains physical block occupancy with a
> >> logical-to-physical mapping, and ensures on-disk data consistency after
> >> power failure, it is an excellent location for recording temporary
> >> occupancy. Furthermore, since inode PA often allocates more blocks than
> >> requested, it can also help reduce file fragmentation.
> >>
> >> The specific approach is as follows:
> >>
> >> 1. Allocate only the PA during block allocation without inserting it into
> >>    the extent status tree. Return the PA to the caller and increment its
> >>    refcount to prevent it from being discarded.
> >>
> >> 2. Issue IOs to the blocks within the inode PA. If IO fails, release the
> >>    refcount and return -EIO. If successful, proceed to the next step.
> >>
> >> 3. Start a handle upon successful IO completion to convert the inode PA to
> >>    extents. Release the refcount and update the extent tree.
> >>
> >> 4. If a corresponding extent already exists, we’ll need to punch holes to
> >>    release the old extent before inserting the new one.
> > 
> > Sounds good. Just if I understand correctly case 4 would happen only if you
> > really try to do something like COW with this? Normally you'd just use the
> > already present blocks and write contents into them?
> > 
> >> This ensures data atomicity, while jbd2—being a COW-like implementation
> >> itself—ensures metadata atomicity. By leveraging this "delay map"
> >> mechanism, we can achieve several benefits:
> >>
> >>  * Lightweight, high-performance COW.
> >>  * High-performance software atomic writes (hardware-independent).
> >>  * Replacing dio_readnolock, which might otherwise read unexpected zeros.
> >>  * Replacing ordered data and data journal modes.
> >>  * Reduced handle hold time, as it's only held during extent tree updates.
> >>  * Paving the way for snapshot support.
> >>
> >> Of course, COW itself can lead to severe file fragmentation, especially
> >> in small-scale overwrite scenarios.
> > 
> > I agree the feature can provide very interesting benefits and we were
> > pondering about something like that for a long time, just never got to
> > implementing it. I'd say the immediate benefits are you can completely get
> > rid of dioread_nolock as well as the legacy dioread_lock modes so overall
> > code complexity should not increase much. We could also mostly get rid of
> > data=ordered mode use (although not completely - see my discussion with
> > Zhang over patch 3) which would be also welcome simplification. These
> 
> I suppose this feature can also be used to get rid of the data=ordered mode
> use in online defragmentation. With this feature, perhaps we can develop a
> new method of online defragmentation that eliminates the need to pre-allocate
> a donor file.  Instead, we can attempt to allocate as many contiguous blocks
> as possible through PA. If the allocated length is longer than the original
> extent, we can perform the swap and copy the data. Once the copy is complete,
> we can atomically construct a new extent, then releases the original blocks
> synchronously or asynchronously, similar to a regular copy-on-write (COW)
> operation. What does this sounds?

Well, the reason why defragmentation uses the donor file is that there can
be a lot of policy in where and how the file is exactly placed (e.g. you
might want to place multiple files together). It was decided it is too
complex to implement these policies in the kernel so we've offloaded the
decision where the file is placed to userspace. Back at those times we were
also considering adding interface to guide allocation of blocks for a file
so the userspace defragmenter could prepare donor file with desired blocks.
But then the interest in defragmentation dropped (particularly due to
advances in flash storage) and so these ideas never materialized.

We might rethink the online defragmentation interface but at this point
I'm not sure we are ready to completely replace the idea of guiding the
block placement using a donor file...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
Re: [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path
Posted by Zhang Yi 1 day, 10 hours ago
On 2/5/2026 8:58 PM, Jan Kara wrote:
> On Thu 05-02-26 10:06:11, Zhang Yi wrote:
>> On 2/4/2026 10:23 PM, Jan Kara wrote:
>>> On Wed 04-02-26 09:59:36, Baokun Li wrote:
>>>> On 2026-02-03 21:14, Theodore Tso wrote:
>>>>> On Tue, Feb 03, 2026 at 05:18:10PM +0800, Zhang Yi wrote:
>>>>>> This means that the ordered journal mode is no longer in ext4 used
>>>>>> under the iomap infrastructure.  The main reason is that iomap
>>>>>> processes each folio one by one during writeback. It first holds the
>>>>>> folio lock and then starts a transaction to create the block mapping.
>>>>>> If we still use the ordered mode, we need to perform writeback in
>>>>>> the logging process, which may require initiating a new transaction,
>>>>>> potentially leading to deadlock issues. In addition, ordered journal
>>>>>> mode indeed has many synchronization dependencies, which increase
>>>>>> the risk of deadlocks, and I believe this is one of the reasons why
>>>>>> ext4_do_writepages() is implemented in such a complicated manner.
>>>>>> Therefore, I think we need to give up using the ordered data mode.
>>>>>>
>>>>>> Currently, there are three scenarios where the ordered mode is used:
>>>>>> 1) append write,
>>>>>> 2) partial block truncate down, and
>>>>>> 3) online defragmentation.
>>>>>>
>>>>>> For append write, we can always allocate unwritten blocks to avoid
>>>>>> using the ordered journal mode.
>>>>> This is going to be a pretty severe performance regression, since it
>>>>> means that we will be doubling the journal load for append writes.
>>>>> What we really need to do here is to first write out the data blocks,
>>>>> and then only start the transaction handle to modify the data blocks
>>>>> *after* the data blocks have been written (to heretofore, unused
>>>>> blocks that were just allocated).  It means inverting the order in
>>>>> which we write data blocks for the append write case, and in fact it
>>>>> will improve fsync() performance since we won't be gating writing the
>>>>> commit block on the date blocks getting written out in the append
>>>>> write case.
>>>>
>>>> I have some local demo patches doing something similar, and I think this
>>>> work could be decoupled from Yi's patch set.
>>>>
>>>> Since inode preallocation (PA) maintains physical block occupancy with a
>>>> logical-to-physical mapping, and ensures on-disk data consistency after
>>>> power failure, it is an excellent location for recording temporary
>>>> occupancy. Furthermore, since inode PA often allocates more blocks than
>>>> requested, it can also help reduce file fragmentation.
>>>>
>>>> The specific approach is as follows:
>>>>
>>>> 1. Allocate only the PA during block allocation without inserting it into
>>>>    the extent status tree. Return the PA to the caller and increment its
>>>>    refcount to prevent it from being discarded.
>>>>
>>>> 2. Issue IOs to the blocks within the inode PA. If IO fails, release the
>>>>    refcount and return -EIO. If successful, proceed to the next step.
>>>>
>>>> 3. Start a handle upon successful IO completion to convert the inode PA to
>>>>    extents. Release the refcount and update the extent tree.
>>>>
>>>> 4. If a corresponding extent already exists, we’ll need to punch holes to
>>>>    release the old extent before inserting the new one.
>>>
>>> Sounds good. Just if I understand correctly case 4 would happen only if you
>>> really try to do something like COW with this? Normally you'd just use the
>>> already present blocks and write contents into them?
>>>
>>>> This ensures data atomicity, while jbd2—being a COW-like implementation
>>>> itself—ensures metadata atomicity. By leveraging this "delay map"
>>>> mechanism, we can achieve several benefits:
>>>>
>>>>  * Lightweight, high-performance COW.
>>>>  * High-performance software atomic writes (hardware-independent).
>>>>  * Replacing dio_readnolock, which might otherwise read unexpected zeros.
>>>>  * Replacing ordered data and data journal modes.
>>>>  * Reduced handle hold time, as it's only held during extent tree updates.
>>>>  * Paving the way for snapshot support.
>>>>
>>>> Of course, COW itself can lead to severe file fragmentation, especially
>>>> in small-scale overwrite scenarios.
>>>
>>> I agree the feature can provide very interesting benefits and we were
>>> pondering about something like that for a long time, just never got to
>>> implementing it. I'd say the immediate benefits are you can completely get
>>> rid of dioread_nolock as well as the legacy dioread_lock modes so overall
>>> code complexity should not increase much. We could also mostly get rid of
>>> data=ordered mode use (although not completely - see my discussion with
>>> Zhang over patch 3) which would be also welcome simplification. These
>>
>> I suppose this feature can also be used to get rid of the data=ordered mode
>> use in online defragmentation. With this feature, perhaps we can develop a
>> new method of online defragmentation that eliminates the need to pre-allocate
>> a donor file.  Instead, we can attempt to allocate as many contiguous blocks
>> as possible through PA. If the allocated length is longer than the original
>> extent, we can perform the swap and copy the data. Once the copy is complete,
>> we can atomically construct a new extent, then releases the original blocks
>> synchronously or asynchronously, similar to a regular copy-on-write (COW)
>> operation. What does this sounds?
> 
> Well, the reason why defragmentation uses the donor file is that there can
> be a lot of policy in where and how the file is exactly placed (e.g. you
> might want to place multiple files together). It was decided it is too
> complex to implement these policies in the kernel so we've offloaded the
> decision where the file is placed to userspace. Back at those times we were
> also considering adding interface to guide allocation of blocks for a file
> so the userspace defragmenter could prepare donor file with desired blocks.

Indeed, it is easier to implement different strategies through donor files.

> But then the interest in defragmentation dropped (particularly due to
> advances in flash storage) and so these ideas never materialized.

As I understand it, defragmentation offers two primary benefits:

1. It improves the contiguity of file blocks, thereby enhancing read/write
   performance;
2. It reduces the overhead on the block allocator and the management of
   metadata.

As for the first point, indeed, this role has gradually diminished with the
development of flash memory devices. However, I believe the second point is
still very useful. For example, some of our customers have scenarios
involving large-capacity storage, where data is continuously written in a
cyclic manner. This results in the disk space usage remaining at a high level
for a long time, with a large number of both big and small files. Over time,
as fragmentation increases, the CPU usage of the mb_allocater will
significantly rise. Although this issue can be alleviated to some extent
through optimizations of the mb_allocater algorithm and the use of other
pre-allocation techniques, we still find online defragmentation to be very
necessary.

> 
> We might rethink the online defragmentation interface but at this point
> I'm not sure we are ready to completely replace the idea of guiding the
> block placement using a donor file...
> 
> 								Honza

Yeah, we can rethink it when supporting online defragmentation for the iomap
path.

Cheers,
Yi.

Re: [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path
Posted by Baokun Li 2 days, 9 hours ago
On 2026-02-05 10:06, Zhang Yi wrote:
> On 2/4/2026 10:23 PM, Jan Kara wrote:
>> On Wed 04-02-26 09:59:36, Baokun Li wrote:
>>> On 2026-02-03 21:14, Theodore Tso wrote:
>>>> On Tue, Feb 03, 2026 at 05:18:10PM +0800, Zhang Yi wrote:
>>>>> This means that the ordered journal mode is no longer in ext4 used
>>>>> under the iomap infrastructure.  The main reason is that iomap
>>>>> processes each folio one by one during writeback. It first holds the
>>>>> folio lock and then starts a transaction to create the block mapping.
>>>>> If we still use the ordered mode, we need to perform writeback in
>>>>> the logging process, which may require initiating a new transaction,
>>>>> potentially leading to deadlock issues. In addition, ordered journal
>>>>> mode indeed has many synchronization dependencies, which increase
>>>>> the risk of deadlocks, and I believe this is one of the reasons why
>>>>> ext4_do_writepages() is implemented in such a complicated manner.
>>>>> Therefore, I think we need to give up using the ordered data mode.
>>>>>
>>>>> Currently, there are three scenarios where the ordered mode is used:
>>>>> 1) append write,
>>>>> 2) partial block truncate down, and
>>>>> 3) online defragmentation.
>>>>>
>>>>> For append write, we can always allocate unwritten blocks to avoid
>>>>> using the ordered journal mode.
>>>> This is going to be a pretty severe performance regression, since it
>>>> means that we will be doubling the journal load for append writes.
>>>> What we really need to do here is to first write out the data blocks,
>>>> and then only start the transaction handle to modify the data blocks
>>>> *after* the data blocks have been written (to heretofore, unused
>>>> blocks that were just allocated).  It means inverting the order in
>>>> which we write data blocks for the append write case, and in fact it
>>>> will improve fsync() performance since we won't be gating writing the
>>>> commit block on the date blocks getting written out in the append
>>>> write case.
>>> I have some local demo patches doing something similar, and I think this
>>> work could be decoupled from Yi's patch set.
>>>
>>> Since inode preallocation (PA) maintains physical block occupancy with a
>>> logical-to-physical mapping, and ensures on-disk data consistency after
>>> power failure, it is an excellent location for recording temporary
>>> occupancy. Furthermore, since inode PA often allocates more blocks than
>>> requested, it can also help reduce file fragmentation.
>>>
>>> The specific approach is as follows:
>>>
>>> 1. Allocate only the PA during block allocation without inserting it into
>>>    the extent status tree. Return the PA to the caller and increment its
>>>    refcount to prevent it from being discarded.
>>>
>>> 2. Issue IOs to the blocks within the inode PA. If IO fails, release the
>>>    refcount and return -EIO. If successful, proceed to the next step.
>>>
>>> 3. Start a handle upon successful IO completion to convert the inode PA to
>>>    extents. Release the refcount and update the extent tree.
>>>
>>> 4. If a corresponding extent already exists, we’ll need to punch holes to
>>>    release the old extent before inserting the new one.
>> Sounds good. Just if I understand correctly case 4 would happen only if you
>> really try to do something like COW with this? Normally you'd just use the
>> already present blocks and write contents into them?
>>
>>> This ensures data atomicity, while jbd2—being a COW-like implementation
>>> itself—ensures metadata atomicity. By leveraging this "delay map"
>>> mechanism, we can achieve several benefits:
>>>
>>>  * Lightweight, high-performance COW.
>>>  * High-performance software atomic writes (hardware-independent).
>>>  * Replacing dio_readnolock, which might otherwise read unexpected zeros.
>>>  * Replacing ordered data and data journal modes.
>>>  * Reduced handle hold time, as it's only held during extent tree updates.
>>>  * Paving the way for snapshot support.
>>>
>>> Of course, COW itself can lead to severe file fragmentation, especially
>>> in small-scale overwrite scenarios.
>> I agree the feature can provide very interesting benefits and we were
>> pondering about something like that for a long time, just never got to
>> implementing it. I'd say the immediate benefits are you can completely get
>> rid of dioread_nolock as well as the legacy dioread_lock modes so overall
>> code complexity should not increase much. We could also mostly get rid of
>> data=ordered mode use (although not completely - see my discussion with
>> Zhang over patch 3) which would be also welcome simplification. These
> I suppose this feature can also be used to get rid of the data=ordered mode
> use in online defragmentation. With this feature, perhaps we can develop a
> new method of online defragmentation that eliminates the need to pre-allocate
> a donor file.  Instead, we can attempt to allocate as many contiguous blocks
> as possible through PA. If the allocated length is longer than the original
> extent, we can perform the swap and copy the data. Once the copy is complete,
> we can atomically construct a new extent, then releases the original blocks
> synchronously or asynchronously, similar to a regular copy-on-write (COW)
> operation. What does this sounds?
>
> Regards,
> Yi.

Good idea! This is much more efficient than allocating files first and
then swapping them. While COW can exacerbate fragmentation, it can also
be leveraged for defragmentation.

We could monitor the average extent length of files within the kernel and
add those that fall below a certain threshold to a "pending defrag" list.
Defragmentation could then be triggered at an appropriate time. To ensure
the effectiveness of the defrag process, we could also set a minimum
length requirement for inode PAs.


Cheers,
Baokun

>> benefits alone are IMO a good enough reason to have the functionality :).
>> Even without COW, atomic writes and other fancy stuff.
>>
>> I don't see how you want to get rid of data=journal mode - perhaps that's
>> related to the COW functionality?
>>
>> 								Honza


Re: [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path
Posted by Zhang Yi 3 days, 11 hours ago
Hi, Ted.

On 2/3/2026 9:14 PM, Theodore Tso wrote:
> On Tue, Feb 03, 2026 at 05:18:10PM +0800, Zhang Yi wrote:
>> This means that the ordered journal mode is no longer in ext4 used
>> under the iomap infrastructure.  The main reason is that iomap
>> processes each folio one by one during writeback. It first holds the
>> folio lock and then starts a transaction to create the block mapping.
>> If we still use the ordered mode, we need to perform writeback in
>> the logging process, which may require initiating a new transaction,
>> potentially leading to deadlock issues. In addition, ordered journal
>> mode indeed has many synchronization dependencies, which increase
>> the risk of deadlocks, and I believe this is one of the reasons why
>> ext4_do_writepages() is implemented in such a complicated manner.
>> Therefore, I think we need to give up using the ordered data mode.
>>
>> Currently, there are three scenarios where the ordered mode is used:
>> 1) append write,
>> 2) partial block truncate down, and
>> 3) online defragmentation.
>>
>> For append write, we can always allocate unwritten blocks to avoid
>> using the ordered journal mode.
> 
> This is going to be a pretty severe performance regression, since it
> means that we will be doubling the journal load for append writes.

Although this will double the journal load compared to directly
allocating written blocks, I think it will not result in significant
performance regression compared to the current append write process, as
this is consistent with the behavior after dioread_nolock is enabled by
default now.

> What we really need to do here is to first write out the data blocks,
> and then only start the transaction handle to modify the data blocks
> *after* the data blocks have been written (to heretofore, unused
> blocks that were just allocated).  It means inverting the order in
> which we write data blocks for the append write case, and in fact it
> will improve fsync() performance since we won't be gating writing the
> commit block on the date blocks getting written out in the append
> write case.
> 

Yeah, thank you for the suggestion. I agree with you. We are planning to
implement this next. Baokun is currently working to develop a POC. Our
current idea is to use inode PA (The benefit of using PA is that it can
avoid changes to disk metadata, and the pre-allocation operation can be
closed within the mb-allocater) to pre-allocate blocks before doing
writeback, and then map the actual written type extents after the data is
written, which would avoid this journal overhead of unwritten
allocations. At the same time, this could also lay the foundation for
future support of COW writes for reflinks.

> Cheers,
> 
> 					- Ted

Thanks,
Yi.