fs/ext4/ext4.h | 21 +- fs/ext4/ext4_jbd2.c | 1 + fs/ext4/ext4_jbd2.h | 7 +- fs/ext4/extents.c | 31 +- fs/ext4/file.c | 40 +- fs/ext4/ialloc.c | 1 + fs/ext4/inode.c | 822 ++++++++++++++++++++++++++++++++---- fs/ext4/move_extent.c | 11 + fs/ext4/page-io.c | 119 ++++++ fs/ext4/super.c | 32 +- fs/iomap/buffered-io.c | 12 +- include/trace/events/ext4.h | 45 ++ 12 files changed, 1033 insertions(+), 109 deletions(-)
From: Zhang Yi <yi.zhang@huaweicloud.com>
Changes since V1:
- Rebase this series on linux-next 20260122.
- Refactor partial block zero range, stop passing handle to
ext4_block_truncate_page() and ext4_zero_partial_blocks(), and move
partial block zeroing operation outside an active journal transaction
to prevent potential deadlocks because of the lock ordering of folio
and transaction start.
- Clarify the lock ordering of folio lock and transaction start, update
the comments accordingly.
- Fix some issues related to fast commit, pollute post-EOF folio.
- Some minor code and comments optimizations.
v1: https://lore.kernel.org/linux-ext4/20241022111059.2566137-1-yi.zhang@huaweicloud.com/
RFC v4: https://lore.kernel.org/linux-ext4/20240410142948.2817554-1-yi.zhang@huaweicloud.com/
RFC v3: https://lore.kernel.org/linux-ext4/20240127015825.1608160-1-yi.zhang@huaweicloud.com/
RFC v2: https://lore.kernel.org/linux-ext4/20240102123918.799062-1-yi.zhang@huaweicloud.com/
RFC v1: https://lore.kernel.org/linux-ext4/20231123125121.4064694-1-yi.zhang@huaweicloud.com/
Original Cover (Updated):
This series adds the iomap buffered I/O path supports for regular files.
It implements the core iomap APIs on ext4 and introduces two mount
options called 'buffered_iomap' and "nobuffered_iomap" to enable and
disable the iomap buffered I/O path. This series supports the default
features, default mount options and bigalloc feature for ext4. We do not
yet support online defragmentation, inline data, fs_verify, fs_crypt,
non-extent, and data=journal mode, it will fall to buffered_head I/O
path automatically if these features and options are used.
Key notes on the iomap implementations in this series.
- Don't use ordered data mode to prevent exposing stale data when
performing append write and truncating down.
- Override dioread_nolock mount option, always allocate unwritten
extents for new blocks.
- When performing write back, don't use reserved journal handle and
postponing updating i_disksize until I/O is done.
- The lock ordering of the folio lock and start transaction is the
opposite of that in the buffer_head buffered write path.
Series details:
Patch 01-08: Refactor partial block zeroing operation, move it out of an
active running journal transaction, and handle post EOF
partial block zeroing properly.
Patch 09-21: Implement the core iomap buffered read, write path, dirty
folio write back path, mmap path and partial block zeroing
path for ext4 regular file.
Patch 22: Introduce 'buffered_iomap' and 'nobuffer_iomap' mount option
to enable and disable the iomap buffered I/O path.
Tests and Performance:
I tested this series using xfstests-bld with auto configurations, as
well as fast_commit and 64k configurations. No new regressions were
observed.
I used fio to test my virtual machine with a 150 GB memory disk and
found an improvement of approximately 30% to 50% in large I/O write
performance, while read performance showed no significant difference.
buffered write
==============
buffer_head:
bs write cache uncached write
1k 423 MiB/s 36.3 MiB/s
4k 1067 MiB/s 58.4 MiB/s
64k 4321 MiB/s 869 MiB/s
1M 4640 MiB/s 3158 MiB/s
iomap:
bs write cache uncached write
1k 403 MiB/s 57 MiB/s
4k 1093 MiB/s 61 MiB/s
64k 6488 MiB/s 1206 MiB/s
1M 7378 MiB/s 4818 MiB/s
buffered read
=============
buffer_head:
bs read hole read cache read data
1k 635 MiB/s 661 MiB/s 605 MiB/s
4k 1987 MiB/s 2128 MiB/s 1761 MiB/s
64k 6068 MiB/s 9472 MiB/s 4475 MiB/s
1M 5471 MiB/s 8657 MiB/s 4405 MiB/s
iomap:
bs read hole read cache read data
1k 643 MiB/s 653 MiB/s 602 MiB/s
4k 2075 MiB/s 2159 MiB/s 1716 MiB/s
64k 6267 MiB/s 9545MiB/s 4451 MiB/s
1M 6072 MiB/s 9191MiB/s 4467 MiB/s
Comments and suggestions are welcome!
Thanks,
Yi.
Zhang Yi (22):
ext4: make ext4_block_zero_page_range() pass out did_zero
ext4: make ext4_block_truncate_page() return zeroed length
ext4: only order data when partially block truncating down
ext4: factor out journalled block zeroing range
ext4: stop passing handle to ext4_journalled_block_zero_range()
ext4: don't zero partial block under an active handle when truncating
down
ext4: move ext4_block_zero_page_range() out of an active handle
ext4: zero post EOF partial block before appending write
ext4: add a new iomap aops for regular file's buffered IO path
ext4: implement buffered read iomap path
ext4: pass out extent seq counter when mapping da blocks
ext4: implement buffered write iomap path
ext4: implement writeback iomap path
ext4: implement mmap iomap path
iomap: correct the range of a partial dirty clear
iomap: support invalidating partial folios
ext4: implement partial block zero range iomap path
ext4: do not order data for inodes using buffered iomap path
ext4: add block mapping tracepoints for iomap buffered I/O path
ext4: disable online defrag when inode using iomap buffered I/O path
ext4: partially enable iomap for the buffered I/O path of regular
files
ext4: introduce a mount option for iomap buffered I/O path
fs/ext4/ext4.h | 21 +-
fs/ext4/ext4_jbd2.c | 1 +
fs/ext4/ext4_jbd2.h | 7 +-
fs/ext4/extents.c | 31 +-
fs/ext4/file.c | 40 +-
fs/ext4/ialloc.c | 1 +
fs/ext4/inode.c | 822 ++++++++++++++++++++++++++++++++----
fs/ext4/move_extent.c | 11 +
fs/ext4/page-io.c | 119 ++++++
fs/ext4/super.c | 32 +-
fs/iomap/buffered-io.c | 12 +-
include/trace/events/ext4.h | 45 ++
12 files changed, 1033 insertions(+), 109 deletions(-)
--
2.52.0
> Original Cover (Updated): This really should always be first. The updates are rather minor compared to the overview that the cover letter provides. > Key notes on the iomap implementations in this series. > - Don't use ordered data mode to prevent exposing stale data when > performing append write and truncating down. I can't parse this. > - Override dioread_nolock mount option, always allocate unwritten > extents for new blocks. Why do you override it? > - When performing write back, don't use reserved journal handle and > postponing updating i_disksize until I/O is done. Again missing the why and the implications. > buffered write > ============== > > buffer_head: > bs write cache uncached write > 1k 423 MiB/s 36.3 MiB/s > 4k 1067 MiB/s 58.4 MiB/s > 64k 4321 MiB/s 869 MiB/s > 1M 4640 MiB/s 3158 MiB/s > > iomap: > bs write cache uncached write > 1k 403 MiB/s 57 MiB/s > 4k 1093 MiB/s 61 MiB/s > 64k 6488 MiB/s 1206 MiB/s > 1M 7378 MiB/s 4818 MiB/s This would read better if you actually compated buffered_head vs iomap side by side. What is the bs? The read unit size? I guess not the file system block size as some of the values are too large for that. Looks like iomap is faster, often much faster except for the 1k cached case, where it is slightly slower. Do you have any idea why? > buffered read > ============= > > buffer_head: > bs read hole read cache read data > 1k 635 MiB/s 661 MiB/s 605 MiB/s > 4k 1987 MiB/s 2128 MiB/s 1761 MiB/s > 64k 6068 MiB/s 9472 MiB/s 4475 MiB/s > 1M 5471 MiB/s 8657 MiB/s 4405 MiB/s > > iomap: > bs read hole read cache read data > 1k 643 MiB/s 653 MiB/s 602 MiB/s > 4k 2075 MiB/s 2159 MiB/s 1716 MiB/s > 64k 6267 MiB/s 9545MiB/s 4451 MiB/s > 1M 6072 MiB/s 9191MiB/s 4467 MiB/s What is read cache vs read data here? Otherwise same comments as for the write case.
Hi, Christoph! On 2/3/2026 2:43 PM, Christoph Hellwig wrote: >> Original Cover (Updated): > > This really should always be first. The updates are rather minor > compared to the overview that the cover letter provides. > >> Key notes on the iomap implementations in this series. >> - Don't use ordered data mode to prevent exposing stale data when >> performing append write and truncating down. > > I can't parse this. Thank you for looking into this series, and sorry for the lack of clarity. The reasons of these key notes have been described in detail in patch 12-13. This means that the ordered journal mode is no longer in ext4 used under the iomap infrastructure. The main reason is that iomap processes each folio one by one during writeback. It first holds the folio lock and then starts a transaction to create the block mapping. If we still use the ordered mode, we need to perform writeback in the logging process, which may require initiating a new transaction, potentially leading to deadlock issues. In addition, ordered journal mode indeed has many synchronization dependencies, which increase the risk of deadlocks, and I believe this is one of the reasons why ext4_do_writepages() is implemented in such a complicated manner. Therefore, I think we need to give up using the ordered data mode. Currently, there are three scenarios where the ordered mode is used: 1) append write, 2) partial block truncate down, and 3) online defragmentation. For append write, we can always allocate unwritten blocks to avoid using the ordered journal mode. For partial block truncate down, we can explicitly perform a write-back. The third case is the only one that will be somewhat more complex. It needs to use the ordered mode to ensure the atomicity of data copying and extents exchange when exchanging extents and copying data between two files, preventing data loss. Considering performance, we cannot explicitly perform a writeback for each extent exchange. I have not yet thought of a simple way to handle this. This will require consideration of other solutions when supporting online defragmentation in the future. > >> - Override dioread_nolock mount option, always allocate unwritten >> extents for new blocks. > > Why do you override it? There are two reasons: The first one is the previously mentioned reason of not using ordered journal mode. To prevent exposing stale data during a power failure that occurs while performing append writes, unwritten extents are always requested for newly allocated blocks. The second one is to consider performance during writeback. When doing writeback, we should allocate blocks as long as possible when first calling ->writeback_range() based on the writeback length, rather than mapping each folio individually. Therefore, to avoid the situation where more blocks are allocated than actually written (which could cause fsck to complain), we cannot directly allocate written blocks before performing writeback. > >> - When performing write back, don't use reserved journal handle and >> postponing updating i_disksize until I/O is done. > > Again missing the why and the implications. The reserved journal handle is used to solve deadlock issues in transaction dependencies when writeback occurs in ordered journal mode. This mechanism is no longer necessary if the ordered mode is not used. > >> buffered write >> ============== >> >> buffer_head: >> bs write cache uncached write >> 1k 423 MiB/s 36.3 MiB/s >> 4k 1067 MiB/s 58.4 MiB/s >> 64k 4321 MiB/s 869 MiB/s >> 1M 4640 MiB/s 3158 MiB/s >> >> iomap: >> bs write cache uncached write >> 1k 403 MiB/s 57 MiB/s >> 4k 1093 MiB/s 61 MiB/s >> 64k 6488 MiB/s 1206 MiB/s >> 1M 7378 MiB/s 4818 MiB/s > > This would read better if you actually compated buffered_head > vs iomap side by side. > > What is the bs? The read unit size? I guess not the file system > block size as some of the values are too large for that. The 'bs' is the read/write unit size, and the fs block size is the default 4KB. > > Looks like iomap is faster, often much faster except for the > 1k cached case, where it is slightly slower. Do you have > any idea why? I observed the on-cpu flame graph. I think the main reason is the buffer_head loop path detects the folio and buffer_head status. It saves the uptodate flag in the buffer_head structure when the first 1KB write for each 4KB folio, it doesn't need to get blocks for the remaining three writes. However, the iomap infrastructure always call ->iomap_begin() to acquire the mapping info for each 1KB write. Although the first call to ->iomap_begin() has already allocated the block extent, there are still some overheads due to synchronization operations such as locking when subsequent calls are made. The smaller the unit size, the greater the impact, and this will also have a greater impact on pure cache writes than on uncached writes. > >> buffered read >> ============= >> >> buffer_head: >> bs read hole read cache read data >> 1k 635 MiB/s 661 MiB/s 605 MiB/s >> 4k 1987 MiB/s 2128 MiB/s 1761 MiB/s >> 64k 6068 MiB/s 9472 MiB/s 4475 MiB/s >> 1M 5471 MiB/s 8657 MiB/s 4405 MiB/s >> >> iomap: >> bs read hole read cache read data >> 1k 643 MiB/s 653 MiB/s 602 MiB/s >> 4k 2075 MiB/s 2159 MiB/s 1716 MiB/s >> 64k 6267 MiB/s 9545MiB/s 4451 MiB/s >> 1M 6072 MiB/s 9191MiB/s 4467 MiB/s > > What is read cache vs read data here? > The 'read cache' means that preread is set to 1 during fio tests, causing it to read cached data. In contrast, the 'read data' preread is set to 0, so it always reads data directly from the disk. Thanks, Yi. > Otherwise same comments as for the write case. >
On Tue, Feb 03, 2026 at 05:18:10PM +0800, Zhang Yi wrote: > This means that the ordered journal mode is no longer in ext4 used > under the iomap infrastructure. The main reason is that iomap > processes each folio one by one during writeback. It first holds the > folio lock and then starts a transaction to create the block mapping. > If we still use the ordered mode, we need to perform writeback in > the logging process, which may require initiating a new transaction, > potentially leading to deadlock issues. In addition, ordered journal > mode indeed has many synchronization dependencies, which increase > the risk of deadlocks, and I believe this is one of the reasons why > ext4_do_writepages() is implemented in such a complicated manner. > Therefore, I think we need to give up using the ordered data mode. > > Currently, there are three scenarios where the ordered mode is used: > 1) append write, > 2) partial block truncate down, and > 3) online defragmentation. > > For append write, we can always allocate unwritten blocks to avoid > using the ordered journal mode. This is going to be a pretty severe performance regression, since it means that we will be doubling the journal load for append writes. What we really need to do here is to first write out the data blocks, and then only start the transaction handle to modify the data blocks *after* the data blocks have been written (to heretofore, unused blocks that were just allocated). It means inverting the order in which we write data blocks for the append write case, and in fact it will improve fsync() performance since we won't be gating writing the commit block on the date blocks getting written out in the append write case. Cheers, - Ted
On 2026-02-03 21:14, Theodore Tso wrote: > On Tue, Feb 03, 2026 at 05:18:10PM +0800, Zhang Yi wrote: >> This means that the ordered journal mode is no longer in ext4 used >> under the iomap infrastructure. The main reason is that iomap >> processes each folio one by one during writeback. It first holds the >> folio lock and then starts a transaction to create the block mapping. >> If we still use the ordered mode, we need to perform writeback in >> the logging process, which may require initiating a new transaction, >> potentially leading to deadlock issues. In addition, ordered journal >> mode indeed has many synchronization dependencies, which increase >> the risk of deadlocks, and I believe this is one of the reasons why >> ext4_do_writepages() is implemented in such a complicated manner. >> Therefore, I think we need to give up using the ordered data mode. >> >> Currently, there are three scenarios where the ordered mode is used: >> 1) append write, >> 2) partial block truncate down, and >> 3) online defragmentation. >> >> For append write, we can always allocate unwritten blocks to avoid >> using the ordered journal mode. > This is going to be a pretty severe performance regression, since it > means that we will be doubling the journal load for append writes. > What we really need to do here is to first write out the data blocks, > and then only start the transaction handle to modify the data blocks > *after* the data blocks have been written (to heretofore, unused > blocks that were just allocated). It means inverting the order in > which we write data blocks for the append write case, and in fact it > will improve fsync() performance since we won't be gating writing the > commit block on the date blocks getting written out in the append > write case. I have some local demo patches doing something similar, and I think this work could be decoupled from Yi's patch set. Since inode preallocation (PA) maintains physical block occupancy with a logical-to-physical mapping, and ensures on-disk data consistency after power failure, it is an excellent location for recording temporary occupancy. Furthermore, since inode PA often allocates more blocks than requested, it can also help reduce file fragmentation. The specific approach is as follows: 1. Allocate only the PA during block allocation without inserting it into the extent status tree. Return the PA to the caller and increment its refcount to prevent it from being discarded. 2. Issue IOs to the blocks within the inode PA. If IO fails, release the refcount and return -EIO. If successful, proceed to the next step. 3. Start a handle upon successful IO completion to convert the inode PA to extents. Release the refcount and update the extent tree. 4. If a corresponding extent already exists, we’ll need to punch holes to release the old extent before inserting the new one. This ensures data atomicity, while jbd2—being a COW-like implementation itself—ensures metadata atomicity. By leveraging this "delay map" mechanism, we can achieve several benefits: * Lightweight, high-performance COW. * High-performance software atomic writes (hardware-independent). * Replacing dio_readnolock, which might otherwise read unexpected zeros. * Replacing ordered data and data journal modes. * Reduced handle hold time, as it's only held during extent tree updates. * Paving the way for snapshot support. Of course, COW itself can lead to severe file fragmentation, especially in small-scale overwrite scenarios. Perhaps I’ve overlooked something. What are your thoughts? Regards, Baokun
On Wed 04-02-26 09:59:36, Baokun Li wrote: > On 2026-02-03 21:14, Theodore Tso wrote: > > On Tue, Feb 03, 2026 at 05:18:10PM +0800, Zhang Yi wrote: > >> This means that the ordered journal mode is no longer in ext4 used > >> under the iomap infrastructure. The main reason is that iomap > >> processes each folio one by one during writeback. It first holds the > >> folio lock and then starts a transaction to create the block mapping. > >> If we still use the ordered mode, we need to perform writeback in > >> the logging process, which may require initiating a new transaction, > >> potentially leading to deadlock issues. In addition, ordered journal > >> mode indeed has many synchronization dependencies, which increase > >> the risk of deadlocks, and I believe this is one of the reasons why > >> ext4_do_writepages() is implemented in such a complicated manner. > >> Therefore, I think we need to give up using the ordered data mode. > >> > >> Currently, there are three scenarios where the ordered mode is used: > >> 1) append write, > >> 2) partial block truncate down, and > >> 3) online defragmentation. > >> > >> For append write, we can always allocate unwritten blocks to avoid > >> using the ordered journal mode. > > This is going to be a pretty severe performance regression, since it > > means that we will be doubling the journal load for append writes. > > What we really need to do here is to first write out the data blocks, > > and then only start the transaction handle to modify the data blocks > > *after* the data blocks have been written (to heretofore, unused > > blocks that were just allocated). It means inverting the order in > > which we write data blocks for the append write case, and in fact it > > will improve fsync() performance since we won't be gating writing the > > commit block on the date blocks getting written out in the append > > write case. > > I have some local demo patches doing something similar, and I think this > work could be decoupled from Yi's patch set. > > Since inode preallocation (PA) maintains physical block occupancy with a > logical-to-physical mapping, and ensures on-disk data consistency after > power failure, it is an excellent location for recording temporary > occupancy. Furthermore, since inode PA often allocates more blocks than > requested, it can also help reduce file fragmentation. > > The specific approach is as follows: > > 1. Allocate only the PA during block allocation without inserting it into > the extent status tree. Return the PA to the caller and increment its > refcount to prevent it from being discarded. > > 2. Issue IOs to the blocks within the inode PA. If IO fails, release the > refcount and return -EIO. If successful, proceed to the next step. > > 3. Start a handle upon successful IO completion to convert the inode PA to > extents. Release the refcount and update the extent tree. > > 4. If a corresponding extent already exists, we’ll need to punch holes to > release the old extent before inserting the new one. Sounds good. Just if I understand correctly case 4 would happen only if you really try to do something like COW with this? Normally you'd just use the already present blocks and write contents into them? > This ensures data atomicity, while jbd2—being a COW-like implementation > itself—ensures metadata atomicity. By leveraging this "delay map" > mechanism, we can achieve several benefits: > > * Lightweight, high-performance COW. > * High-performance software atomic writes (hardware-independent). > * Replacing dio_readnolock, which might otherwise read unexpected zeros. > * Replacing ordered data and data journal modes. > * Reduced handle hold time, as it's only held during extent tree updates. > * Paving the way for snapshot support. > > Of course, COW itself can lead to severe file fragmentation, especially > in small-scale overwrite scenarios. I agree the feature can provide very interesting benefits and we were pondering about something like that for a long time, just never got to implementing it. I'd say the immediate benefits are you can completely get rid of dioread_nolock as well as the legacy dioread_lock modes so overall code complexity should not increase much. We could also mostly get rid of data=ordered mode use (although not completely - see my discussion with Zhang over patch 3) which would be also welcome simplification. These benefits alone are IMO a good enough reason to have the functionality :). Even without COW, atomic writes and other fancy stuff. I don't see how you want to get rid of data=journal mode - perhaps that's related to the COW functionality? Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR
On 2026-02-04 22:23, Jan Kara wrote: > On Wed 04-02-26 09:59:36, Baokun Li wrote: >> On 2026-02-03 21:14, Theodore Tso wrote: >>> On Tue, Feb 03, 2026 at 05:18:10PM +0800, Zhang Yi wrote: >>>> This means that the ordered journal mode is no longer in ext4 used >>>> under the iomap infrastructure. The main reason is that iomap >>>> processes each folio one by one during writeback. It first holds the >>>> folio lock and then starts a transaction to create the block mapping. >>>> If we still use the ordered mode, we need to perform writeback in >>>> the logging process, which may require initiating a new transaction, >>>> potentially leading to deadlock issues. In addition, ordered journal >>>> mode indeed has many synchronization dependencies, which increase >>>> the risk of deadlocks, and I believe this is one of the reasons why >>>> ext4_do_writepages() is implemented in such a complicated manner. >>>> Therefore, I think we need to give up using the ordered data mode. >>>> >>>> Currently, there are three scenarios where the ordered mode is used: >>>> 1) append write, >>>> 2) partial block truncate down, and >>>> 3) online defragmentation. >>>> >>>> For append write, we can always allocate unwritten blocks to avoid >>>> using the ordered journal mode. >>> This is going to be a pretty severe performance regression, since it >>> means that we will be doubling the journal load for append writes. >>> What we really need to do here is to first write out the data blocks, >>> and then only start the transaction handle to modify the data blocks >>> *after* the data blocks have been written (to heretofore, unused >>> blocks that were just allocated). It means inverting the order in >>> which we write data blocks for the append write case, and in fact it >>> will improve fsync() performance since we won't be gating writing the >>> commit block on the date blocks getting written out in the append >>> write case. >> I have some local demo patches doing something similar, and I think this >> work could be decoupled from Yi's patch set. >> >> Since inode preallocation (PA) maintains physical block occupancy with a >> logical-to-physical mapping, and ensures on-disk data consistency after >> power failure, it is an excellent location for recording temporary >> occupancy. Furthermore, since inode PA often allocates more blocks than >> requested, it can also help reduce file fragmentation. >> >> The specific approach is as follows: >> >> 1. Allocate only the PA during block allocation without inserting it into >> the extent status tree. Return the PA to the caller and increment its >> refcount to prevent it from being discarded. >> >> 2. Issue IOs to the blocks within the inode PA. If IO fails, release the >> refcount and return -EIO. If successful, proceed to the next step. >> >> 3. Start a handle upon successful IO completion to convert the inode PA to >> extents. Release the refcount and update the extent tree. >> >> 4. If a corresponding extent already exists, we’ll need to punch holes to >> release the old extent before inserting the new one. > Sounds good. Just if I understand correctly case 4 would happen only if you > really try to do something like COW with this? Normally you'd just use the > already present blocks and write contents into them? Yes, case 4 only needs to be considered when implementing COW. > >> This ensures data atomicity, while jbd2—being a COW-like implementation >> itself—ensures metadata atomicity. By leveraging this "delay map" >> mechanism, we can achieve several benefits: >> >> * Lightweight, high-performance COW. >> * High-performance software atomic writes (hardware-independent). >> * Replacing dio_readnolock, which might otherwise read unexpected zeros. >> * Replacing ordered data and data journal modes. >> * Reduced handle hold time, as it's only held during extent tree updates. >> * Paving the way for snapshot support. >> >> Of course, COW itself can lead to severe file fragmentation, especially >> in small-scale overwrite scenarios. > I agree the feature can provide very interesting benefits and we were > pondering about something like that for a long time, just never got to > implementing it. I'd say the immediate benefits are you can completely get > rid of dioread_nolock as well as the legacy dioread_lock modes so overall > code complexity should not increase much. We could also mostly get rid of > data=ordered mode use (although not completely - see my discussion with > Zhang over patch 3) which would be also welcome simplification. These > benefits alone are IMO a good enough reason to have the functionality :). > Even without COW, atomic writes and other fancy stuff. Glad you liked the 'delay map' concept (naming suggestions are welcome!). With delay-map in place, implementing COW only requires handling overwrite scenarios, and software atomic writes can be achieved by enabling atomic delay-maps across multiple PAs. I expect to send out a minimal RFC version for discussion in a few weeks. I will share some additional thoughts regarding EOF blocks and data=ordered mode in patch 3. Thanks for your feedback! > > I don't see how you want to get rid of data=journal mode - perhaps that's > related to the COW functionality? > > Honza Yes. The only real advantage of data=journal mode over data=ordered is its guarantee of data atomicity for overwrites. If we can achieve this through COW-based software atomic writes, we can move away from the performance-heavy data=journal mode. Cheers, Baokun
On Thu 05-02-26 10:55:59, Baokun Li wrote: > > I don't see how you want to get rid of data=journal mode - perhaps that's > > related to the COW functionality? > > Yes. The only real advantage of data=journal mode over data=ordered is > its guarantee of data atomicity for overwrites. > > If we can achieve this through COW-based software atomic writes, we can > move away from the performance-heavy data=journal mode. Hum, I don't think data=journal actually currently offers overwrite atomicity - even in data=journal mode you can observe only part of the write completed after a crash. But what it does offer and why people tend to use it is that it offers strict linear ordering guarantees between all data and metadata operations happening in the system. Thus if you can prove that operation A completed before operation B started, then you are guaranteed that even after crash you will not see B done and A not done. This is a very strong consistency guarantee that makes life simpler for the applications so people that can afford the performance cost and care a lot about crash safety like it. I think you are correct that with COW and a bit of care it could be possible to achieve these guarantees even without journalling data. But I'd need to think more about it. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR
On 2/4/2026 10:23 PM, Jan Kara wrote: > On Wed 04-02-26 09:59:36, Baokun Li wrote: >> On 2026-02-03 21:14, Theodore Tso wrote: >>> On Tue, Feb 03, 2026 at 05:18:10PM +0800, Zhang Yi wrote: >>>> This means that the ordered journal mode is no longer in ext4 used >>>> under the iomap infrastructure. The main reason is that iomap >>>> processes each folio one by one during writeback. It first holds the >>>> folio lock and then starts a transaction to create the block mapping. >>>> If we still use the ordered mode, we need to perform writeback in >>>> the logging process, which may require initiating a new transaction, >>>> potentially leading to deadlock issues. In addition, ordered journal >>>> mode indeed has many synchronization dependencies, which increase >>>> the risk of deadlocks, and I believe this is one of the reasons why >>>> ext4_do_writepages() is implemented in such a complicated manner. >>>> Therefore, I think we need to give up using the ordered data mode. >>>> >>>> Currently, there are three scenarios where the ordered mode is used: >>>> 1) append write, >>>> 2) partial block truncate down, and >>>> 3) online defragmentation. >>>> >>>> For append write, we can always allocate unwritten blocks to avoid >>>> using the ordered journal mode. >>> This is going to be a pretty severe performance regression, since it >>> means that we will be doubling the journal load for append writes. >>> What we really need to do here is to first write out the data blocks, >>> and then only start the transaction handle to modify the data blocks >>> *after* the data blocks have been written (to heretofore, unused >>> blocks that were just allocated). It means inverting the order in >>> which we write data blocks for the append write case, and in fact it >>> will improve fsync() performance since we won't be gating writing the >>> commit block on the date blocks getting written out in the append >>> write case. >> >> I have some local demo patches doing something similar, and I think this >> work could be decoupled from Yi's patch set. >> >> Since inode preallocation (PA) maintains physical block occupancy with a >> logical-to-physical mapping, and ensures on-disk data consistency after >> power failure, it is an excellent location for recording temporary >> occupancy. Furthermore, since inode PA often allocates more blocks than >> requested, it can also help reduce file fragmentation. >> >> The specific approach is as follows: >> >> 1. Allocate only the PA during block allocation without inserting it into >> the extent status tree. Return the PA to the caller and increment its >> refcount to prevent it from being discarded. >> >> 2. Issue IOs to the blocks within the inode PA. If IO fails, release the >> refcount and return -EIO. If successful, proceed to the next step. >> >> 3. Start a handle upon successful IO completion to convert the inode PA to >> extents. Release the refcount and update the extent tree. >> >> 4. If a corresponding extent already exists, we’ll need to punch holes to >> release the old extent before inserting the new one. > > Sounds good. Just if I understand correctly case 4 would happen only if you > really try to do something like COW with this? Normally you'd just use the > already present blocks and write contents into them? > >> This ensures data atomicity, while jbd2—being a COW-like implementation >> itself—ensures metadata atomicity. By leveraging this "delay map" >> mechanism, we can achieve several benefits: >> >> * Lightweight, high-performance COW. >> * High-performance software atomic writes (hardware-independent). >> * Replacing dio_readnolock, which might otherwise read unexpected zeros. >> * Replacing ordered data and data journal modes. >> * Reduced handle hold time, as it's only held during extent tree updates. >> * Paving the way for snapshot support. >> >> Of course, COW itself can lead to severe file fragmentation, especially >> in small-scale overwrite scenarios. > > I agree the feature can provide very interesting benefits and we were > pondering about something like that for a long time, just never got to > implementing it. I'd say the immediate benefits are you can completely get > rid of dioread_nolock as well as the legacy dioread_lock modes so overall > code complexity should not increase much. We could also mostly get rid of > data=ordered mode use (although not completely - see my discussion with > Zhang over patch 3) which would be also welcome simplification. These I suppose this feature can also be used to get rid of the data=ordered mode use in online defragmentation. With this feature, perhaps we can develop a new method of online defragmentation that eliminates the need to pre-allocate a donor file. Instead, we can attempt to allocate as many contiguous blocks as possible through PA. If the allocated length is longer than the original extent, we can perform the swap and copy the data. Once the copy is complete, we can atomically construct a new extent, then releases the original blocks synchronously or asynchronously, similar to a regular copy-on-write (COW) operation. What does this sounds? Regards, Yi. > benefits alone are IMO a good enough reason to have the functionality :). > Even without COW, atomic writes and other fancy stuff. > > I don't see how you want to get rid of data=journal mode - perhaps that's > related to the COW functionality? > > Honza
On Thu 05-02-26 10:06:11, Zhang Yi wrote: > On 2/4/2026 10:23 PM, Jan Kara wrote: > > On Wed 04-02-26 09:59:36, Baokun Li wrote: > >> On 2026-02-03 21:14, Theodore Tso wrote: > >>> On Tue, Feb 03, 2026 at 05:18:10PM +0800, Zhang Yi wrote: > >>>> This means that the ordered journal mode is no longer in ext4 used > >>>> under the iomap infrastructure. The main reason is that iomap > >>>> processes each folio one by one during writeback. It first holds the > >>>> folio lock and then starts a transaction to create the block mapping. > >>>> If we still use the ordered mode, we need to perform writeback in > >>>> the logging process, which may require initiating a new transaction, > >>>> potentially leading to deadlock issues. In addition, ordered journal > >>>> mode indeed has many synchronization dependencies, which increase > >>>> the risk of deadlocks, and I believe this is one of the reasons why > >>>> ext4_do_writepages() is implemented in such a complicated manner. > >>>> Therefore, I think we need to give up using the ordered data mode. > >>>> > >>>> Currently, there are three scenarios where the ordered mode is used: > >>>> 1) append write, > >>>> 2) partial block truncate down, and > >>>> 3) online defragmentation. > >>>> > >>>> For append write, we can always allocate unwritten blocks to avoid > >>>> using the ordered journal mode. > >>> This is going to be a pretty severe performance regression, since it > >>> means that we will be doubling the journal load for append writes. > >>> What we really need to do here is to first write out the data blocks, > >>> and then only start the transaction handle to modify the data blocks > >>> *after* the data blocks have been written (to heretofore, unused > >>> blocks that were just allocated). It means inverting the order in > >>> which we write data blocks for the append write case, and in fact it > >>> will improve fsync() performance since we won't be gating writing the > >>> commit block on the date blocks getting written out in the append > >>> write case. > >> > >> I have some local demo patches doing something similar, and I think this > >> work could be decoupled from Yi's patch set. > >> > >> Since inode preallocation (PA) maintains physical block occupancy with a > >> logical-to-physical mapping, and ensures on-disk data consistency after > >> power failure, it is an excellent location for recording temporary > >> occupancy. Furthermore, since inode PA often allocates more blocks than > >> requested, it can also help reduce file fragmentation. > >> > >> The specific approach is as follows: > >> > >> 1. Allocate only the PA during block allocation without inserting it into > >> the extent status tree. Return the PA to the caller and increment its > >> refcount to prevent it from being discarded. > >> > >> 2. Issue IOs to the blocks within the inode PA. If IO fails, release the > >> refcount and return -EIO. If successful, proceed to the next step. > >> > >> 3. Start a handle upon successful IO completion to convert the inode PA to > >> extents. Release the refcount and update the extent tree. > >> > >> 4. If a corresponding extent already exists, we’ll need to punch holes to > >> release the old extent before inserting the new one. > > > > Sounds good. Just if I understand correctly case 4 would happen only if you > > really try to do something like COW with this? Normally you'd just use the > > already present blocks and write contents into them? > > > >> This ensures data atomicity, while jbd2—being a COW-like implementation > >> itself—ensures metadata atomicity. By leveraging this "delay map" > >> mechanism, we can achieve several benefits: > >> > >> * Lightweight, high-performance COW. > >> * High-performance software atomic writes (hardware-independent). > >> * Replacing dio_readnolock, which might otherwise read unexpected zeros. > >> * Replacing ordered data and data journal modes. > >> * Reduced handle hold time, as it's only held during extent tree updates. > >> * Paving the way for snapshot support. > >> > >> Of course, COW itself can lead to severe file fragmentation, especially > >> in small-scale overwrite scenarios. > > > > I agree the feature can provide very interesting benefits and we were > > pondering about something like that for a long time, just never got to > > implementing it. I'd say the immediate benefits are you can completely get > > rid of dioread_nolock as well as the legacy dioread_lock modes so overall > > code complexity should not increase much. We could also mostly get rid of > > data=ordered mode use (although not completely - see my discussion with > > Zhang over patch 3) which would be also welcome simplification. These > > I suppose this feature can also be used to get rid of the data=ordered mode > use in online defragmentation. With this feature, perhaps we can develop a > new method of online defragmentation that eliminates the need to pre-allocate > a donor file. Instead, we can attempt to allocate as many contiguous blocks > as possible through PA. If the allocated length is longer than the original > extent, we can perform the swap and copy the data. Once the copy is complete, > we can atomically construct a new extent, then releases the original blocks > synchronously or asynchronously, similar to a regular copy-on-write (COW) > operation. What does this sounds? Well, the reason why defragmentation uses the donor file is that there can be a lot of policy in where and how the file is exactly placed (e.g. you might want to place multiple files together). It was decided it is too complex to implement these policies in the kernel so we've offloaded the decision where the file is placed to userspace. Back at those times we were also considering adding interface to guide allocation of blocks for a file so the userspace defragmenter could prepare donor file with desired blocks. But then the interest in defragmentation dropped (particularly due to advances in flash storage) and so these ideas never materialized. We might rethink the online defragmentation interface but at this point I'm not sure we are ready to completely replace the idea of guiding the block placement using a donor file... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR
On 2/5/2026 8:58 PM, Jan Kara wrote: > On Thu 05-02-26 10:06:11, Zhang Yi wrote: >> On 2/4/2026 10:23 PM, Jan Kara wrote: >>> On Wed 04-02-26 09:59:36, Baokun Li wrote: >>>> On 2026-02-03 21:14, Theodore Tso wrote: >>>>> On Tue, Feb 03, 2026 at 05:18:10PM +0800, Zhang Yi wrote: >>>>>> This means that the ordered journal mode is no longer in ext4 used >>>>>> under the iomap infrastructure. The main reason is that iomap >>>>>> processes each folio one by one during writeback. It first holds the >>>>>> folio lock and then starts a transaction to create the block mapping. >>>>>> If we still use the ordered mode, we need to perform writeback in >>>>>> the logging process, which may require initiating a new transaction, >>>>>> potentially leading to deadlock issues. In addition, ordered journal >>>>>> mode indeed has many synchronization dependencies, which increase >>>>>> the risk of deadlocks, and I believe this is one of the reasons why >>>>>> ext4_do_writepages() is implemented in such a complicated manner. >>>>>> Therefore, I think we need to give up using the ordered data mode. >>>>>> >>>>>> Currently, there are three scenarios where the ordered mode is used: >>>>>> 1) append write, >>>>>> 2) partial block truncate down, and >>>>>> 3) online defragmentation. >>>>>> >>>>>> For append write, we can always allocate unwritten blocks to avoid >>>>>> using the ordered journal mode. >>>>> This is going to be a pretty severe performance regression, since it >>>>> means that we will be doubling the journal load for append writes. >>>>> What we really need to do here is to first write out the data blocks, >>>>> and then only start the transaction handle to modify the data blocks >>>>> *after* the data blocks have been written (to heretofore, unused >>>>> blocks that were just allocated). It means inverting the order in >>>>> which we write data blocks for the append write case, and in fact it >>>>> will improve fsync() performance since we won't be gating writing the >>>>> commit block on the date blocks getting written out in the append >>>>> write case. >>>> >>>> I have some local demo patches doing something similar, and I think this >>>> work could be decoupled from Yi's patch set. >>>> >>>> Since inode preallocation (PA) maintains physical block occupancy with a >>>> logical-to-physical mapping, and ensures on-disk data consistency after >>>> power failure, it is an excellent location for recording temporary >>>> occupancy. Furthermore, since inode PA often allocates more blocks than >>>> requested, it can also help reduce file fragmentation. >>>> >>>> The specific approach is as follows: >>>> >>>> 1. Allocate only the PA during block allocation without inserting it into >>>> the extent status tree. Return the PA to the caller and increment its >>>> refcount to prevent it from being discarded. >>>> >>>> 2. Issue IOs to the blocks within the inode PA. If IO fails, release the >>>> refcount and return -EIO. If successful, proceed to the next step. >>>> >>>> 3. Start a handle upon successful IO completion to convert the inode PA to >>>> extents. Release the refcount and update the extent tree. >>>> >>>> 4. If a corresponding extent already exists, we’ll need to punch holes to >>>> release the old extent before inserting the new one. >>> >>> Sounds good. Just if I understand correctly case 4 would happen only if you >>> really try to do something like COW with this? Normally you'd just use the >>> already present blocks and write contents into them? >>> >>>> This ensures data atomicity, while jbd2—being a COW-like implementation >>>> itself—ensures metadata atomicity. By leveraging this "delay map" >>>> mechanism, we can achieve several benefits: >>>> >>>> * Lightweight, high-performance COW. >>>> * High-performance software atomic writes (hardware-independent). >>>> * Replacing dio_readnolock, which might otherwise read unexpected zeros. >>>> * Replacing ordered data and data journal modes. >>>> * Reduced handle hold time, as it's only held during extent tree updates. >>>> * Paving the way for snapshot support. >>>> >>>> Of course, COW itself can lead to severe file fragmentation, especially >>>> in small-scale overwrite scenarios. >>> >>> I agree the feature can provide very interesting benefits and we were >>> pondering about something like that for a long time, just never got to >>> implementing it. I'd say the immediate benefits are you can completely get >>> rid of dioread_nolock as well as the legacy dioread_lock modes so overall >>> code complexity should not increase much. We could also mostly get rid of >>> data=ordered mode use (although not completely - see my discussion with >>> Zhang over patch 3) which would be also welcome simplification. These >> >> I suppose this feature can also be used to get rid of the data=ordered mode >> use in online defragmentation. With this feature, perhaps we can develop a >> new method of online defragmentation that eliminates the need to pre-allocate >> a donor file. Instead, we can attempt to allocate as many contiguous blocks >> as possible through PA. If the allocated length is longer than the original >> extent, we can perform the swap and copy the data. Once the copy is complete, >> we can atomically construct a new extent, then releases the original blocks >> synchronously or asynchronously, similar to a regular copy-on-write (COW) >> operation. What does this sounds? > > Well, the reason why defragmentation uses the donor file is that there can > be a lot of policy in where and how the file is exactly placed (e.g. you > might want to place multiple files together). It was decided it is too > complex to implement these policies in the kernel so we've offloaded the > decision where the file is placed to userspace. Back at those times we were > also considering adding interface to guide allocation of blocks for a file > so the userspace defragmenter could prepare donor file with desired blocks. Indeed, it is easier to implement different strategies through donor files. > But then the interest in defragmentation dropped (particularly due to > advances in flash storage) and so these ideas never materialized. As I understand it, defragmentation offers two primary benefits: 1. It improves the contiguity of file blocks, thereby enhancing read/write performance; 2. It reduces the overhead on the block allocator and the management of metadata. As for the first point, indeed, this role has gradually diminished with the development of flash memory devices. However, I believe the second point is still very useful. For example, some of our customers have scenarios involving large-capacity storage, where data is continuously written in a cyclic manner. This results in the disk space usage remaining at a high level for a long time, with a large number of both big and small files. Over time, as fragmentation increases, the CPU usage of the mb_allocater will significantly rise. Although this issue can be alleviated to some extent through optimizations of the mb_allocater algorithm and the use of other pre-allocation techniques, we still find online defragmentation to be very necessary. > > We might rethink the online defragmentation interface but at this point > I'm not sure we are ready to completely replace the idea of guiding the > block placement using a donor file... > > Honza Yeah, we can rethink it when supporting online defragmentation for the iomap path. Cheers, Yi.
On 2026-02-05 10:06, Zhang Yi wrote: > On 2/4/2026 10:23 PM, Jan Kara wrote: >> On Wed 04-02-26 09:59:36, Baokun Li wrote: >>> On 2026-02-03 21:14, Theodore Tso wrote: >>>> On Tue, Feb 03, 2026 at 05:18:10PM +0800, Zhang Yi wrote: >>>>> This means that the ordered journal mode is no longer in ext4 used >>>>> under the iomap infrastructure. The main reason is that iomap >>>>> processes each folio one by one during writeback. It first holds the >>>>> folio lock and then starts a transaction to create the block mapping. >>>>> If we still use the ordered mode, we need to perform writeback in >>>>> the logging process, which may require initiating a new transaction, >>>>> potentially leading to deadlock issues. In addition, ordered journal >>>>> mode indeed has many synchronization dependencies, which increase >>>>> the risk of deadlocks, and I believe this is one of the reasons why >>>>> ext4_do_writepages() is implemented in such a complicated manner. >>>>> Therefore, I think we need to give up using the ordered data mode. >>>>> >>>>> Currently, there are three scenarios where the ordered mode is used: >>>>> 1) append write, >>>>> 2) partial block truncate down, and >>>>> 3) online defragmentation. >>>>> >>>>> For append write, we can always allocate unwritten blocks to avoid >>>>> using the ordered journal mode. >>>> This is going to be a pretty severe performance regression, since it >>>> means that we will be doubling the journal load for append writes. >>>> What we really need to do here is to first write out the data blocks, >>>> and then only start the transaction handle to modify the data blocks >>>> *after* the data blocks have been written (to heretofore, unused >>>> blocks that were just allocated). It means inverting the order in >>>> which we write data blocks for the append write case, and in fact it >>>> will improve fsync() performance since we won't be gating writing the >>>> commit block on the date blocks getting written out in the append >>>> write case. >>> I have some local demo patches doing something similar, and I think this >>> work could be decoupled from Yi's patch set. >>> >>> Since inode preallocation (PA) maintains physical block occupancy with a >>> logical-to-physical mapping, and ensures on-disk data consistency after >>> power failure, it is an excellent location for recording temporary >>> occupancy. Furthermore, since inode PA often allocates more blocks than >>> requested, it can also help reduce file fragmentation. >>> >>> The specific approach is as follows: >>> >>> 1. Allocate only the PA during block allocation without inserting it into >>> the extent status tree. Return the PA to the caller and increment its >>> refcount to prevent it from being discarded. >>> >>> 2. Issue IOs to the blocks within the inode PA. If IO fails, release the >>> refcount and return -EIO. If successful, proceed to the next step. >>> >>> 3. Start a handle upon successful IO completion to convert the inode PA to >>> extents. Release the refcount and update the extent tree. >>> >>> 4. If a corresponding extent already exists, we’ll need to punch holes to >>> release the old extent before inserting the new one. >> Sounds good. Just if I understand correctly case 4 would happen only if you >> really try to do something like COW with this? Normally you'd just use the >> already present blocks and write contents into them? >> >>> This ensures data atomicity, while jbd2—being a COW-like implementation >>> itself—ensures metadata atomicity. By leveraging this "delay map" >>> mechanism, we can achieve several benefits: >>> >>> * Lightweight, high-performance COW. >>> * High-performance software atomic writes (hardware-independent). >>> * Replacing dio_readnolock, which might otherwise read unexpected zeros. >>> * Replacing ordered data and data journal modes. >>> * Reduced handle hold time, as it's only held during extent tree updates. >>> * Paving the way for snapshot support. >>> >>> Of course, COW itself can lead to severe file fragmentation, especially >>> in small-scale overwrite scenarios. >> I agree the feature can provide very interesting benefits and we were >> pondering about something like that for a long time, just never got to >> implementing it. I'd say the immediate benefits are you can completely get >> rid of dioread_nolock as well as the legacy dioread_lock modes so overall >> code complexity should not increase much. We could also mostly get rid of >> data=ordered mode use (although not completely - see my discussion with >> Zhang over patch 3) which would be also welcome simplification. These > I suppose this feature can also be used to get rid of the data=ordered mode > use in online defragmentation. With this feature, perhaps we can develop a > new method of online defragmentation that eliminates the need to pre-allocate > a donor file. Instead, we can attempt to allocate as many contiguous blocks > as possible through PA. If the allocated length is longer than the original > extent, we can perform the swap and copy the data. Once the copy is complete, > we can atomically construct a new extent, then releases the original blocks > synchronously or asynchronously, similar to a regular copy-on-write (COW) > operation. What does this sounds? > > Regards, > Yi. Good idea! This is much more efficient than allocating files first and then swapping them. While COW can exacerbate fragmentation, it can also be leveraged for defragmentation. We could monitor the average extent length of files within the kernel and add those that fall below a certain threshold to a "pending defrag" list. Defragmentation could then be triggered at an appropriate time. To ensure the effectiveness of the defrag process, we could also set a minimum length requirement for inode PAs. Cheers, Baokun >> benefits alone are IMO a good enough reason to have the functionality :). >> Even without COW, atomic writes and other fancy stuff. >> >> I don't see how you want to get rid of data=journal mode - perhaps that's >> related to the COW functionality? >> >> Honza
Hi, Ted. On 2/3/2026 9:14 PM, Theodore Tso wrote: > On Tue, Feb 03, 2026 at 05:18:10PM +0800, Zhang Yi wrote: >> This means that the ordered journal mode is no longer in ext4 used >> under the iomap infrastructure. The main reason is that iomap >> processes each folio one by one during writeback. It first holds the >> folio lock and then starts a transaction to create the block mapping. >> If we still use the ordered mode, we need to perform writeback in >> the logging process, which may require initiating a new transaction, >> potentially leading to deadlock issues. In addition, ordered journal >> mode indeed has many synchronization dependencies, which increase >> the risk of deadlocks, and I believe this is one of the reasons why >> ext4_do_writepages() is implemented in such a complicated manner. >> Therefore, I think we need to give up using the ordered data mode. >> >> Currently, there are three scenarios where the ordered mode is used: >> 1) append write, >> 2) partial block truncate down, and >> 3) online defragmentation. >> >> For append write, we can always allocate unwritten blocks to avoid >> using the ordered journal mode. > > This is going to be a pretty severe performance regression, since it > means that we will be doubling the journal load for append writes. Although this will double the journal load compared to directly allocating written blocks, I think it will not result in significant performance regression compared to the current append write process, as this is consistent with the behavior after dioread_nolock is enabled by default now. > What we really need to do here is to first write out the data blocks, > and then only start the transaction handle to modify the data blocks > *after* the data blocks have been written (to heretofore, unused > blocks that were just allocated). It means inverting the order in > which we write data blocks for the append write case, and in fact it > will improve fsync() performance since we won't be gating writing the > commit block on the date blocks getting written out in the append > write case. > Yeah, thank you for the suggestion. I agree with you. We are planning to implement this next. Baokun is currently working to develop a POC. Our current idea is to use inode PA (The benefit of using PA is that it can avoid changes to disk metadata, and the pre-allocation operation can be closed within the mb-allocater) to pre-allocate blocks before doing writeback, and then map the actual written type extents after the data is written, which would avoid this journal overhead of unwritten allocations. At the same time, this could also lay the foundation for future support of COW writes for reflinks. > Cheers, > > - Ted Thanks, Yi.
© 2016 - 2026 Red Hat, Inc.