Currently, __ext4_block_zero_page_range() is called in the following
four cases to zero out the data in partial blocks:
1. Truncate down.
2. Truncate up.
3. Perform block allocation (e.g., fallocate) or append writes across a
range extending beyond the end of the file (EOF).
4. Partial block punch hole.
If the default ordered data mode is used, __ext4_block_zero_page_range()
will write back the zeroed data to the disk through the order mode after
zeroing out.
Among the cases 1,2 and 3 described above, only case 1 actually requires
this ordered write. Assuming no one intentionally bypasses the file
system to write directly to the disk. When performing a truncate down
operation, ensuring that the data beyond the EOF is zeroed out before
updating i_disksize is sufficient to prevent old data from being exposed
when the file is later extended. In other words, as long as the on-disk
data in case 1 can be properly zeroed out, only the data in memory needs
to be zeroed out in cases 2 and 3, without requiring ordered data.
Case 4 does not require ordered data because the entire punch hole
operation does not provide atomicity guarantees. Therefore, it's safe to
move the ordered data operation from __ext4_block_zero_page_range() to
ext4_truncate().
It should be noted that after this change, we can only determine whether
to perform ordered data operations based on whether the target block has
been zeroed, rather than on the state of the buffer head. Consequently,
unnecessary ordered data operations may occur when truncating an
unwritten dirty block. However, this scenario is relatively rare, so the
overall impact is minimal.
This is prepared for the conversion to the iomap infrastructure since it
doesn't use ordered data mode and requires active writeback, which
reduces the complexity of the conversion.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/inode.c | 32 +++++++++++++++++++-------------
1 file changed, 19 insertions(+), 13 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f856ea015263..20b60abcf777 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4106,19 +4106,10 @@ static int __ext4_block_zero_page_range(handle_t *handle,
folio_zero_range(folio, offset, length);
BUFFER_TRACE(bh, "zeroed end of block");
- if (ext4_should_journal_data(inode)) {
+ if (ext4_should_journal_data(inode))
err = ext4_dirty_journalled_data(handle, bh);
- } else {
+ else
mark_buffer_dirty(bh);
- /*
- * Only the written block requires ordered data to prevent
- * exposing stale data.
- */
- if (!buffer_unwritten(bh) && !buffer_delay(bh) &&
- ext4_should_order_data(inode))
- err = ext4_jbd2_inode_add_write(handle, inode, from,
- length);
- }
if (!err && did_zero)
*did_zero = true;
@@ -4578,8 +4569,23 @@ int ext4_truncate(struct inode *inode)
goto out_trace;
}
- if (inode->i_size & (inode->i_sb->s_blocksize - 1))
- ext4_block_truncate_page(handle, mapping, inode->i_size);
+ if (inode->i_size & (inode->i_sb->s_blocksize - 1)) {
+ unsigned int zero_len;
+
+ zero_len = ext4_block_truncate_page(handle, mapping,
+ inode->i_size);
+ if (zero_len < 0) {
+ err = zero_len;
+ goto out_stop;
+ }
+ if (zero_len && !IS_DAX(inode) &&
+ ext4_should_order_data(inode)) {
+ err = ext4_jbd2_inode_add_write(handle, inode,
+ inode->i_size, zero_len);
+ if (err)
+ goto out_stop;
+ }
+ }
/*
* We add the inode to the orphan list, so that if this
--
2.52.0
Hi Zhang,
kernel test robot noticed the following build warnings:
[auto build test WARNING on next-20260202]
url: https://github.com/intel-lab-lkp/linux/commits/Zhang-Yi/ext4-make-ext4_block_zero_page_range-pass-out-did_zero/20260203-144244
base: next-20260202
patch link: https://lore.kernel.org/r/20260203062523.3869120-4-yi.zhang%40huawei.com
patch subject: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down
config: riscv-randconfig-r071-20260204 (https://download.01.org/0day-ci/archive/20260204/202602041239.JFyNwVcg-lkp@intel.com/config)
compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project 9b8addffa70cee5b2acc5454712d9cf78ce45710)
smatch version: v0.5.0-8994-gd50c5a4c
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602041239.JFyNwVcg-lkp@intel.com/
New smatch warnings:
fs/ext4/inode.c:4577 ext4_truncate() warn: unsigned 'zero_len' is never less than zero.
Old smatch warnings:
fs/ext4/inode.c:2651 mpage_prepare_extent_to_map() warn: missing error code 'err'
fs/ext4/inode.c:5129 check_igot_inode() warn: missing unwind goto?
vim +/zero_len +4577 fs/ext4/inode.c
4494
4495 /*
4496 * ext4_truncate()
4497 *
4498 * We block out ext4_get_block() block instantiations across the entire
4499 * transaction, and VFS/VM ensures that ext4_truncate() cannot run
4500 * simultaneously on behalf of the same inode.
4501 *
4502 * As we work through the truncate and commit bits of it to the journal there
4503 * is one core, guiding principle: the file's tree must always be consistent on
4504 * disk. We must be able to restart the truncate after a crash.
4505 *
4506 * The file's tree may be transiently inconsistent in memory (although it
4507 * probably isn't), but whenever we close off and commit a journal transaction,
4508 * the contents of (the filesystem + the journal) must be consistent and
4509 * restartable. It's pretty simple, really: bottom up, right to left (although
4510 * left-to-right works OK too).
4511 *
4512 * Note that at recovery time, journal replay occurs *before* the restart of
4513 * truncate against the orphan inode list.
4514 *
4515 * The committed inode has the new, desired i_size (which is the same as
4516 * i_disksize in this case). After a crash, ext4_orphan_cleanup() will see
4517 * that this inode's truncate did not complete and it will again call
4518 * ext4_truncate() to have another go. So there will be instantiated blocks
4519 * to the right of the truncation point in a crashed ext4 filesystem. But
4520 * that's fine - as long as they are linked from the inode, the post-crash
4521 * ext4_truncate() run will find them and release them.
4522 */
4523 int ext4_truncate(struct inode *inode)
4524 {
4525 struct ext4_inode_info *ei = EXT4_I(inode);
4526 unsigned int credits;
4527 int err = 0, err2;
4528 handle_t *handle;
4529 struct address_space *mapping = inode->i_mapping;
4530
4531 /*
4532 * There is a possibility that we're either freeing the inode
4533 * or it's a completely new inode. In those cases we might not
4534 * have i_rwsem locked because it's not necessary.
4535 */
4536 if (!(inode_state_read_once(inode) & (I_NEW | I_FREEING)))
4537 WARN_ON(!inode_is_locked(inode));
4538 trace_ext4_truncate_enter(inode);
4539
4540 if (!ext4_can_truncate(inode))
4541 goto out_trace;
4542
4543 if (inode->i_size == 0 && !test_opt(inode->i_sb, NO_AUTO_DA_ALLOC))
4544 ext4_set_inode_state(inode, EXT4_STATE_DA_ALLOC_CLOSE);
4545
4546 if (ext4_has_inline_data(inode)) {
4547 int has_inline = 1;
4548
4549 err = ext4_inline_data_truncate(inode, &has_inline);
4550 if (err || has_inline)
4551 goto out_trace;
4552 }
4553
4554 /* If we zero-out tail of the page, we have to create jinode for jbd2 */
4555 if (inode->i_size & (inode->i_sb->s_blocksize - 1)) {
4556 err = ext4_inode_attach_jinode(inode);
4557 if (err)
4558 goto out_trace;
4559 }
4560
4561 if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
4562 credits = ext4_chunk_trans_extent(inode, 1);
4563 else
4564 credits = ext4_blocks_for_truncate(inode);
4565
4566 handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE, credits);
4567 if (IS_ERR(handle)) {
4568 err = PTR_ERR(handle);
4569 goto out_trace;
4570 }
4571
4572 if (inode->i_size & (inode->i_sb->s_blocksize - 1)) {
4573 unsigned int zero_len;
4574
4575 zero_len = ext4_block_truncate_page(handle, mapping,
4576 inode->i_size);
> 4577 if (zero_len < 0) {
4578 err = zero_len;
4579 goto out_stop;
4580 }
4581 if (zero_len && !IS_DAX(inode) &&
4582 ext4_should_order_data(inode)) {
4583 err = ext4_jbd2_inode_add_write(handle, inode,
4584 inode->i_size, zero_len);
4585 if (err)
4586 goto out_stop;
4587 }
4588 }
4589
4590 /*
4591 * We add the inode to the orphan list, so that if this
4592 * truncate spans multiple transactions, and we crash, we will
4593 * resume the truncate when the filesystem recovers. It also
4594 * marks the inode dirty, to catch the new size.
4595 *
4596 * Implication: the file must always be in a sane, consistent
4597 * truncatable state while each transaction commits.
4598 */
4599 err = ext4_orphan_add(handle, inode);
4600 if (err)
4601 goto out_stop;
4602
4603 ext4_fc_track_inode(handle, inode);
4604 ext4_check_map_extents_env(inode);
4605
4606 down_write(&EXT4_I(inode)->i_data_sem);
4607 ext4_discard_preallocations(inode);
4608
4609 if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
4610 err = ext4_ext_truncate(handle, inode);
4611 else
4612 ext4_ind_truncate(handle, inode);
4613
4614 up_write(&ei->i_data_sem);
4615 if (err)
4616 goto out_stop;
4617
4618 if (IS_SYNC(inode))
4619 ext4_handle_sync(handle);
4620
4621 out_stop:
4622 /*
4623 * If this was a simple ftruncate() and the file will remain alive,
4624 * then we need to clear up the orphan record which we created above.
4625 * However, if this was a real unlink then we were called by
4626 * ext4_evict_inode(), and we allow that function to clean up the
4627 * orphan info for us.
4628 */
4629 if (inode->i_nlink)
4630 ext4_orphan_del(handle, inode);
4631
4632 inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
4633 err2 = ext4_mark_inode_dirty(handle, inode);
4634 if (unlikely(err2 && !err))
4635 err = err2;
4636 ext4_journal_stop(handle);
4637
4638 out_trace:
4639 trace_ext4_truncate_exit(inode);
4640 return err;
4641 }
4642
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
On Tue 03-02-26 14:25:03, Zhang Yi wrote:
> Currently, __ext4_block_zero_page_range() is called in the following
> four cases to zero out the data in partial blocks:
>
> 1. Truncate down.
> 2. Truncate up.
> 3. Perform block allocation (e.g., fallocate) or append writes across a
> range extending beyond the end of the file (EOF).
> 4. Partial block punch hole.
>
> If the default ordered data mode is used, __ext4_block_zero_page_range()
> will write back the zeroed data to the disk through the order mode after
> zeroing out.
>
> Among the cases 1,2 and 3 described above, only case 1 actually requires
> this ordered write. Assuming no one intentionally bypasses the file
> system to write directly to the disk. When performing a truncate down
> operation, ensuring that the data beyond the EOF is zeroed out before
> updating i_disksize is sufficient to prevent old data from being exposed
> when the file is later extended. In other words, as long as the on-disk
> data in case 1 can be properly zeroed out, only the data in memory needs
> to be zeroed out in cases 2 and 3, without requiring ordered data.
Hum, I'm not sure this is correct. The tail block of the file is not
necessarily zeroed out beyond EOF (as mmap writes can race with page
writeback and modify the tail block contents beyond EOF before we really
submit it to the device). Thus after this commit if you truncate up, just
zero out the newly exposed contents in the page cache and dirty it, then
the transaction with the i_disksize update commits (I see nothing
preventing it) and then you crash, you can observe file with the new size
but non-zero content in the newly exposed area. Am I missing something?
> Case 4 does not require ordered data because the entire punch hole
> operation does not provide atomicity guarantees. Therefore, it's safe to
> move the ordered data operation from __ext4_block_zero_page_range() to
> ext4_truncate().
I agree hole punching can already expose intermediate results in case of
crash so there removing the ordered mode handling is safe.
Honza
> It should be noted that after this change, we can only determine whether
> to perform ordered data operations based on whether the target block has
> been zeroed, rather than on the state of the buffer head. Consequently,
> unnecessary ordered data operations may occur when truncating an
> unwritten dirty block. However, this scenario is relatively rare, so the
> overall impact is minimal.
>
> This is prepared for the conversion to the iomap infrastructure since it
> doesn't use ordered data mode and requires active writeback, which
> reduces the complexity of the conversion.
>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
> fs/ext4/inode.c | 32 +++++++++++++++++++-------------
> 1 file changed, 19 insertions(+), 13 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index f856ea015263..20b60abcf777 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4106,19 +4106,10 @@ static int __ext4_block_zero_page_range(handle_t *handle,
> folio_zero_range(folio, offset, length);
> BUFFER_TRACE(bh, "zeroed end of block");
>
> - if (ext4_should_journal_data(inode)) {
> + if (ext4_should_journal_data(inode))
> err = ext4_dirty_journalled_data(handle, bh);
> - } else {
> + else
> mark_buffer_dirty(bh);
> - /*
> - * Only the written block requires ordered data to prevent
> - * exposing stale data.
> - */
> - if (!buffer_unwritten(bh) && !buffer_delay(bh) &&
> - ext4_should_order_data(inode))
> - err = ext4_jbd2_inode_add_write(handle, inode, from,
> - length);
> - }
> if (!err && did_zero)
> *did_zero = true;
>
> @@ -4578,8 +4569,23 @@ int ext4_truncate(struct inode *inode)
> goto out_trace;
> }
>
> - if (inode->i_size & (inode->i_sb->s_blocksize - 1))
> - ext4_block_truncate_page(handle, mapping, inode->i_size);
> + if (inode->i_size & (inode->i_sb->s_blocksize - 1)) {
> + unsigned int zero_len;
> +
> + zero_len = ext4_block_truncate_page(handle, mapping,
> + inode->i_size);
> + if (zero_len < 0) {
> + err = zero_len;
> + goto out_stop;
> + }
> + if (zero_len && !IS_DAX(inode) &&
> + ext4_should_order_data(inode)) {
> + err = ext4_jbd2_inode_add_write(handle, inode,
> + inode->i_size, zero_len);
> + if (err)
> + goto out_stop;
> + }
> + }
>
> /*
> * We add the inode to the orphan list, so that if this
> --
> 2.52.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
Hi, Jan!
On 2/3/2026 5:59 PM, Jan Kara wrote:
> On Tue 03-02-26 14:25:03, Zhang Yi wrote:
>> Currently, __ext4_block_zero_page_range() is called in the following
>> four cases to zero out the data in partial blocks:
>>
>> 1. Truncate down.
>> 2. Truncate up.
>> 3. Perform block allocation (e.g., fallocate) or append writes across a
>> range extending beyond the end of the file (EOF).
>> 4. Partial block punch hole.
>>
>> If the default ordered data mode is used, __ext4_block_zero_page_range()
>> will write back the zeroed data to the disk through the order mode after
>> zeroing out.
>>
>> Among the cases 1,2 and 3 described above, only case 1 actually requires
>> this ordered write. Assuming no one intentionally bypasses the file
>> system to write directly to the disk. When performing a truncate down
>> operation, ensuring that the data beyond the EOF is zeroed out before
>> updating i_disksize is sufficient to prevent old data from being exposed
>> when the file is later extended. In other words, as long as the on-disk
>> data in case 1 can be properly zeroed out, only the data in memory needs
>> to be zeroed out in cases 2 and 3, without requiring ordered data.
>
> Hum, I'm not sure this is correct. The tail block of the file is not
> necessarily zeroed out beyond EOF (as mmap writes can race with page
> writeback and modify the tail block contents beyond EOF before we really
> submit it to the device). Thus after this commit if you truncate up, just
> zero out the newly exposed contents in the page cache and dirty it, then
> the transaction with the i_disksize update commits (I see nothing
> preventing it) and then you crash, you can observe file with the new size
> but non-zero content in the newly exposed area. Am I missing something?
>
Well, I think you are right! I missed the mmap write race condition that
happens during the writeback submitting I/O. Thank you a lot for pointing
this out. I thought of two possible solutions:
1. We also add explicit writeback operations to the truncate-up and
post-EOF append writes. This solution is the most straightforward but
may cause some performance overhead. However, since at most only one
block is written, the impact is likely limited. Additionally, I
observed that the implementation of the XFS file system also adopts a
similar approach in its truncate up and down operation. (But it is
somewhat strange that XFS also appears to have the same issue with
post-EOF append writes; it only zero out the partial block in
xfs_file_write_checks(), but it neither explicitly writeback zeroed
data nor employs any other mechanism to ensure that the zero data
writebacks before the metadata is written to disk.)
2. Resolve this race condition, ensure that there are no non-zero data
in the post-EOF partial blocks on the disk. I observed that after the
writeback holds the folio lock and calls folio_clear_dirty_for_io(),
mmap writes will re-trigger the page fault. Perhaps we can filter out
the EOF folio based on i_size in ext4_page_mkwrite(),
block_page_mkwrite() and iomap_page_mkwrite(), and then call
folio_wait_writeback() to wait for this partial folio writeback to
complete. This seems can break the race condition without introducing
too much overhead (no?).
What do you think? Any other suggestions are also welcome.
Thanks,
Yi.
>> Case 4 does not require ordered data because the entire punch hole
>> operation does not provide atomicity guarantees. Therefore, it's safe to
>> move the ordered data operation from __ext4_block_zero_page_range() to
>> ext4_truncate().
>
> I agree hole punching can already expose intermediate results in case of
> crash so there removing the ordered mode handling is safe.
>
> Honza
>
>> It should be noted that after this change, we can only determine whether
>> to perform ordered data operations based on whether the target block has
>> been zeroed, rather than on the state of the buffer head. Consequently,
>> unnecessary ordered data operations may occur when truncating an
>> unwritten dirty block. However, this scenario is relatively rare, so the
>> overall impact is minimal.
>>
>> This is prepared for the conversion to the iomap infrastructure since it
>> doesn't use ordered data mode and requires active writeback, which
>> reduces the complexity of the conversion.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
>> fs/ext4/inode.c | 32 +++++++++++++++++++-------------
>> 1 file changed, 19 insertions(+), 13 deletions(-)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index f856ea015263..20b60abcf777 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -4106,19 +4106,10 @@ static int __ext4_block_zero_page_range(handle_t *handle,
>> folio_zero_range(folio, offset, length);
>> BUFFER_TRACE(bh, "zeroed end of block");
>>
>> - if (ext4_should_journal_data(inode)) {
>> + if (ext4_should_journal_data(inode))
>> err = ext4_dirty_journalled_data(handle, bh);
>> - } else {
>> + else
>> mark_buffer_dirty(bh);
>> - /*
>> - * Only the written block requires ordered data to prevent
>> - * exposing stale data.
>> - */
>> - if (!buffer_unwritten(bh) && !buffer_delay(bh) &&
>> - ext4_should_order_data(inode))
>> - err = ext4_jbd2_inode_add_write(handle, inode, from,
>> - length);
>> - }
>> if (!err && did_zero)
>> *did_zero = true;
>>
>> @@ -4578,8 +4569,23 @@ int ext4_truncate(struct inode *inode)
>> goto out_trace;
>> }
>>
>> - if (inode->i_size & (inode->i_sb->s_blocksize - 1))
>> - ext4_block_truncate_page(handle, mapping, inode->i_size);
>> + if (inode->i_size & (inode->i_sb->s_blocksize - 1)) {
>> + unsigned int zero_len;
>> +
>> + zero_len = ext4_block_truncate_page(handle, mapping,
>> + inode->i_size);
>> + if (zero_len < 0) {
>> + err = zero_len;
>> + goto out_stop;
>> + }
>> + if (zero_len && !IS_DAX(inode) &&
>> + ext4_should_order_data(inode)) {
>> + err = ext4_jbd2_inode_add_write(handle, inode,
>> + inode->i_size, zero_len);
>> + if (err)
>> + goto out_stop;
>> + }
>> + }
>>
>> /*
>> * We add the inode to the orphan list, so that if this
>> --
>> 2.52.0
>>
Hi Zhang! On Wed 04-02-26 14:42:46, Zhang Yi wrote: > On 2/3/2026 5:59 PM, Jan Kara wrote: > > On Tue 03-02-26 14:25:03, Zhang Yi wrote: > >> Currently, __ext4_block_zero_page_range() is called in the following > >> four cases to zero out the data in partial blocks: > >> > >> 1. Truncate down. > >> 2. Truncate up. > >> 3. Perform block allocation (e.g., fallocate) or append writes across a > >> range extending beyond the end of the file (EOF). > >> 4. Partial block punch hole. > >> > >> If the default ordered data mode is used, __ext4_block_zero_page_range() > >> will write back the zeroed data to the disk through the order mode after > >> zeroing out. > >> > >> Among the cases 1,2 and 3 described above, only case 1 actually requires > >> this ordered write. Assuming no one intentionally bypasses the file > >> system to write directly to the disk. When performing a truncate down > >> operation, ensuring that the data beyond the EOF is zeroed out before > >> updating i_disksize is sufficient to prevent old data from being exposed > >> when the file is later extended. In other words, as long as the on-disk > >> data in case 1 can be properly zeroed out, only the data in memory needs > >> to be zeroed out in cases 2 and 3, without requiring ordered data. > > > > Hum, I'm not sure this is correct. The tail block of the file is not > > necessarily zeroed out beyond EOF (as mmap writes can race with page > > writeback and modify the tail block contents beyond EOF before we really > > submit it to the device). Thus after this commit if you truncate up, just > > zero out the newly exposed contents in the page cache and dirty it, then > > the transaction with the i_disksize update commits (I see nothing > > preventing it) and then you crash, you can observe file with the new size > > but non-zero content in the newly exposed area. Am I missing something? > > > > Well, I think you are right! I missed the mmap write race condition that > happens during the writeback submitting I/O. Thank you a lot for pointing > this out. I thought of two possible solutions: > > 1. We also add explicit writeback operations to the truncate-up and > post-EOF append writes. This solution is the most straightforward but > may cause some performance overhead. However, since at most only one > block is written, the impact is likely limited. Additionally, I > observed that the implementation of the XFS file system also adopts a > similar approach in its truncate up and down operation. (But it is > somewhat strange that XFS also appears to have the same issue with > post-EOF append writes; it only zero out the partial block in > xfs_file_write_checks(), but it neither explicitly writeback zeroed > data nor employs any other mechanism to ensure that the zero data > writebacks before the metadata is written to disk.) > > 2. Resolve this race condition, ensure that there are no non-zero data > in the post-EOF partial blocks on the disk. I observed that after the > writeback holds the folio lock and calls folio_clear_dirty_for_io(), > mmap writes will re-trigger the page fault. Perhaps we can filter out > the EOF folio based on i_size in ext4_page_mkwrite(), > block_page_mkwrite() and iomap_page_mkwrite(), and then call > folio_wait_writeback() to wait for this partial folio writeback to > complete. This seems can break the race condition without introducing > too much overhead (no?). > > What do you think? Any other suggestions are also welcome. Hum, I like the option 2 because IMO non-zero data beyond EOF is a corner-case quirk which unnecessarily complicates rather common paths. But I'm not sure we can easily get rid of it. It can happen for example when you do appending write inside a block. The page is written back but before the transaction with i_disksize update commits we crash. Then again we have a non-zero content inside the block beyond EOF. So the only realistic option I see is to ensure tail of the block gets zeroed on disk before the transaction with i_disksize update commits in the cases of truncate up or write beyond EOF. data=ordered mode machinery is an asynchronous way how to achieve this. We could also just synchronously writeback the block where needed but the latency hit of such operation is going to be significant so I'm quite sure some workload somewhere will notice although the truncate up / write beyond EOF operations triggering this are not too common. So why do you need to get rid of these data=ordered mode usages? I guess because with iomap keeping our transaction handle -> folio lock ordering is complicated? Last time I looked it seemed still possible to keep it though. Another possibility would be to just *submit* the write synchronously and use data=ordered mode machinery only to wait for IO to complete before the transaction commits. That way it should be safe to start a transaction while holding folio lock and thus the iomap conversion would be easier. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR
On 2/4/2026 10:18 PM, Jan Kara wrote: > On Wed 04-02-26 14:42:46, Zhang Yi wrote: >> On 2/3/2026 5:59 PM, Jan Kara wrote: >>> On Tue 03-02-26 14:25:03, Zhang Yi wrote: >>>> Currently, __ext4_block_zero_page_range() is called in the following >>>> four cases to zero out the data in partial blocks: >>>> >>>> 1. Truncate down. >>>> 2. Truncate up. >>>> 3. Perform block allocation (e.g., fallocate) or append writes across a >>>> range extending beyond the end of the file (EOF). >>>> 4. Partial block punch hole. >>>> >>>> If the default ordered data mode is used, __ext4_block_zero_page_range() >>>> will write back the zeroed data to the disk through the order mode after >>>> zeroing out. >>>> >>>> Among the cases 1,2 and 3 described above, only case 1 actually requires >>>> this ordered write. Assuming no one intentionally bypasses the file >>>> system to write directly to the disk. When performing a truncate down >>>> operation, ensuring that the data beyond the EOF is zeroed out before >>>> updating i_disksize is sufficient to prevent old data from being exposed >>>> when the file is later extended. In other words, as long as the on-disk >>>> data in case 1 can be properly zeroed out, only the data in memory needs >>>> to be zeroed out in cases 2 and 3, without requiring ordered data. >>> >>> Hum, I'm not sure this is correct. The tail block of the file is not >>> necessarily zeroed out beyond EOF (as mmap writes can race with page >>> writeback and modify the tail block contents beyond EOF before we really >>> submit it to the device). Thus after this commit if you truncate up, just >>> zero out the newly exposed contents in the page cache and dirty it, then >>> the transaction with the i_disksize update commits (I see nothing >>> preventing it) and then you crash, you can observe file with the new size >>> but non-zero content in the newly exposed area. Am I missing something? >>> >> >> Well, I think you are right! I missed the mmap write race condition that >> happens during the writeback submitting I/O. Thank you a lot for pointing >> this out. I thought of two possible solutions: >> >> 1. We also add explicit writeback operations to the truncate-up and >> post-EOF append writes. This solution is the most straightforward but >> may cause some performance overhead. However, since at most only one >> block is written, the impact is likely limited. Additionally, I >> observed that the implementation of the XFS file system also adopts a >> similar approach in its truncate up and down operation. (But it is >> somewhat strange that XFS also appears to have the same issue with >> post-EOF append writes; it only zero out the partial block in >> xfs_file_write_checks(), but it neither explicitly writeback zeroed >> data nor employs any other mechanism to ensure that the zero data >> writebacks before the metadata is written to disk.) >> >> 2. Resolve this race condition, ensure that there are no non-zero data >> in the post-EOF partial blocks on the disk. I observed that after the >> writeback holds the folio lock and calls folio_clear_dirty_for_io(), >> mmap writes will re-trigger the page fault. Perhaps we can filter out >> the EOF folio based on i_size in ext4_page_mkwrite(), >> block_page_mkwrite() and iomap_page_mkwrite(), and then call >> folio_wait_writeback() to wait for this partial folio writeback to >> complete. This seems can break the race condition without introducing >> too much overhead (no?). >> >> What do you think? Any other suggestions are also welcome. > > Hum, I like the option 2 because IMO non-zero data beyond EOF is a > corner-case quirk which unnecessarily complicates rather common paths. But > I'm not sure we can easily get rid of it. It can happen for example when > you do appending write inside a block. The page is written back but before > the transaction with i_disksize update commits we crash. Then again we have > a non-zero content inside the block beyond EOF. Yes, indeed. From this perspective, it seems difficult to avoid non-zero content within the block beyond the EOF. > > So the only realistic option I see is to ensure tail of the block gets > zeroed on disk before the transaction with i_disksize update commits in the > cases of truncate up or write beyond EOF. data=ordered mode machinery is an > asynchronous way how to achieve this. We could also just synchronously > writeback the block where needed but the latency hit of such operation is > going to be significant so I'm quite sure some workload somewhere will > notice although the truncate up / write beyond EOF operations triggering this > are not too common. Yes, I agree. > So why do you need to get rid of these data=ordered > mode usages? I guess because with iomap keeping our transaction handle -> > folio lock ordering is complicated? Last time I looked it seemed still > possible to keep it though. > Yes, that's one reason. There's another reason is that we also need to implement partial folio submits for iomap. When the journal process is waiting for a folio to be written back (which contains an ordered block), and the folio also contains unmapped blocks with a block size smaller than the folio size, if the regular writeback process has already started committing this folio (and set the writeback flag), then a deadlock may occur while mapping the remaining unmapped blocks. This is because the writeback flag is cleared only after the entire folio are processed and committed. If we want to support partial folio submit for iomap, we need to be careful to prevent adding additional performance overhead in the case of severe fragmentation. Therefore, this aspect of the logic is complicated and subtle. As we discussed in patch 0, if we can avoid using the data=ordered mode in append write and online defrag, then this would be the only remaining corner case. I'm not sure if it is worth implementing this and adjusting the lock ordering. > Another possibility would be to just *submit* the write synchronously and > use data=ordered mode machinery only to wait for IO to complete before the > transaction commits. That way it should be safe to start a transaction > while holding folio lock and thus the iomap conversion would be easier. > > Honza IIUC, this solution seems can avoid adjusting the lock ordering, but partial folio submission still needs to be implemented, is my understanding right? This is because although we have already submitted this zeroed partial EOF block, when the journal process is waiting for this folio, this folio is being written back, and there are other blocks in this folio that need to be mapped. Cheers, Yi.
On Thu 05-02-26 15:50:38, Zhang Yi wrote: > On 2/4/2026 10:18 PM, Jan Kara wrote: > > So why do you need to get rid of these data=ordered > > mode usages? I guess because with iomap keeping our transaction handle -> > > folio lock ordering is complicated? Last time I looked it seemed still > > possible to keep it though. > > Yes, that's one reason. There's another reason is that we also need to > implement partial folio submits for iomap. > > When the journal process is waiting for a folio to be written back > (which contains an ordered block), and the folio also contains unmapped > blocks with a block size smaller than the folio size, if the regular > writeback process has already started committing this folio (and set the > writeback flag), then a deadlock may occur while mapping the remaining > unmapped blocks. This is because the writeback flag is cleared only > after the entire folio are processed and committed. If we want to support > partial folio submit for iomap, we need to be careful to prevent adding > additional performance overhead in the case of severe fragmentation. Yeah, this logic is currently handled by ext4_bio_write_folio(). And the deadlocks are currently resolved by grabbing transaction handle before we go and lock any page for writeback. But I agree that with iomap it may be tricky to keep this scheme. > Therefore, this aspect of the logic is complicated and subtle. As we > discussed in patch 0, if we can avoid using the data=ordered mode in > append write and online defrag, then this would be the only remaining > corner case. I'm not sure if it is worth implementing this and adjusting > the lock ordering. > > > Another possibility would be to just *submit* the write synchronously and > > use data=ordered mode machinery only to wait for IO to complete before the > > transaction commits. That way it should be safe to start a transaction > > IIUC, this solution seems can avoid adjusting the lock ordering, but partial > folio submission still needs to be implemented, is my understanding right? > This is because although we have already submitted this zeroed partial EOF > block, when the journal process is waiting for this folio, this folio is > being written back, and there are other blocks in this folio that need to be > mapped. That's a good question. If we submit the tail folio from truncation code, we could just submit the full folio write and there's no need to restrict ourselves only to mapped blocks. But you are correct that if this IO completes but the folio had holes in it and the hole gets filled in by write before the transaction with i_disksize update commits, jbd2 commit could still race with flush worker writing this folio again and the deadlock could happen. Hrm... So how about the following: We expand our io_end processing with the ability to journal i_disksize updates after page writeback completes. Then when doing truncate up or appending writes, we keep i_disksize at the old value and just zero folio tails in the page cache, mark the folio dirty and update i_size. When submitting writeback for a folio beyond current i_disksize we make sure writepages submits IO for all the folios from current i_disksize upwards. When io_end processing happens after completed folio writeback, we update i_disksize to min(i_size, end of IO). This should take care of non-zero data exposure issues and with "delay map" processing Baokun works on all the inode metadata updates will happen after IO completion anyway so it will be nicely batched up in one transaction. It's a big change but so far I think it should work. What do you think? Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR
On 2/5/2026 11:05 PM, Jan Kara wrote:
> On Thu 05-02-26 15:50:38, Zhang Yi wrote:
>> On 2/4/2026 10:18 PM, Jan Kara wrote:
>>> So why do you need to get rid of these data=ordered
>>> mode usages? I guess because with iomap keeping our transaction handle ->
>>> folio lock ordering is complicated? Last time I looked it seemed still
>>> possible to keep it though.
>>
>> Yes, that's one reason. There's another reason is that we also need to
>> implement partial folio submits for iomap.
>>
>> When the journal process is waiting for a folio to be written back
>> (which contains an ordered block), and the folio also contains unmapped
>> blocks with a block size smaller than the folio size, if the regular
>> writeback process has already started committing this folio (and set the
>> writeback flag), then a deadlock may occur while mapping the remaining
>> unmapped blocks. This is because the writeback flag is cleared only
>> after the entire folio are processed and committed. If we want to support
>> partial folio submit for iomap, we need to be careful to prevent adding
>> additional performance overhead in the case of severe fragmentation.
>
> Yeah, this logic is currently handled by ext4_bio_write_folio(). And the
> deadlocks are currently resolved by grabbing transaction handle before we
> go and lock any page for writeback. But I agree that with iomap it may be
> tricky to keep this scheme.
>
>> Therefore, this aspect of the logic is complicated and subtle. As we
>> discussed in patch 0, if we can avoid using the data=ordered mode in
>> append write and online defrag, then this would be the only remaining
>> corner case. I'm not sure if it is worth implementing this and adjusting
>> the lock ordering.
>>
>>> Another possibility would be to just *submit* the write synchronously and
>>> use data=ordered mode machinery only to wait for IO to complete before the
>>> transaction commits. That way it should be safe to start a transaction
>>
>> IIUC, this solution seems can avoid adjusting the lock ordering, but partial
>> folio submission still needs to be implemented, is my understanding right?
>> This is because although we have already submitted this zeroed partial EOF
>> block, when the journal process is waiting for this folio, this folio is
>> being written back, and there are other blocks in this folio that need to be
>> mapped.
>
> That's a good question. If we submit the tail folio from truncation code,
> we could just submit the full folio write and there's no need to restrict
> ourselves only to mapped blocks. But you are correct that if this IO
> completes but the folio had holes in it and the hole gets filled in by
> write before the transaction with i_disksize update commits, jbd2 commit
> could still race with flush worker writing this folio again and the
> deadlock could happen. Hrm...
>
Yes!
> So how about the following:
Let me see, please correct me if my understanding is wrong, ana there are
also some points I don't get.
> We expand our io_end processing with the
> ability to journal i_disksize updates after page writeback completes. Then
> when doing truncate up or appending writes, we keep i_disksize at the old
> value and just zero folio tails in the page cache, mark the folio dirty and
> update i_size.
I think we need to submit this zeroed folio here as well. Because,
1) In the case of truncate up, if we don't submit, the i_disksize may have to
wait a long time (until the folio writeback is complete, which takes about
30 seconds by default) before being updated, which is too long.
2) In the case of appending writes. Assume that the folio written beyond this
one is written back first, we have to wait this zeroed folio to be write
back and then update i_disksize, so we can't wait too long either.
Right?
> When submitting writeback for a folio beyond current
> i_disksize we make sure writepages submits IO for all the folios from
> current i_disksize upwards.
Why "all the folios"? IIUC, we only wait the zeroed EOF folio is sufficient.
> When io_end processing happens after completed
> folio writeback, we update i_disksize to min(i_size, end of IO).
Yeah, in the case of append write back. Assume we append write the folio 2
and folio 3,
old_idisksize new_isize
| |
[WWZZ][WWWW][WWWW]
1 | 2 3
A
Assume that folio 1 first completes the writeback, then we update i_disksize
to pos A when the writeback is complete. Assume that folio 2 or 3 completes
first, we should wait(e.g. call filemap_fdatawait_range_keep_errors() or
something like) folio 1 to complete and then update i_disksize to new_isize.
But in the case of truncate up, We will only write back this zeroed folio. If
the new i_size exceeds the end of this folio, how should we update i_disksize
to the correct value?
For example, we truncate the file from old old_idisksize to new_isize, but we
only zero and writeback folio 1, in the end_io processing of folio 1, we can
only update the i_disksize to A, but we can never update it to new_isize. Am
I missing something ?
old_idisksize new_isize
| |
[WWZZ]...hole ...
1 |
A
> This
> should take care of non-zero data exposure issues and with "delay map"
> processing Baokun works on all the inode metadata updates will happen after
> IO completion anyway so it will be nicely batched up in one transaction.
Currently, my iomap convert implementation always enables dioread_nolock,
so I feel that this solution can be achieved even without the "delay map"
feature. After we have the "delay map", we can extend this to the
buffer_head path.
Thanks,
Yi.
> It's a big change but so far I think it should work. What do you think?
>
> Honza
On Fri 06-02-26 19:09:53, Zhang Yi wrote: > On 2/5/2026 11:05 PM, Jan Kara wrote: > > So how about the following: > > Let me see, please correct me if my understanding is wrong, ana there are > also some points I don't get. > > > We expand our io_end processing with the > > ability to journal i_disksize updates after page writeback completes. Then > > when doing truncate up or appending writes, we keep i_disksize at the old > > value and just zero folio tails in the page cache, mark the folio dirty and > > update i_size. > > I think we need to submit this zeroed folio here as well. Because, > > 1) In the case of truncate up, if we don't submit, the i_disksize may have to > wait a long time (until the folio writeback is complete, which takes about > 30 seconds by default) before being updated, which is too long. Correct but I'm not sure it matters. Current delalloc writes behave in the same way already. For simplicity I'd thus prefer to not treat truncate up in a special way but if we decide this indeed desirable, we can either submit the tail folio immediately, or schedule work with earlier writeback. > 2) In the case of appending writes. Assume that the folio written beyond this > one is written back first, we have to wait this zeroed folio to be write > back and then update i_disksize, so we can't wait too long either. Correct, update of i_disksize after writeback of folios beyond current i_disksize is blocked by the writeback of the tail folio. > > When submitting writeback for a folio beyond current > > i_disksize we make sure writepages submits IO for all the folios from > > current i_disksize upwards. > > Why "all the folios"? IIUC, we only wait the zeroed EOF folio is sufficient. I was worried about a case like: We have 4k blocksize, file is i_disksize 2k. Now you do: pwrite(file, buf, 1, 6k); pwrite(file, buf, 1, 10k); pwrite(file, buf, 1, 14k); The pwrite at offset 6k needs to zero the tail of the folio with index 0, pwrite at 10k needs to zero the tail of the folio with index 1, etc. And for us to safely advance i_disksize to 14k+1, I though all the folios (and zeroed out tails) need to be written out. But that's actually not the case. We need to make sure the zeroed tail is written out only if the underlying block is already allocated and marked as written at the time of zeroing. And the blocks underlying intermediate i_size values will never be allocated and written without advancing i_disksize to them. So I think you're correct, we always have at most one tail folio - the one surrounding current i_disksize - which needs to be written out to safely advance i_disksize and we don't care about folios inbetween. > > When io_end processing happens after completed > > folio writeback, we update i_disksize to min(i_size, end of IO). > > Yeah, in the case of append write back. Assume we append write the folio 2 > and folio 3, > > old_idisksize new_isize > | | > [WWZZ][WWWW][WWWW] > 1 | 2 3 > A > > Assume that folio 1 first completes the writeback, then we update i_disksize > to pos A when the writeback is complete. Assume that folio 2 or 3 completes > first, we should wait(e.g. call filemap_fdatawait_range_keep_errors() or > something like) folio 1 to complete and then update i_disksize to new_isize. > > But in the case of truncate up, We will only write back this zeroed folio. If > the new i_size exceeds the end of this folio, how should we update i_disksize > to the correct value? > > For example, we truncate the file from old old_idisksize to new_isize, but we > only zero and writeback folio 1, in the end_io processing of folio 1, we can > only update the i_disksize to A, but we can never update it to new_isize. Am > I missing something ? > > old_idisksize new_isize > | | > [WWZZ]...hole ... > 1 | > A Good question. Based on the analysis above one option would be to setup writeback of page straddling current i_disksize to update i_disksize to current i_size on completion. That would be simple but would have an unpleasant side effect that in case of a crash after append write we could see increased i_disksize but zeros instead of written data. Another option would be to update i_disksize on completion to the beginning of the first dirty folio behind the written back range or i_size of there's not such folio. This would still be relatively simple and mostly deal with "zeros instead of data" problem. > > This > > should take care of non-zero data exposure issues and with "delay map" > > processing Baokun works on all the inode metadata updates will happen after > > IO completion anyway so it will be nicely batched up in one transaction. > > Currently, my iomap convert implementation always enables dioread_nolock, Yes, BTW I think you could remove no-dioread_nolock paths before doing the conversion to simplify matters a bit. I don't think it's seriously used anywhere anymore. > so I feel that this solution can be achieved even without the "delay map" > feature. After we have the "delay map", we can extend this to the > buffer_head path. I agree, delay map is not necessary for this to work. But it will make things likely faster. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR
On 2/6/2026 11:35 PM, Jan Kara wrote: > On Fri 06-02-26 19:09:53, Zhang Yi wrote: >> On 2/5/2026 11:05 PM, Jan Kara wrote: >>> So how about the following: >> >> Let me see, please correct me if my understanding is wrong, ana there are >> also some points I don't get. >> >>> We expand our io_end processing with the >>> ability to journal i_disksize updates after page writeback completes. Then >>> when doing truncate up or appending writes, we keep i_disksize at the old >>> value and just zero folio tails in the page cache, mark the folio dirty and >>> update i_size. >> >> I think we need to submit this zeroed folio here as well. Because, >> >> 1) In the case of truncate up, if we don't submit, the i_disksize may have to >> wait a long time (until the folio writeback is complete, which takes about >> 30 seconds by default) before being updated, which is too long. > > Correct but I'm not sure it matters. Current delalloc writes behave in the > same way already. For simplicity I'd thus prefer to not treat truncate up > in a special way but if we decide this indeed desirable, we can either > submit the tail folio immediately, or schedule work with earlier writeback. > >> 2) In the case of appending writes. Assume that the folio written beyond this >> one is written back first, we have to wait this zeroed folio to be write >> back and then update i_disksize, so we can't wait too long either. > > Correct, update of i_disksize after writeback of folios beyond current > i_disksize is blocked by the writeback of the tail folio. > >>> When submitting writeback for a folio beyond current >>> i_disksize we make sure writepages submits IO for all the folios from >>> current i_disksize upwards. >> >> Why "all the folios"? IIUC, we only wait the zeroed EOF folio is sufficient. > > I was worried about a case like: > > We have 4k blocksize, file is i_disksize 2k. Now you do: > pwrite(file, buf, 1, 6k); > pwrite(file, buf, 1, 10k); > pwrite(file, buf, 1, 14k); > > The pwrite at offset 6k needs to zero the tail of the folio with index 0, > pwrite at 10k needs to zero the tail of the folio with index 1, etc. And > for us to safely advance i_disksize to 14k+1, I though all the folios (and > zeroed out tails) need to be written out. But that's actually not the case. > We need to make sure the zeroed tail is written out only if the underlying > block is already allocated and marked as written at the time of zeroing. > And the blocks underlying intermediate i_size values will never be allocated > and written without advancing i_disksize to them. So I think you're > correct, we always have at most one tail folio - the one surrounding > current i_disksize - which needs to be written out to safely advance > i_disksize and we don't care about folios inbetween. > >>> When io_end processing happens after completed >>> folio writeback, we update i_disksize to min(i_size, end of IO). >> >> Yeah, in the case of append write back. Assume we append write the folio 2 >> and folio 3, >> >> old_idisksize new_isize >> | | >> [WWZZ][WWWW][WWWW] >> 1 | 2 3 >> A >> >> Assume that folio 1 first completes the writeback, then we update i_disksize >> to pos A when the writeback is complete. Assume that folio 2 or 3 completes >> first, we should wait(e.g. call filemap_fdatawait_range_keep_errors() or >> something like) folio 1 to complete and then update i_disksize to new_isize. >> >> But in the case of truncate up, We will only write back this zeroed folio. If >> the new i_size exceeds the end of this folio, how should we update i_disksize >> to the correct value? >> >> For example, we truncate the file from old old_idisksize to new_isize, but we >> only zero and writeback folio 1, in the end_io processing of folio 1, we can >> only update the i_disksize to A, but we can never update it to new_isize. Am >> I missing something ? >> >> old_idisksize new_isize >> | | >> [WWZZ]...hole ... >> 1 | >> A > > Good question. Based on the analysis above one option would be to setup > writeback of page straddling current i_disksize to update i_disksize to > current i_size on completion. That would be simple but would have an > unpleasant side effect that in case of a crash after append write we could > see increased i_disksize but zeros instead of written data. Another option > would be to update i_disksize on completion to the beginning of the first > dirty folio behind the written back range or i_size of there's not such > folio. This would still be relatively simple and mostly deal with "zeros > instead of data" problem. Ha, good idea! I think it should work. I will try the second option, thank you a lot for this suggestion. :) > >>> This >>> should take care of non-zero data exposure issues and with "delay map" >>> processing Baokun works on all the inode metadata updates will happen after >>> IO completion anyway so it will be nicely batched up in one transaction. >> >> Currently, my iomap convert implementation always enables dioread_nolock, > > Yes, BTW I think you could remove no-dioread_nolock paths before doing the > conversion to simplify matters a bit. I don't think it's seriously used > anywhere anymore. > Sure. After removing the no-dioread_nolock paths, the behavior of the buffer_head path (extents-based and no-journal data mode) and the iomap path in append write and truncate operations can be made consistent. Cheers, Yi. >> so I feel that this solution can be achieved even without the "delay map" >> feature. After we have the "delay map", we can extend this to the >> buffer_head path. > > I agree, delay map is not necessary for this to work. But it will make > things likely faster. > > Honza
On 2026-02-04 22:18, Jan Kara wrote: > Hi Zhang! > > On Wed 04-02-26 14:42:46, Zhang Yi wrote: >> On 2/3/2026 5:59 PM, Jan Kara wrote: >>> On Tue 03-02-26 14:25:03, Zhang Yi wrote: >>>> Currently, __ext4_block_zero_page_range() is called in the following >>>> four cases to zero out the data in partial blocks: >>>> >>>> 1. Truncate down. >>>> 2. Truncate up. >>>> 3. Perform block allocation (e.g., fallocate) or append writes across a >>>> range extending beyond the end of the file (EOF). >>>> 4. Partial block punch hole. >>>> >>>> If the default ordered data mode is used, __ext4_block_zero_page_range() >>>> will write back the zeroed data to the disk through the order mode after >>>> zeroing out. >>>> >>>> Among the cases 1,2 and 3 described above, only case 1 actually requires >>>> this ordered write. Assuming no one intentionally bypasses the file >>>> system to write directly to the disk. When performing a truncate down >>>> operation, ensuring that the data beyond the EOF is zeroed out before >>>> updating i_disksize is sufficient to prevent old data from being exposed >>>> when the file is later extended. In other words, as long as the on-disk >>>> data in case 1 can be properly zeroed out, only the data in memory needs >>>> to be zeroed out in cases 2 and 3, without requiring ordered data. >>> Hum, I'm not sure this is correct. The tail block of the file is not >>> necessarily zeroed out beyond EOF (as mmap writes can race with page >>> writeback and modify the tail block contents beyond EOF before we really >>> submit it to the device). Thus after this commit if you truncate up, just >>> zero out the newly exposed contents in the page cache and dirty it, then >>> the transaction with the i_disksize update commits (I see nothing >>> preventing it) and then you crash, you can observe file with the new size >>> but non-zero content in the newly exposed area. Am I missing something? >>> >> Well, I think you are right! I missed the mmap write race condition that >> happens during the writeback submitting I/O. Thank you a lot for pointing >> this out. I thought of two possible solutions: >> >> 1. We also add explicit writeback operations to the truncate-up and >> post-EOF append writes. This solution is the most straightforward but >> may cause some performance overhead. However, since at most only one >> block is written, the impact is likely limited. Additionally, I >> observed that the implementation of the XFS file system also adopts a >> similar approach in its truncate up and down operation. (But it is >> somewhat strange that XFS also appears to have the same issue with >> post-EOF append writes; it only zero out the partial block in >> xfs_file_write_checks(), but it neither explicitly writeback zeroed >> data nor employs any other mechanism to ensure that the zero data >> writebacks before the metadata is written to disk.) >> >> 2. Resolve this race condition, ensure that there are no non-zero data >> in the post-EOF partial blocks on the disk. I observed that after the >> writeback holds the folio lock and calls folio_clear_dirty_for_io(), >> mmap writes will re-trigger the page fault. Perhaps we can filter out >> the EOF folio based on i_size in ext4_page_mkwrite(), >> block_page_mkwrite() and iomap_page_mkwrite(), and then call >> folio_wait_writeback() to wait for this partial folio writeback to >> complete. This seems can break the race condition without introducing >> too much overhead (no?). >> >> What do you think? Any other suggestions are also welcome. > Hum, I like the option 2 because IMO non-zero data beyond EOF is a > corner-case quirk which unnecessarily complicates rather common paths. But > I'm not sure we can easily get rid of it. It can happen for example when > you do appending write inside a block. The page is written back but before > the transaction with i_disksize update commits we crash. Then again we have > a non-zero content inside the block beyond EOF. > > So the only realistic option I see is to ensure tail of the block gets > zeroed on disk before the transaction with i_disksize update commits in the > cases of truncate up or write beyond EOF. data=ordered mode machinery is an > asynchronous way how to achieve this. We could also just synchronously > writeback the block where needed but the latency hit of such operation is > going to be significant so I'm quite sure some workload somewhere will > notice although the truncate up / write beyond EOF operations triggering this > are not too common. So why do you need to get rid of these data=ordered > mode usages? I guess because with iomap keeping our transaction handle -> > folio lock ordering is complicated? Last time I looked it seemed still > possible to keep it though. > > Another possibility would be to just *submit* the write synchronously and > use data=ordered mode machinery only to wait for IO to complete before the > transaction commits. That way it should be safe to start a transaction > while holding folio lock and thus the iomap conversion would be easier. > > Honza Can we treat EOF blocks as metadata and update them in the same transaction as i_disksize? Although this would introduce some management and journaling overhead, it could avoid the deadlock of "writeback -> start handle -> trigger writeback". Regards, Baokun
On Thu 05-02-26 11:27:09, Baokun Li wrote: > On 2026-02-04 22:18, Jan Kara wrote: > > Hi Zhang! > > > > On Wed 04-02-26 14:42:46, Zhang Yi wrote: > >> On 2/3/2026 5:59 PM, Jan Kara wrote: > >>> On Tue 03-02-26 14:25:03, Zhang Yi wrote: > >>>> Currently, __ext4_block_zero_page_range() is called in the following > >>>> four cases to zero out the data in partial blocks: > >>>> > >>>> 1. Truncate down. > >>>> 2. Truncate up. > >>>> 3. Perform block allocation (e.g., fallocate) or append writes across a > >>>> range extending beyond the end of the file (EOF). > >>>> 4. Partial block punch hole. > >>>> > >>>> If the default ordered data mode is used, __ext4_block_zero_page_range() > >>>> will write back the zeroed data to the disk through the order mode after > >>>> zeroing out. > >>>> > >>>> Among the cases 1,2 and 3 described above, only case 1 actually requires > >>>> this ordered write. Assuming no one intentionally bypasses the file > >>>> system to write directly to the disk. When performing a truncate down > >>>> operation, ensuring that the data beyond the EOF is zeroed out before > >>>> updating i_disksize is sufficient to prevent old data from being exposed > >>>> when the file is later extended. In other words, as long as the on-disk > >>>> data in case 1 can be properly zeroed out, only the data in memory needs > >>>> to be zeroed out in cases 2 and 3, without requiring ordered data. > >>> Hum, I'm not sure this is correct. The tail block of the file is not > >>> necessarily zeroed out beyond EOF (as mmap writes can race with page > >>> writeback and modify the tail block contents beyond EOF before we really > >>> submit it to the device). Thus after this commit if you truncate up, just > >>> zero out the newly exposed contents in the page cache and dirty it, then > >>> the transaction with the i_disksize update commits (I see nothing > >>> preventing it) and then you crash, you can observe file with the new size > >>> but non-zero content in the newly exposed area. Am I missing something? > >>> > >> Well, I think you are right! I missed the mmap write race condition that > >> happens during the writeback submitting I/O. Thank you a lot for pointing > >> this out. I thought of two possible solutions: > >> > >> 1. We also add explicit writeback operations to the truncate-up and > >> post-EOF append writes. This solution is the most straightforward but > >> may cause some performance overhead. However, since at most only one > >> block is written, the impact is likely limited. Additionally, I > >> observed that the implementation of the XFS file system also adopts a > >> similar approach in its truncate up and down operation. (But it is > >> somewhat strange that XFS also appears to have the same issue with > >> post-EOF append writes; it only zero out the partial block in > >> xfs_file_write_checks(), but it neither explicitly writeback zeroed > >> data nor employs any other mechanism to ensure that the zero data > >> writebacks before the metadata is written to disk.) > >> > >> 2. Resolve this race condition, ensure that there are no non-zero data > >> in the post-EOF partial blocks on the disk. I observed that after the > >> writeback holds the folio lock and calls folio_clear_dirty_for_io(), > >> mmap writes will re-trigger the page fault. Perhaps we can filter out > >> the EOF folio based on i_size in ext4_page_mkwrite(), > >> block_page_mkwrite() and iomap_page_mkwrite(), and then call > >> folio_wait_writeback() to wait for this partial folio writeback to > >> complete. This seems can break the race condition without introducing > >> too much overhead (no?). > >> > >> What do you think? Any other suggestions are also welcome. > > Hum, I like the option 2 because IMO non-zero data beyond EOF is a > > corner-case quirk which unnecessarily complicates rather common paths. But > > I'm not sure we can easily get rid of it. It can happen for example when > > you do appending write inside a block. The page is written back but before > > the transaction with i_disksize update commits we crash. Then again we have > > a non-zero content inside the block beyond EOF. > > > > So the only realistic option I see is to ensure tail of the block gets > > zeroed on disk before the transaction with i_disksize update commits in the > > cases of truncate up or write beyond EOF. data=ordered mode machinery is an > > asynchronous way how to achieve this. We could also just synchronously > > writeback the block where needed but the latency hit of such operation is > > going to be significant so I'm quite sure some workload somewhere will > > notice although the truncate up / write beyond EOF operations triggering this > > are not too common. So why do you need to get rid of these data=ordered > > mode usages? I guess because with iomap keeping our transaction handle -> > > folio lock ordering is complicated? Last time I looked it seemed still > > possible to keep it though. > > > > Another possibility would be to just *submit* the write synchronously and > > use data=ordered mode machinery only to wait for IO to complete before the > > transaction commits. That way it should be safe to start a transaction > > while holding folio lock and thus the iomap conversion would be easier. > > > > Honza > > Can we treat EOF blocks as metadata and update them in the same > transaction as i_disksize? Although this would introduce some > management and journaling overhead, it could avoid the deadlock > of "writeback -> start handle -> trigger writeback". No, IMHO that would get too difficult. Just look at the hoops data=journal mode has to jump through to make page cache handling work with the journalling machinery. And you'd now have that for all the inodes. So I think some form of data=ordered machinery is much simpler to reason about. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR
On 2026-02-05 22:07, Jan Kara wrote: > On Thu 05-02-26 11:27:09, Baokun Li wrote: >> On 2026-02-04 22:18, Jan Kara wrote: >>> Hi Zhang! >>> >>> On Wed 04-02-26 14:42:46, Zhang Yi wrote: >>>> On 2/3/2026 5:59 PM, Jan Kara wrote: >>>>> On Tue 03-02-26 14:25:03, Zhang Yi wrote: >>>>>> Currently, __ext4_block_zero_page_range() is called in the following >>>>>> four cases to zero out the data in partial blocks: >>>>>> >>>>>> 1. Truncate down. >>>>>> 2. Truncate up. >>>>>> 3. Perform block allocation (e.g., fallocate) or append writes across a >>>>>> range extending beyond the end of the file (EOF). >>>>>> 4. Partial block punch hole. >>>>>> >>>>>> If the default ordered data mode is used, __ext4_block_zero_page_range() >>>>>> will write back the zeroed data to the disk through the order mode after >>>>>> zeroing out. >>>>>> >>>>>> Among the cases 1,2 and 3 described above, only case 1 actually requires >>>>>> this ordered write. Assuming no one intentionally bypasses the file >>>>>> system to write directly to the disk. When performing a truncate down >>>>>> operation, ensuring that the data beyond the EOF is zeroed out before >>>>>> updating i_disksize is sufficient to prevent old data from being exposed >>>>>> when the file is later extended. In other words, as long as the on-disk >>>>>> data in case 1 can be properly zeroed out, only the data in memory needs >>>>>> to be zeroed out in cases 2 and 3, without requiring ordered data. >>>>> Hum, I'm not sure this is correct. The tail block of the file is not >>>>> necessarily zeroed out beyond EOF (as mmap writes can race with page >>>>> writeback and modify the tail block contents beyond EOF before we really >>>>> submit it to the device). Thus after this commit if you truncate up, just >>>>> zero out the newly exposed contents in the page cache and dirty it, then >>>>> the transaction with the i_disksize update commits (I see nothing >>>>> preventing it) and then you crash, you can observe file with the new size >>>>> but non-zero content in the newly exposed area. Am I missing something? >>>>> >>>> Well, I think you are right! I missed the mmap write race condition that >>>> happens during the writeback submitting I/O. Thank you a lot for pointing >>>> this out. I thought of two possible solutions: >>>> >>>> 1. We also add explicit writeback operations to the truncate-up and >>>> post-EOF append writes. This solution is the most straightforward but >>>> may cause some performance overhead. However, since at most only one >>>> block is written, the impact is likely limited. Additionally, I >>>> observed that the implementation of the XFS file system also adopts a >>>> similar approach in its truncate up and down operation. (But it is >>>> somewhat strange that XFS also appears to have the same issue with >>>> post-EOF append writes; it only zero out the partial block in >>>> xfs_file_write_checks(), but it neither explicitly writeback zeroed >>>> data nor employs any other mechanism to ensure that the zero data >>>> writebacks before the metadata is written to disk.) >>>> >>>> 2. Resolve this race condition, ensure that there are no non-zero data >>>> in the post-EOF partial blocks on the disk. I observed that after the >>>> writeback holds the folio lock and calls folio_clear_dirty_for_io(), >>>> mmap writes will re-trigger the page fault. Perhaps we can filter out >>>> the EOF folio based on i_size in ext4_page_mkwrite(), >>>> block_page_mkwrite() and iomap_page_mkwrite(), and then call >>>> folio_wait_writeback() to wait for this partial folio writeback to >>>> complete. This seems can break the race condition without introducing >>>> too much overhead (no?). >>>> >>>> What do you think? Any other suggestions are also welcome. >>> Hum, I like the option 2 because IMO non-zero data beyond EOF is a >>> corner-case quirk which unnecessarily complicates rather common paths. But >>> I'm not sure we can easily get rid of it. It can happen for example when >>> you do appending write inside a block. The page is written back but before >>> the transaction with i_disksize update commits we crash. Then again we have >>> a non-zero content inside the block beyond EOF. >>> >>> So the only realistic option I see is to ensure tail of the block gets >>> zeroed on disk before the transaction with i_disksize update commits in the >>> cases of truncate up or write beyond EOF. data=ordered mode machinery is an >>> asynchronous way how to achieve this. We could also just synchronously >>> writeback the block where needed but the latency hit of such operation is >>> going to be significant so I'm quite sure some workload somewhere will >>> notice although the truncate up / write beyond EOF operations triggering this >>> are not too common. So why do you need to get rid of these data=ordered >>> mode usages? I guess because with iomap keeping our transaction handle -> >>> folio lock ordering is complicated? Last time I looked it seemed still >>> possible to keep it though. >>> >>> Another possibility would be to just *submit* the write synchronously and >>> use data=ordered mode machinery only to wait for IO to complete before the >>> transaction commits. That way it should be safe to start a transaction >>> while holding folio lock and thus the iomap conversion would be easier. >>> >>> Honza >> Can we treat EOF blocks as metadata and update them in the same >> transaction as i_disksize? Although this would introduce some >> management and journaling overhead, it could avoid the deadlock >> of "writeback -> start handle -> trigger writeback". > No, IMHO that would get too difficult. Just look at the hoops data=journal > mode has to jump through to make page cache handling work with the > journalling machinery. And you'd now have that for all the inodes. So I > think some form of data=ordered machinery is much simpler to reason about. > > Honza Indeed, this is a bit tricky. Regards, Baokun
© 2016 - 2026 Red Hat, Inc.