ext4: use iomap for regular file's buffered I/O path

[PATCH -next v2 03/22] ext4: only order data when partially block truncating down

Posted by Zhang Yi 6 days, 4 hours ago

Currently, __ext4_block_zero_page_range() is called in the following
four cases to zero out the data in partial blocks:

1. Truncate down.
2. Truncate up.
3. Perform block allocation (e.g., fallocate) or append writes across a
   range extending beyond the end of the file (EOF).
4. Partial block punch hole.

If the default ordered data mode is used, __ext4_block_zero_page_range()
will write back the zeroed data to the disk through the order mode after
zeroing out.

Among the cases 1,2 and 3 described above, only case 1 actually requires
this ordered write. Assuming no one intentionally bypasses the file
system to write directly to the disk. When performing a truncate down
operation, ensuring that the data beyond the EOF is zeroed out before
updating i_disksize is sufficient to prevent old data from being exposed
when the file is later extended. In other words, as long as the on-disk
data in case 1 can be properly zeroed out, only the data in memory needs
to be zeroed out in cases 2 and 3, without requiring ordered data.

Case 4 does not require ordered data because the entire punch hole
operation does not provide atomicity guarantees. Therefore, it's safe to
move the ordered data operation from __ext4_block_zero_page_range() to
ext4_truncate().

It should be noted that after this change, we can only determine whether
to perform ordered data operations based on whether the target block has
been zeroed, rather than on the state of the buffer head. Consequently,
unnecessary ordered data operations may occur when truncating an
unwritten dirty block. However, this scenario is relatively rare, so the
overall impact is minimal.

This is prepared for the conversion to the iomap infrastructure since it
doesn't use ordered data mode and requires active writeback, which
reduces the complexity of the conversion.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/inode.c | 32 +++++++++++++++++++-------------
 1 file changed, 19 insertions(+), 13 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f856ea015263..20b60abcf777 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4106,19 +4106,10 @@ static int __ext4_block_zero_page_range(handle_t *handle,
 	folio_zero_range(folio, offset, length);
 	BUFFER_TRACE(bh, "zeroed end of block");
 
-	if (ext4_should_journal_data(inode)) {
+	if (ext4_should_journal_data(inode))
 		err = ext4_dirty_journalled_data(handle, bh);
-	} else {
+	else
 		mark_buffer_dirty(bh);
-		/*
-		 * Only the written block requires ordered data to prevent
-		 * exposing stale data.
-		 */
-		if (!buffer_unwritten(bh) && !buffer_delay(bh) &&
-		    ext4_should_order_data(inode))
-			err = ext4_jbd2_inode_add_write(handle, inode, from,
-					length);
-	}
 	if (!err && did_zero)
 		*did_zero = true;
 
@@ -4578,8 +4569,23 @@ int ext4_truncate(struct inode *inode)
 		goto out_trace;
 	}
 
-	if (inode->i_size & (inode->i_sb->s_blocksize - 1))
-		ext4_block_truncate_page(handle, mapping, inode->i_size);
+	if (inode->i_size & (inode->i_sb->s_blocksize - 1)) {
+		unsigned int zero_len;
+
+		zero_len = ext4_block_truncate_page(handle, mapping,
+						    inode->i_size);
+		if (zero_len < 0) {
+			err = zero_len;
+			goto out_stop;
+		}
+		if (zero_len && !IS_DAX(inode) &&
+		    ext4_should_order_data(inode)) {
+			err = ext4_jbd2_inode_add_write(handle, inode,
+					inode->i_size, zero_len);
+			if (err)
+				goto out_stop;
+		}
+	}
 
 	/*
 	 * We add the inode to the orphan list, so that if this
-- 
2.52.0

Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down

Posted by kernel test robot 5 days, 6 hours ago

Hi Zhang,

kernel test robot noticed the following build warnings:

[auto build test WARNING on next-20260202]

url:    https://github.com/intel-lab-lkp/linux/commits/Zhang-Yi/ext4-make-ext4_block_zero_page_range-pass-out-did_zero/20260203-144244
base:   next-20260202
patch link:    https://lore.kernel.org/r/20260203062523.3869120-4-yi.zhang%40huawei.com
patch subject: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down
config: riscv-randconfig-r071-20260204 (https://download.01.org/0day-ci/archive/20260204/202602041239.JFyNwVcg-lkp@intel.com/config)
compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project 9b8addffa70cee5b2acc5454712d9cf78ce45710)
smatch version: v0.5.0-8994-gd50c5a4c

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602041239.JFyNwVcg-lkp@intel.com/

New smatch warnings:
fs/ext4/inode.c:4577 ext4_truncate() warn: unsigned 'zero_len' is never less than zero.

Old smatch warnings:
fs/ext4/inode.c:2651 mpage_prepare_extent_to_map() warn: missing error code 'err'
fs/ext4/inode.c:5129 check_igot_inode() warn: missing unwind goto?

vim +/zero_len +4577 fs/ext4/inode.c

  4494	
  4495	/*
  4496	 * ext4_truncate()
  4497	 *
  4498	 * We block out ext4_get_block() block instantiations across the entire
  4499	 * transaction, and VFS/VM ensures that ext4_truncate() cannot run
  4500	 * simultaneously on behalf of the same inode.
  4501	 *
  4502	 * As we work through the truncate and commit bits of it to the journal there
  4503	 * is one core, guiding principle: the file's tree must always be consistent on
  4504	 * disk.  We must be able to restart the truncate after a crash.
  4505	 *
  4506	 * The file's tree may be transiently inconsistent in memory (although it
  4507	 * probably isn't), but whenever we close off and commit a journal transaction,
  4508	 * the contents of (the filesystem + the journal) must be consistent and
  4509	 * restartable.  It's pretty simple, really: bottom up, right to left (although
  4510	 * left-to-right works OK too).
  4511	 *
  4512	 * Note that at recovery time, journal replay occurs *before* the restart of
  4513	 * truncate against the orphan inode list.
  4514	 *
  4515	 * The committed inode has the new, desired i_size (which is the same as
  4516	 * i_disksize in this case).  After a crash, ext4_orphan_cleanup() will see
  4517	 * that this inode's truncate did not complete and it will again call
  4518	 * ext4_truncate() to have another go.  So there will be instantiated blocks
  4519	 * to the right of the truncation point in a crashed ext4 filesystem.  But
  4520	 * that's fine - as long as they are linked from the inode, the post-crash
  4521	 * ext4_truncate() run will find them and release them.
  4522	 */
  4523	int ext4_truncate(struct inode *inode)
  4524	{
  4525		struct ext4_inode_info *ei = EXT4_I(inode);
  4526		unsigned int credits;
  4527		int err = 0, err2;
  4528		handle_t *handle;
  4529		struct address_space *mapping = inode->i_mapping;
  4530	
  4531		/*
  4532		 * There is a possibility that we're either freeing the inode
  4533		 * or it's a completely new inode. In those cases we might not
  4534		 * have i_rwsem locked because it's not necessary.
  4535		 */
  4536		if (!(inode_state_read_once(inode) & (I_NEW | I_FREEING)))
  4537			WARN_ON(!inode_is_locked(inode));
  4538		trace_ext4_truncate_enter(inode);
  4539	
  4540		if (!ext4_can_truncate(inode))
  4541			goto out_trace;
  4542	
  4543		if (inode->i_size == 0 && !test_opt(inode->i_sb, NO_AUTO_DA_ALLOC))
  4544			ext4_set_inode_state(inode, EXT4_STATE_DA_ALLOC_CLOSE);
  4545	
  4546		if (ext4_has_inline_data(inode)) {
  4547			int has_inline = 1;
  4548	
  4549			err = ext4_inline_data_truncate(inode, &has_inline);
  4550			if (err || has_inline)
  4551				goto out_trace;
  4552		}
  4553	
  4554		/* If we zero-out tail of the page, we have to create jinode for jbd2 */
  4555		if (inode->i_size & (inode->i_sb->s_blocksize - 1)) {
  4556			err = ext4_inode_attach_jinode(inode);
  4557			if (err)
  4558				goto out_trace;
  4559		}
  4560	
  4561		if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
  4562			credits = ext4_chunk_trans_extent(inode, 1);
  4563		else
  4564			credits = ext4_blocks_for_truncate(inode);
  4565	
  4566		handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE, credits);
  4567		if (IS_ERR(handle)) {
  4568			err = PTR_ERR(handle);
  4569			goto out_trace;
  4570		}
  4571	
  4572		if (inode->i_size & (inode->i_sb->s_blocksize - 1)) {
  4573			unsigned int zero_len;
  4574	
  4575			zero_len = ext4_block_truncate_page(handle, mapping,
  4576							    inode->i_size);
> 4577			if (zero_len < 0) {
  4578				err = zero_len;
  4579				goto out_stop;
  4580			}
  4581			if (zero_len && !IS_DAX(inode) &&
  4582			    ext4_should_order_data(inode)) {
  4583				err = ext4_jbd2_inode_add_write(handle, inode,
  4584						inode->i_size, zero_len);
  4585				if (err)
  4586					goto out_stop;
  4587			}
  4588		}
  4589	
  4590		/*
  4591		 * We add the inode to the orphan list, so that if this
  4592		 * truncate spans multiple transactions, and we crash, we will
  4593		 * resume the truncate when the filesystem recovers.  It also
  4594		 * marks the inode dirty, to catch the new size.
  4595		 *
  4596		 * Implication: the file must always be in a sane, consistent
  4597		 * truncatable state while each transaction commits.
  4598		 */
  4599		err = ext4_orphan_add(handle, inode);
  4600		if (err)
  4601			goto out_stop;
  4602	
  4603		ext4_fc_track_inode(handle, inode);
  4604		ext4_check_map_extents_env(inode);
  4605	
  4606		down_write(&EXT4_I(inode)->i_data_sem);
  4607		ext4_discard_preallocations(inode);
  4608	
  4609		if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
  4610			err = ext4_ext_truncate(handle, inode);
  4611		else
  4612			ext4_ind_truncate(handle, inode);
  4613	
  4614		up_write(&ei->i_data_sem);
  4615		if (err)
  4616			goto out_stop;
  4617	
  4618		if (IS_SYNC(inode))
  4619			ext4_handle_sync(handle);
  4620	
  4621	out_stop:
  4622		/*
  4623		 * If this was a simple ftruncate() and the file will remain alive,
  4624		 * then we need to clear up the orphan record which we created above.
  4625		 * However, if this was a real unlink then we were called by
  4626		 * ext4_evict_inode(), and we allow that function to clean up the
  4627		 * orphan info for us.
  4628		 */
  4629		if (inode->i_nlink)
  4630			ext4_orphan_del(handle, inode);
  4631	
  4632		inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
  4633		err2 = ext4_mark_inode_dirty(handle, inode);
  4634		if (unlikely(err2 && !err))
  4635			err = err2;
  4636		ext4_journal_stop(handle);
  4637	
  4638	out_trace:
  4639		trace_ext4_truncate_exit(inode);
  4640		return err;
  4641	}
  4642	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down

Posted by Jan Kara 6 days ago

On Tue 03-02-26 14:25:03, Zhang Yi wrote:
> Currently, __ext4_block_zero_page_range() is called in the following
> four cases to zero out the data in partial blocks:
> 
> 1. Truncate down.
> 2. Truncate up.
> 3. Perform block allocation (e.g., fallocate) or append writes across a
>    range extending beyond the end of the file (EOF).
> 4. Partial block punch hole.
> 
> If the default ordered data mode is used, __ext4_block_zero_page_range()
> will write back the zeroed data to the disk through the order mode after
> zeroing out.
> 
> Among the cases 1,2 and 3 described above, only case 1 actually requires
> this ordered write. Assuming no one intentionally bypasses the file
> system to write directly to the disk. When performing a truncate down
> operation, ensuring that the data beyond the EOF is zeroed out before
> updating i_disksize is sufficient to prevent old data from being exposed
> when the file is later extended. In other words, as long as the on-disk
> data in case 1 can be properly zeroed out, only the data in memory needs
> to be zeroed out in cases 2 and 3, without requiring ordered data.

Hum, I'm not sure this is correct. The tail block of the file is not
necessarily zeroed out beyond EOF (as mmap writes can race with page
writeback and modify the tail block contents beyond EOF before we really
submit it to the device). Thus after this commit if you truncate up, just
zero out the newly exposed contents in the page cache and dirty it, then
the transaction with the i_disksize update commits (I see nothing
preventing it) and then you crash, you can observe file with the new size
but non-zero content in the newly exposed area. Am I missing something?

> Case 4 does not require ordered data because the entire punch hole
> operation does not provide atomicity guarantees. Therefore, it's safe to
> move the ordered data operation from __ext4_block_zero_page_range() to
> ext4_truncate().

I agree hole punching can already expose intermediate results in case of
crash so there removing the ordered mode handling is safe.

								Honza

> It should be noted that after this change, we can only determine whether
> to perform ordered data operations based on whether the target block has
> been zeroed, rather than on the state of the buffer head. Consequently,
> unnecessary ordered data operations may occur when truncating an
> unwritten dirty block. However, this scenario is relatively rare, so the
> overall impact is minimal.
> 
> This is prepared for the conversion to the iomap infrastructure since it
> doesn't use ordered data mode and requires active writeback, which
> reduces the complexity of the conversion.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  fs/ext4/inode.c | 32 +++++++++++++++++++-------------
>  1 file changed, 19 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index f856ea015263..20b60abcf777 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4106,19 +4106,10 @@ static int __ext4_block_zero_page_range(handle_t *handle,
>  	folio_zero_range(folio, offset, length);
>  	BUFFER_TRACE(bh, "zeroed end of block");
>  
> -	if (ext4_should_journal_data(inode)) {
> +	if (ext4_should_journal_data(inode))
>  		err = ext4_dirty_journalled_data(handle, bh);
> -	} else {
> +	else
>  		mark_buffer_dirty(bh);
> -		/*
> -		 * Only the written block requires ordered data to prevent
> -		 * exposing stale data.
> -		 */
> -		if (!buffer_unwritten(bh) && !buffer_delay(bh) &&
> -		    ext4_should_order_data(inode))
> -			err = ext4_jbd2_inode_add_write(handle, inode, from,
> -					length);
> -	}
>  	if (!err && did_zero)
>  		*did_zero = true;
>  
> @@ -4578,8 +4569,23 @@ int ext4_truncate(struct inode *inode)
>  		goto out_trace;
>  	}
>  
> -	if (inode->i_size & (inode->i_sb->s_blocksize - 1))
> -		ext4_block_truncate_page(handle, mapping, inode->i_size);
> +	if (inode->i_size & (inode->i_sb->s_blocksize - 1)) {
> +		unsigned int zero_len;
> +
> +		zero_len = ext4_block_truncate_page(handle, mapping,
> +						    inode->i_size);
> +		if (zero_len < 0) {
> +			err = zero_len;
> +			goto out_stop;
> +		}
> +		if (zero_len && !IS_DAX(inode) &&
> +		    ext4_should_order_data(inode)) {
> +			err = ext4_jbd2_inode_add_write(handle, inode,
> +					inode->i_size, zero_len);
> +			if (err)
> +				goto out_stop;
> +		}
> +	}
>  
>  	/*
>  	 * We add the inode to the orphan list, so that if this
> -- 
> 2.52.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down

Posted by Zhang Yi 5 days, 3 hours ago

Hi, Jan!

On 2/3/2026 5:59 PM, Jan Kara wrote:
> On Tue 03-02-26 14:25:03, Zhang Yi wrote:
>> Currently, __ext4_block_zero_page_range() is called in the following
>> four cases to zero out the data in partial blocks:
>>
>> 1. Truncate down.
>> 2. Truncate up.
>> 3. Perform block allocation (e.g., fallocate) or append writes across a
>>    range extending beyond the end of the file (EOF).
>> 4. Partial block punch hole.
>>
>> If the default ordered data mode is used, __ext4_block_zero_page_range()
>> will write back the zeroed data to the disk through the order mode after
>> zeroing out.
>>
>> Among the cases 1,2 and 3 described above, only case 1 actually requires
>> this ordered write. Assuming no one intentionally bypasses the file
>> system to write directly to the disk. When performing a truncate down
>> operation, ensuring that the data beyond the EOF is zeroed out before
>> updating i_disksize is sufficient to prevent old data from being exposed
>> when the file is later extended. In other words, as long as the on-disk
>> data in case 1 can be properly zeroed out, only the data in memory needs
>> to be zeroed out in cases 2 and 3, without requiring ordered data.
> 
> Hum, I'm not sure this is correct. The tail block of the file is not
> necessarily zeroed out beyond EOF (as mmap writes can race with page
> writeback and modify the tail block contents beyond EOF before we really
> submit it to the device). Thus after this commit if you truncate up, just
> zero out the newly exposed contents in the page cache and dirty it, then
> the transaction with the i_disksize update commits (I see nothing
> preventing it) and then you crash, you can observe file with the new size
> but non-zero content in the newly exposed area. Am I missing something?
> 

Well, I think you are right! I missed the mmap write race condition that
happens during the writeback submitting I/O. Thank you a lot for pointing
this out. I thought of two possible solutions:

1. We also add explicit writeback operations to the truncate-up and
   post-EOF append writes. This solution is the most straightforward but
   may cause some performance overhead. However, since at most only one
   block is written, the impact is likely limited. Additionally, I
   observed that the implementation of the XFS file system also adopts a
   similar approach in its truncate up and down operation. (But it is
   somewhat strange that XFS also appears to have the same issue with
   post-EOF append writes; it only zero out the partial block in
   xfs_file_write_checks(), but it neither explicitly writeback zeroed
   data nor employs any other mechanism to ensure that the zero data
   writebacks before the metadata is written to disk.)

2. Resolve this race condition, ensure that there are no non-zero data
   in the post-EOF partial blocks on the disk. I observed that after the
   writeback holds the folio lock and calls folio_clear_dirty_for_io(),
   mmap writes will re-trigger the page fault. Perhaps we can filter out
   the EOF folio based on i_size in ext4_page_mkwrite(),
   block_page_mkwrite() and iomap_page_mkwrite(), and then call
   folio_wait_writeback() to wait for this partial folio writeback to
   complete. This seems can break the race condition without introducing
   too much overhead (no?).

What do you think? Any other suggestions are also welcome.

Thanks,
Yi.

>> Case 4 does not require ordered data because the entire punch hole
>> operation does not provide atomicity guarantees. Therefore, it's safe to
>> move the ordered data operation from __ext4_block_zero_page_range() to
>> ext4_truncate().
> 
> I agree hole punching can already expose intermediate results in case of
> crash so there removing the ordered mode handling is safe.
> 
> 								Honza
> 
>> It should be noted that after this change, we can only determine whether
>> to perform ordered data operations based on whether the target block has
>> been zeroed, rather than on the state of the buffer head. Consequently,
>> unnecessary ordered data operations may occur when truncating an
>> unwritten dirty block. However, this scenario is relatively rare, so the
>> overall impact is minimal.
>>
>> This is prepared for the conversion to the iomap infrastructure since it
>> doesn't use ordered data mode and requires active writeback, which
>> reduces the complexity of the conversion.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
>>  fs/ext4/inode.c | 32 +++++++++++++++++++-------------
>>  1 file changed, 19 insertions(+), 13 deletions(-)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index f856ea015263..20b60abcf777 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -4106,19 +4106,10 @@ static int __ext4_block_zero_page_range(handle_t *handle,
>>  	folio_zero_range(folio, offset, length);
>>  	BUFFER_TRACE(bh, "zeroed end of block");
>>  
>> -	if (ext4_should_journal_data(inode)) {
>> +	if (ext4_should_journal_data(inode))
>>  		err = ext4_dirty_journalled_data(handle, bh);
>> -	} else {
>> +	else
>>  		mark_buffer_dirty(bh);
>> -		/*
>> -		 * Only the written block requires ordered data to prevent
>> -		 * exposing stale data.
>> -		 */
>> -		if (!buffer_unwritten(bh) && !buffer_delay(bh) &&
>> -		    ext4_should_order_data(inode))
>> -			err = ext4_jbd2_inode_add_write(handle, inode, from,
>> -					length);
>> -	}
>>  	if (!err && did_zero)
>>  		*did_zero = true;
>>  
>> @@ -4578,8 +4569,23 @@ int ext4_truncate(struct inode *inode)
>>  		goto out_trace;
>>  	}
>>  
>> -	if (inode->i_size & (inode->i_sb->s_blocksize - 1))
>> -		ext4_block_truncate_page(handle, mapping, inode->i_size);
>> +	if (inode->i_size & (inode->i_sb->s_blocksize - 1)) {
>> +		unsigned int zero_len;
>> +
>> +		zero_len = ext4_block_truncate_page(handle, mapping,
>> +						    inode->i_size);
>> +		if (zero_len < 0) {
>> +			err = zero_len;
>> +			goto out_stop;
>> +		}
>> +		if (zero_len && !IS_DAX(inode) &&
>> +		    ext4_should_order_data(inode)) {
>> +			err = ext4_jbd2_inode_add_write(handle, inode,
>> +					inode->i_size, zero_len);
>> +			if (err)
>> +				goto out_stop;
>> +		}
>> +	}
>>  
>>  	/*
>>  	 * We add the inode to the orphan list, so that if this
>> -- 
>> 2.52.0
>>

Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down

Posted by Jan Kara 4 days, 20 hours ago

Hi Zhang!

On Wed 04-02-26 14:42:46, Zhang Yi wrote:
> On 2/3/2026 5:59 PM, Jan Kara wrote:
> > On Tue 03-02-26 14:25:03, Zhang Yi wrote:
> >> Currently, __ext4_block_zero_page_range() is called in the following
> >> four cases to zero out the data in partial blocks:
> >>
> >> 1. Truncate down.
> >> 2. Truncate up.
> >> 3. Perform block allocation (e.g., fallocate) or append writes across a
> >>    range extending beyond the end of the file (EOF).
> >> 4. Partial block punch hole.
> >>
> >> If the default ordered data mode is used, __ext4_block_zero_page_range()
> >> will write back the zeroed data to the disk through the order mode after
> >> zeroing out.
> >>
> >> Among the cases 1,2 and 3 described above, only case 1 actually requires
> >> this ordered write. Assuming no one intentionally bypasses the file
> >> system to write directly to the disk. When performing a truncate down
> >> operation, ensuring that the data beyond the EOF is zeroed out before
> >> updating i_disksize is sufficient to prevent old data from being exposed
> >> when the file is later extended. In other words, as long as the on-disk
> >> data in case 1 can be properly zeroed out, only the data in memory needs
> >> to be zeroed out in cases 2 and 3, without requiring ordered data.
> > 
> > Hum, I'm not sure this is correct. The tail block of the file is not
> > necessarily zeroed out beyond EOF (as mmap writes can race with page
> > writeback and modify the tail block contents beyond EOF before we really
> > submit it to the device). Thus after this commit if you truncate up, just
> > zero out the newly exposed contents in the page cache and dirty it, then
> > the transaction with the i_disksize update commits (I see nothing
> > preventing it) and then you crash, you can observe file with the new size
> > but non-zero content in the newly exposed area. Am I missing something?
> > 
> 
> Well, I think you are right! I missed the mmap write race condition that
> happens during the writeback submitting I/O. Thank you a lot for pointing
> this out. I thought of two possible solutions:
> 
> 1. We also add explicit writeback operations to the truncate-up and
>    post-EOF append writes. This solution is the most straightforward but
>    may cause some performance overhead. However, since at most only one
>    block is written, the impact is likely limited. Additionally, I
>    observed that the implementation of the XFS file system also adopts a
>    similar approach in its truncate up and down operation. (But it is
>    somewhat strange that XFS also appears to have the same issue with
>    post-EOF append writes; it only zero out the partial block in
>    xfs_file_write_checks(), but it neither explicitly writeback zeroed
>    data nor employs any other mechanism to ensure that the zero data
>    writebacks before the metadata is written to disk.)
> 
> 2. Resolve this race condition, ensure that there are no non-zero data
>    in the post-EOF partial blocks on the disk. I observed that after the
>    writeback holds the folio lock and calls folio_clear_dirty_for_io(),
>    mmap writes will re-trigger the page fault. Perhaps we can filter out
>    the EOF folio based on i_size in ext4_page_mkwrite(),
>    block_page_mkwrite() and iomap_page_mkwrite(), and then call
>    folio_wait_writeback() to wait for this partial folio writeback to
>    complete. This seems can break the race condition without introducing
>    too much overhead (no?).
> 
> What do you think? Any other suggestions are also welcome.

Hum, I like the option 2 because IMO non-zero data beyond EOF is a
corner-case quirk which unnecessarily complicates rather common paths. But
I'm not sure we can easily get rid of it. It can happen for example when
you do appending write inside a block. The page is written back but before
the transaction with i_disksize update commits we crash. Then again we have
a non-zero content inside the block beyond EOF.

So the only realistic option I see is to ensure tail of the block gets
zeroed on disk before the transaction with i_disksize update commits in the
cases of truncate up or write beyond EOF. data=ordered mode machinery is an
asynchronous way how to achieve this. We could also just synchronously
writeback the block where needed but the latency hit of such operation is
going to be significant so I'm quite sure some workload somewhere will
notice although the truncate up / write beyond EOF operations triggering this
are not too common. So why do you need to get rid of these data=ordered
mode usages? I guess because with iomap keeping our transaction handle ->
folio lock ordering is complicated? Last time I looked it seemed still
possible to keep it though.

Another possibility would be to just *submit* the write synchronously and
use data=ordered mode machinery only to wait for IO to complete before the
transaction commits. That way it should be safe to start a transaction
while holding folio lock and thus the iomap conversion would be easier.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down

Posted by Zhang Yi 4 days, 2 hours ago

On 2/4/2026 10:18 PM, Jan Kara wrote:
> On Wed 04-02-26 14:42:46, Zhang Yi wrote:
>> On 2/3/2026 5:59 PM, Jan Kara wrote:
>>> On Tue 03-02-26 14:25:03, Zhang Yi wrote:
>>>> Currently, __ext4_block_zero_page_range() is called in the following
>>>> four cases to zero out the data in partial blocks:
>>>>
>>>> 1. Truncate down.
>>>> 2. Truncate up.
>>>> 3. Perform block allocation (e.g., fallocate) or append writes across a
>>>>    range extending beyond the end of the file (EOF).
>>>> 4. Partial block punch hole.
>>>>
>>>> If the default ordered data mode is used, __ext4_block_zero_page_range()
>>>> will write back the zeroed data to the disk through the order mode after
>>>> zeroing out.
>>>>
>>>> Among the cases 1,2 and 3 described above, only case 1 actually requires
>>>> this ordered write. Assuming no one intentionally bypasses the file
>>>> system to write directly to the disk. When performing a truncate down
>>>> operation, ensuring that the data beyond the EOF is zeroed out before
>>>> updating i_disksize is sufficient to prevent old data from being exposed
>>>> when the file is later extended. In other words, as long as the on-disk
>>>> data in case 1 can be properly zeroed out, only the data in memory needs
>>>> to be zeroed out in cases 2 and 3, without requiring ordered data.
>>>
>>> Hum, I'm not sure this is correct. The tail block of the file is not
>>> necessarily zeroed out beyond EOF (as mmap writes can race with page
>>> writeback and modify the tail block contents beyond EOF before we really
>>> submit it to the device). Thus after this commit if you truncate up, just
>>> zero out the newly exposed contents in the page cache and dirty it, then
>>> the transaction with the i_disksize update commits (I see nothing
>>> preventing it) and then you crash, you can observe file with the new size
>>> but non-zero content in the newly exposed area. Am I missing something?
>>>
>>
>> Well, I think you are right! I missed the mmap write race condition that
>> happens during the writeback submitting I/O. Thank you a lot for pointing
>> this out. I thought of two possible solutions:
>>
>> 1. We also add explicit writeback operations to the truncate-up and
>>    post-EOF append writes. This solution is the most straightforward but
>>    may cause some performance overhead. However, since at most only one
>>    block is written, the impact is likely limited. Additionally, I
>>    observed that the implementation of the XFS file system also adopts a
>>    similar approach in its truncate up and down operation. (But it is
>>    somewhat strange that XFS also appears to have the same issue with
>>    post-EOF append writes; it only zero out the partial block in
>>    xfs_file_write_checks(), but it neither explicitly writeback zeroed
>>    data nor employs any other mechanism to ensure that the zero data
>>    writebacks before the metadata is written to disk.)
>>
>> 2. Resolve this race condition, ensure that there are no non-zero data
>>    in the post-EOF partial blocks on the disk. I observed that after the
>>    writeback holds the folio lock and calls folio_clear_dirty_for_io(),
>>    mmap writes will re-trigger the page fault. Perhaps we can filter out
>>    the EOF folio based on i_size in ext4_page_mkwrite(),
>>    block_page_mkwrite() and iomap_page_mkwrite(), and then call
>>    folio_wait_writeback() to wait for this partial folio writeback to
>>    complete. This seems can break the race condition without introducing
>>    too much overhead (no?).
>>
>> What do you think? Any other suggestions are also welcome.
> 
> Hum, I like the option 2 because IMO non-zero data beyond EOF is a
> corner-case quirk which unnecessarily complicates rather common paths. But
> I'm not sure we can easily get rid of it. It can happen for example when
> you do appending write inside a block. The page is written back but before
> the transaction with i_disksize update commits we crash. Then again we have
> a non-zero content inside the block beyond EOF.

Yes, indeed. From this perspective, it seems difficult to avoid non-zero
content within the block beyond the EOF.

> 
> So the only realistic option I see is to ensure tail of the block gets
> zeroed on disk before the transaction with i_disksize update commits in the
> cases of truncate up or write beyond EOF. data=ordered mode machinery is an
> asynchronous way how to achieve this. We could also just synchronously
> writeback the block where needed but the latency hit of such operation is
> going to be significant so I'm quite sure some workload somewhere will
> notice although the truncate up / write beyond EOF operations triggering this
> are not too common.

Yes, I agree.

> So why do you need to get rid of these data=ordered
> mode usages? I guess because with iomap keeping our transaction handle ->
> folio lock ordering is complicated? Last time I looked it seemed still
> possible to keep it though.
> 

Yes, that's one reason. There's another reason is that we also need to
implement partial folio submits for iomap.

When the journal process is waiting for a folio to be written back
(which contains an ordered block), and the folio also contains unmapped
blocks with a block size smaller than the folio size, if the regular
writeback process has already started committing this folio (and set the
writeback flag), then a deadlock may occur while mapping the remaining
unmapped blocks. This is because the writeback flag is cleared only
after the entire folio are processed and committed. If we want to support
partial folio submit for iomap, we need to be careful to prevent adding
additional performance overhead in the case of severe fragmentation.

Therefore, this aspect of the logic is complicated and subtle. As we
discussed in patch 0, if we can avoid using the data=ordered mode in
append write and online defrag, then this would be the only remaining
corner case. I'm not sure if it is worth implementing this and adjusting
the lock ordering.

> Another possibility would be to just *submit* the write synchronously and
> use data=ordered mode machinery only to wait for IO to complete before the
> transaction commits. That way it should be safe to start a transaction
> while holding folio lock and thus the iomap conversion would be easier.
> 
> 								Honza

IIUC, this solution seems can avoid adjusting the lock ordering, but partial
folio submission still needs to be implemented, is my understanding right?
This is because although we have already submitted this zeroed partial EOF
block, when the journal process is waiting for this folio, this folio is
being written back, and there are other blocks in this folio that need to be
mapped.

Cheers,
Yi.

Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down

Posted by Jan Kara 3 days, 19 hours ago

On Thu 05-02-26 15:50:38, Zhang Yi wrote:
> On 2/4/2026 10:18 PM, Jan Kara wrote:
> > So why do you need to get rid of these data=ordered
> > mode usages? I guess because with iomap keeping our transaction handle ->
> > folio lock ordering is complicated? Last time I looked it seemed still
> > possible to keep it though.
> 
> Yes, that's one reason. There's another reason is that we also need to
> implement partial folio submits for iomap.
> 
> When the journal process is waiting for a folio to be written back
> (which contains an ordered block), and the folio also contains unmapped
> blocks with a block size smaller than the folio size, if the regular
> writeback process has already started committing this folio (and set the
> writeback flag), then a deadlock may occur while mapping the remaining
> unmapped blocks. This is because the writeback flag is cleared only
> after the entire folio are processed and committed. If we want to support
> partial folio submit for iomap, we need to be careful to prevent adding
> additional performance overhead in the case of severe fragmentation.

Yeah, this logic is currently handled by ext4_bio_write_folio(). And the
deadlocks are currently resolved by grabbing transaction handle before we
go and lock any page for writeback. But I agree that with iomap it may be
tricky to keep this scheme.

> Therefore, this aspect of the logic is complicated and subtle. As we
> discussed in patch 0, if we can avoid using the data=ordered mode in
> append write and online defrag, then this would be the only remaining
> corner case. I'm not sure if it is worth implementing this and adjusting
> the lock ordering.
> 
> > Another possibility would be to just *submit* the write synchronously and
> > use data=ordered mode machinery only to wait for IO to complete before the
> > transaction commits. That way it should be safe to start a transaction
> 
> IIUC, this solution seems can avoid adjusting the lock ordering, but partial
> folio submission still needs to be implemented, is my understanding right?
> This is because although we have already submitted this zeroed partial EOF
> block, when the journal process is waiting for this folio, this folio is
> being written back, and there are other blocks in this folio that need to be
> mapped.

That's a good question. If we submit the tail folio from truncation code,
we could just submit the full folio write and there's no need to restrict
ourselves only to mapped blocks. But you are correct that if this IO
completes but the folio had holes in it and the hole gets filled in by
write before the transaction with i_disksize update commits, jbd2 commit
could still race with flush worker writing this folio again and the
deadlock could happen. Hrm...

So how about the following: We expand our io_end processing with the
ability to journal i_disksize updates after page writeback completes. Then
when doing truncate up or appending writes, we keep i_disksize at the old
value and just zero folio tails in the page cache, mark the folio dirty and
update i_size. When submitting writeback for a folio beyond current
i_disksize we make sure writepages submits IO for all the folios from
current i_disksize upwards. When io_end processing happens after completed
folio writeback, we update i_disksize to min(i_size, end of IO). This
should take care of non-zero data exposure issues and with "delay map"
processing Baokun works on all the inode metadata updates will happen after
IO completion anyway so it will be nicely batched up in one transaction.
It's a big change but so far I think it should work. What do you think?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down

Posted by Zhang Yi 2 days, 23 hours ago

On 2/5/2026 11:05 PM, Jan Kara wrote:
> On Thu 05-02-26 15:50:38, Zhang Yi wrote:
>> On 2/4/2026 10:18 PM, Jan Kara wrote:
>>> So why do you need to get rid of these data=ordered
>>> mode usages? I guess because with iomap keeping our transaction handle ->
>>> folio lock ordering is complicated? Last time I looked it seemed still
>>> possible to keep it though.
>>
>> Yes, that's one reason. There's another reason is that we also need to
>> implement partial folio submits for iomap.
>>
>> When the journal process is waiting for a folio to be written back
>> (which contains an ordered block), and the folio also contains unmapped
>> blocks with a block size smaller than the folio size, if the regular
>> writeback process has already started committing this folio (and set the
>> writeback flag), then a deadlock may occur while mapping the remaining
>> unmapped blocks. This is because the writeback flag is cleared only
>> after the entire folio are processed and committed. If we want to support
>> partial folio submit for iomap, we need to be careful to prevent adding
>> additional performance overhead in the case of severe fragmentation.
> 
> Yeah, this logic is currently handled by ext4_bio_write_folio(). And the
> deadlocks are currently resolved by grabbing transaction handle before we
> go and lock any page for writeback. But I agree that with iomap it may be
> tricky to keep this scheme.
> 
>> Therefore, this aspect of the logic is complicated and subtle. As we
>> discussed in patch 0, if we can avoid using the data=ordered mode in
>> append write and online defrag, then this would be the only remaining
>> corner case. I'm not sure if it is worth implementing this and adjusting
>> the lock ordering.
>>
>>> Another possibility would be to just *submit* the write synchronously and
>>> use data=ordered mode machinery only to wait for IO to complete before the
>>> transaction commits. That way it should be safe to start a transaction
>>
>> IIUC, this solution seems can avoid adjusting the lock ordering, but partial
>> folio submission still needs to be implemented, is my understanding right?
>> This is because although we have already submitted this zeroed partial EOF
>> block, when the journal process is waiting for this folio, this folio is
>> being written back, and there are other blocks in this folio that need to be
>> mapped.
> 
> That's a good question. If we submit the tail folio from truncation code,
> we could just submit the full folio write and there's no need to restrict
> ourselves only to mapped blocks. But you are correct that if this IO
> completes but the folio had holes in it and the hole gets filled in by
> write before the transaction with i_disksize update commits, jbd2 commit
> could still race with flush worker writing this folio again and the
> deadlock could happen. Hrm...
> 
Yes!

> So how about the following:

Let me see, please correct me if my understanding is wrong, ana there are
also some points I don't get.

> We expand our io_end processing with the
> ability to journal i_disksize updates after page writeback completes. Then
> when doing truncate up or appending writes, we keep i_disksize at the old
> value and just zero folio tails in the page cache, mark the folio dirty and
> update i_size.

I think we need to submit this zeroed folio here as well. Because,

1) In the case of truncate up, if we don't submit, the i_disksize may have to
   wait a long time (until the folio writeback is complete, which takes about
   30 seconds by default) before being updated, which is too long.
2) In the case of appending writes. Assume that the folio written beyond this
   one is written back first, we have to wait this zeroed folio to be write
   back and then update i_disksize, so we can't wait too long either.

Right?

> When submitting writeback for a folio beyond current
> i_disksize we make sure writepages submits IO for all the folios from
> current i_disksize upwards.

Why "all the folios"? IIUC, we only wait the zeroed EOF folio is sufficient.

> When io_end processing happens after completed
> folio writeback, we update i_disksize to min(i_size, end of IO).

Yeah, in the case of append write back. Assume we append write the folio 2
and folio 3,

       old_idisksize  new_isize
       |             |
     [WWZZ][WWWW][WWWW]
       1  |  2     3
          A

Assume that folio 1 first completes the writeback, then we update i_disksize
to pos A when the writeback is complete. Assume that folio 2 or 3 completes
first, we should wait(e.g. call filemap_fdatawait_range_keep_errors() or
something like) folio 1 to complete and then update i_disksize to new_isize.

But in the case of truncate up, We will only write back this zeroed folio. If
the new i_size exceeds the end of this folio, how should we update i_disksize
to the correct value?

For example, we truncate the file from old old_idisksize to new_isize, but we
only zero and writeback folio 1, in the end_io processing of folio 1, we can
only update the i_disksize to A, but we can never update it to new_isize. Am
I missing something ?

       old_idisksize new_isize
       |             |
     [WWZZ]...hole ...
       1  |
          A

> This
> should take care of non-zero data exposure issues and with "delay map"
> processing Baokun works on all the inode metadata updates will happen after
> IO completion anyway so it will be nicely batched up in one transaction.

Currently, my iomap convert implementation always enables dioread_nolock,
so I feel that this solution can be achieved even without the "delay map"
feature. After we have the "delay map", we can extend this to the
buffer_head path.

Thanks,
Yi.

> It's a big change but so far I think it should work. What do you think?
> 
> 								Honza

Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down

Posted by Jan Kara 2 days, 19 hours ago

On Fri 06-02-26 19:09:53, Zhang Yi wrote:
> On 2/5/2026 11:05 PM, Jan Kara wrote:
> > So how about the following:
> 
> Let me see, please correct me if my understanding is wrong, ana there are
> also some points I don't get.
> 
> > We expand our io_end processing with the
> > ability to journal i_disksize updates after page writeback completes. Then
> > when doing truncate up or appending writes, we keep i_disksize at the old
> > value and just zero folio tails in the page cache, mark the folio dirty and
> > update i_size.
> 
> I think we need to submit this zeroed folio here as well. Because,
> 
> 1) In the case of truncate up, if we don't submit, the i_disksize may have to
>    wait a long time (until the folio writeback is complete, which takes about
>    30 seconds by default) before being updated, which is too long.

Correct but I'm not sure it matters. Current delalloc writes behave in the
same way already. For simplicity I'd thus prefer to not treat truncate up
in a special way but if we decide this indeed desirable, we can either
submit the tail folio immediately, or schedule work with earlier writeback.

> 2) In the case of appending writes. Assume that the folio written beyond this
>    one is written back first, we have to wait this zeroed folio to be write
>    back and then update i_disksize, so we can't wait too long either.

Correct, update of i_disksize after writeback of folios beyond current
i_disksize is blocked by the writeback of the tail folio.

> > When submitting writeback for a folio beyond current
> > i_disksize we make sure writepages submits IO for all the folios from
> > current i_disksize upwards.
> 
> Why "all the folios"? IIUC, we only wait the zeroed EOF folio is sufficient.

I was worried about a case like:

We have 4k blocksize, file is i_disksize 2k. Now you do:
pwrite(file, buf, 1, 6k);
pwrite(file, buf, 1, 10k);
pwrite(file, buf, 1, 14k);

The pwrite at offset 6k needs to zero the tail of the folio with index 0,
pwrite at 10k needs to zero the tail of the folio with index 1, etc. And
for us to safely advance i_disksize to 14k+1, I though all the folios (and
zeroed out tails) need to be written out. But that's actually not the case.
We need to make sure the zeroed tail is written out only if the underlying
block is already allocated and marked as written at the time of zeroing.
And the blocks underlying intermediate i_size values will never be allocated
and written without advancing i_disksize to them. So I think you're
correct, we always have at most one tail folio - the one surrounding
current i_disksize - which needs to be written out to safely advance
i_disksize and we don't care about folios inbetween.

> > When io_end processing happens after completed
> > folio writeback, we update i_disksize to min(i_size, end of IO).
> 
> Yeah, in the case of append write back. Assume we append write the folio 2
> and folio 3,
> 
>        old_idisksize  new_isize
>        |             |
>      [WWZZ][WWWW][WWWW]
>        1  |  2     3
>           A
> 
> Assume that folio 1 first completes the writeback, then we update i_disksize
> to pos A when the writeback is complete. Assume that folio 2 or 3 completes
> first, we should wait(e.g. call filemap_fdatawait_range_keep_errors() or
> something like) folio 1 to complete and then update i_disksize to new_isize.
> 
> But in the case of truncate up, We will only write back this zeroed folio. If
> the new i_size exceeds the end of this folio, how should we update i_disksize
> to the correct value?
> 
> For example, we truncate the file from old old_idisksize to new_isize, but we
> only zero and writeback folio 1, in the end_io processing of folio 1, we can
> only update the i_disksize to A, but we can never update it to new_isize. Am
> I missing something ?
> 
>        old_idisksize new_isize
>        |             |
>      [WWZZ]...hole ...
>        1  |
>           A

Good question. Based on the analysis above one option would be to setup
writeback of page straddling current i_disksize to update i_disksize to
current i_size on completion. That would be simple but would have an
unpleasant side effect that in case of a crash after append write we could
see increased i_disksize but zeros instead of written data. Another option
would be to update i_disksize on completion to the beginning of the first
dirty folio behind the written back range or i_size of there's not such
folio. This would still be relatively simple and mostly deal with "zeros
instead of data" problem.

> > This
> > should take care of non-zero data exposure issues and with "delay map"
> > processing Baokun works on all the inode metadata updates will happen after
> > IO completion anyway so it will be nicely batched up in one transaction.
> 
> Currently, my iomap convert implementation always enables dioread_nolock,

Yes, BTW I think you could remove no-dioread_nolock paths before doing the
conversion to simplify matters a bit. I don't think it's seriously used
anywhere anymore.

> so I feel that this solution can be achieved even without the "delay map"
> feature. After we have the "delay map", we can extend this to the
> buffer_head path.

I agree, delay map is not necessary for this to work. But it will make
things likely faster.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down

Posted by Zhang Yi 2 hours ago

On 2/6/2026 11:35 PM, Jan Kara wrote:
> On Fri 06-02-26 19:09:53, Zhang Yi wrote:
>> On 2/5/2026 11:05 PM, Jan Kara wrote:
>>> So how about the following:
>>
>> Let me see, please correct me if my understanding is wrong, ana there are
>> also some points I don't get.
>>
>>> We expand our io_end processing with the
>>> ability to journal i_disksize updates after page writeback completes. Then
>>> when doing truncate up or appending writes, we keep i_disksize at the old
>>> value and just zero folio tails in the page cache, mark the folio dirty and
>>> update i_size.
>>
>> I think we need to submit this zeroed folio here as well. Because,
>>
>> 1) In the case of truncate up, if we don't submit, the i_disksize may have to
>>    wait a long time (until the folio writeback is complete, which takes about
>>    30 seconds by default) before being updated, which is too long.
> 
> Correct but I'm not sure it matters. Current delalloc writes behave in the
> same way already. For simplicity I'd thus prefer to not treat truncate up
> in a special way but if we decide this indeed desirable, we can either
> submit the tail folio immediately, or schedule work with earlier writeback.
> 
>> 2) In the case of appending writes. Assume that the folio written beyond this
>>    one is written back first, we have to wait this zeroed folio to be write
>>    back and then update i_disksize, so we can't wait too long either.
> 
> Correct, update of i_disksize after writeback of folios beyond current
> i_disksize is blocked by the writeback of the tail folio.
> 
>>> When submitting writeback for a folio beyond current
>>> i_disksize we make sure writepages submits IO for all the folios from
>>> current i_disksize upwards.
>>
>> Why "all the folios"? IIUC, we only wait the zeroed EOF folio is sufficient.
> 
> I was worried about a case like:
> 
> We have 4k blocksize, file is i_disksize 2k. Now you do:
> pwrite(file, buf, 1, 6k);
> pwrite(file, buf, 1, 10k);
> pwrite(file, buf, 1, 14k);
> 
> The pwrite at offset 6k needs to zero the tail of the folio with index 0,
> pwrite at 10k needs to zero the tail of the folio with index 1, etc. And
> for us to safely advance i_disksize to 14k+1, I though all the folios (and
> zeroed out tails) need to be written out. But that's actually not the case.
> We need to make sure the zeroed tail is written out only if the underlying
> block is already allocated and marked as written at the time of zeroing.
> And the blocks underlying intermediate i_size values will never be allocated
> and written without advancing i_disksize to them. So I think you're
> correct, we always have at most one tail folio - the one surrounding
> current i_disksize - which needs to be written out to safely advance
> i_disksize and we don't care about folios inbetween.
> 
>>> When io_end processing happens after completed
>>> folio writeback, we update i_disksize to min(i_size, end of IO).
>>
>> Yeah, in the case of append write back. Assume we append write the folio 2
>> and folio 3,
>>
>>        old_idisksize  new_isize
>>        |             |
>>      [WWZZ][WWWW][WWWW]
>>        1  |  2     3
>>           A
>>
>> Assume that folio 1 first completes the writeback, then we update i_disksize
>> to pos A when the writeback is complete. Assume that folio 2 or 3 completes
>> first, we should wait(e.g. call filemap_fdatawait_range_keep_errors() or
>> something like) folio 1 to complete and then update i_disksize to new_isize.
>>
>> But in the case of truncate up, We will only write back this zeroed folio. If
>> the new i_size exceeds the end of this folio, how should we update i_disksize
>> to the correct value?
>>
>> For example, we truncate the file from old old_idisksize to new_isize, but we
>> only zero and writeback folio 1, in the end_io processing of folio 1, we can
>> only update the i_disksize to A, but we can never update it to new_isize. Am
>> I missing something ?
>>
>>        old_idisksize new_isize
>>        |             |
>>      [WWZZ]...hole ...
>>        1  |
>>           A
> 
> Good question. Based on the analysis above one option would be to setup
> writeback of page straddling current i_disksize to update i_disksize to
> current i_size on completion. That would be simple but would have an
> unpleasant side effect that in case of a crash after append write we could
> see increased i_disksize but zeros instead of written data. Another option
> would be to update i_disksize on completion to the beginning of the first
> dirty folio behind the written back range or i_size of there's not such
> folio. This would still be relatively simple and mostly deal with "zeros
> instead of data" problem.

Ha, good idea! I think it should work. I will try the second option, thank
you a lot for this suggestion. :)

> 
>>> This
>>> should take care of non-zero data exposure issues and with "delay map"
>>> processing Baokun works on all the inode metadata updates will happen after
>>> IO completion anyway so it will be nicely batched up in one transaction.
>>
>> Currently, my iomap convert implementation always enables dioread_nolock,
> 
> Yes, BTW I think you could remove no-dioread_nolock paths before doing the
> conversion to simplify matters a bit. I don't think it's seriously used
> anywhere anymore.
> 

Sure. After removing the no-dioread_nolock paths, the behavior of the
buffer_head path (extents-based and no-journal data mode) and the iomap path
in append write and truncate operations can be made consistent.

Cheers,
Yi.

>> so I feel that this solution can be achieved even without the "delay map"
>> feature. After we have the "delay map", we can extend this to the
>> buffer_head path.
> 
> I agree, delay map is not necessary for this to work. But it will make
> things likely faster.
> 
> 								Honza

Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down

Posted by Baokun Li 4 days, 7 hours ago

On 2026-02-04 22:18, Jan Kara wrote:
> Hi Zhang!
>
> On Wed 04-02-26 14:42:46, Zhang Yi wrote:
>> On 2/3/2026 5:59 PM, Jan Kara wrote:
>>> On Tue 03-02-26 14:25:03, Zhang Yi wrote:
>>>> Currently, __ext4_block_zero_page_range() is called in the following
>>>> four cases to zero out the data in partial blocks:
>>>>
>>>> 1. Truncate down.
>>>> 2. Truncate up.
>>>> 3. Perform block allocation (e.g., fallocate) or append writes across a
>>>>    range extending beyond the end of the file (EOF).
>>>> 4. Partial block punch hole.
>>>>
>>>> If the default ordered data mode is used, __ext4_block_zero_page_range()
>>>> will write back the zeroed data to the disk through the order mode after
>>>> zeroing out.
>>>>
>>>> Among the cases 1,2 and 3 described above, only case 1 actually requires
>>>> this ordered write. Assuming no one intentionally bypasses the file
>>>> system to write directly to the disk. When performing a truncate down
>>>> operation, ensuring that the data beyond the EOF is zeroed out before
>>>> updating i_disksize is sufficient to prevent old data from being exposed
>>>> when the file is later extended. In other words, as long as the on-disk
>>>> data in case 1 can be properly zeroed out, only the data in memory needs
>>>> to be zeroed out in cases 2 and 3, without requiring ordered data.
>>> Hum, I'm not sure this is correct. The tail block of the file is not
>>> necessarily zeroed out beyond EOF (as mmap writes can race with page
>>> writeback and modify the tail block contents beyond EOF before we really
>>> submit it to the device). Thus after this commit if you truncate up, just
>>> zero out the newly exposed contents in the page cache and dirty it, then
>>> the transaction with the i_disksize update commits (I see nothing
>>> preventing it) and then you crash, you can observe file with the new size
>>> but non-zero content in the newly exposed area. Am I missing something?
>>>
>> Well, I think you are right! I missed the mmap write race condition that
>> happens during the writeback submitting I/O. Thank you a lot for pointing
>> this out. I thought of two possible solutions:
>>
>> 1. We also add explicit writeback operations to the truncate-up and
>>    post-EOF append writes. This solution is the most straightforward but
>>    may cause some performance overhead. However, since at most only one
>>    block is written, the impact is likely limited. Additionally, I
>>    observed that the implementation of the XFS file system also adopts a
>>    similar approach in its truncate up and down operation. (But it is
>>    somewhat strange that XFS also appears to have the same issue with
>>    post-EOF append writes; it only zero out the partial block in
>>    xfs_file_write_checks(), but it neither explicitly writeback zeroed
>>    data nor employs any other mechanism to ensure that the zero data
>>    writebacks before the metadata is written to disk.)
>>
>> 2. Resolve this race condition, ensure that there are no non-zero data
>>    in the post-EOF partial blocks on the disk. I observed that after the
>>    writeback holds the folio lock and calls folio_clear_dirty_for_io(),
>>    mmap writes will re-trigger the page fault. Perhaps we can filter out
>>    the EOF folio based on i_size in ext4_page_mkwrite(),
>>    block_page_mkwrite() and iomap_page_mkwrite(), and then call
>>    folio_wait_writeback() to wait for this partial folio writeback to
>>    complete. This seems can break the race condition without introducing
>>    too much overhead (no?).
>>
>> What do you think? Any other suggestions are also welcome.
> Hum, I like the option 2 because IMO non-zero data beyond EOF is a
> corner-case quirk which unnecessarily complicates rather common paths. But
> I'm not sure we can easily get rid of it. It can happen for example when
> you do appending write inside a block. The page is written back but before
> the transaction with i_disksize update commits we crash. Then again we have
> a non-zero content inside the block beyond EOF.
>
> So the only realistic option I see is to ensure tail of the block gets
> zeroed on disk before the transaction with i_disksize update commits in the
> cases of truncate up or write beyond EOF. data=ordered mode machinery is an
> asynchronous way how to achieve this. We could also just synchronously
> writeback the block where needed but the latency hit of such operation is
> going to be significant so I'm quite sure some workload somewhere will
> notice although the truncate up / write beyond EOF operations triggering this
> are not too common. So why do you need to get rid of these data=ordered
> mode usages? I guess because with iomap keeping our transaction handle ->
> folio lock ordering is complicated? Last time I looked it seemed still
> possible to keep it though.
>
> Another possibility would be to just *submit* the write synchronously and
> use data=ordered mode machinery only to wait for IO to complete before the
> transaction commits. That way it should be safe to start a transaction
> while holding folio lock and thus the iomap conversion would be easier.
>
> 								Honza

Can we treat EOF blocks as metadata and update them in the same
transaction as i_disksize? Although this would introduce some
management and journaling overhead, it could avoid the deadlock
of "writeback -> start handle -> trigger writeback".


Regards,
Baokun

Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down

Posted by Jan Kara 3 days, 20 hours ago

On Thu 05-02-26 11:27:09, Baokun Li wrote:
> On 2026-02-04 22:18, Jan Kara wrote:
> > Hi Zhang!
> >
> > On Wed 04-02-26 14:42:46, Zhang Yi wrote:
> >> On 2/3/2026 5:59 PM, Jan Kara wrote:
> >>> On Tue 03-02-26 14:25:03, Zhang Yi wrote:
> >>>> Currently, __ext4_block_zero_page_range() is called in the following
> >>>> four cases to zero out the data in partial blocks:
> >>>>
> >>>> 1. Truncate down.
> >>>> 2. Truncate up.
> >>>> 3. Perform block allocation (e.g., fallocate) or append writes across a
> >>>>    range extending beyond the end of the file (EOF).
> >>>> 4. Partial block punch hole.
> >>>>
> >>>> If the default ordered data mode is used, __ext4_block_zero_page_range()
> >>>> will write back the zeroed data to the disk through the order mode after
> >>>> zeroing out.
> >>>>
> >>>> Among the cases 1,2 and 3 described above, only case 1 actually requires
> >>>> this ordered write. Assuming no one intentionally bypasses the file
> >>>> system to write directly to the disk. When performing a truncate down
> >>>> operation, ensuring that the data beyond the EOF is zeroed out before
> >>>> updating i_disksize is sufficient to prevent old data from being exposed
> >>>> when the file is later extended. In other words, as long as the on-disk
> >>>> data in case 1 can be properly zeroed out, only the data in memory needs
> >>>> to be zeroed out in cases 2 and 3, without requiring ordered data.
> >>> Hum, I'm not sure this is correct. The tail block of the file is not
> >>> necessarily zeroed out beyond EOF (as mmap writes can race with page
> >>> writeback and modify the tail block contents beyond EOF before we really
> >>> submit it to the device). Thus after this commit if you truncate up, just
> >>> zero out the newly exposed contents in the page cache and dirty it, then
> >>> the transaction with the i_disksize update commits (I see nothing
> >>> preventing it) and then you crash, you can observe file with the new size
> >>> but non-zero content in the newly exposed area. Am I missing something?
> >>>
> >> Well, I think you are right! I missed the mmap write race condition that
> >> happens during the writeback submitting I/O. Thank you a lot for pointing
> >> this out. I thought of two possible solutions:
> >>
> >> 1. We also add explicit writeback operations to the truncate-up and
> >>    post-EOF append writes. This solution is the most straightforward but
> >>    may cause some performance overhead. However, since at most only one
> >>    block is written, the impact is likely limited. Additionally, I
> >>    observed that the implementation of the XFS file system also adopts a
> >>    similar approach in its truncate up and down operation. (But it is
> >>    somewhat strange that XFS also appears to have the same issue with
> >>    post-EOF append writes; it only zero out the partial block in
> >>    xfs_file_write_checks(), but it neither explicitly writeback zeroed
> >>    data nor employs any other mechanism to ensure that the zero data
> >>    writebacks before the metadata is written to disk.)
> >>
> >> 2. Resolve this race condition, ensure that there are no non-zero data
> >>    in the post-EOF partial blocks on the disk. I observed that after the
> >>    writeback holds the folio lock and calls folio_clear_dirty_for_io(),
> >>    mmap writes will re-trigger the page fault. Perhaps we can filter out
> >>    the EOF folio based on i_size in ext4_page_mkwrite(),
> >>    block_page_mkwrite() and iomap_page_mkwrite(), and then call
> >>    folio_wait_writeback() to wait for this partial folio writeback to
> >>    complete. This seems can break the race condition without introducing
> >>    too much overhead (no?).
> >>
> >> What do you think? Any other suggestions are also welcome.
> > Hum, I like the option 2 because IMO non-zero data beyond EOF is a
> > corner-case quirk which unnecessarily complicates rather common paths. But
> > I'm not sure we can easily get rid of it. It can happen for example when
> > you do appending write inside a block. The page is written back but before
> > the transaction with i_disksize update commits we crash. Then again we have
> > a non-zero content inside the block beyond EOF.
> >
> > So the only realistic option I see is to ensure tail of the block gets
> > zeroed on disk before the transaction with i_disksize update commits in the
> > cases of truncate up or write beyond EOF. data=ordered mode machinery is an
> > asynchronous way how to achieve this. We could also just synchronously
> > writeback the block where needed but the latency hit of such operation is
> > going to be significant so I'm quite sure some workload somewhere will
> > notice although the truncate up / write beyond EOF operations triggering this
> > are not too common. So why do you need to get rid of these data=ordered
> > mode usages? I guess because with iomap keeping our transaction handle ->
> > folio lock ordering is complicated? Last time I looked it seemed still
> > possible to keep it though.
> >
> > Another possibility would be to just *submit* the write synchronously and
> > use data=ordered mode machinery only to wait for IO to complete before the
> > transaction commits. That way it should be safe to start a transaction
> > while holding folio lock and thus the iomap conversion would be easier.
> >
> > 								Honza
> 
> Can we treat EOF blocks as metadata and update them in the same
> transaction as i_disksize? Although this would introduce some
> management and journaling overhead, it could avoid the deadlock
> of "writeback -> start handle -> trigger writeback".

No, IMHO that would get too difficult. Just look at the hoops data=journal
mode has to jump through to make page cache handling work with the
journalling machinery. And you'd now have that for all the inodes. So I
think some form of data=ordered machinery is much simpler to reason about.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down

Posted by Baokun Li 3 days, 9 hours ago

On 2026-02-05 22:07, Jan Kara wrote:
> On Thu 05-02-26 11:27:09, Baokun Li wrote:
>> On 2026-02-04 22:18, Jan Kara wrote:
>>> Hi Zhang!
>>>
>>> On Wed 04-02-26 14:42:46, Zhang Yi wrote:
>>>> On 2/3/2026 5:59 PM, Jan Kara wrote:
>>>>> On Tue 03-02-26 14:25:03, Zhang Yi wrote:
>>>>>> Currently, __ext4_block_zero_page_range() is called in the following
>>>>>> four cases to zero out the data in partial blocks:
>>>>>>
>>>>>> 1. Truncate down.
>>>>>> 2. Truncate up.
>>>>>> 3. Perform block allocation (e.g., fallocate) or append writes across a
>>>>>>    range extending beyond the end of the file (EOF).
>>>>>> 4. Partial block punch hole.
>>>>>>
>>>>>> If the default ordered data mode is used, __ext4_block_zero_page_range()
>>>>>> will write back the zeroed data to the disk through the order mode after
>>>>>> zeroing out.
>>>>>>
>>>>>> Among the cases 1,2 and 3 described above, only case 1 actually requires
>>>>>> this ordered write. Assuming no one intentionally bypasses the file
>>>>>> system to write directly to the disk. When performing a truncate down
>>>>>> operation, ensuring that the data beyond the EOF is zeroed out before
>>>>>> updating i_disksize is sufficient to prevent old data from being exposed
>>>>>> when the file is later extended. In other words, as long as the on-disk
>>>>>> data in case 1 can be properly zeroed out, only the data in memory needs
>>>>>> to be zeroed out in cases 2 and 3, without requiring ordered data.
>>>>> Hum, I'm not sure this is correct. The tail block of the file is not
>>>>> necessarily zeroed out beyond EOF (as mmap writes can race with page
>>>>> writeback and modify the tail block contents beyond EOF before we really
>>>>> submit it to the device). Thus after this commit if you truncate up, just
>>>>> zero out the newly exposed contents in the page cache and dirty it, then
>>>>> the transaction with the i_disksize update commits (I see nothing
>>>>> preventing it) and then you crash, you can observe file with the new size
>>>>> but non-zero content in the newly exposed area. Am I missing something?
>>>>>
>>>> Well, I think you are right! I missed the mmap write race condition that
>>>> happens during the writeback submitting I/O. Thank you a lot for pointing
>>>> this out. I thought of two possible solutions:
>>>>
>>>> 1. We also add explicit writeback operations to the truncate-up and
>>>>    post-EOF append writes. This solution is the most straightforward but
>>>>    may cause some performance overhead. However, since at most only one
>>>>    block is written, the impact is likely limited. Additionally, I
>>>>    observed that the implementation of the XFS file system also adopts a
>>>>    similar approach in its truncate up and down operation. (But it is
>>>>    somewhat strange that XFS also appears to have the same issue with
>>>>    post-EOF append writes; it only zero out the partial block in
>>>>    xfs_file_write_checks(), but it neither explicitly writeback zeroed
>>>>    data nor employs any other mechanism to ensure that the zero data
>>>>    writebacks before the metadata is written to disk.)
>>>>
>>>> 2. Resolve this race condition, ensure that there are no non-zero data
>>>>    in the post-EOF partial blocks on the disk. I observed that after the
>>>>    writeback holds the folio lock and calls folio_clear_dirty_for_io(),
>>>>    mmap writes will re-trigger the page fault. Perhaps we can filter out
>>>>    the EOF folio based on i_size in ext4_page_mkwrite(),
>>>>    block_page_mkwrite() and iomap_page_mkwrite(), and then call
>>>>    folio_wait_writeback() to wait for this partial folio writeback to
>>>>    complete. This seems can break the race condition without introducing
>>>>    too much overhead (no?).
>>>>
>>>> What do you think? Any other suggestions are also welcome.
>>> Hum, I like the option 2 because IMO non-zero data beyond EOF is a
>>> corner-case quirk which unnecessarily complicates rather common paths. But
>>> I'm not sure we can easily get rid of it. It can happen for example when
>>> you do appending write inside a block. The page is written back but before
>>> the transaction with i_disksize update commits we crash. Then again we have
>>> a non-zero content inside the block beyond EOF.
>>>
>>> So the only realistic option I see is to ensure tail of the block gets
>>> zeroed on disk before the transaction with i_disksize update commits in the
>>> cases of truncate up or write beyond EOF. data=ordered mode machinery is an
>>> asynchronous way how to achieve this. We could also just synchronously
>>> writeback the block where needed but the latency hit of such operation is
>>> going to be significant so I'm quite sure some workload somewhere will
>>> notice although the truncate up / write beyond EOF operations triggering this
>>> are not too common. So why do you need to get rid of these data=ordered
>>> mode usages? I guess because with iomap keeping our transaction handle ->
>>> folio lock ordering is complicated? Last time I looked it seemed still
>>> possible to keep it though.
>>>
>>> Another possibility would be to just *submit* the write synchronously and
>>> use data=ordered mode machinery only to wait for IO to complete before the
>>> transaction commits. That way it should be safe to start a transaction
>>> while holding folio lock and thus the iomap conversion would be easier.
>>>
>>> 								Honza
>> Can we treat EOF blocks as metadata and update them in the same
>> transaction as i_disksize? Although this would introduce some
>> management and journaling overhead, it could avoid the deadlock
>> of "writeback -> start handle -> trigger writeback".
> No, IMHO that would get too difficult. Just look at the hoops data=journal
> mode has to jump through to make page cache handling work with the
> journalling machinery. And you'd now have that for all the inodes. So I
> think some form of data=ordered machinery is much simpler to reason about.
>
> 								Honza

Indeed, this is a bit tricky.


Regards,
Baokun