ext4: use iomap for regular file's buffered I/O path and enable large folio

[PATCH 03/27] ext4: don't write back data before punch hole in nojournal mode

Posted by Zhang Yi 1 year, 3 months ago

From: Zhang Yi <yi.zhang@huawei.com>

There is no need to write back all data before punching a hole in
data=ordered|writeback mode since it will be dropped soon after removing
space, so just remove the filemap_write_and_wait_range() in these modes.
However, in data=journal mode, we need to write dirty pages out before
discarding page cache in case of crash before committing the freeing
data transaction, which could expose old, stale data.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/inode.c | 26 +++++++++++++++-----------
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f8796f7b0f94..94b923afcd9c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3965,17 +3965,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
 
 	trace_ext4_punch_hole(inode, offset, length, 0);
 
-	/*
-	 * Write out all dirty pages to avoid race conditions
-	 * Then release them.
-	 */
-	if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
-		ret = filemap_write_and_wait_range(mapping, offset,
-						   offset + length - 1);
-		if (ret)
-			return ret;
-	}
-
 	inode_lock(inode);
 
 	/* No need to punch hole beyond i_size */
@@ -4037,6 +4026,21 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
 		ret = ext4_update_disksize_before_punch(inode, offset, length);
 		if (ret)
 			goto out_dio;
+
+		/*
+		 * For journalled data we need to write (and checkpoint) pages
+		 * before discarding page cache to avoid inconsitent data on
+		 * disk in case of crash before punching trans is committed.
+		 */
+		if (ext4_should_journal_data(inode)) {
+			ret = filemap_write_and_wait_range(mapping,
+					first_block_offset, last_block_offset);
+			if (ret)
+				goto out_dio;
+		}
+
+		ext4_truncate_folios_range(inode, first_block_offset,
+					   last_block_offset + 1);
 		truncate_pagecache_range(inode, first_block_offset,
 					 last_block_offset);
 	}
-- 
2.46.1

Re: [PATCH 03/27] ext4: don't write back data before punch hole in nojournal mode

Posted by Jan Kara 1 year, 2 months ago

On Tue 22-10-24 19:10:34, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> There is no need to write back all data before punching a hole in
> data=ordered|writeback mode since it will be dropped soon after removing
> space, so just remove the filemap_write_and_wait_range() in these modes.
> However, in data=journal mode, we need to write dirty pages out before
> discarding page cache in case of crash before committing the freeing
> data transaction, which could expose old, stale data.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

With the ext4_truncate_page_cache_block_range() function I propose, this
will get slightly simpler. But overall the patch looks good.

								Honza

> ---
>  fs/ext4/inode.c | 26 +++++++++++++++-----------
>  1 file changed, 15 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index f8796f7b0f94..94b923afcd9c 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3965,17 +3965,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>  
>  	trace_ext4_punch_hole(inode, offset, length, 0);
>  
> -	/*
> -	 * Write out all dirty pages to avoid race conditions
> -	 * Then release them.
> -	 */
> -	if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> -		ret = filemap_write_and_wait_range(mapping, offset,
> -						   offset + length - 1);
> -		if (ret)
> -			return ret;
> -	}
> -
>  	inode_lock(inode);
>  
>  	/* No need to punch hole beyond i_size */
> @@ -4037,6 +4026,21 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>  		ret = ext4_update_disksize_before_punch(inode, offset, length);
>  		if (ret)
>  			goto out_dio;
> +
> +		/*
> +		 * For journalled data we need to write (and checkpoint) pages
> +		 * before discarding page cache to avoid inconsitent data on
> +		 * disk in case of crash before punching trans is committed.
> +		 */
> +		if (ext4_should_journal_data(inode)) {
> +			ret = filemap_write_and_wait_range(mapping,
> +					first_block_offset, last_block_offset);
> +			if (ret)
> +				goto out_dio;
> +		}
> +
> +		ext4_truncate_folios_range(inode, first_block_offset,
> +					   last_block_offset + 1);
>  		truncate_pagecache_range(inode, first_block_offset,
>  					 last_block_offset);
>  	}
> -- 
> 2.46.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH 03/27] ext4: don't write back data before punch hole in nojournal mode

Posted by Darrick J. Wong 1 year, 2 months ago

On Tue, Oct 22, 2024 at 07:10:34PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> There is no need to write back all data before punching a hole in
> data=ordered|writeback mode since it will be dropped soon after removing
> space, so just remove the filemap_write_and_wait_range() in these modes.
> However, in data=journal mode, we need to write dirty pages out before
> discarding page cache in case of crash before committing the freeing
> data transaction, which could expose old, stale data.

Can't the same thing happen with non-journaled data writes?

Say you write 1GB of "A"s to a file and fsync.  Then you write "B"s to
the same 1GB of file and immediately start punching it.  If the system
reboots before the mapping updates all get written to disk, won't you
risk seeing some of those "A" because we no longer flush the "B"s?

Also, since the program didn't explicitly fsync the Bs, why bother
flushing the dirty data at all?  Are data=journal writes supposed to be
synchronous flushing writes nowadays?

--D

> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  fs/ext4/inode.c | 26 +++++++++++++++-----------
>  1 file changed, 15 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index f8796f7b0f94..94b923afcd9c 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3965,17 +3965,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>  
>  	trace_ext4_punch_hole(inode, offset, length, 0);
>  
> -	/*
> -	 * Write out all dirty pages to avoid race conditions
> -	 * Then release them.
> -	 */
> -	if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> -		ret = filemap_write_and_wait_range(mapping, offset,
> -						   offset + length - 1);
> -		if (ret)
> -			return ret;
> -	}
> -
>  	inode_lock(inode);
>  
>  	/* No need to punch hole beyond i_size */
> @@ -4037,6 +4026,21 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>  		ret = ext4_update_disksize_before_punch(inode, offset, length);
>  		if (ret)
>  			goto out_dio;
> +
> +		/*
> +		 * For journalled data we need to write (and checkpoint) pages
> +		 * before discarding page cache to avoid inconsitent data on
> +		 * disk in case of crash before punching trans is committed.
> +		 */
> +		if (ext4_should_journal_data(inode)) {
> +			ret = filemap_write_and_wait_range(mapping,
> +					first_block_offset, last_block_offset);
> +			if (ret)
> +				goto out_dio;
> +		}
> +
> +		ext4_truncate_folios_range(inode, first_block_offset,
> +					   last_block_offset + 1);
>  		truncate_pagecache_range(inode, first_block_offset,
>  					 last_block_offset);
>  	}
> -- 
> 2.46.1
> 
>

Re: [PATCH 03/27] ext4: don't write back data before punch hole in nojournal mode

Posted by Zhang Yi 1 year, 2 months ago

On 2024/11/19 7:15, Darrick J. Wong wrote:
> On Tue, Oct 22, 2024 at 07:10:34PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> There is no need to write back all data before punching a hole in
>> data=ordered|writeback mode since it will be dropped soon after removing
>> space, so just remove the filemap_write_and_wait_range() in these modes.
>> However, in data=journal mode, we need to write dirty pages out before
>> discarding page cache in case of crash before committing the freeing
>> data transaction, which could expose old, stale data.
> 
> Can't the same thing happen with non-journaled data writes?
> 
> Say you write 1GB of "A"s to a file and fsync.  Then you write "B"s to
> the same 1GB of file and immediately start punching it.  If the system
> reboots before the mapping updates all get written to disk, won't you
> risk seeing some of those "A" because we no longer flush the "B"s?
> 
> Also, since the program didn't explicitly fsync the Bs, why bother
> flushing the dirty data at all?  Are data=journal writes supposed to be
> synchronous flushing writes nowadays?

Thanks you for your replay.

This case is not exactly the problem that can occur in data=journal
mode, the problem is even if we fsync "B"s before punching the hole, we
may still encounter old data ("A"s or even order) if the system reboots
before the hole-punching process is completed.

The details of this problem is the ext4_punch_hole()->
truncate_pagecache_range()-> ..->journal_unmap_buffer() will drop the
checkpoint transaction, which may contain B's journaled data. Consequently,
the journal tail could move advance beyond this point. If we do not flush
the data before dropping the cache and a crash occurs before the punching
transaction is committed, B's transaction will never recover, resulting
in the loss of B's data. Therefore, this cannot happen in non-journaled
data writes.

This flush logic is copied from ext4_zero_range() since it has the same
problem, Jan added it in commit 783ae448b7a2 ("ext4: Fix special handling
of journalled data from extent zeroing"), please see it for more details.
Jan, please correct me if my understanding is incorrect.

Thanks,
Yi.

> 
> --D
> 
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
>>  fs/ext4/inode.c | 26 +++++++++++++++-----------
>>  1 file changed, 15 insertions(+), 11 deletions(-)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index f8796f7b0f94..94b923afcd9c 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -3965,17 +3965,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>>  
>>  	trace_ext4_punch_hole(inode, offset, length, 0);
>>  
>> -	/*
>> -	 * Write out all dirty pages to avoid race conditions
>> -	 * Then release them.
>> -	 */
>> -	if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
>> -		ret = filemap_write_and_wait_range(mapping, offset,
>> -						   offset + length - 1);
>> -		if (ret)
>> -			return ret;
>> -	}
>> -
>>  	inode_lock(inode);
>>  
>>  	/* No need to punch hole beyond i_size */
>> @@ -4037,6 +4026,21 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>>  		ret = ext4_update_disksize_before_punch(inode, offset, length);
>>  		if (ret)
>>  			goto out_dio;
>> +
>> +		/*
>> +		 * For journalled data we need to write (and checkpoint) pages
>> +		 * before discarding page cache to avoid inconsitent data on
>> +		 * disk in case of crash before punching trans is committed.
>> +		 */
>> +		if (ext4_should_journal_data(inode)) {
>> +			ret = filemap_write_and_wait_range(mapping,
>> +					first_block_offset, last_block_offset);
>> +			if (ret)
>> +				goto out_dio;
>> +		}
>> +
>> +		ext4_truncate_folios_range(inode, first_block_offset,
>> +					   last_block_offset + 1);
>>  		truncate_pagecache_range(inode, first_block_offset,
>>  					 last_block_offset);
>>  	}
>> -- 
>> 2.46.1
>>
>>

Re: [PATCH 03/27] ext4: don't write back data before punch hole in nojournal mode

Posted by Jan Kara 1 year, 2 months ago

On Wed 20-11-24 10:56:53, Zhang Yi wrote:
> On 2024/11/19 7:15, Darrick J. Wong wrote:
> > On Tue, Oct 22, 2024 at 07:10:34PM +0800, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@huawei.com>
> >>
> >> There is no need to write back all data before punching a hole in
> >> data=ordered|writeback mode since it will be dropped soon after removing
> >> space, so just remove the filemap_write_and_wait_range() in these modes.
> >> However, in data=journal mode, we need to write dirty pages out before
> >> discarding page cache in case of crash before committing the freeing
> >> data transaction, which could expose old, stale data.
> > 
> > Can't the same thing happen with non-journaled data writes?
> > 
> > Say you write 1GB of "A"s to a file and fsync.  Then you write "B"s to
> > the same 1GB of file and immediately start punching it.  If the system
> > reboots before the mapping updates all get written to disk, won't you
> > risk seeing some of those "A" because we no longer flush the "B"s?
> > 
> > Also, since the program didn't explicitly fsync the Bs, why bother
> > flushing the dirty data at all?  Are data=journal writes supposed to be
> > synchronous flushing writes nowadays?
> 
> Thanks you for your replay.
> 
> This case is not exactly the problem that can occur in data=journal
> mode, the problem is even if we fsync "B"s before punching the hole, we
> may still encounter old data ("A"s or even order) if the system reboots
> before the hole-punching process is completed.
> 
> The details of this problem is the ext4_punch_hole()->
> truncate_pagecache_range()-> ..->journal_unmap_buffer() will drop the
> checkpoint transaction, which may contain B's journaled data. Consequently,
> the journal tail could move advance beyond this point. If we do not flush
> the data before dropping the cache and a crash occurs before the punching
> transaction is committed, B's transaction will never recover, resulting
> in the loss of B's data. Therefore, this cannot happen in non-journaled
> data writes.

Yes, you're correct. The logic in journal_unmap_buffer() (used when freeing
journaled data blocks) assumes that if there's no running / committing
transaction, then orphan replay is going to fixup possible partial
operations and thus it simply discards the block that's being freed. That
works for truncate but not for hole punch or range zeroing. 

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR