[v1] ext4: fix insufficient credits when writing back large folios

[PATCH 1/5] ext4: restart handle if credits are insufficient during writepages

Posted by Zhang Yi 8 months, 2 weeks ago

From: Zhang Yi <yi.zhang@huawei.com>

After large folios are supported on ext4, writing back a sufficiently
large and discontinuous folio may consume a significant number of
journal credits, placing considerable strain on the journal. For
example, in a 20GB filesystem with 1K block size and 1MB journal size,
writing back a 2MB folio could require thousands of credits in the
worst-case scenario (when each block is discontinuous and distributed
across different block groups), potentially exceeding the journal size.

Fix this by making the write-back process first reserves credits for one
page and attempts to extend the transaction if the credits are
insufficient. In particular, if the credits for a transaction reach
their upper limit, stop the handle and initiate a new transaction.

Note that since we do not support partial folio writeouts, some blocks
within this folio may have been allocated. These allocated extents are
submitted through the current transaction, but the folio itself is not
submitted. To prevent stale data and potential deadlocks in ordered
mode, only the dioread_nolock mode supports this solution, as it always
allocate unwritten extents.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/inode.c | 57 +++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 51 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index be9a4cba35fd..5ef34c0c5633 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1680,6 +1680,7 @@ struct mpage_da_data {
 	unsigned int do_map:1;
 	unsigned int scanned_until_end:1;
 	unsigned int journalled_more_data:1;
+	unsigned int continue_map:1;
 };
 
 static void mpage_release_unused_pages(struct mpage_da_data *mpd,
@@ -2367,6 +2368,8 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
  *
  * @handle - handle for journal operations
  * @mpd - extent to map
+ * @needed_blocks - journal credits needed for one writepages iteration
+ * @check_blocks - journal credits needed for map one extent
  * @give_up_on_write - we set this to true iff there is a fatal error and there
  *                     is no hope of writing the data. The caller should discard
  *                     dirty pages to avoid infinite loops.
@@ -2383,6 +2386,7 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
  */
 static int mpage_map_and_submit_extent(handle_t *handle,
 				       struct mpage_da_data *mpd,
+				       int needed_blocks, int check_blocks,
 				       bool *give_up_on_write)
 {
 	struct inode *inode = mpd->inode;
@@ -2393,6 +2397,8 @@ static int mpage_map_and_submit_extent(handle_t *handle,
 	ext4_io_end_t *io_end = mpd->io_submit.io_end;
 	struct ext4_io_end_vec *io_end_vec;
 
+	mpd->continue_map = 0;
+
 	io_end_vec = ext4_alloc_io_end_vec(io_end);
 	if (IS_ERR(io_end_vec))
 		return PTR_ERR(io_end_vec);
@@ -2439,6 +2445,34 @@ static int mpage_map_and_submit_extent(handle_t *handle,
 		err = mpage_map_and_submit_buffers(mpd);
 		if (err < 0)
 			goto update_disksize;
+		if (!map->m_len)
+			goto update_disksize;
+
+		/*
+		 * For mapping a folio that is sufficiently large and
+		 * discontinuous, the current handle credits may be
+		 * insufficient, try to extend the handle.
+		 */
+		err = __ext4_journal_ensure_credits(handle, check_blocks,
+				needed_blocks, 0);
+		if (err < 0)
+			goto update_disksize;
+		/*
+		 * The credits for the current handle and transaction have
+		 * reached their upper limit, stop the handle and initiate a
+		 * new transaction. Note that some blocks in this folio may
+		 * have been allocated, and these allocated extents are
+		 * submitted through the current transaction, but the folio
+		 * itself is not submitted. To prevent stale data and
+		 * potential deadlock in ordered mode, only the
+		 * dioread_nolock mode supports this.
+		 */
+		if (err > 0) {
+			WARN_ON_ONCE(!ext4_should_dioread_nolock(inode));
+			mpd->continue_map = 1;
+			err = 0;
+			goto update_disksize;
+		}
 	} while (map->m_len);
 
 update_disksize:
@@ -2467,6 +2501,9 @@ static int mpage_map_and_submit_extent(handle_t *handle,
 		if (!err)
 			err = err2;
 	}
+	if (!err && mpd->continue_map)
+		ext4_get_io_end(io_end);
+
 	return err;
 }
 
@@ -2703,7 +2740,7 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
 	handle_t *handle = NULL;
 	struct inode *inode = mpd->inode;
 	struct address_space *mapping = inode->i_mapping;
-	int needed_blocks, rsv_blocks = 0, ret = 0;
+	int needed_blocks, check_blocks, rsv_blocks = 0, ret = 0;
 	struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
 	struct blk_plug plug;
 	bool give_up_on_write = false;
@@ -2825,10 +2862,13 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
 
 	while (!mpd->scanned_until_end && wbc->nr_to_write > 0) {
 		/* For each extent of pages we use new io_end */
-		mpd->io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL);
 		if (!mpd->io_submit.io_end) {
-			ret = -ENOMEM;
-			break;
+			mpd->io_submit.io_end =
+					ext4_init_io_end(inode, GFP_KERNEL);
+			if (!mpd->io_submit.io_end) {
+				ret = -ENOMEM;
+				break;
+			}
 		}
 
 		WARN_ON_ONCE(!mpd->can_map);
@@ -2841,10 +2881,13 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
 		 */
 		BUG_ON(ext4_should_journal_data(inode));
 		needed_blocks = ext4_da_writepages_trans_blocks(inode);
+		check_blocks = ext4_chunk_trans_blocks(inode,
+				MAX_WRITEPAGES_EXTENT_LEN);
 
 		/* start a new transaction */
 		handle = ext4_journal_start_with_reserve(inode,
-				EXT4_HT_WRITE_PAGE, needed_blocks, rsv_blocks);
+				EXT4_HT_WRITE_PAGE, needed_blocks,
+				mpd->continue_map ? 0 : rsv_blocks);
 		if (IS_ERR(handle)) {
 			ret = PTR_ERR(handle);
 			ext4_msg(inode->i_sb, KERN_CRIT, "%s: jbd2_start: "
@@ -2861,6 +2904,7 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
 		ret = mpage_prepare_extent_to_map(mpd);
 		if (!ret && mpd->map.m_len)
 			ret = mpage_map_and_submit_extent(handle, mpd,
+					needed_blocks, check_blocks,
 					&give_up_on_write);
 		/*
 		 * Caution: If the handle is synchronous,
@@ -2894,7 +2938,8 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
 			ext4_journal_stop(handle);
 		} else
 			ext4_put_io_end(mpd->io_submit.io_end);
-		mpd->io_submit.io_end = NULL;
+		if (ret || !mpd->continue_map)
+			mpd->io_submit.io_end = NULL;
 
 		if (ret == -ENOSPC && sbi->s_journal) {
 			/*
-- 
2.46.1

Re: [PATCH 1/5] ext4: restart handle if credits are insufficient during writepages

Posted by Jan Kara 8 months, 1 week ago

On Fri 30-05-25 14:28:54, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> After large folios are supported on ext4, writing back a sufficiently
> large and discontinuous folio may consume a significant number of
> journal credits, placing considerable strain on the journal. For
> example, in a 20GB filesystem with 1K block size and 1MB journal size,
> writing back a 2MB folio could require thousands of credits in the
> worst-case scenario (when each block is discontinuous and distributed
> across different block groups), potentially exceeding the journal size.
> 
> Fix this by making the write-back process first reserves credits for one
> page and attempts to extend the transaction if the credits are
> insufficient. In particular, if the credits for a transaction reach
> their upper limit, stop the handle and initiate a new transaction.
> 
> Note that since we do not support partial folio writeouts, some blocks
> within this folio may have been allocated. These allocated extents are
> submitted through the current transaction, but the folio itself is not
> submitted. To prevent stale data and potential deadlocks in ordered
> mode, only the dioread_nolock mode supports this solution, as it always
> allocate unwritten extents.
> 
> Suggested-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Couple of simplification suggestions below and one bigger issue we need to
deal with.

> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index be9a4cba35fd..5ef34c0c5633 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1680,6 +1680,7 @@ struct mpage_da_data {
>  	unsigned int do_map:1;
>  	unsigned int scanned_until_end:1;
>  	unsigned int journalled_more_data:1;
> +	unsigned int continue_map:1;
>  };
>  
>  static void mpage_release_unused_pages(struct mpage_da_data *mpd,
> @@ -2367,6 +2368,8 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
>   *
>   * @handle - handle for journal operations
>   * @mpd - extent to map
> + * @needed_blocks - journal credits needed for one writepages iteration
> + * @check_blocks - journal credits needed for map one extent
>   * @give_up_on_write - we set this to true iff there is a fatal error and there
>   *                     is no hope of writing the data. The caller should discard
>   *                     dirty pages to avoid infinite loops.
> @@ -2383,6 +2386,7 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
>   */
>  static int mpage_map_and_submit_extent(handle_t *handle,
>  				       struct mpage_da_data *mpd,
> +				       int needed_blocks, int check_blocks,
>  				       bool *give_up_on_write)
>  {
>  	struct inode *inode = mpd->inode;
> @@ -2393,6 +2397,8 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>  	ext4_io_end_t *io_end = mpd->io_submit.io_end;
>  	struct ext4_io_end_vec *io_end_vec;
>  
> +	mpd->continue_map = 0;
> +
>  	io_end_vec = ext4_alloc_io_end_vec(io_end);
>  	if (IS_ERR(io_end_vec))
>  		return PTR_ERR(io_end_vec);
> @@ -2439,6 +2445,34 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>  		err = mpage_map_and_submit_buffers(mpd);
>  		if (err < 0)
>  			goto update_disksize;
> +		if (!map->m_len)
> +			goto update_disksize;
> +
> +		/*
> +		 * For mapping a folio that is sufficiently large and
> +		 * discontinuous, the current handle credits may be
> +		 * insufficient, try to extend the handle.
> +		 */
> +		err = __ext4_journal_ensure_credits(handle, check_blocks,
> +				needed_blocks, 0);
> +		if (err < 0)
> +			goto update_disksize;

IMO it would be more logical to have __ext4_journal_ensure_credits() in
mpage_map_one_extent() where the handle is actually used. Also there it
would be pretty logical to do:

		/* Make sure transaction has enough credits for this extent */
		needed_credits = ext4_chunk_trans_blocks(inode, 1);
		err = ext4_journal_ensure_credits(handle, needed_credits, 0);

No need to extend the transaction by more than we need for this current
extent and also no need to propagate needed credits here.

If __ext4_journal_ensure_credits() cannot extend the transaction, we can
just return -EAGAIN (or something like that) and make sure the retry logic
on ENOSPC or similar transient errors in mpage_map_and_submit_extent()
applies properly.

> +		/*
> +		 * The credits for the current handle and transaction have
> +		 * reached their upper limit, stop the handle and initiate a
> +		 * new transaction. Note that some blocks in this folio may
> +		 * have been allocated, and these allocated extents are
> +		 * submitted through the current transaction, but the folio
> +		 * itself is not submitted. To prevent stale data and
> +		 * potential deadlock in ordered mode, only the
> +		 * dioread_nolock mode supports this.
> +		 */
> +		if (err > 0) {
> +			WARN_ON_ONCE(!ext4_should_dioread_nolock(inode));
> +			mpd->continue_map = 1;
> +			err = 0;
> +			goto update_disksize;
> +		}
>  	} while (map->m_len);
>  
>  update_disksize:
> @@ -2467,6 +2501,9 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>  		if (!err)
>  			err = err2;
>  	}
> +	if (!err && mpd->continue_map)
> +		ext4_get_io_end(io_end);
> +

IMHO it would be more logical to not call ext4_put_io_end[_deferred]() in
ext4_do_writepages() if we see we need to continue doing mapping for the
current io_end.

That way it would be also more obvious that you've just reintroduced
deadlock fixed by 646caa9c8e196 ("ext4: fix deadlock during page
writeback"). This is actually a fundamental thing because for
ext4_journal_stop() to complete, we may need IO on the folio to finish
which means we need io_end to be processed. Even if we avoided the awkward
case with sync handle described in 646caa9c8e196, to be able to start a new
handle we may need to complete a previous transaction commit to be able to
make space in the journal.

Thinking some more about this holding ioend for a folio with partially
submitted IO is also deadlock prone because mpage_prepare_extent_to_map()
can call folio_wait_writeback() which will effectively wait for the last
reference to ioend to be dropped so that underlying extents can be
converted and folio_writeback bit cleared.

So what I think we need to do is that if we submit part of the folio and
cannot submit it all, we just redirty the folio and bail out of the mapping
loop (similarly as in ENOSPC case). Then once IO completes
mpage_prepare_extent_to_map() is able to start working on the folio again.
Since we cleared dirty bits in the buffers we should not be repeating the
work we already did...

								Honza

>  	return err;
>  }
>  
> @@ -2703,7 +2740,7 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
>  	handle_t *handle = NULL;
>  	struct inode *inode = mpd->inode;
>  	struct address_space *mapping = inode->i_mapping;
> -	int needed_blocks, rsv_blocks = 0, ret = 0;
> +	int needed_blocks, check_blocks, rsv_blocks = 0, ret = 0;
>  	struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
>  	struct blk_plug plug;
>  	bool give_up_on_write = false;
> @@ -2825,10 +2862,13 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
>  
>  	while (!mpd->scanned_until_end && wbc->nr_to_write > 0) {
>  		/* For each extent of pages we use new io_end */
> -		mpd->io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL);
>  		if (!mpd->io_submit.io_end) {
> -			ret = -ENOMEM;
> -			break;
> +			mpd->io_submit.io_end =
> +					ext4_init_io_end(inode, GFP_KERNEL);
> +			if (!mpd->io_submit.io_end) {
> +				ret = -ENOMEM;
> +				break;
> +			}
>  		}
>  
>  		WARN_ON_ONCE(!mpd->can_map);
> @@ -2841,10 +2881,13 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
>  		 */
>  		BUG_ON(ext4_should_journal_data(inode));
>  		needed_blocks = ext4_da_writepages_trans_blocks(inode);
> +		check_blocks = ext4_chunk_trans_blocks(inode,
> +				MAX_WRITEPAGES_EXTENT_LEN);
>  
>  		/* start a new transaction */
>  		handle = ext4_journal_start_with_reserve(inode,
> -				EXT4_HT_WRITE_PAGE, needed_blocks, rsv_blocks);
> +				EXT4_HT_WRITE_PAGE, needed_blocks,
> +				mpd->continue_map ? 0 : rsv_blocks);
>  		if (IS_ERR(handle)) {
>  			ret = PTR_ERR(handle);
>  			ext4_msg(inode->i_sb, KERN_CRIT, "%s: jbd2_start: "
> @@ -2861,6 +2904,7 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
>  		ret = mpage_prepare_extent_to_map(mpd);
>  		if (!ret && mpd->map.m_len)
>  			ret = mpage_map_and_submit_extent(handle, mpd,
> +					needed_blocks, check_blocks,
>  					&give_up_on_write);
>  		/*
>  		 * Caution: If the handle is synchronous,
> @@ -2894,7 +2938,8 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
>  			ext4_journal_stop(handle);
>  		} else
>  			ext4_put_io_end(mpd->io_submit.io_end);
> -		mpd->io_submit.io_end = NULL;
> +		if (ret || !mpd->continue_map)
> +			mpd->io_submit.io_end = NULL;
>  
>  		if (ret == -ENOSPC && sbi->s_journal) {
>  			/*
> -- 
> 2.46.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH 1/5] ext4: restart handle if credits are insufficient during writepages

Posted by Zhang Yi 8 months, 1 week ago

On 2025/6/5 22:04, Jan Kara wrote:
> On Fri 30-05-25 14:28:54, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> After large folios are supported on ext4, writing back a sufficiently
>> large and discontinuous folio may consume a significant number of
>> journal credits, placing considerable strain on the journal. For
>> example, in a 20GB filesystem with 1K block size and 1MB journal size,
>> writing back a 2MB folio could require thousands of credits in the
>> worst-case scenario (when each block is discontinuous and distributed
>> across different block groups), potentially exceeding the journal size.
>>
>> Fix this by making the write-back process first reserves credits for one
>> page and attempts to extend the transaction if the credits are
>> insufficient. In particular, if the credits for a transaction reach
>> their upper limit, stop the handle and initiate a new transaction.
>>
>> Note that since we do not support partial folio writeouts, some blocks
>> within this folio may have been allocated. These allocated extents are
>> submitted through the current transaction, but the folio itself is not
>> submitted. To prevent stale data and potential deadlocks in ordered
>> mode, only the dioread_nolock mode supports this solution, as it always
>> allocate unwritten extents.
>>
>> Suggested-by: Jan Kara <jack@suse.cz>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> 
> Couple of simplification suggestions below and one bigger issue we need to
> deal with.
> 
Thanks a lot for your comments and suggestions.


>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index be9a4cba35fd..5ef34c0c5633 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -1680,6 +1680,7 @@ struct mpage_da_data {
>>  	unsigned int do_map:1;
>>  	unsigned int scanned_until_end:1;
>>  	unsigned int journalled_more_data:1;
>> +	unsigned int continue_map:1;
>>  };
>>  
>>  static void mpage_release_unused_pages(struct mpage_da_data *mpd,
>> @@ -2367,6 +2368,8 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
>>   *
>>   * @handle - handle for journal operations
>>   * @mpd - extent to map
>> + * @needed_blocks - journal credits needed for one writepages iteration
>> + * @check_blocks - journal credits needed for map one extent
>>   * @give_up_on_write - we set this to true iff there is a fatal error and there
>>   *                     is no hope of writing the data. The caller should discard
>>   *                     dirty pages to avoid infinite loops.
>> @@ -2383,6 +2386,7 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
>>   */
>>  static int mpage_map_and_submit_extent(handle_t *handle,
>>  				       struct mpage_da_data *mpd,
>> +				       int needed_blocks, int check_blocks,
>>  				       bool *give_up_on_write)
>>  {
>>  	struct inode *inode = mpd->inode;
>> @@ -2393,6 +2397,8 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>>  	ext4_io_end_t *io_end = mpd->io_submit.io_end;
>>  	struct ext4_io_end_vec *io_end_vec;
>>  
>> +	mpd->continue_map = 0;
>> +
>>  	io_end_vec = ext4_alloc_io_end_vec(io_end);
>>  	if (IS_ERR(io_end_vec))
>>  		return PTR_ERR(io_end_vec);
>> @@ -2439,6 +2445,34 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>>  		err = mpage_map_and_submit_buffers(mpd);
>>  		if (err < 0)
>>  			goto update_disksize;
>> +		if (!map->m_len)
>> +			goto update_disksize;
>> +
>> +		/*
>> +		 * For mapping a folio that is sufficiently large and
>> +		 * discontinuous, the current handle credits may be
>> +		 * insufficient, try to extend the handle.
>> +		 */
>> +		err = __ext4_journal_ensure_credits(handle, check_blocks,
>> +				needed_blocks, 0);
>> +		if (err < 0)
>> +			goto update_disksize;
> 
> IMO it would be more logical to have __ext4_journal_ensure_credits() in
> mpage_map_one_extent() where the handle is actually used. Also there it
> would be pretty logical to do:
> 
> 		/* Make sure transaction has enough credits for this extent */
> 		needed_credits = ext4_chunk_trans_blocks(inode, 1);
> 		err = ext4_journal_ensure_credits(handle, needed_credits, 0);
> 
> No need to extend the transaction by more than we need for this current
> extent and also no need to propagate needed credits here.
> 
> If __ext4_journal_ensure_credits() cannot extend the transaction, we can
> just return -EAGAIN (or something like that) and make sure the retry logic
> on ENOSPC or similar transient errors in mpage_map_and_submit_extent()
> applies properly.

Yeah, that would be simpler.

> 
>> +		/*
>> +		 * The credits for the current handle and transaction have
>> +		 * reached their upper limit, stop the handle and initiate a
>> +		 * new transaction. Note that some blocks in this folio may
>> +		 * have been allocated, and these allocated extents are
>> +		 * submitted through the current transaction, but the folio
>> +		 * itself is not submitted. To prevent stale data and
>> +		 * potential deadlock in ordered mode, only the
>> +		 * dioread_nolock mode supports this.
>> +		 */
>> +		if (err > 0) {
>> +			WARN_ON_ONCE(!ext4_should_dioread_nolock(inode));
>> +			mpd->continue_map = 1;
>> +			err = 0;
>> +			goto update_disksize;
>> +		}
>>  	} while (map->m_len);
>>  
>>  update_disksize:
>> @@ -2467,6 +2501,9 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>>  		if (!err)
>>  			err = err2;
>>  	}
>> +	if (!err && mpd->continue_map)
>> +		ext4_get_io_end(io_end);
>> +
> 
> IMHO it would be more logical to not call ext4_put_io_end[_deferred]() in
> ext4_do_writepages() if we see we need to continue doing mapping for the
> current io_end.
> 
> That way it would be also more obvious that you've just reintroduced
> deadlock fixed by 646caa9c8e196 ("ext4: fix deadlock during page
> writeback"). This is actually a fundamental thing because for
> ext4_journal_stop() to complete, we may need IO on the folio to finish
> which means we need io_end to be processed. Even if we avoided the awkward
> case with sync handle described in 646caa9c8e196, to be able to start a new
> handle we may need to complete a previous transaction commit to be able to
> make space in the journal.

Yeah, you are right, I missed the full folios that were attached to the
same io_end in the previous rounds. If we continue to use this solution,
I think we should split the io_end and submit the previous one which
includes those full folios before the previous transaction is
committed.

> 
> Thinking some more about this holding ioend for a folio with partially
> submitted IO is also deadlock prone because mpage_prepare_extent_to_map()
> can call folio_wait_writeback() which will effectively wait for the last
> reference to ioend to be dropped so that underlying extents can be
> converted and folio_writeback bit cleared.

I don't understand this one. The mpage_prepare_extent_to_map() should
call folio_wait_writeback() for the current processing partial folio,
not the previous full folios that were attached to the io_end. This is
because mpd->first_page should be moved forward in mpage_folio_done()
once we complete the previous full folio. Besides, in my current
solution, the current partial folio will not be submitted, the
folio_writeback flag will not be set, so how does this deadlock happen?

> 
> So what I think we need to do is that if we submit part of the folio and
> cannot submit it all, we just redirty the folio and bail out of the mapping
> loop (similarly as in ENOSPC case).

After looking at the ENOSPC case again, I found that the handling of
ENOSPC before we enabling large folio is also wrong, it may case stale
data on 1K block size. Suppose we are processing four bhs on a dirty
page. We map the first bh, and the corresponding io_vec is added to the
io_end, with the unwritten flag set. However, when we attempt to map
the second bh, we bail out of the loop due to ENOSPC. At this point,
ext4_do_writepages()->ext4_put_io_end() will convert the extent of the
first bh to written. However, since the folio has not been committed
(mpage_submit_folio() submit a full folio), it will trigger stale data
issue. Is that right? I suppose we also have to write partial folio out
in this case.

> Then once IO completes
> mpage_prepare_extent_to_map() is able to start working on the folio again.
> Since we cleared dirty bits in the buffers we should not be repeating the
> work we already did...
> 

Hmm, it looks like this solution should work. We should introduce a
partial folio version of mpage_submit_folio(), call it and redirty
the folio once we need to bail out of the loop since insufficient
space or journal credits. But ext4_bio_write_folio() will handle the
the logic of fscrypt case, I'm not familiar with fscrypt, so I'm not
sure it could handle the partial page properly. I'll give it a try.

Thanks,
Yi.

> 
>>  	return err;
>>  }
>>  
>> @@ -2703,7 +2740,7 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
>>  	handle_t *handle = NULL;
>>  	struct inode *inode = mpd->inode;
>>  	struct address_space *mapping = inode->i_mapping;
>> -	int needed_blocks, rsv_blocks = 0, ret = 0;
>> +	int needed_blocks, check_blocks, rsv_blocks = 0, ret = 0;
>>  	struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
>>  	struct blk_plug plug;
>>  	bool give_up_on_write = false;
>> @@ -2825,10 +2862,13 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
>>  
>>  	while (!mpd->scanned_until_end && wbc->nr_to_write > 0) {
>>  		/* For each extent of pages we use new io_end */
>> -		mpd->io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL);
>>  		if (!mpd->io_submit.io_end) {
>> -			ret = -ENOMEM;
>> -			break;
>> +			mpd->io_submit.io_end =
>> +					ext4_init_io_end(inode, GFP_KERNEL);
>> +			if (!mpd->io_submit.io_end) {
>> +				ret = -ENOMEM;
>> +				break;
>> +			}
>>  		}
>>  
>>  		WARN_ON_ONCE(!mpd->can_map);
>> @@ -2841,10 +2881,13 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
>>  		 */
>>  		BUG_ON(ext4_should_journal_data(inode));
>>  		needed_blocks = ext4_da_writepages_trans_blocks(inode);
>> +		check_blocks = ext4_chunk_trans_blocks(inode,
>> +				MAX_WRITEPAGES_EXTENT_LEN);
>>  
>>  		/* start a new transaction */
>>  		handle = ext4_journal_start_with_reserve(inode,
>> -				EXT4_HT_WRITE_PAGE, needed_blocks, rsv_blocks);
>> +				EXT4_HT_WRITE_PAGE, needed_blocks,
>> +				mpd->continue_map ? 0 : rsv_blocks);
>>  		if (IS_ERR(handle)) {
>>  			ret = PTR_ERR(handle);
>>  			ext4_msg(inode->i_sb, KERN_CRIT, "%s: jbd2_start: "
>> @@ -2861,6 +2904,7 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
>>  		ret = mpage_prepare_extent_to_map(mpd);
>>  		if (!ret && mpd->map.m_len)
>>  			ret = mpage_map_and_submit_extent(handle, mpd,
>> +					needed_blocks, check_blocks,
>>  					&give_up_on_write);
>>  		/*
>>  		 * Caution: If the handle is synchronous,
>> @@ -2894,7 +2938,8 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
>>  			ext4_journal_stop(handle);
>>  		} else
>>  			ext4_put_io_end(mpd->io_submit.io_end);
>> -		mpd->io_submit.io_end = NULL;
>> +		if (ret || !mpd->continue_map)
>> +			mpd->io_submit.io_end = NULL;
>>  
>>  		if (ret == -ENOSPC && sbi->s_journal) {
>>  			/*
>> -- 
>> 2.46.1
>>

Re: [PATCH 1/5] ext4: restart handle if credits are insufficient during writepages

Posted by Jan Kara 8 months, 1 week ago

On Fri 06-06-25 14:54:21, Zhang Yi wrote:
> On 2025/6/5 22:04, Jan Kara wrote:
> >> +		/*
> >> +		 * The credits for the current handle and transaction have
> >> +		 * reached their upper limit, stop the handle and initiate a
> >> +		 * new transaction. Note that some blocks in this folio may
> >> +		 * have been allocated, and these allocated extents are
> >> +		 * submitted through the current transaction, but the folio
> >> +		 * itself is not submitted. To prevent stale data and
> >> +		 * potential deadlock in ordered mode, only the
> >> +		 * dioread_nolock mode supports this.
> >> +		 */
> >> +		if (err > 0) {
> >> +			WARN_ON_ONCE(!ext4_should_dioread_nolock(inode));
> >> +			mpd->continue_map = 1;
> >> +			err = 0;
> >> +			goto update_disksize;
> >> +		}
> >>  	} while (map->m_len);
> >>  
> >>  update_disksize:
> >> @@ -2467,6 +2501,9 @@ static int mpage_map_and_submit_extent(handle_t *handle,
> >>  		if (!err)
> >>  			err = err2;
> >>  	}
> >> +	if (!err && mpd->continue_map)
> >> +		ext4_get_io_end(io_end);
> >> +
> > 
> > IMHO it would be more logical to not call ext4_put_io_end[_deferred]() in
> > ext4_do_writepages() if we see we need to continue doing mapping for the
> > current io_end.
> > 
> > That way it would be also more obvious that you've just reintroduced
> > deadlock fixed by 646caa9c8e196 ("ext4: fix deadlock during page
> > writeback"). This is actually a fundamental thing because for
> > ext4_journal_stop() to complete, we may need IO on the folio to finish
> > which means we need io_end to be processed. Even if we avoided the awkward
> > case with sync handle described in 646caa9c8e196, to be able to start a new
> > handle we may need to complete a previous transaction commit to be able to
> > make space in the journal.
> 
> Yeah, you are right, I missed the full folios that were attached to the
> same io_end in the previous rounds. If we continue to use this solution,
> I think we should split the io_end and submit the previous one which
> includes those full folios before the previous transaction is
> committed.

Yes, fully mapped folios definitely need to be submitted. But I think that
should be handled by ext4_io_submit() call in ext4_do_writepages() loop?

> > Thinking some more about this holding ioend for a folio with partially
> > submitted IO is also deadlock prone because mpage_prepare_extent_to_map()
> > can call folio_wait_writeback() which will effectively wait for the last
> > reference to ioend to be dropped so that underlying extents can be
> > converted and folio_writeback bit cleared.
> 
> I don't understand this one. The mpage_prepare_extent_to_map() should
> call folio_wait_writeback() for the current processing partial folio,
> not the previous full folios that were attached to the io_end. This is
> because mpd->first_page should be moved forward in mpage_folio_done()
> once we complete the previous full folio. Besides, in my current
> solution, the current partial folio will not be submitted, the
> folio_writeback flag will not be set, so how does this deadlock happen?

Sorry, this was me being confused. I went through the path again and indeed
if we cannot map all buffers underlying the folio, we don't clear buffer
(and folio) dirty bits and don't set folio writeback bit so there's no
deadlock there.

> > So what I think we need to do is that if we submit part of the folio and
> > cannot submit it all, we just redirty the folio and bail out of the mapping
> > loop (similarly as in ENOSPC case).
> 
> After looking at the ENOSPC case again, I found that the handling of
> ENOSPC before we enabling large folio is also wrong, it may case stale
> data on 1K block size. Suppose we are processing four bhs on a dirty
> page. We map the first bh, and the corresponding io_vec is added to the
> io_end, with the unwritten flag set. However, when we attempt to map
> the second bh, we bail out of the loop due to ENOSPC. At this point,
> ext4_do_writepages()->ext4_put_io_end() will convert the extent of the
> first bh to written. However, since the folio has not been committed
> (mpage_submit_folio() submit a full folio), it will trigger stale data
> issue. Is that right? I suppose we also have to write partial folio out
> in this case.

Yes, this case will be problematic actually both with dioread_nolock but
also without it (as in this case we create written extents from the start
and we tell JBD2 to only wait for data IO to complete but not to submit
it). We really need to make sure partially mapped folio is submitted for IO
as well in this case.

> > Then once IO completes
> > mpage_prepare_extent_to_map() is able to start working on the folio again.
> > Since we cleared dirty bits in the buffers we should not be repeating the
> > work we already did...
> > 
> 
> Hmm, it looks like this solution should work. We should introduce a
> partial folio version of mpage_submit_folio(), call it and redirty
> the folio once we need to bail out of the loop since insufficient
> space or journal credits. But ext4_bio_write_folio() will handle the
> the logic of fscrypt case, I'm not familiar with fscrypt, so I'm not
> sure it could handle the partial page properly. I'll give it a try.

As far as I can tell it should work fine. The logic in
ext4_bio_write_folio() is already prepared for handling partial folio
writeouts, redirtying of the page etc. (because it needs to handle writeout
from transaction commit where we can writeout only parts of folios with
underlying blocks allocated). We just need to teach mpage_submit_folio() to
substract only written-out number of pages from nr_to_write.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH 1/5] ext4: restart handle if credits are insufficient during writepages

Posted by Zhang Yi 8 months, 1 week ago

On 2025/6/6 21:16, Jan Kara wrote:
> On Fri 06-06-25 14:54:21, Zhang Yi wrote:
>> On 2025/6/5 22:04, Jan Kara wrote:
>>>> +		/*
>>>> +		 * The credits for the current handle and transaction have
>>>> +		 * reached their upper limit, stop the handle and initiate a
>>>> +		 * new transaction. Note that some blocks in this folio may
>>>> +		 * have been allocated, and these allocated extents are
>>>> +		 * submitted through the current transaction, but the folio
>>>> +		 * itself is not submitted. To prevent stale data and
>>>> +		 * potential deadlock in ordered mode, only the
>>>> +		 * dioread_nolock mode supports this.
>>>> +		 */
>>>> +		if (err > 0) {
>>>> +			WARN_ON_ONCE(!ext4_should_dioread_nolock(inode));
>>>> +			mpd->continue_map = 1;
>>>> +			err = 0;
>>>> +			goto update_disksize;
>>>> +		}
>>>>  	} while (map->m_len);
>>>>  
>>>>  update_disksize:
>>>> @@ -2467,6 +2501,9 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>>>>  		if (!err)
>>>>  			err = err2;
>>>>  	}
>>>> +	if (!err && mpd->continue_map)
>>>> +		ext4_get_io_end(io_end);
>>>> +
>>>
>>> IMHO it would be more logical to not call ext4_put_io_end[_deferred]() in
>>> ext4_do_writepages() if we see we need to continue doing mapping for the
>>> current io_end.
>>>
>>> That way it would be also more obvious that you've just reintroduced
>>> deadlock fixed by 646caa9c8e196 ("ext4: fix deadlock during page
>>> writeback"). This is actually a fundamental thing because for
>>> ext4_journal_stop() to complete, we may need IO on the folio to finish
>>> which means we need io_end to be processed. Even if we avoided the awkward
>>> case with sync handle described in 646caa9c8e196, to be able to start a new
>>> handle we may need to complete a previous transaction commit to be able to
>>> make space in the journal.
>>
>> Yeah, you are right, I missed the full folios that were attached to the
>> same io_end in the previous rounds. If we continue to use this solution,
>> I think we should split the io_end and submit the previous one which
>> includes those full folios before the previous transaction is
>> committed.
> 
> Yes, fully mapped folios definitely need to be submitted. But I think that
> should be handled by ext4_io_submit() call in ext4_do_writepages() loop?

Sorry, my previous description may not have been clear enough. The
deadlock issue in this solution should be:

1. In the latest round of ext4_do_writepages(),
   mpage_prepare_extent_to_map() may add some contiguous fully mapped
   folios and an unmapped folio(A) to the current processing io_end.
2. mpage_map_and_submit_extent() mapped some bhs in folio A and bail out
   due to the insufficient journal credits, it acquires one more
   refcount of io_end to prevent ext4_put_io_end() convert the extent
   written status of the newly allcoated extents, since we don't submit
   this partial folio now.
3. ext4_io_submit() in ext4_do_writepages() submits the fully mapped
   folios, but the endio process cannot be invoked properly since the
   above extra refcount, so the writeback state of the folio cannot be
   cleared.
4. Finally, it stops the current handle and waits for the transaction to
   be committed. However, the commit process also waits for those fully
   mapped folios to complete, which can lead to a deadlock.

Therefore, if the io_end contains both fully mapped folios and a partial
folio, we need to split the io_end, the first one contains those mapped
folios, and the second one only contain the partial folio. We only hold
the refcount of the second io_end, the first one can be finished properly,
this can break the deadlock.

However, the solution of submitting the partial folio sounds more simple
to me.

[..]

>>> Then once IO completes
>>> mpage_prepare_extent_to_map() is able to start working on the folio again.
>>> Since we cleared dirty bits in the buffers we should not be repeating the
>>> work we already did...
>>>
>>
>> Hmm, it looks like this solution should work. We should introduce a
>> partial folio version of mpage_submit_folio(), call it and redirty
>> the folio once we need to bail out of the loop since insufficient
>> space or journal credits. But ext4_bio_write_folio() will handle the
>> the logic of fscrypt case, I'm not familiar with fscrypt, so I'm not
>> sure it could handle the partial page properly. I'll give it a try.
> 
> As far as I can tell it should work fine. The logic in
> ext4_bio_write_folio() is already prepared for handling partial folio
> writeouts, redirtying of the page etc. (because it needs to handle writeout
> from transaction commit where we can writeout only parts of folios with
> underlying blocks allocated). We just need to teach mpage_submit_folio() to
> substract only written-out number of pages from nr_to_write.
> 

Yes, indeed. I will try this solution.

Thanks,
Yi.

[PATCH 1/5] ext4: restart handle if credits are insufficient during writepages
[PATCH 2/5] ext4: correct the reserved credits for extent conversion
[PATCH 3/5] ext4/jbd2: reintroduce jbd2_journal_blocks_per_page()
[PATCH 4/5] ext4: fix insufficient credits calculation in ext4_meta_trans_blocks()
[PATCH 5/5] ext4: disable large folios if dioread_nolock is not enabled