[v2] ext4: fix insufficient credits when writing back large folios

[PATCH v2 2/6] ext4: fix stale data if it bail out of the extents mapping loop

Posted by Zhang Yi 4 months ago

From: Zhang Yi <yi.zhang@huawei.com>

During the process of writing back folios, if
mpage_map_and_submit_extent() exits the extent mapping loop due to an
ENOSPC or ENOMEM error, it may result in stale data or filesystem
inconsistency in environments where the block size is smaller than the
folio size.

When mapping a discontinuous folio in mpage_map_and_submit_extent(),
some buffers may have already be mapped. If we exit the mapping loop
prematurely, the folio data within the mapped range will not be written
back, and the file's disk size will not be updated. Once the transaction
that includes this range of extents is committed, this can lead to stale
data or filesystem inconsistency.

Fix this by submitting the current processing partial mapped folio and
update the disk size to the end of the mapped range.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/inode.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 48 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3a086fee7989..d0db6e3bf158 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2362,6 +2362,42 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
 	return 0;
 }
 
+/*
+ * This is used to submit mapped buffers in a single folio that is not fully
+ * mapped for various reasons, such as insufficient space or journal credits.
+ */
+static int mpage_submit_buffers(struct mpage_da_data *mpd, loff_t pos)
+{
+	struct inode *inode = mpd->inode;
+	struct folio *folio;
+	int ret;
+
+	folio = filemap_get_folio(inode->i_mapping, mpd->first_page);
+	if (IS_ERR(folio))
+		return PTR_ERR(folio);
+
+	ret = mpage_submit_folio(mpd, folio);
+	if (ret)
+		goto out;
+	/*
+	 * Update first_page to prevent this folio from being released in
+	 * mpage_release_unused_pages(), it should not equal to the folio
+	 * index.
+	 *
+	 * The first_page will be reset to the aligned folio index when this
+	 * folio is written again in the next round. Additionally, do not
+	 * update wbc->nr_to_write here, as it will be updated once the
+	 * entire folio has finished processing.
+	 */
+	mpd->first_page = round_up(pos, PAGE_SIZE) >> PAGE_SHIFT;
+	WARN_ON_ONCE((folio->index == mpd->first_page) ||
+		     !folio_contains(folio, pos >> PAGE_SHIFT));
+out:
+	folio_unlock(folio);
+	folio_put(folio);
+	return ret;
+}
+
 /*
  * mpage_map_and_submit_extent - map extent starting at mpd->lblk of length
  *				 mpd->len and submit pages underlying it for IO
@@ -2412,8 +2448,16 @@ static int mpage_map_and_submit_extent(handle_t *handle,
 			 */
 			if ((err == -ENOMEM) ||
 			    (err == -ENOSPC && ext4_count_free_clusters(sb))) {
-				if (progress)
+				/*
+				 * We may have already allocated extents for
+				 * some bhs inside the folio, issue the
+				 * corresponding data to prevent stale data.
+				 */
+				if (progress) {
+					if (mpage_submit_buffers(mpd, disksize))
+						goto invalidate_dirty_pages;
 					goto update_disksize;
+				}
 				return err;
 			}
 			ext4_msg(sb, KERN_CRIT,
@@ -2432,6 +2476,8 @@ static int mpage_map_and_submit_extent(handle_t *handle,
 			*give_up_on_write = true;
 			return err;
 		}
+		disksize = ((loff_t)(map->m_lblk + map->m_len)) <<
+				inode->i_blkbits;
 		progress = 1;
 		/*
 		 * Update buffer state, submit mapped pages, and get us new
@@ -2442,12 +2488,12 @@ static int mpage_map_and_submit_extent(handle_t *handle,
 			goto update_disksize;
 	} while (map->m_len);
 
+	disksize = ((loff_t)mpd->first_page) << PAGE_SHIFT;
 update_disksize:
 	/*
 	 * Update on-disk size after IO is submitted.  Races with
 	 * truncate are avoided by checking i_size under i_data_sem.
 	 */
-	disksize = ((loff_t)mpd->first_page) << PAGE_SHIFT;
 	if (disksize > READ_ONCE(EXT4_I(inode)->i_disksize)) {
 		int err2;
 		loff_t i_size;
-- 
2.46.1

Re: [PATCH v2 2/6] ext4: fix stale data if it bail out of the extents mapping loop

Posted by Jan Kara 3 months, 3 weeks ago

On Wed 11-06-25 19:16:21, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> During the process of writing back folios, if
> mpage_map_and_submit_extent() exits the extent mapping loop due to an
> ENOSPC or ENOMEM error, it may result in stale data or filesystem
> inconsistency in environments where the block size is smaller than the
> folio size.
> 
> When mapping a discontinuous folio in mpage_map_and_submit_extent(),
> some buffers may have already be mapped. If we exit the mapping loop
> prematurely, the folio data within the mapped range will not be written
> back, and the file's disk size will not be updated. Once the transaction
> that includes this range of extents is committed, this can lead to stale
> data or filesystem inconsistency.
> 
> Fix this by submitting the current processing partial mapped folio and
> update the disk size to the end of the mapped range.
> 
> Suggested-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  fs/ext4/inode.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 48 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 3a086fee7989..d0db6e3bf158 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2362,6 +2362,42 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
>  	return 0;
>  }
>  
> +/*
> + * This is used to submit mapped buffers in a single folio that is not fully
> + * mapped for various reasons, such as insufficient space or journal credits.
> + */
> +static int mpage_submit_buffers(struct mpage_da_data *mpd, loff_t pos)
> +{
> +	struct inode *inode = mpd->inode;
> +	struct folio *folio;
> +	int ret;
> +
> +	folio = filemap_get_folio(inode->i_mapping, mpd->first_page);
> +	if (IS_ERR(folio))
> +		return PTR_ERR(folio);
> +
> +	ret = mpage_submit_folio(mpd, folio);
> +	if (ret)
> +		goto out;
> +	/*
> +	 * Update first_page to prevent this folio from being released in
> +	 * mpage_release_unused_pages(), it should not equal to the folio
> +	 * index.
> +	 *
> +	 * The first_page will be reset to the aligned folio index when this
> +	 * folio is written again in the next round. Additionally, do not
> +	 * update wbc->nr_to_write here, as it will be updated once the
> +	 * entire folio has finished processing.
> +	 */
> +	mpd->first_page = round_up(pos, PAGE_SIZE) >> PAGE_SHIFT;

Well, but there can be many folios between mpd->first_page and pos. And
this way you avoid cleaning them up (unlocking them and dropping elevated
refcount) before we restart next loop. How is this going to work?

Also I don't see in this patch where mpd->first_page would get set back to
retry writing this folio. What am I missing?

> +	WARN_ON_ONCE((folio->index == mpd->first_page) ||
> +		     !folio_contains(folio, pos >> PAGE_SHIFT));
> +out:
> +	folio_unlock(folio);
> +	folio_put(folio);
> +	return ret;
> +}
> +
>  /*
>   * mpage_map_and_submit_extent - map extent starting at mpd->lblk of length
>   *				 mpd->len and submit pages underlying it for IO
> @@ -2412,8 +2448,16 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>  			 */
>  			if ((err == -ENOMEM) ||
>  			    (err == -ENOSPC && ext4_count_free_clusters(sb))) {
> -				if (progress)
> +				/*
> +				 * We may have already allocated extents for
> +				 * some bhs inside the folio, issue the
> +				 * corresponding data to prevent stale data.
> +				 */
> +				if (progress) {
> +					if (mpage_submit_buffers(mpd, disksize))
> +						goto invalidate_dirty_pages;
>  					goto update_disksize;
> +				}
>  				return err;
>  			}
>  			ext4_msg(sb, KERN_CRIT,
> @@ -2432,6 +2476,8 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>  			*give_up_on_write = true;
>  			return err;
>  		}
> +		disksize = ((loff_t)(map->m_lblk + map->m_len)) <<
> +				inode->i_blkbits;

I don't think setting disksize like this is correct in case
mpage_map_and_submit_buffers() below fails (when extent covers many folios
and we don't succeed in writing them all). In that case we may need to keep
disksize somewhere in the middle of the extent.

Overall I don't think we need to modify disksize handling here. It is fine
to leave (part of) the extent dangling beyond disksize until we retry the
writeback in these rare cases.

>  		progress = 1;
>  		/*
>  		 * Update buffer state, submit mapped pages, and get us new
> @@ -2442,12 +2488,12 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>  			goto update_disksize;
>  	} while (map->m_len);
>  
> +	disksize = ((loff_t)mpd->first_page) << PAGE_SHIFT;
>  update_disksize:
>  	/*
>  	 * Update on-disk size after IO is submitted.  Races with
>  	 * truncate are avoided by checking i_size under i_data_sem.
>  	 */
> -	disksize = ((loff_t)mpd->first_page) << PAGE_SHIFT;
>  	if (disksize > READ_ONCE(EXT4_I(inode)->i_disksize)) {
>  		int err2;
>  		loff_t i_size;

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH v2 2/6] ext4: fix stale data if it bail out of the extents mapping loop

Posted by Zhang Yi 3 months, 3 weeks ago

On 2025/6/20 0:21, Jan Kara wrote:
> On Wed 11-06-25 19:16:21, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> During the process of writing back folios, if
>> mpage_map_and_submit_extent() exits the extent mapping loop due to an
>> ENOSPC or ENOMEM error, it may result in stale data or filesystem
>> inconsistency in environments where the block size is smaller than the
>> folio size.
>>
>> When mapping a discontinuous folio in mpage_map_and_submit_extent(),
>> some buffers may have already be mapped. If we exit the mapping loop
>> prematurely, the folio data within the mapped range will not be written
>> back, and the file's disk size will not be updated. Once the transaction
>> that includes this range of extents is committed, this can lead to stale
>> data or filesystem inconsistency.
>>
>> Fix this by submitting the current processing partial mapped folio and
>> update the disk size to the end of the mapped range.
>>
>> Suggested-by: Jan Kara <jack@suse.cz>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
>>  fs/ext4/inode.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++--
>>  1 file changed, 48 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 3a086fee7989..d0db6e3bf158 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -2362,6 +2362,42 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
>>  	return 0;
>>  }
>>  
>> +/*
>> + * This is used to submit mapped buffers in a single folio that is not fully
>> + * mapped for various reasons, such as insufficient space or journal credits.
>> + */
>> +static int mpage_submit_buffers(struct mpage_da_data *mpd, loff_t pos)
>> +{
>> +	struct inode *inode = mpd->inode;
>> +	struct folio *folio;
>> +	int ret;
>> +
>> +	folio = filemap_get_folio(inode->i_mapping, mpd->first_page);
>> +	if (IS_ERR(folio))
>> +		return PTR_ERR(folio);
>> +
>> +	ret = mpage_submit_folio(mpd, folio);
>> +	if (ret)
>> +		goto out;
>> +	/*
>> +	 * Update first_page to prevent this folio from being released in
>> +	 * mpage_release_unused_pages(), it should not equal to the folio
>> +	 * index.
>> +	 *
>> +	 * The first_page will be reset to the aligned folio index when this
>> +	 * folio is written again in the next round. Additionally, do not
>> +	 * update wbc->nr_to_write here, as it will be updated once the
>> +	 * entire folio has finished processing.
>> +	 */
>> +	mpd->first_page = round_up(pos, PAGE_SIZE) >> PAGE_SHIFT;
> 
> Well, but there can be many folios between mpd->first_page and pos. And
> this way you avoid cleaning them up (unlocking them and dropping elevated
> refcount) before we restart next loop. How is this going to work?
> 

Hmm, I don't think there can be many folios between mpd->first_page and
pos. All of the fully mapped folios should be unlocked by
mpage_folio_done(), and there is no elevated since it always call
folio_batch_release() once we finish processing the folios.
mpage_release_unused_pages() is used to clean up unsubmitted folios.

For example, suppose we have a 4kB block size filesystem and we found 4
order-2 folios need to be mapped in the mpage_prepare_extent_to_map().

       first_page             next_page
       |                      |
      [HHHH][HHHH][HHHH][HHHH]              H: hole  L: locked
       LLLL  LLLL  LLLL  LLLL

In the first round in the mpage_map_and_submit_extent(), we mapped the
first two folios along with the first two pages of the third folio, the
mpage_map_and_submit_buffers() should then submit and unlock the first
two folios, while also updating mpd->first_page to the beginning of the
third folio.

                  first_page  next_page
                  |          |
      [WWWW][WWWW][WWHH][HHHH]              H: hole    L: locked
       UUUU  UUUU  LLLL  LLLL               W: mapped  U: unlocked

In the second round in the mpage_map_and_submit_extent(), we failed to
map the blocks and call mpage_submit_buffers() to submit and unlock
this partially mapped folio. Additionally, we increased mpd->first_page.

                     first_page next_page
                     |        /
      [WWWW][WWWW][WWHH][HHHH]              H: hole    L: locked
       UUUU  UUUU  UUUU  LLLL               W: mapped  U: unlocked

Then, we break out to ext4_do_writepages(), which calls
mpage_release_unused_pages() to unlock the last folio.

                     first_page next_page
                     |        /
      [WWWW][WWWW][WWHH][HHHH]              H: hole    L: locked
       UUUU  UUUU  UUUU  UUUU               W: mapped  U: unlocked

In the next round in the ext4_do_writepages(), mpage_prepare_extent_to_map()
restarts processing the third folio and resets the mpd->first_page to
the beginning of it.

                   first_page  next_page
                   |          /
      [WWWW][WWWW][WWHH][HHHH]              H: hole    L: locked
       UUUU  UUUU  LLLL  LLLL               W: mapped  U: unlocked


> Also I don't see in this patch where mpd->first_page would get set back to
> retry writing this folio. What am I missing?
> 

We already have this setting, please refer to the following section in
mpage_prepare_extent_to_map().

static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
{
	pgoff_t index = mpd->first_page;
	...
	mpd->map.m_len = 0;
	mpd->next_page = index;
	...
	while (index <= end) {
...
			if (mpd->map.m_len == 0)
				mpd->first_page = folio->index;
...
}

>> +	WARN_ON_ONCE((folio->index == mpd->first_page) ||
>> +		     !folio_contains(folio, pos >> PAGE_SHIFT));
>> +out:
>> +	folio_unlock(folio);
>> +	folio_put(folio);
>> +	return ret;
>> +}
>> +
>>  /*
>>   * mpage_map_and_submit_extent - map extent starting at mpd->lblk of length
>>   *				 mpd->len and submit pages underlying it for IO
>> @@ -2412,8 +2448,16 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>>  			 */
>>  			if ((err == -ENOMEM) ||
>>  			    (err == -ENOSPC && ext4_count_free_clusters(sb))) {
>> -				if (progress)
>> +				/*
>> +				 * We may have already allocated extents for
>> +				 * some bhs inside the folio, issue the
>> +				 * corresponding data to prevent stale data.
>> +				 */
>> +				if (progress) {
>> +					if (mpage_submit_buffers(mpd, disksize))
>> +						goto invalidate_dirty_pages;
>>  					goto update_disksize;
>> +				}
>>  				return err;
>>  			}
>>  			ext4_msg(sb, KERN_CRIT,
>> @@ -2432,6 +2476,8 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>>  			*give_up_on_write = true;
>>  			return err;
>>  		}
>> +		disksize = ((loff_t)(map->m_lblk + map->m_len)) <<
>> +				inode->i_blkbits;
> 
> I don't think setting disksize like this is correct in case
> mpage_map_and_submit_buffers() below fails (when extent covers many folios
> and we don't succeed in writing them all). In that case we may need to keep
> disksize somewhere in the middle of the extent.
> 
> Overall I don't think we need to modify disksize handling here. It is fine
> to leave (part of) the extent dangling beyond disksize until we retry the
> writeback in these rare cases.

OK, this is indeed a rare case. Let's keep it as it is.

Thanks,
Yi.

> 
>>  		progress = 1;
>>  		/*
>>  		 * Update buffer state, submit mapped pages, and get us new
>> @@ -2442,12 +2488,12 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>>  			goto update_disksize;
>>  	} while (map->m_len);
>>  
>> +	disksize = ((loff_t)mpd->first_page) << PAGE_SHIFT;
>>  update_disksize:
>>  	/*
>>  	 * Update on-disk size after IO is submitted.  Races with
>>  	 * truncate are avoided by checking i_size under i_data_sem.
>>  	 */
>> -	disksize = ((loff_t)mpd->first_page) << PAGE_SHIFT;
>>  	if (disksize > READ_ONCE(EXT4_I(inode)->i_disksize)) {
>>  		int err2;
>>  		loff_t i_size;
> 
> 								Honza

Re: [PATCH v2 2/6] ext4: fix stale data if it bail out of the extents mapping loop

Posted by Jan Kara 3 months, 3 weeks ago

On Fri 20-06-25 12:57:18, Zhang Yi wrote:
> On 2025/6/20 0:21, Jan Kara wrote:
> > On Wed 11-06-25 19:16:21, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@huawei.com>
> >>
> >> During the process of writing back folios, if
> >> mpage_map_and_submit_extent() exits the extent mapping loop due to an
> >> ENOSPC or ENOMEM error, it may result in stale data or filesystem
> >> inconsistency in environments where the block size is smaller than the
> >> folio size.
> >>
> >> When mapping a discontinuous folio in mpage_map_and_submit_extent(),
> >> some buffers may have already be mapped. If we exit the mapping loop
> >> prematurely, the folio data within the mapped range will not be written
> >> back, and the file's disk size will not be updated. Once the transaction
> >> that includes this range of extents is committed, this can lead to stale
> >> data or filesystem inconsistency.
> >>
> >> Fix this by submitting the current processing partial mapped folio and
> >> update the disk size to the end of the mapped range.
> >>
> >> Suggested-by: Jan Kara <jack@suse.cz>
> >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> >> ---
> >>  fs/ext4/inode.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++--
> >>  1 file changed, 48 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >> index 3a086fee7989..d0db6e3bf158 100644
> >> --- a/fs/ext4/inode.c
> >> +++ b/fs/ext4/inode.c
> >> @@ -2362,6 +2362,42 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
> >>  	return 0;
> >>  }
> >>  
> >> +/*
> >> + * This is used to submit mapped buffers in a single folio that is not fully
> >> + * mapped for various reasons, such as insufficient space or journal credits.
> >> + */
> >> +static int mpage_submit_buffers(struct mpage_da_data *mpd, loff_t pos)
> >> +{
> >> +	struct inode *inode = mpd->inode;
> >> +	struct folio *folio;
> >> +	int ret;
> >> +
> >> +	folio = filemap_get_folio(inode->i_mapping, mpd->first_page);
> >> +	if (IS_ERR(folio))
> >> +		return PTR_ERR(folio);
> >> +
> >> +	ret = mpage_submit_folio(mpd, folio);
> >> +	if (ret)
> >> +		goto out;
> >> +	/*
> >> +	 * Update first_page to prevent this folio from being released in
> >> +	 * mpage_release_unused_pages(), it should not equal to the folio
> >> +	 * index.
> >> +	 *
> >> +	 * The first_page will be reset to the aligned folio index when this
> >> +	 * folio is written again in the next round. Additionally, do not
> >> +	 * update wbc->nr_to_write here, as it will be updated once the
> >> +	 * entire folio has finished processing.
> >> +	 */
> >> +	mpd->first_page = round_up(pos, PAGE_SIZE) >> PAGE_SHIFT;
> > 
> > Well, but there can be many folios between mpd->first_page and pos. And
> > this way you avoid cleaning them up (unlocking them and dropping elevated
> > refcount) before we restart next loop. How is this going to work?
> > 
> 
> Hmm, I don't think there can be many folios between mpd->first_page and
> pos. All of the fully mapped folios should be unlocked by
> mpage_folio_done(), and there is no elevated since it always call
> folio_batch_release() once we finish processing the folios.

Indeed. I forgot that mpage_map_one_extent() with shorten mpd->map->m_len
to the number of currently mapped blocks.

> mpage_release_unused_pages() is used to clean up unsubmitted folios.
> 
> For example, suppose we have a 4kB block size filesystem and we found 4
> order-2 folios need to be mapped in the mpage_prepare_extent_to_map().
> 
>        first_page             next_page
>        |                      |
>       [HHHH][HHHH][HHHH][HHHH]              H: hole  L: locked
>        LLLL  LLLL  LLLL  LLLL
> 
> In the first round in the mpage_map_and_submit_extent(), we mapped the
> first two folios along with the first two pages of the third folio, the
> mpage_map_and_submit_buffers() should then submit and unlock the first
> two folios, while also updating mpd->first_page to the beginning of the
> third folio.
> 
>                   first_page  next_page
>                   |          |
>       [WWWW][WWWW][WWHH][HHHH]              H: hole    L: locked
>        UUUU  UUUU  LLLL  LLLL               W: mapped  U: unlocked
> 
> In the second round in the mpage_map_and_submit_extent(), we failed to
> map the blocks and call mpage_submit_buffers() to submit and unlock
> this partially mapped folio. Additionally, we increased mpd->first_page.
> 
>                      first_page next_page
>                      |        /
>       [WWWW][WWWW][WWHH][HHHH]              H: hole    L: locked
>        UUUU  UUUU  UUUU  LLLL               W: mapped  U: unlocked

Good. But what if we have a filesystem with 1k blocksize and order 0
folios? I mean situation like:

        first_page             next_page
        |                      |
       [HHHH][HHHH][HHHH][HHHH]              H: hole  L: locked
        L     L     L     L

Now we map first two folios.

                   first_page  next_page
                   |           |
       [MMMM][MMMM][HHHH][HHHH]              H: hole  L: locked
                    L     L

Now mpage_map_one_extent() maps half of the folio and fails to extend the
transaction further:

                   first_page  next_page
                   |           |
       [MMMM][MMMM][MMHH][HHHH]              H: hole  L: locked
                    L     L

and mpage_submit_folio() now shifts mpd->first page like:

                          first_page
                          |    next_page
                          |    |
       [MMMM][MMMM][MMHH][HHHH]              H: hole  L: locked
                    L     L

and it never gets reset back?

I suspect you thought that the failure to extend transaction in the middle
of order 0 folio should not happen because you reserve credits for full
page worth of writeback? But those credits could be exhaused by the time we
get to mapping third folio because mpage_map_one_extent() only ensures
there are credits for mapping one extent.

And I think reserving credits for just one extent is fine even from the
beginning (as I wrote in my comment to patch 4). We just need to handle
this partial case - which should be possible by just leaving
mpd->first_page untouched and leave unlocking to
mpage_release_unused_pages(). But I can be missing some effects, the
writeback code is really complex...

BTW long-term the code may be easier to follow if we replace
mpd->first_page and mpd->next_page with logical block based or byte based
indexing. Now when we have large order folios, page is not that important
concept for writeback anymore.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH v2 2/6] ext4: fix stale data if it bail out of the extents mapping loop

Posted by Zhang Yi 3 months, 3 weeks ago

On 2025/6/20 23:21, Jan Kara wrote:
> On Fri 20-06-25 12:57:18, Zhang Yi wrote:
>> On 2025/6/20 0:21, Jan Kara wrote:
>>> On Wed 11-06-25 19:16:21, Zhang Yi wrote:
>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>
>>>> During the process of writing back folios, if
>>>> mpage_map_and_submit_extent() exits the extent mapping loop due to an
>>>> ENOSPC or ENOMEM error, it may result in stale data or filesystem
>>>> inconsistency in environments where the block size is smaller than the
>>>> folio size.
>>>>
>>>> When mapping a discontinuous folio in mpage_map_and_submit_extent(),
>>>> some buffers may have already be mapped. If we exit the mapping loop
>>>> prematurely, the folio data within the mapped range will not be written
>>>> back, and the file's disk size will not be updated. Once the transaction
>>>> that includes this range of extents is committed, this can lead to stale
>>>> data or filesystem inconsistency.
>>>>
>>>> Fix this by submitting the current processing partial mapped folio and
>>>> update the disk size to the end of the mapped range.
>>>>
>>>> Suggested-by: Jan Kara <jack@suse.cz>
>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>>> ---
>>>>  fs/ext4/inode.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++--
>>>>  1 file changed, 48 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>> index 3a086fee7989..d0db6e3bf158 100644
>>>> --- a/fs/ext4/inode.c
>>>> +++ b/fs/ext4/inode.c
>>>> @@ -2362,6 +2362,42 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
>>>>  	return 0;
>>>>  }
>>>>  
>>>> +/*
>>>> + * This is used to submit mapped buffers in a single folio that is not fully
>>>> + * mapped for various reasons, such as insufficient space or journal credits.
>>>> + */
>>>> +static int mpage_submit_buffers(struct mpage_da_data *mpd, loff_t pos)
>>>> +{
>>>> +	struct inode *inode = mpd->inode;
>>>> +	struct folio *folio;
>>>> +	int ret;
>>>> +
>>>> +	folio = filemap_get_folio(inode->i_mapping, mpd->first_page);
>>>> +	if (IS_ERR(folio))
>>>> +		return PTR_ERR(folio);
>>>> +
>>>> +	ret = mpage_submit_folio(mpd, folio);
>>>> +	if (ret)
>>>> +		goto out;
>>>> +	/*
>>>> +	 * Update first_page to prevent this folio from being released in
>>>> +	 * mpage_release_unused_pages(), it should not equal to the folio
>>>> +	 * index.
>>>> +	 *
>>>> +	 * The first_page will be reset to the aligned folio index when this
>>>> +	 * folio is written again in the next round. Additionally, do not
>>>> +	 * update wbc->nr_to_write here, as it will be updated once the
>>>> +	 * entire folio has finished processing.
>>>> +	 */
>>>> +	mpd->first_page = round_up(pos, PAGE_SIZE) >> PAGE_SHIFT;
>>>
>>> Well, but there can be many folios between mpd->first_page and pos. And
>>> this way you avoid cleaning them up (unlocking them and dropping elevated
>>> refcount) before we restart next loop. How is this going to work?
>>>
>>
>> Hmm, I don't think there can be many folios between mpd->first_page and
>> pos. All of the fully mapped folios should be unlocked by
>> mpage_folio_done(), and there is no elevated since it always call
>> folio_batch_release() once we finish processing the folios.
> 
> Indeed. I forgot that mpage_map_one_extent() with shorten mpd->map->m_len
> to the number of currently mapped blocks.
> 
>> mpage_release_unused_pages() is used to clean up unsubmitted folios.
>>
>> For example, suppose we have a 4kB block size filesystem and we found 4
>> order-2 folios need to be mapped in the mpage_prepare_extent_to_map().
>>
>>        first_page             next_page
>>        |                      |
>>       [HHHH][HHHH][HHHH][HHHH]              H: hole  L: locked
>>        LLLL  LLLL  LLLL  LLLL
>>
>> In the first round in the mpage_map_and_submit_extent(), we mapped the
>> first two folios along with the first two pages of the third folio, the
>> mpage_map_and_submit_buffers() should then submit and unlock the first
>> two folios, while also updating mpd->first_page to the beginning of the
>> third folio.
>>
>>                   first_page  next_page
>>                   |          |
>>       [WWWW][WWWW][WWHH][HHHH]              H: hole    L: locked
>>        UUUU  UUUU  LLLL  LLLL               W: mapped  U: unlocked
>>
>> In the second round in the mpage_map_and_submit_extent(), we failed to
>> map the blocks and call mpage_submit_buffers() to submit and unlock
>> this partially mapped folio. Additionally, we increased mpd->first_page.
>>
>>                      first_page next_page
>>                      |        /
>>       [WWWW][WWWW][WWHH][HHHH]              H: hole    L: locked
>>        UUUU  UUUU  UUUU  LLLL               W: mapped  U: unlocked
> 
> Good. But what if we have a filesystem with 1k blocksize and order 0
> folios? I mean situation like:
> 
>         first_page             next_page
>         |                      |
>        [HHHH][HHHH][HHHH][HHHH]              H: hole  L: locked
>         L     L     L     L
> 
> Now we map first two folios.
> 
>                    first_page  next_page
>                    |           |
>        [MMMM][MMMM][HHHH][HHHH]              H: hole  L: locked
>                     L     L
> 
> Now mpage_map_one_extent() maps half of the folio and fails to extend the
> transaction further:
> 
>                    first_page  next_page
>                    |           |
>        [MMMM][MMMM][MMHH][HHHH]              H: hole  L: locked
>                     L     L
> 
> and mpage_submit_folio() now shifts mpd->first page like:
> 
>                           first_page
>                           |    next_page
>                           |    |
>        [MMMM][MMMM][MMHH][HHHH]              H: hole  L: locked
>                     L     L
> 
> and it never gets reset back?
> 
> I suspect you thought that the failure to extend transaction in the middle
> of order 0 folio should not happen because you reserve credits for full
> page worth of writeback? But those credits could be exhaused by the time we
> get to mapping third folio because mpage_map_one_extent() only ensures
> there are credits for mapping one extent.

Ooops, you are right. Sorry, it was my mistake.

> 
> And I think reserving credits for just one extent is fine even from the
> beginning (as I wrote in my comment to patch 4). We just need to handle
> this partial case 

Yeah.

> which should be possible by just leaving
> mpd->first_page untouched and leave unlocking to
> mpage_release_unused_pages(). 

I was going to use this solution, but it breaks the semantics of the
mpage_release_unused_pages() and trigger BUG_ON(folio_test_writeback(folio))
in this function. I don't want to drop this BUG_ON since I think it's
somewhat useful.

> But I can be missing some effects, the writeback code is really complex...

Indeed, I was confused by this code for a long time. Thank you a lot for
patiently correcting my mistakes in my patch.

> BTW long-term the code may be easier to follow if we replace
> mpd->first_page and mpd->next_page with logical block based or byte based
> indexing. Now when we have large order folios, page is not that important
> concept for writeback anymore.
> 

I suppose we should do this conversion now.

Thanks,
Yi.

Re: [PATCH v2 2/6] ext4: fix stale data if it bail out of the extents mapping loop

Posted by Jan Kara 3 months, 2 weeks ago

On Sat 21-06-25 12:42:11, Zhang Yi wrote:
> On 2025/6/20 23:21, Jan Kara wrote:
> > I suspect you thought that the failure to extend transaction in the middle
> > of order 0 folio should not happen because you reserve credits for full
> > page worth of writeback? But those credits could be exhaused by the time we
> > get to mapping third folio because mpage_map_one_extent() only ensures
> > there are credits for mapping one extent.
> 
> Ooops, you are right. Sorry, it was my mistake.
> 
> > 
> > And I think reserving credits for just one extent is fine even from the
> > beginning (as I wrote in my comment to patch 4). We just need to handle
> > this partial case 
> 
> Yeah.
> 
> > which should be possible by just leaving
> > mpd->first_page untouched and leave unlocking to
> > mpage_release_unused_pages(). 
> 
> I was going to use this solution, but it breaks the semantics of the
> mpage_release_unused_pages() and trigger BUG_ON(folio_test_writeback(folio))
> in this function. I don't want to drop this BUG_ON since I think it's
> somewhat useful.

Oh, right. Well, we could modify the BUG_ON to:

	/*
	 * first_page folio can be under writeback if we need to restart
         * transaction to map more
	 */
	BUG_ON((invalidate || folio->inode > mpd->first_page) && folio_test_writeback(folio));

> > But I can be missing some effects, the writeback code is really complex...
> 
> Indeed, I was confused by this code for a long time. Thank you a lot for
> patiently correcting my mistakes in my patch.

Thank you for taking time to properly fix these issues :)

> > BTW long-term the code may be easier to follow if we replace
> > mpd->first_page and mpd->next_page with logical block based or byte based
> > indexing. Now when we have large order folios, page is not that important
> > concept for writeback anymore.
> 
> I suppose we should do this conversion now.

Yes or this... I guess it would be more obvious what's going on this way.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[PATCH v2 1/6] ext4: move the calculation of wbc->nr_to_write to mpage_folio_done()
[PATCH v2 2/6] ext4: fix stale data if it bail out of the extents mapping loop
[PATCH v2 3/6] ext4: restart handle if credits are insufficient during allocating blocks
[PATCH v2 4/6] ext4: correct the reserved credits for extent conversion
[PATCH v2 5/6] ext4/jbd2: reintroduce jbd2_journal_blocks_per_page()
[PATCH v2 6/6] ext4: fix insufficient credits calculation in ext4_meta_trans_blocks()